Automate getting data from PDFs

PDFTables has an API (Application Programming Interface) so that programmers can integrate PDF data extraction into your operations.

It's a simple web based API, so can be called from any programming language.

Basic usage

To convert a PDF, do a multipart HTTP request with the content of the file to https://pdftables.com/api?key=YOUR_API_KEY.

Here's an example using cURL, a commonly available command-line tool for running HTTP requests.

curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xml"

The name of the form variable (f= above) is ignored, and only the first file is processed.

Choosing format

The above example converts to an XML file. To specify a different format when using cURL, change the value of the format= parameter. For example, to download a single-sheet XLSX from the API, you might use:

curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xlsx-single"

Format	URL parameter	Notes
CSV	`format=csv`	Comma Separated Values, blank row between pages.
XML	`format=xml`	Contains HTML `<table>` tags; `<td>` tags may have `colspan=` attributes. See XML format for details.
HTML	`format=html`	Table as HTML fragment. New pages are separated by `<h2>` elements that have `class="pagenumber"` and "Page X" as the element text, where X is the page number.
XLSX	`format=xlsx-single`	Excel, all PDF pages on one sheet, blank row between pages.
XLSX	`format=xlsx-multiple`	Excel, one sheet per page of the PDF.

We plan to support other formats in the future, according to demand. If you need something else, contact us!

Choosing extractor

The above example uses the default Standard extractor to process the PDF document. To specify a different extractor when using cURL, set the value of the extractor= parameter. For example, to explicitly use the Intelligent #2 extractor, you would use:

curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xml&extractor=ai-2"

Extractor	URL parameter	Notes	Credits per page	Max pages per PDF	Max file upload size
Standard	`extractor=standard`	Our fastest and original extractor. However, this is not suitable for documents containing only images (i.e. scanned without text recognition).	1	4000	500 MB
Intelligent #2	`extractor=ai-2`	Maybe the best overall? Handles poor scans, confusing layouts and handwriting. Uses Artificial Intelligence on the PDF as you see it. Extractor options Choose extraction mode (`?extract=`) Tables only (`tables`, default) Tables and paragraphs (`tables-paragraphs`)	2	250	50 MB
Intelligent #1	`extractor=ai-1`	Works better in some cases. Good for well structured documents and cleaner scans. Uses Artificial Intelligence on the PDF as you see it. Extractor options Choose extraction mode (`?extract=`) Tables only (`tables`, default) Tables and paragraphs (`tables-paragraphs`)	2	250	50 MB

Get remaining balance

This endpoint returns the integer number of credits remaining.

$ curl https://pdftables.com/api/remaining?key=YOUR_API_KEY
<an integer representing your remaining balance, e.g: 40>

Language examples

Python

There is an official Python API for PDF to Excel on GitHub.

Alternatively, you can use the requests library or another library capable of doing multi-part HTTP requests in a straightforward manner.

Windows PowerShell

We have examples of Windows PowerShell PDF to Excel scripts on GitHub for batch and single PDF conversion.
You can use this on Windows 7 and above (including Windows 10) with no additional software.

PHP

There is an official example PHP script for PDF to Excel or CSV conversion on GitHub.

C#

There is an official example C# PDF to Excel conversion program on GitHub.

Visual Basic for Applications (VBA)

There are official VBA macros for PDF to Excel conversion on GitHub, kindly provided by Dan Elgaard.

Java

There is an official example Java program to convert PDF to Excel on GitHub.

R

There's an unofficial R package for PDF to Excel conversion on GitHub.

C/C++

There is an official example C and C++ program to convert PDF to Excel on GitHub.

Go

There is an official Golang API for PDF to Excel conversion on GitHub.

Node.js

There is an official Node.js API for PDF to Excel conversion on GitHub.

Other languages

If your favourite language isn't listed here, and you'd like help, contact us.

Output formats

XML

The XML output format contains HTML style tables.

We strongly recommend using an XML parsing library. We may add attributes to tags, and add tags with different names to the XML document.

<document>

The outermost tag is a <document> tag, which corresponds to a single PDF document.

Contains any number of <page> tags.

Attributes

page-count: the number of pages in the PDF document.

<page>

A single page from the PDF document.

Contains any number of <table> tags. May in future contain text that is not part of a table.

Attributes

number: the physical page number, starting from 1. Beware this may not correspond to the logical page number in the PDF.

<table>

A single table. At the moment, only one table is identified per page, which covers the whole page; this may change in the future.

Contains any number of <tr> tags.

Attributes

data-filename: should be ignored, internal use only.
data-page: a number matching the number of the page tag, i.e. the page number on which the table was found.
data-table: an index number for the tables on a page. Currently always 1. A future update may parse multiple tables per page.

<tr>

A single row from a table.

Contains any number of <td> tags.

Attributes

Currently none. Some attributes may be added in a future update.

<td>

A table cell.

Contains text — the value of the cell. May contain <br /> tags to indicate new-lines.

Attributes

style: Currently used for formatting numbers. Deprecated and may be removed in a later update.
class: The table cell may have a class attribute to indicate the type of content, for example a number.
colspan: The width of a cell which is wider than a single column. Not always present. Should be interpreted as per HTML.
rowspan: The height of a cell which is taller than a single row. Not always present. Should be interpreted as per HTML.

The HTML 4 specification describes HTML tables in more detail.

Automate getting data from PDFs

Basic usage

Choosing format

Choosing extractor

Get remaining balance

Language examples

Python

Further reading

Windows PowerShell

PHP

C#

Visual Basic for Applications (VBA)

Further reading

Java

R

Further reading

C/C++

Go

Node.js

Other languages

Output formats

XML

<document>

Attributes

<page>

Attributes

<table>

Attributes

<tr>

Attributes

<td>

Attributes

Product

Support

Made by