Automate getting data from PDFs

PDFTables has an API (Application Programming Interface) so that programmers can integrate PDF data extraction into your operations.

It's a simple web based API, so can be called from any programming language. The same API is also available in the on-premises version.

Log in or join to get an API key

Basic usage

To convert a PDF, do a multipart HTTP request with the content of the file to https://pdftables.com/api?key=YOUR_API_KEY.

Here's an example using cURL, a commonly available command-line tool for running HTTP requests.

curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xml"

The name of the form variable (f= above) is ignored, and only the first file is processed.

Choosing format

The above example converts to an XML file. To specify a different format, change the value of the format= parameter. For example, to download a single-sheet XLSX from the API, you might use:

curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xlsx-single"
FormatURL ParameterNotes
CSVformat=csvComma Separated Values, blank row between pages.
XMLformat=xmlContains HTML <table> tags; <td> tags may have colspan= attributes. See XML format for details.
XLSXformat=xlsx-singleExcel, all PDF pages on one sheet, blank row between pages.
format=xlsx-multipleExcel, one sheet per page of the PDF.

We plan to support other formats in the future, according to demand. If you need something else, contact us!

Get remaining balance

This endpoint returns the integer number of pages remaining.

$ curl https://pdftables.com/api/remaining?key=YOUR_API_KEY
<an integer representing your remaining balance, e.g: 50>

Language examples

Python

There is an official Python API package on GitHub.

Alternatively, you can use the requests library or another library capable of doing multi-part HTTP requests in a straightforward manner.

PHP

There is an official PHP example on GitHub.

C#

There is an official C# example on GitHub.

Visual Basic for Applications

There are official VBA macros on GitHub, kindly provided by Dan Elgaard.

For instructions on how to use these macros, check out our PDF to Excel VBA walkthrough.

Java

There is an official Java example on GitHub.

R

There's an unofficial R package on GitHub.

C/C++

There is an official C and C++ example on GitHub.

Go

There is an official Golang API example on GitHub.

Other languages

If your favourite language isn't listed here, and you'd like help, contact us.

Output formats

XML

The XML output format contains HTML style tables.

We strongly recommend using an XML parsing library. We may add attributes to tags, and add tags with different names to the XML document.

<document>

The outermost tag is a <document> tag, which corresponds to a single PDF document.

Contains any number of <page> tags.

Attributes

  • page-count: the number of pages in the PDF document.

<page>

A single page from the PDF document.

Contains any number of <table> tags. May in future contain text that is not part of a table.

Attributes

  • number: the physical page number, starting from 1. Beware this may not correspond to the logical page number in the PDF. A PDF file can specify multiple

<table>

A single table. At the moment, only one table is identified per page, which covers the whole page; this may change in the future.

Contains any number of <tr> tags.

Attributes

  • data-filename: should be ignored, internal use only.
  • data-page: a number matching the number of the page tag, i.e. the page number on which the table was found.
  • data-table: an index number for the tables on a page. Currently always 1. A future update may parse multiple tables per page.

<tr>

A single row from a table.

Contains any number of <td> tags.

Attributes

Currently none. Some attributes may be added in a future update.

<td>

A table cell.

Contains text — the value of the cell. May contain <br /> tags to indicate new-lines.

Attributes

  • style: Currently used for formatting numbers. Deprecated and may be removed in a later update.
  • class: The table cell may have a class attribute to indicate the type of content, for example a number.
  • colspan: The width of a cell which is wider than a single column. Not always present. Should be interpreted as per HTML.
  • rowspan: The height of a cell which is taller than a single row. Not always present. Should be interpreted as per HTML. Not yet implemented.

The HTML 4 specification describes HTML tables in more detail.