Automate getting data from PDFs

PDFTables has an API (Application Programming Interface) so that programmers can integrate PDF data extraction into your operations.

It's a simple web based API, so can be called from any programming language.

Log in or join to get an API key

Basic usage

To convert a PDF, do a multipart HTTP request with the content of the file to https://pdftables.com/api?key=YOUR_API_KEY.

Here's an example using cURL, a commonly available command-line tool for running HTTP requests.

curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xml"

The name of the form variable (f= above) is ignored, and only the first file is processed.

Choosing format

The above example converts to an XML file. To specify a different format when using cURL, change the value of the format= parameter. For example, to download a single-sheet XLSX from the API, you might use:

curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xlsx-single"
FormatURL ParameterNotes
CSVformat=csvComma Separated Values, blank row between pages.
XMLformat=xmlContains HTML <table> tags; <td> tags may have colspan= attributes. See XML format for details.
HTMLformat=htmlTable as HTML fragment. New pages are separated by <h2> elements that have class="pagenumber" and "Page X" as the element text, where X is the page number.
XLSXformat=xlsx-singleExcel, all PDF pages on one sheet, blank row between pages.
format=xlsx-multipleExcel, one sheet per page of the PDF.

We plan to support other formats in the future, according to demand. If you need something else, contact us!

Get remaining balance

This endpoint returns the integer number of pages remaining.

$ curl https://pdftables.com/api/remaining?key=YOUR_API_KEY
<an integer representing your remaining balance, e.g: 40>

Language examples

Python

There is an official Python API for PDF to Excel on GitHub.

Alternatively, you can use the requests library or another library capable of doing multi-part HTTP requests in a straightforward manner.

Further reading

Windows PowerShell

We have examples of Windows PowerShell PDF to Excel scripts on GitHub for batch and single PDF conversion.
You can use this on Windows 7 and above (including Windows 10) with no additional software.

PHP

There is an official example PHP script for PDF to Excel or CSV conversion on GitHub.

C#

There is an official example C# PDF to Excel conversion program on GitHub.

Visual Basic for Applications (VBA)

There are official VBA macros for PDF to Excel conversion on GitHub, kindly provided by Dan Elgaard.

Further reading

Java

There is an official example Java program to convert PDF to Excel on GitHub.

R

There's an unofficial R package for PDF to Excel conversion on GitHub.

Further reading

C/C++

There is an official example C and C++ program to convert PDF to Excel on GitHub.

Go

There is an official Golang API for PDF to Excel conversion on GitHub.

Node.js

There is an official Node.js API for PDF to Excel conversion on GitHub.

Other languages

If your favourite language isn't listed here, and you'd like help, contact us.

Output formats

XML

The XML output format contains HTML style tables.

We strongly recommend using an XML parsing library. We may add attributes to tags, and add tags with different names to the XML document.

<document>

The outermost tag is a <document> tag, which corresponds to a single PDF document.

Contains any number of <page> tags.

Attributes

  • page-count: the number of pages in the PDF document.

<page>

A single page from the PDF document.

Contains any number of <table> tags. May in future contain text that is not part of a table.

Attributes

  • number: the physical page number, starting from 1. Beware this may not correspond to the logical page number in the PDF.

<table>

A single table. At the moment, only one table is identified per page, which covers the whole page; this may change in the future.

Contains any number of <tr> tags.

Attributes

  • data-filename: should be ignored, internal use only.
  • data-page: a number matching the number of the page tag, i.e. the page number on which the table was found.
  • data-table: an index number for the tables on a page. Currently always 1. A future update may parse multiple tables per page.

<tr>

A single row from a table.

Contains any number of <td> tags.

Attributes

Currently none. Some attributes may be added in a future update.

<td>

A table cell.

Contains text — the value of the cell. May contain <br /> tags to indicate new-lines.

Attributes

  • style: Currently used for formatting numbers. Deprecated and may be removed in a later update.
  • class: The table cell may have a class attribute to indicate the type of content, for example a number.
  • colspan: The width of a cell which is wider than a single column. Not always present. Should be interpreted as per HTML.
  • rowspan: The height of a cell which is taller than a single row. Not always present. Should be interpreted as per HTML. Not yet implemented.

The HTML 4 specification describes HTML tables in more detail.

PDFTables.com uses cookies to provide a service and collect information about how you use our site. If you don't want us to collect information about your site behaviour, please go to our privacy page for more information. Read about our use of cookies.