PDFTables has an API (Application Programming Interface) so that programmers can integrate PDF data extraction into your operations.
It's a simple web based API, so can be called from any programming language.
To convert a PDF, do a multipart HTTP request with the content of the
file to https://pdftables.com/api?key=YOUR_API_KEY
.
Here's an example using cURL, a commonly available command-line tool for running HTTP requests.
curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xml"
The name of the form variable (f=
above) is ignored, and only the first file is processed.
The above example converts to an XML file. To specify a
different format when using cURL, change the value of the format=
parameter. For example, to download a
single-sheet XLSX from the API, you might use:
curl -F f=@example.pdf "https://pdftables.com/api?key=YOUR_API_KEY&format=xlsx-single"
Format | URL Parameter | Notes |
---|---|---|
CSV | format=csv | Comma Separated Values, blank row between pages. |
XML | format=xml | Contains HTML <table> tags; <td> tags may have colspan= attributes. See XML format for details. |
HTML | format=html | Table as HTML fragment. New pages are separated by <h2> elements that have class="pagenumber" and "Page X" as the element text, where X is the page number. |
XLSX | format=xlsx-single | Excel, all PDF pages on one sheet, blank row between pages. |
format=xlsx-multiple | Excel, one sheet per page of the PDF. |
We plan to support other formats in the future, according to demand. If you need something else, contact us!
This endpoint returns the integer number of pages remaining.
$ curl https://pdftables.com/api/remaining?key=YOUR_API_KEY <an integer representing your remaining balance, e.g: 40>
There is an official Python API for PDF to Excel on GitHub.
Alternatively, you can use the requests library or another library capable of doing multi-part HTTP requests in a straightforward manner.
We have examples of
Windows PowerShell PDF to Excel scripts
on GitHub for batch and single PDF conversion.
You can use this on Windows 7 and above (including Windows 10) with no additional software.
There is an official example PHP script for PDF to Excel or CSV conversion on GitHub.
There is an official example C# PDF to Excel conversion program on GitHub.
There are official VBA macros for PDF to Excel conversion on GitHub, kindly provided by Dan Elgaard.
There is an official example Java program to convert PDF to Excel on GitHub.
There's an unofficial R package for PDF to Excel conversion on GitHub.
There is an official example C and C++ program to convert PDF to Excel on GitHub.
There is an official Golang API for PDF to Excel conversion on GitHub.
There is an official Node.js API for PDF to Excel conversion on GitHub.
If your favourite language isn't listed here, and you'd like help, contact us.
The XML output format contains HTML style tables.
We strongly recommend using an XML parsing library. We may add attributes to tags, and add tags with different names to the XML document.
The outermost tag is a <document>
tag, which corresponds to a single PDF document.
Contains any number of <page>
tags.
page-count
: the number of pages in the PDF document.A single page from the PDF document.
Contains any number of <table>
tags. May in future contain text that is not part of a table.
number
: the physical page number, starting from 1. Beware this may not correspond to the logical page number in the PDF.A single table. At the moment, only one table is identified per page, which covers the whole page; this may change in the future.
Contains any number of <tr>
tags.
data-filename
: should be ignored, internal use only.data-page
: a number matching the number
of the page
tag, i.e. the page number on which the table was found.data-table
: an index number for the tables on a page. Currently always 1
. A future update may parse multiple tables per page.A single row from a table.
Contains any number of <td>
tags.
Currently none. Some attributes may be added in a future update.
A table cell.
Contains text — the value of the cell.
May contain <br />
tags to indicate new-lines.
style
: Currently used for formatting numbers. Deprecated and may be removed in a later update.class
: The table cell may have a class
attribute to indicate the type of content, for example a number.colspan
: The width of a cell which is wider than a single column. Not always present. Should be interpreted as per HTML.rowspan
: The height of a cell which is taller than a single row. Not always present. Should be interpreted as per HTML. Not yet implemented.The HTML 4 specification describes HTML tables in more detail.