If, like me, you want to automate processes or you need a quick and easy way to convert multiple PDFs to Excel at once, this is the tutorial for you.
I'll be calling our API from Python, using a Mac (the instructions are very similar for Windows and Linux). The API is very useful for integrating PDF extraction into your operations. Setup may take some time, however once it is done, you won't need to do any more setup for when you want to use the API again. You'll now easily be able to import data from PDF files to whichever database you are using.
Before we start
I will be converting a sample PDF bank statement from JPMorgan Chase and a set of pages from a PDF version of Nestlé Group's Consolidated Financial Statements for 2016. If you would like to extract pages from a PDF, see our tutorial on how to extract pages from a PDF document.
Download Anaconda by choosing the MacOS option and selecting the Python 3.6 version. This can take a little while to download so now is a good time to grab a cup of tea!
Once the download is complete, click the downloaded package and run through the installation window that pops up. This may also take a little while! Close when all is finished. To check it has installed, go to 'Finder' and click 'Applications' and you will see Anaconda in the list (for Windows, in Explorer, go to your C drive and search for 'Anaconda'). Downloading Anaconda means that pip will also be installed. Pip gives a simple way to install the PDFTables API Python package.
Press F4 on your keyboard to open up apps. Type 'terminal' and click the black
box logo that appears (For Windows, search for Anaconda Prompt in the Start Menu
and run it). To check Python is working, type
python and then the return key.
If it looks similar to the image below, you are all set up. If not, try uninstalling
Anaconda and installing it again.
Next, if you haven't already, you will need to create an account on PDFTables to enable you to call the API. Click 'Join' in the header of any page on our website to sign up for free. An activation link will arrive in your email inbox which you will need to click.
Open up the Terminal again, press 'Ctrl'+'Z' to exit Python (For Windows, type exit() and hit return). Run the following command to clone the PDFTables API from Github:
pip install git+https://github.com/pdftables/python-pdftables-api.git
If git is not recognised, download it here. Then, run the above command again.
Once the code editor has downloaded, open it up and create a new blank page. Copy the following code onto the page:
import pdftables_api import os c = pdftables_api.Client('MY-API-KEY') file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\" for file in os.listdir(file_path): if file.endswith(".pdf"): c.xlsx(os.path.join(file_path,file), file+'.xlsx')
You will need to make the following changes to the script:
MY-API-KEYwith your PDFTables API key which you will find here.
C:\\Users\\MyName\\Documents\\PDFTablesCode\\with the path to where your PDF documents are saved. If you are copying the path in Windows, replace the single '\' with '\\' or '/'.
Save the Python script as 'convert.py' in the same folder as the PDF documents.
Go back to the Terminal (or Anaconda Prompt), locate the folder in which the Python script and PDFs are
cd *folder_name* to move around the directories.
python convert.py in the Terminal. Press enter and let this run. Once it has
finished, you will see an Excel file for each PDF document in the folder.
You now know how to convert multiple PDFs to Excel using an API!
Now that everything has been setup, just repeat steps 7, 8 & 9 to convert multiple PDFs again.