If you aren't already aware, PDFTables has an API! You can interact with it in a number of languages, including Python, PHP, Java and many more.
In the past, we provided example snippets on our PDF to Excel API page for each of the languages.
We're always looking for ways to make it as easy as possible for developers to use our API, so we've decided to package each API into easy-to-use libraries and host them on GitHub!
Here are the new libraries, and links to the GitHub repositories with examples:
Python - pdftables/python-pdftables-api
Java - pdftables/java-pdftables-api
R - expersso/pdftables (Unofficial package)
C/C++ - pdftables/c-pdftables-api
So you've found a tutorial that shows you how to code a macro, but you've hit a roadblock: where's the Developer tab they keep referring to?
Excel doesn't show the Developer tab by default, so you'll need to dive into the options to find the correct setting.
In this tutorial, I'll show you how to activate the Excel Developer tab.
Open Excel from the Start menu, and create a blank workbook by double-clicking the
Blank workbook option (selected by default).
Open the Excel options window
(File → Options) and navigate to the
Customize Ribbon tab,
then on the right hand side, tick the box next to
Developer. Click OK to close the dialog.
Back on the spreadsheet view, you'll see that a new Developer tab has been added to the end of the tab list.
DEVELOPER, and you'll see the following:
Congratulations! You've activated the Developer tab!
From now on, the tab be visible by default whenever you create a new Excel workbook.
Want to try your hand at creating your own macro? Check out our easy-to-follow PDF to Excel using VBA guide.
Alternatively, if you'd like to learn more about VBA as a programming language, check out the Excel VBA Tutorial from EasyExcelVBA.com.
If you're a Python user, and want to be able to convert PDFs without uploading it manually to PDFTables.com, you can make use of our brand-new Python PDFTables API.
In this tutorial, I'll be showing you how to get the library set up on your local machine, and how to use it to convert a PDF in a folder to Excel.
Here's an example of a PDF that I've converted with the library. In order to properly test the library, make sure you have a PDF handy!
If you haven't already, install Python on your machine from the Python website. You can use either Python 3.5.x or 2.7.x, as the PDFTables API works with both.
For this tutorial, I'll be using the Windows Python IDLE Shell, but the instructions are almost identical for Linux and Mac. Depending on the version of Python you have, you will also need to install the pip package management system for Python.
You'll also need to create an account on PDFTables.com in order to get your free PDFTables API key.
In your terminal/command line, install the PDFTables Python library with:
pip install git+https://github.com/pdftables/python-pdftables-api.git
Or if you'd prefer to install it manually, you can download it from python-pdftables-api, and install it with:
python setup.py install
Create a new Python script, and add the following code:
import pdftables_api c = pdftables_api.Client('my-api-key') c.xlsx('input.pdf', 'output.xlsx')
Now, you'll need to make the following changes to the script:
my-api-keywith your PDFTables API key, which you can get here.
input.pdfwith the PDF you want to convert.
output.xlsxwith the name of the converted spreadsheet.
Now, save your finished script as
convert-pdf.py in the same directory as the PDF document you want to convert.
Open your command line/terminal, and change your directory (e.g.
cd C:/Users/Bob) to the folder you saved your
convert-pdf.py script and PDF in, then run the following command:
To find your converted spreadsheet, navigate to the folder in your file explorer and hey presto, you've converted a PDF to Excel with Python!
Automation realises 10X financial savings, increases efficiency and means better accuracy in our reporting.
- Warren Yancey, Marketing Director
Established in 1958, The Milner Group is a highly respected insurance brokerage in Atlanta, Georgia. Like all insurance agencies, it has been looking at ways to improve the service it offers its valued insurance agents. It also wants a laser focus on debt commitments.
The processing of agent commissions relies on information flows from many carriers like TransAmerica Life Assurance and Legal and General America. Each carrier sends its commission reports to The Milner Group in a PDF document. Processing these reports more quickly means that agents can get paid faster.
“We’ve been doing this by hand for 13 years” says Danielle Graves, Director of Internal Operations at The Milner Group. “The last five years has seen a big increase in volume and we’ve been looking to automate it. We want to take the raw data from PDFs of multiple carriers and import these to OneHQ, our customer relationship management software Processing these PDFs by hand, copy and pasting the information every time, is a slow and laborious job. And we have a lot of PDFs to process each week”.
An example of the PDFs that are being converted on a regular basis
Josh Powell of Innovative Operations, Milner’s trusted integrator, identified PDFTables.com as a solution. He’s been working with our data engineers to create a system for Milner. The solution was developed over a few weeks.
“It’s simple!” Josh says. “We’ve written a Ruby on Rails app that calls the PDFTables API, runs a series of validation routines on the extracted PDF table, and produces a spreadsheet for each carrier that’s exactly in the form that the office staff need. In just a few mouse clicks, the operations team at Milner are ready to import carrier data to OneHQ. Our next step is to automatically push the data to OneHQ.”
Danielle added, “I am extremely pleased with the progress and joint collaboration on this project.”
The new interface for automatically converting carrier reports
Warren Yancey, Marketing Director says, “We welcome innovation. Our ops team are saving a huge amount of time which can be better spent on direct customer service. Automation realises 10X financial savings, increases efficiency and means better accuracy in our reporting.”
About Milner Group
The Milner Group is a full service insurance brokerage agency providing impaired risk, life, annuity, health, disability and long term care insurance products to agents nationwide.
About Innovation Operations
We help small and medium sized businesses leverage technology to streamline their operations, better share information, and manage complex workflows. We personally design and build custom software when appropriate, and integrate third party services when they deliver value.
Press Contact: Josh Powell / josh (at) innovativeops.co / Tel +1 (305) 814 4878
PDFTables is made by The Sensible Code Company. We make products that turn messy information into valuable data. We work with systems integrators and corporate customers to help streamline front and back office operations that rely on external data sources. We also make QuickCode which is a place where statisticians and economists can up skill up in Python and R whilst working on their operational data.
Press Contact: Tristan Bacon / tristan (at) pdftables.com / Tel +44 (0)151 3315200
We continue to improve the algorithm that analyzes and retrieves content from your PDFs. We've recently implemented some larger updates to our algorithm and thought this would be a good opportunity to show you some of our work!
Let's consider some examples - we'll start with what we internally call a shipping-manifest-type PDF.
Customers approach us frequently with this type of document (in all its variations!) and give us feedback on how the conversion went. Having access to customer PDFs enables us to fine-tune our algorithm which results in an improvement for the customer. We make changes to the PDFTables algorithm with one aim only: to minimize the time you have to spend on post-processing the extracted data!
If you'd like to find out whether we can improve the content extraction for your PDF files, get in touch at firstname.lastname@example.org . Please don't forget to attach the relevant PDFs!
Next let's answer the question you've probably been dying to ask:
PDF files contain only the most basic information necessary to display their content. Most of the time, all we have to work with is a series of simple graphical instructions such as drawing a character at point, or drawing a line from point A to point B. PDFs generally do not contain higher-level data structures such as tables. While humans can easily recognize such structures visually, it is quite a different thing to teach a computer to perform this task. This is where PDFTables comes in - our algorithm analyzes the spatial information contained in a PDF to construct tables!
Consider this fairly straight-forward example:
PDFTables did a great job recognizing the column header containing multiple lines of text. This is one of the improvements we've recently made: using the line information to deduce whether multiple lines of text belong in one cell.
The above table had a clear structure so let's next take a look at one that might be not as obvious:
That's why it is important that you sent us any PDF that you'd like to see improved!Every additional PDF helps us to improve our system and to find edge cases. I hope we've given you some small insights in what makes PDFTables work, as well as hinted at some of the work that's still ahead of us.
Got a PDF for which PDFTables returns an output that you'd like to see improved? Get in touch at email@example.com or on Twitter.
More people search for "pdf" (compared to other terms) than they used to. Over twice as many now as at the low point, back in 2007. That's in addition to the increase in overall search volume of all terms! Although to geeks PDF feels like a dated format, really it is about the same age as the web, and designed for a similar purpose - sharing documents. In 1991, executives didn't read on screens, so PDF differed from HTML by concentrating on printing. John Warnock, cofounder of Adobe, describes the vision in the very first memo on PDFs:
Imagine being able to send full text and graphics documents (newspapers, magazine articles, technical manuals etc.) over electronic mail distribution networks. These documents could be viewed on any machine and any selected document could be printed locally. This capability would truly change the way information is managed.
PDF became an ISO standard as late as January 2008, so it isn't surprising that it is increasingly popular. Right now, the highest search volume for "pdf" comes from Cuba. All the rest of the top 10 are African countries.
Just over a decade earlier, the countries most interested in PDFs were very different. Iran and North Africa at the top.
Maybe there's a particular stage of Internet access, open Government publishing or corporate reporting, which causes the search volume for "pdf" to increase in a particular country.
Here's to the PDF! Digital paper, yes, but still useful!
Henry Morris is always in a hurry. He's changing the lives of thousands of young people and he has no time to waste. When I asked what motivates him, he replies:
Working with amazing people to make a positive impact on the world
He's a serial social entrepreneur. After a short stint playing at being an investment banker he set up his first social enterprise called upReach When I ask him why he says:
It's about helping players to play the game better
It supports undergraduates from less privileged backgrounds secure top jobs. It prepares people with the skills they need to be successful to get jobs which ordinarily might seem out of reach and to date its supported 400 undergraduates from less-privileged backgrounds. When the organisation became operational, he knew it was time to move on.
About his new start-up PiC (Performance in Context) which is also about social mobility, Henry says:
This time we're trying to change the game
PiC looks at a person’s performance in relation to their educational background. People with massive potential are being ignored as the measurement system is skewed to look at the top performers from the best schools.
It’s arguably easier for a young person to be as good as the next in a highly successful school, the value of a young person being a top performer in low performing school can be much higher.
Henry wants to re-calibrate the measurement system and open data is his friend. His efforts are all the more critical as people fork out the equivalent of a mortgage to put themselves through tertiary level education.
Apple Contacts is his preferred digital rolodex. When I spoke to Henry he was between appointments. He’s proactive and meets lots of people at conferences. He often asks the event organiser for the delegate list to be dispatched to him electronically. So no surprise that it usually arrives as a PDF.
He uses PDFTables.com to convert the PDF to Excel. He transfers the data into a Google Sheet which has headers that map to Apple Contacts. He adds notes that give context to each entry. Once the Google Sheet is ready he exports a CSV and imports it into Apple contacts. Job done!
When Henry needs to make contact - he's able to identify when and where he met the person.
Thanks for using PDFTables.com Henry and long may your amazing work continue!