Computer Vision Image Processing Natural Language Processing Scientific Libraries (Numpy/Pandas/SciKit/...) Web Servers and MicroFWs (Flask/Tornado/Nginx/...)
Portable Document Format (PDF) is commonly used to produce, publish, exchange, and
archive business and academic documents alike. Often in such PDFs there are tables
with data that you want to extract and process in some automated fashion. Unlike HTML
or other formats, PDF has no concept of tables as rows and columns with related data.
Tables in PDFs are rendered to visually resemble a table (when printed) using low-level
instructions to place the text of each table cell where it should be, while the original
tabular structure is lost.
While there are existing solutions to extract structured data from PDFs, most of them
are expensive proprietary products or hosted online services, not Python-based, not
open-source, and give you little control over the process, or how your sensitive
PDF documents are handled.
In this talk I'll present two open-source Python tools for PDF tables extraction, the
CLI tool Camelot, and its web-based frontend UI - Excalibur. I'll show you how to
install both locally, and how to use them to extract tabular data from PDFs with ease.
Extraction under your control: 1) define rules with areas on the PDF page containing
the table you want to extract; 2) save and reuse the rules to automate / batch-process
similar PDFs; 3) export the extracted tables as CSV, Excel, JSON, HTML, or use directly
as pandas DataFrames.
If you find Camelot and Excalibur useful, please consider supporting those projects,
or even get involved as a contributor!
Type: Talk (30 mins); Python level: Beginner; Domain level: Beginner
I'm a software developer with 20+ years of experience. Used to work for Canonical and other smaller
companies. In the beginning of 2017 I founded my own company, Develated Ltd., and got into
I've used and learned quite a few programming languages over the years, but Python is by far my
favorite language. My interests are in Linux and open-source software, with particular focus on data science,
testing, machine learning, and open data-driven research.