Copy link to clipboard
Copied
Hi,
I have 10,000 invoice pdfs I need to convert to excel and then extract the data into a database. I have a licenced copy of Acrobat Pro DC. Is there an API I can use to do this programatically. I have been looking at the cloud based API but at $0.05 per transaction it is going to be very expensive and given the volume of data I would rather not have the overhead of having to move it to the cloud. I had hoped I could just link MS Access to my desktop application API to do the work.
I have a nasty suspicion that Adobe will not let me run the export from my desktop PC programatically.
Could someone tell me if it is possible and if it is then point me in the direction where I can fnd out how to do it?
I would much prefer to use Acrobat to do the conversion, having done it manually on the few files the layout of the generated Excel file is easy to work with and generally I find the programatic interface with VBA easy to use.
Any advice very much appreciated
Jasper.
Copy link to clipboard
Copied
You can use the Action Wizard in Acrobat to convert multiple PDFs to Excel (or any other format), but running it on 10K files is not going to be feasible. You might be able to do it on around 500 files at a time, and then you'll have to restart the application and do the next batch, though.
Copy link to clipboard
Copied
Thanks I will look into that but I really wanted to do it programatically and the pdfs are stored in a complex(ish) folder structure that I need to replicate and that is easy to do with VBA but a pain doing it manually. Also doing it programatically will make checking for errors so much easier.
I have found an open source Java based system called Tabula which I am not looking at but given my lack of experience in Java this is going to be a much steeper learning curve that the Adobe API.
Copy link to clipboard
Copied
I'm quite familiar with Java and especially PDF-related libraries (like PDFBox, which Tabula is based on), so if you're interested in hiring someone to work on it for you, feel free to contact me privately (click my user name and then on "Send a Message") to discuss it further.
Copy link to clipboard
Copied
There's something you may not yet know: PDF files don't contain tables. This means that converting to Excel is based on guesswork, looking at text alignment, lines on the page and other factors. Sometimes it's perfect, other times useless. And it will be different for each API, and each app. So you need to test. Certainly, I would not expect to just throw 5000 PDFs at any API and get 5000 accurate scrapes.
If your layout is identical for each file, I would look for an (apparently) more complex API that allows you to programmatically define the extra zones to extract text. Forget the Excel stage, extract the text directly, validate carefully (especially by extraction of text expected to be fixed), and add to the database.
Copy link to clipboard
Copied
Hi, Thanks for taking the time to reply, it is appreciated.
I have learnt alot about pdfs since starting this project!!!!! This morning I manged to get Tabula (tabula-py) working and have started to do some test converts and so far it is doing a pretty good job. Luckily for me all the invoices are generated by the same accounts system so there is little variation in layout and an added bonus is that the output is CSV which will make reading them into the database much faster. As you pointed out in your post.
Tabula-py is really fast as well, it converted a 2000 page (yes, two thousand) invoice in under seven minutes, if I had used Acrobat I would have had to chop the file into 200 page chunks to prevent Acrobat from crashing.
It looks like Tabula-py will do the job, but if you have any recomendations for better APIs I would be very greatful.
Copy link to clipboard
Copied
Just an update on my problem.
I now have a solution to converting my 7,000 pdf converting problem using tabula-py and python to scan the file structure in which they are stored, auto convert to CSV and store the output file in a corresponding file structure. As a solution (given that I want load the data into a database) it is far better than my original idea of using the Adobe API.
I have attached the Python programe I wrote to do it here (with links on how to install tabula-py) in case anyone else has a similar need. It is my first ever Python program so I apologise if the style is horrible but it does work.
I currently have 3 i7 PCs chuntering through 662,000 pages of pdf data in about 7000 files, each PC is converting about 3 pages a second so it should take them about 24 hours to export all the data. I hate to think how long that would have taken Acrobat to do.
Anyway thanks for the help, it was appreciated!
==========Python code================
# Bulk convert pdfs to CSV using tabula-py
#
# Details of how to get the open source system and install are below
# https://blog.knoldus.com/tabula-scraping-table-data-from-pdf-files/
# https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py
#
# The structure I am working with is as follows;
#
# Source folder
# Customer Folder
# file_1.pdf
# file_2.pdf
# file_3.pdf
#
# The system converts the pdfs and saves into a corresponding destination folder.
# if the system crashes due to a problem file you can correct the problem and restart
# it will continue from where it left off.
#
# My apologies if the python code is offensive to purists, this is my first ever Python Program!
#
import tabula
import os
src_path = 'r:\RedSqPDFInvoices'
dst_path = 'r:\RedSqPDFExcel'
folders = os.listdir(src_path)
for folder in folders:
if os.path.isdir(os.path.join(src_path, folder)):
pdf_path = os.path.join(src_path, folder)
csv_path = os.path.join(dst_path, folder)
if not os.path.isdir(csv_path): # if destination folder does not exists then create it
os.mkdir(csv_path)
pdffiles = os.listdir(pdf_path)
for file in pdffiles:
cust_pdf = os.path.join(pdf_path, file)
if os.path.isfile(cust_pdf): # check it is a file (assume any files are pdfs)
cust_csv = os.path.join(csv_path, file[:(len(file)-4)] + '.csv')
# check if desinate file exists (might be true if we are rerunning after a crash)
# if it does exist then check file length, any zero length files should be deleted and converted
if os.path.isfile(cust_csv):
if os.path.getsize(cust_csv) == 0:
os.remove(cust_csv)
# check if file still exists and if not then convert pdf to csv
# this will allow us to restart after a crash without having to reconvert all the files
if not os.path.isfile(cust_csv):
tabula.convert_into(cust_pdf, cust_csv, output_format="csv", pages='all')
Copy link to clipboard
Copied
Excellent - good advice.