Copy link to clipboard
Copied
Question: For converting Technical PDFs to MS Word, is there some kind of comparison between different extraction methods?
My pdfs are complex: text, tables, images, not always well formatted.
The extraction methods:
- MS Word--Import PDF (Local)
- Acrobat Pro -- Export PDF to Word (Local)
- Adobe extraction API (Cloud)
I work with proprietary/ITAR rated documents--so Cloud-based conversion is probably not an option.
Any opinions are welcome as python/unix-based extraction methods (PyPDF/PdfToText/Other) don't capture inline tables...not well. And converting tables to images and then OCR'ing them with machine learning....that's just terrible.
Once the extraction is in Word, all text, table, and image objects are (more) extractable
For me, all this eventually gets stored in a dataframe.
Much appreciation in advance
Have something to add?
Find more inspiration, events, and resources on the new Adobe Community
Explore Now