Copy link to clipboard
Copied
Hi all,
I want to extract PDF text ( along with formatting properties like boldness, underline , etc.) in python. The open source tools that I have tried so far, are not able to do that.
I also want to extract images from PDF documents.
Does Adobe formally support any such function?
Copy link to clipboard
Copied
Not via Python, no.
Copy link to clipboard
Copied
Part of your problem may be that you are looking for something that is not there. These are NOT character attributes in PDF.
Copy link to clipboard
Copied
Thanks for replying.
Well, I thought that similar to Word documents ( where the underlying XML has diferent tags for boldness, underline etc.) , Adoby may have an proprietary tool that does this for PDF documents.
Copy link to clipboard
Copied
Yes, but your problem is that, while these attributes exist and can be extracted from XML, they simply do not exist as attibutes in PDF. There are many tools to extract text, some proprietary, but none can extract what is not there.
Copy link to clipboard
Copied
Actually i did find a way to search through text according to its "format". I'm using HTMLConverter from python's library pdfminer.six: GitHub - pdfminer/pdfminer.six: Community maintained fork of pdfminer.
I'm not sure if their method is consistent enough but it's solving the problem for the moment. There are other tools that do almost the same thing but using XML but i couldn't find a way to navigate them easily.
Did you find any other solution?
Good Luck!
Get ready! An upgraded Adobe Community experience is coming in January.
Learn more