Highlighted

Extracting PDF Text in python

New Here ,
Jul 16, 2020

Copy link to clipboard

Copied

Hi all,

I want to extract PDF text ( along with formatting properties like boldness, underline , etc.) in python. The open source tools that I have tried so far, are not able to do that.

I also want to extract images from PDF documents.

Does Adobe formally support any such function?

TOPICS
How to

Views

107

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more

Extracting PDF Text in python

New Here ,
Jul 16, 2020

Copy link to clipboard

Copied

Hi all,

I want to extract PDF text ( along with formatting properties like boldness, underline , etc.) in python. The open source tools that I have tried so far, are not able to do that.

I also want to extract images from PDF documents.

Does Adobe formally support any such function?

TOPICS
How to

Views

108

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Jul 16, 2020 0
Most Valuable Participant ,
Jul 19, 2020

Copy link to clipboard

Copied

Not via Python, no.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 19, 2020 0
Most Valuable Participant ,
Jul 19, 2020

Copy link to clipboard

Copied

Part of your problem may be that you are looking for something that is not there. These are NOT character attributes in PDF.

  • Boldness is not an attribute. Each piece of text in a PDF has a font. Some fonts are bolder than others. 
  • Underlining is not an attribute, and not connected to text in any way. A PDF can contain text, and it can contain lines. If the lines appear just under text, we may call it an underline. To export text with "underline" would require detailed analysis of all the text positions and line attributes, and a serious amount of fuzzy logic.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 19, 2020 0
New Here ,
Jul 29, 2020

Copy link to clipboard

Copied

Thanks for replying.

 

Well, I thought that similar to Word documents ( where the underlying XML has diferent tags for boldness, underline etc.) , Adoby may have an proprietary tool that does this for PDF  documents.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 29, 2020 0
Most Valuable Participant ,
Jul 29, 2020

Copy link to clipboard

Copied

Yes, but your problem is that, while these attributes exist and can be extracted from XML, they simply do not exist as attibutes in PDF. There are many tools to extract text, some proprietary, but none can extract what is not there.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 29, 2020 0
joaq1 LATEST
New Here ,
Sep 02, 2020

Copy link to clipboard

Copied

Actually i did find a way to search through text according to its "format". I'm using HTMLConverter from python's library pdfminer.six: GitHub - pdfminer/pdfminer.six: Community maintained fork of pdfminer.

 

I'm not sure if their method is consistent enough but it's solving the problem for the moment. There are other tools that do almost the same thing but using XML but i couldn't find a way to navigate them easily.

 

Did you find any other solution?

Good Luck! 

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Sep 02, 2020 0
Resources