PDF to plain text, Some difficult pages were encountered

New Here ,
Sep 04, 2022 Sep 04, 2022

Copy link to clipboard

Copied

 

As part of my attempts to batch convert PDFs to text, I have run into a strange error where acrobat XI returns the following:

 

Dwukc

 

When I click okay, the text file generates and creates line breaks but is empty.

 

As a solution to get the text out, I can save it as accessible text; however, for my needs, it is crucial to be able to save it as plain text. 

 

Here is a comparison of a document that does work but when saved as plain text as well as when saved as accessible text. 

Plain text: 

eg2K3

 

Accessible text: 

IpFHo

I want to avoid using accessible text because it introduces CRs and LF, which plain text does not. 

 

I am simply after the text; the fact the figures would, as the message box says would, be converted to a number is fine, but currently, I get nothing out. 

 

I have a script in VBA to convert all of this to batch conversion, but the conversion fails for certain PDFs like the above (attached). `jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"`

 

If anyone could potentially think of a workaround or be able to explain why this fails, that would be useful. Acrobat appears to be the only program I have found which generates plain text documents in this way and it's especially useful for my purpose as the sentences aren't broken. Just a real shame it has failed at the final hurdle with some of the PDFs I need to convert in this way.

 

TOPICS
Edit and convert PDFs , General troubleshooting

Views

35

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 19, 2022 Sep 19, 2022

Copy link to clipboard

Copied

Is it possible to copy and paste a plain text document into Adobe DC Pro and have the page run on with no page breaks as it was within the plain text?

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 19, 2022 Sep 19, 2022

Copy link to clipboard

Copied

Hi there not sure I quite understand what you mean. 
Initially found that if you covert to a doc file without images being converted. Then save this as a PDF and repeat the exercise it works as essentially it doesn't need to perform the tagging. 

SOLUTION:

I have found a reasonable solution (though may not be perfect) the answer is posted here:

https://stackoverflow.com/questions/73628493/extracting-whole-sentences-from-pdfs-as-best-as-possibl...

The script is in VBA... and uses somewhat dated Acrobat XI and Word however I think proves this can be done (at least reasonably). It works by using word to identify the line breaks as whole sentences. The reason for not directly loading from PDF into word is occasionally word will recognise passages of text as an image. So, I use acrobat to generate word doc from PDF, then use words plain text feature to generate the text file. 


let me know if you have any further thoughts. 

 

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 19, 2022 Sep 19, 2022

Copy link to clipboard

Copied

LATEST

SOLUTION:

 

my solution to this problem as best as possible is on stackoverflow: 

 

https://stackoverflow.com/questions/73628493/extracting-whole-sentences-from-pdfs-as-best-as-possibl...

it relies on word identifying the whole sentences and exporting as plain text file. Note word does not always perfectly achieve this but it's pretty accurate. 

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

Make content for your business needs with Adobe Express.

Get started easily with free templates: