Extracting Text from Image PS or PDF FIle

Report · Oct 15, 2018

I Have a Image PDF File. I need to Extract Text from it. I Tried many OCR but was not successful. I converted it to PS file using Adobe Acrobat PRO DC.

Is there anyway I can Extract Simple text from it or any way I can convert my PDF file to Simple Pain Text Postscript file.

My Sample PDF and PS file is attached

Dropbox - 3.pdf

Dropbox - 3.ps

Please Guide

Thanks as Always

Report · Oct 16, 2018

Extracting from PostScript is MUCH harder than extracting from PDF. There are many tools to extract text from PDF, but only if it there. You must make the OCR work, no use looking for another route... ! But this is a terrible scan, and to make it worse it has been very damaged by being stored as a JPEG. I think this may be beyond hope. You may need to type in the information, there is a time to give up.

View solution in original post

Report · Oct 16, 2018

What happen when you perform OCR?

Report · Oct 16, 2018

Extracting from PostScript is MUCH harder than extracting from PDF. There are many tools to extract text from PDF, but only if it there. You must make the OCR work, no use looking for another route... ! But this is a terrible scan, and to make it worse it has been very damaged by being stored as a JPEG. I think this may be beyond hope. You may need to type in the information, there is a time to give up.

Report · Oct 17, 2018

The original file PDF file was of around 250 Pages so I opened Page 3 in Illustrator and Shared as Sample PDF File.

IF I open the entire original PDF File in Google, I am able to get around 70 % correct text using Google OCR, but still there seems to be scope for improvement.

Alternatively is there any way I could increase the quality of my original PDF so that OCR can work even more better or Exporting it to some other format will work Better ?

Anyway I can delete the table ( horizontal and vertical lines used for each set of Data) and resave it before extraction.

I tried Acrobat DC , but it required mannul work and not possible for thousand sets of data.

Any idea for above please guide.

Thanks

Report · Oct 26, 2018

Here's something I discovered -

If the image is clear enough, you can export to Word.

There is a setting "recognize text if needed" (necessary?)

And it may just do the job. I had a source with ligatures and strange font encodings.

The export process gave me pretty clean text - all the hyphenations were hard hypens, of course. But it recognized the ligatures and gave me the non-combined text. Ordinary text, no funny encodings.

Clean as much of the extra crap out of the PDF first - headers, footers, extra graphics junk. They just clutter up the Word file

hth

Jay

Report · Nov 02, 2018

An additional note - the op wished for a simple Postscript file with plain text. That's no easier than any other OCR.

There are lots of paths through the OCR jungle - and Adobe's hasn't always been the best choice.

But my recent experience exporting to Word encourages me - that worked pretty well with my source.

ymmv, of course.