Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Extracting Text from Image PS or PDF FIle

New Here ,
Oct 15, 2018 Oct 15, 2018

I Have a Image PDF File. I need to Extract Text from it. I Tried many OCR but was not successful. I converted it to PS file using Adobe Acrobat PRO DC.

Is there anyway I can Extract Simple text from it or any way I can convert my PDF file to Simple Pain Text Postscript file.

My Sample PDF and PS file is attached

Dropbox - 3.pdf

Dropbox - 3.ps

Please Guide

Thanks as Always

TOPICS
Edit and convert PDFs
7.0K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
1 ACCEPTED SOLUTION
LEGEND ,
Oct 16, 2018 Oct 16, 2018

Extracting from PostScript is MUCH harder than extracting from PDF. There are many tools to extract text from PDF, but only if it there. You must make the OCR work, no use looking for another route... ! But this is a terrible scan, and to make it worse it has been very damaged by being stored as a JPEG. I think this may be beyond hope. You may need to type in the information, there is a time to give up.

View solution in original post

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Oct 16, 2018 Oct 16, 2018

What happen when you perform OCR?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Oct 16, 2018 Oct 16, 2018

Extracting from PostScript is MUCH harder than extracting from PDF. There are many tools to extract text from PDF, but only if it there. You must make the OCR work, no use looking for another route... ! But this is a terrible scan, and to make it worse it has been very damaged by being stored as a JPEG. I think this may be beyond hope. You may need to type in the information, there is a time to give up.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Oct 17, 2018 Oct 17, 2018

The original file PDF file was of around 250 Pages so I opened Page 3 in Illustrator and Shared as Sample PDF File.

IF I open the entire original PDF File in Google, I am able to get around 70 % correct text using Google OCR, but still there seems to be scope for improvement.

Alternatively is there any way I could increase the quality of my original PDF so that OCR can work even more better or Exporting it to some other format will work Better ?

Anyway I can delete the table ( horizontal and vertical lines used for each set of Data) and resave it before extraction.

I tried Acrobat DC , but it required mannul work and not possible for thousand sets of data.

Any idea for above please guide.

Thanks

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Oct 26, 2018 Oct 26, 2018

Here's something I  discovered -

If the image is clear enough, you can export to Word.

There is a setting "recognize text if needed" (necessary?)

And it may just do the job. I had a source with ligatures and strange font encodings.

The export process gave me pretty clean text - all the hyphenations were hard hypens, of course. But it recognized the ligatures and gave me the non-combined text. Ordinary text, no funny encodings.

Clean as much of the extra crap out of the PDF first - headers, footers, extra graphics junk. They just clutter up the Word file

hth

Jay

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Nov 02, 2018 Nov 02, 2018
LATEST

An additional note - the op wished for a simple Postscript file with plain text. That's no easier than any other OCR.

There are lots of paths through the OCR jungle - and Adobe's hasn't always been the best choice.

But my recent experience exporting to Word encourages me - that worked pretty well with my source.

ymmv, of course.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines