• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Extracting Text from Image PS or PDF FIle

New Here ,
Oct 15, 2018 Oct 15, 2018

Copy link to clipboard

Copied

I Have a Image PDF File. I need to Extract Text from it. I Tried many OCR but was not successful. I converted it to PS file using Adobe Acrobat PRO DC.

Is there anyway I can Extract Simple text from it or any way I can convert my PDF file to Simple Pain Text Postscript file.

My Sample PDF and PS file is attached

Dropbox - 3.pdf

Dropbox - 3.ps

Please Guide

Thanks as Always

TOPICS
Edit and convert PDFs

Views

6.5K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

LEGEND , Oct 16, 2018 Oct 16, 2018

Extracting from PostScript is MUCH harder than extracting from PDF. There are many tools to extract text from PDF, but only if it there. You must make the OCR work, no use looking for another route... ! But this is a terrible scan, and to make it worse it has been very damaged by being stored as a JPEG. I think this may be beyond hope. You may need to type in the information, there is a time to give up.

Votes

Translate

Translate
Community Expert ,
Oct 16, 2018 Oct 16, 2018

Copy link to clipboard

Copied

What happen when you perform OCR?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Oct 16, 2018 Oct 16, 2018

Copy link to clipboard

Copied

Extracting from PostScript is MUCH harder than extracting from PDF. There are many tools to extract text from PDF, but only if it there. You must make the OCR work, no use looking for another route... ! But this is a terrible scan, and to make it worse it has been very damaged by being stored as a JPEG. I think this may be beyond hope. You may need to type in the information, there is a time to give up.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Oct 17, 2018 Oct 17, 2018

Copy link to clipboard

Copied

The original file PDF file was of around 250 Pages so I opened Page 3 in Illustrator and Shared as Sample PDF File.

IF I open the entire original PDF File in Google, I am able to get around 70 % correct text using Google OCR, but still there seems to be scope for improvement.

Alternatively is there any way I could increase the quality of my original PDF so that OCR can work even more better or Exporting it to some other format will work Better ?

Anyway I can delete the table ( horizontal and vertical lines used for each set of Data) and resave it before extraction.

I tried Acrobat DC , but it required mannul work and not possible for thousand sets of data.

Any idea for above please guide.

Thanks

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Oct 26, 2018 Oct 26, 2018

Copy link to clipboard

Copied

Here's something I  discovered -

If the image is clear enough, you can export to Word.

There is a setting "recognize text if needed" (necessary?)

And it may just do the job. I had a source with ligatures and strange font encodings.

The export process gave me pretty clean text - all the hyphenations were hard hypens, of course. But it recognized the ligatures and gave me the non-combined text. Ordinary text, no funny encodings.

Clean as much of the extra crap out of the PDF first - headers, footers, extra graphics junk. They just clutter up the Word file

hth

Jay

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Nov 02, 2018 Nov 02, 2018

Copy link to clipboard

Copied

LATEST

An additional note - the op wished for a simple Postscript file with plain text. That's no easier than any other OCR.

There are lots of paths through the OCR jungle - and Adobe's hasn't always been the best choice.

But my recent experience exporting to Word encourages me - that worked pretty well with my source.

ymmv, of course.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines