Known Participant

Question

Text Extraction from PDF

Forum|Forum|8 years ago
June 25, 2017
22 replies
6751 views

I am a Windows application developer using Visual Studio.

And trying to extract texts from a pdf file.

I get complete text extraction in ENGLISH language

But, not able to extract clean text in "SANSKRIT" and "GUJARATI" Languages.

I tried with different DLL libraries and functions.

Finally I got the problem and no Solution.

Problem : While extracting text from pdf, it does not give proper UNICODE of the character sometimes. SEE THE BELOW IMAGE.

THE PDF FILE HAS :

BUT THE TEXT FILE SHOWS :

Kindly suggest the solution.

This topic has been closed for replies.

K

KetankobaAuthor

Known Participant

This is the .indd file created with INDESIGN : TEST.indd - Google Drive

This is the .pdf file created from INDESIGN (Exported as .pdf) : TEST.pdf - Google Drive

By copying the content of pdf and pasting it into the same file (USING ADOBE ACROBAT DC PRO) as TEXT BOX gives garbage values SHOWN IN THE SAME PDF FILE.

Font Used : ADOBE DEVANAGARI BOLD

Kindly provide the solution...!!!

T

Test Screen Name

Legend

I am not going to repeat or justify my analysis, or explain in any more detail how text extraction works. You may pay someone else to repeat it if you wish, I have already spent lot of time on this.

My considered opinion is that this PDF contains bad information for text extraction and that nothing can fix it. You do not have to accept my opinion. You can study the PDF reference yourself. This is very interesting and has occupied me for many years.

It it is a basic fact of PDF files: some do not have correctly extractable text. Have you considered reporting this as a bug in the PDF creator? You may link to my analysis.

K

KetankobaAuthor

Known Participant

I have a lot of PDF files created using INDESIGN and having no INDESIGN files (.indd).

Kindly suggest a method to extract text from PDF except copy-pasting in NOTEPAD.

Thank you.

K

KetankobaAuthor

Known Participant

Thank you for the analysis.

The point 4 suggests by you is : "DEVANAGARI NA has the Unicode value for DEVANAGARI VOWEL SIGN I"

Then Why does it show correct character for the first time and incorrect for the second time. And again correct for the further characters.

Mapping is to be researched...!!!

AND ALSO NOT ONLY FOR "DEVANAGARI NA" WHICH SHOWS INCORRECT, THERE ARE MANY CHARACTERS WHICH EXTRACT WRONG TEXT IN DEVANAGARI AND OTHER LANGUAGES AS WELL.

SEE THE BELOW IMAGE

https://drive.google.com/file/d/0BzT4y2YlCY9Xclo5WENLc3JNd2M/view?usp=sharing

IN INDESIGN....TYPED TEXTS...

https://drive.google.com/file/d/0BzT4y2YlCY9XTlNKNkY1eHNMejA/view?usp=sharing

C

carlosa94449578

Participant

I have the same issue but went I try to convert from PDF to exel File some of numbers change in exel to letters or symbol . How I can fixed this issue. I try to convert PDF from adobe -pro to exel 2010

T

Test Screen Name

Legend

I analysed your original file. Simply: if lots of different software extracts text wrong from the PDF, blame the PDF.

Technical analysis.

1. Extracting text from a PDF is a complex task, but the PDF standard gives some recommendations.

2. One recommendation is that a PDF may contain a "ToUnicode" map, and that if it is present it should take precedence.

3. Your sample file contains a ToUnicode map.

4. In the ToUnicode map, the character which is visually DEVANAGARI NA has the Unicode value for DEVANAGARI VOWEL SIGN I.

5. It seems to me that this ToUnicode map is incorrect, though Devanagari is a particularly complex part of Unicode, and I do not pretend to properly understand the rules for shaping and composing accents.

6. Adobe and all other technologies are extracting text using correct methods

7.Your use of multiple technologies is an empirical conformation of the points from my analysis. If the PDF is incorrect, you need to focus your attention on the technology that creates it this way. It was not created with Adobe technology: did you try that?

C

carlosa94449578

Participant

I have the same issue but went I try to convert from PDF to exel File some of numbers change in exel to letters or symbol . How I can fixed this issue. I try to convert PDF from adobe -pro to exel 2010

K

KetankobaAuthor

Known Participant

ANY SOLUTION...!!!!

K

KetankobaAuthor

Known Participant

NOW I AM IN ADOBE TECHNOLOGY...

IS THERE ANY SOLUTION ???

K

KetankobaAuthor

Known Participant

I HAVE NOT USED THE APPLICATION. JUST COPIED FROM ADOBE ACROBAT DC......AND PASTED IN NOTEPAD AND MICROSOFT WORD.

LET US SAY IF I AM NOT USING 3rd PARTY PRODUCTS.. BUT....

WHAT ABOUT THE SEARCH OPTION IN ADOBE ACROBAT...!!!

THE IMAGE SHOWING SEARCHED TEXT IN ADOBE ACROBAT DC

TESTED IN PDF.PNG - Google Drive

Karl Heinz Kremer

Community Expert

It does not sound like you are actually using Adobe technology. You've mentioned that you've used "iTextSharp, PdfBox and PDFlib", these are all 3rd party products, and you will need to use their support systems to find out what's wrong with either your code or the PDF file you are using.

Show more replies

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded