Skip to main content
Known Participant
June 25, 2017
Question

Text Extraction from PDF

  • June 25, 2017
  • 22 replies
  • 6735 views

I am a Windows application developer using Visual Studio.

And trying to extract texts from a pdf file.

I get complete text extraction in ENGLISH language

But, not able to extract clean text in "SANSKRIT" and "GUJARATI" Languages.

I tried with different DLL libraries and functions.

Finally I got the problem and no Solution.

Problem : While extracting text from pdf, it does not give proper UNICODE of the character sometimes. SEE THE BELOW IMAGE.

THE PDF FILE HAS :

BUT THE TEXT FILE SHOWS : 

Kindly suggest the solution.

    This topic has been closed for replies.

    22 replies

    Legend
    June 26, 2017

    You don't need to tell us what Unicode is, sorry if that was not clear. And you are wrong: characters do not always have a unique representation. I was asking you what specific codes were extracted for this sequence. Please also tell us which exact codes you expected, so we can see how they are different.

    For example if you saw XY but expected XZ I would want you to reply that you saw 0058 0059 but expected 0058 0060.  I am not able to guess what Devanagari diacritics are in your pictures.

    Legend
    June 25, 2017

    1. Have you tried using Microsoft Word to view this text? What was your result?

    2. You say the Unicode is incorrect. What is the Unicode exactly? Have you examined all of the code values produced?

    KetankobaAuthor
    Known Participant
    June 26, 2017

    Unicode contains a repertoire of over 136,000 characters covering 139 modern and historic scripts, as well as multiple symbol sets.

    EACH CHARACTER IN EACH LANGUAGE HAS A UNIQUE CODE, THAT IS UNICODE/

    YES, I TRIED TO CONVERT IN WORD AS WELL.

    IT SHOWS THE SAME RESULT.

    IN DETAIL -  THE CHARACTERS IN THE PDF LET US SAY :

    ENGLISH CONSONANT "X" HAS A PARTICULAR UNICODE VALUE 0058. IT SHOWS CORRECT VALUE WHEN EXTRACT ENGLISH TEXT. NO ISSUES AT ALL.

    BUT IN "SANSKRIT" OR "GUJARATI" LANGUAGE , IT EXTRACTS WRONG UNICODE. (INCORRECT READING)