Text Extraction from PDF

Forum|Forum|8 years ago
June 25, 2017
22 replies
6751 views

I am a Windows application developer using Visual Studio.

And trying to extract texts from a pdf file.

I get complete text extraction in ENGLISH language

But, not able to extract clean text in "SANSKRIT" and "GUJARATI" Languages.

I tried with different DLL libraries and functions.

Finally I got the problem and no Solution.

Problem : While extracting text from pdf, it does not give proper UNICODE of the character sometimes. SEE THE BELOW IMAGE.

THE PDF FILE HAS :

BUT THE TEXT FILE SHOWS :

Kindly suggest the solution.

This topic has been closed for replies.

Show previous replies

T

Test Screen Name

Legend

You don't need to tell us what Unicode is, sorry if that was not clear. And you are wrong: characters do not always have a unique representation. I was asking you what specific codes were extracted for this sequence. Please also tell us which exact codes you expected, so we can see how they are different.

For example if you saw XY but expected XZ I would want you to reply that you saw 0058 0059 but expected 0058 0060. I am not able to guess what Devanagari diacritics are in your pictures.

T

Test Screen Name

Legend

1. Have you tried using Microsoft Word to view this text? What was your result?

2. You say the Unicode is incorrect. What is the Unicode exactly? Have you examined all of the code values produced?

K

KetankobaAuthor

Known Participant

Unicode contains a repertoire of over 136,000 characters covering 139 modern and historic scripts, as well as multiple symbol sets.

EACH CHARACTER IN EACH LANGUAGE HAS A UNIQUE CODE, THAT IS UNICODE/

YES, I TRIED TO CONVERT IN WORD AS WELL.

IT SHOWS THE SAME RESULT.

IN DETAIL - THE CHARACTERS IN THE PDF LET US SAY :

ENGLISH CONSONANT "X" HAS A PARTICULAR UNICODE VALUE 0058. IT SHOWS CORRECT VALUE WHEN EXTRACT ENGLISH TEXT. NO ISSUES AT ALL.

BUT IN "SANSKRIT" OR "GUJARATI" LANGUAGE , IT EXTRACTS WRONG UNICODE. (INCORRECT READING)

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded