Text Extraction from PDF

TESTED IN PDF.PNG - Google Drive

KetankobaAuthor

Known Participant

We had copied Contents(TEXTS) from PDF file and Pasted in .txt as well as .docx file.

THE RESULTS REMAINS SAME.

Additionally, THE PDF SEARCH OPTION SHOWS THE SAME INCORRECTED TEXT.

THE TEXT FILE LINK:

TEST.txt - Google Drive

THE DOC FILE LINK :

TEST.docx - Google Drive

THE IMAGE SHOWING SEARCHED TEXT IN ADOBE ACROBAT DC

Legend

This is the Acrobat Reader forum. Please share the result you get when extracting text or copy/paste using Acrobat Reader.

KetankobaAuthor

Known Participant

THE PDF FILE LINK :

TEST.pdf - Google Drive

KetankobaAuthor

Known Participant

/// PROGRAM IN C# USING "PDFBOX" LIBRARY

PDDocument doc = null;

string input = textBox1.Text; // INPUT FILE PATH IS IN textBox1.Text

doc = PDDocument.load(input);

PDFTextStripper stripper = new PDFTextStripper();

foreach (char c in stripper.getText(doc)) // EXTRACT EACH CHARACTER FROM TEXT

{

string text1 = String.Format("Character '{0}' has Decimal code: {1} and Unicode value: U+{2}",

c, ((int)c), ((int)c).ToString("X4"));

string unicode = "\\" + "u" + ((int)c).ToString("X4"); // TO STORE UNICODE AS "\u<unicode>", EACH CHARACTER

File.AppendAllText(output, unicode); // STORE UNICODE IN TEXT FILE (output)

}

// OPTIONAL

// File.WriteAllText(output, stripper.getText(doc), Encoding.UTF8); // TO STORE TEXT in TEXT FILE (output).

Legend

Can you confirm how you obtained this list of Unicode points? Was it by adding code to the character extractor? Please do not read this value from the app opening the extracted data as this introduces the possibility of new errors.

It it would be best if you could share the PDF. You must ise your own file sharing for this.

KetankobaAuthor

Known Participant

For the word in pdf

the unicode should be: \u0939\u0930\u094D\u0937\u0928\u093F\u0927\u093E\u0928\u0938\u0942\u0930\u093F

Text SHOWS :

It shows unicode as : \u0939\u0930\u094D\u0937\u0928\u093F\u0927\u093E\u093F\u0938\u0942\u0930\u093F

Legend

Yes, so please share the specific numbers for the two strings in your images. In full please.

KetankobaAuthor

Known Participant

Yes Exactly.

When the text is "XZ" in PDF; It must extract "XZ".

I have built a .NET application in C# .

Till now I have tried using iTextSharp, PdfBox and PDFlib. But no clear extraction.