Copy link to clipboard
Copied
Hello from Hello from St.Petersburg.
I need to extract the text from the page of the PDF document in the sequence in which it is displayed on the screen. I sequentially read the words from the page. After that I want to sort the received characters in the desired sequence. For this I use PDWordGetCharQuad method. This method should return character's quad specified in user-space coordinates. It turned out that for all characters of one word, the PDWordGetCharQuad method returns quad with the same coordinate values. Why is that?
I would be grateful for your help.
Copy link to clipboard
Copied
This is not expected. Does this Nth Word have any normal characters?
Copy link to clipboard
Copied
This Word have normal characters.
The most amazing thing is that I created the pdf file from scratch. I wrote several lines into it and checked which quads the PDWordGetCharQuad method reverses. It turned out that for all characters of one word the coordinates of quad are the same. And this is true for all words.
Next, I give the code with which I got these results.
ACCB1 ASBool ACCB2 wordEnumerator(PDWordFinder wObj, PDWord pdWord, ASInt32 pgNum, void* clientData)
{
char str[128];
PDWordGetString(pdWord, str, sizeof(str));
ASFixedQuad quad;
FILE* pOutput;
pOutput = fopen("1.txt", "w+b");
for (int i = 0; i < PDWordGetLength(pdWord); i++) {
bool b = PDWordGetCharQuad(pdWord, i, &quad);
fprintf(pOutput, "%c %d, %d %d, %d %d, %d %d, %d\n",
str[i], quad.tl.h, quad.tl.v, quad.tr.h, quad.tr.v, quad.bl.h, quad.bl.v, quad.br.h, quad.br.v);
}
fclose(pOutput);
return true;
}
Copy link to clipboard
Copied
Maybe that method is broken. What is your exact Acrobat version?
Copy link to clipboard
Copied
I have Adobe Acrobat Pro DC version 2019.021.20061
Copy link to clipboard
Copied
It is not written anywhere in help that the method does not work. Can I get some advice from the company's programmers?
Copy link to clipboard
Copied
"It is not written anywhere in help that the method does not work. Can I get some advice from the company's programmers?" I said it was broken, not planned. Bugs happen.
"Can I get some advice from the company's programmers?" No, I'm quite sure you cannot. I never have in 20 years.
Some thoughts (though you might like to consider using PDFEdit instead, I'm sure it is much more visited by programmers).
1. You do not check the return value of PDWordGetCharQuad. I suggest you do.
2. You say you created the PDFs yourself (in a text editor?) Does it happen with PDF files you did not make?
Copy link to clipboard
Copied
"1. You do not check the return value of PDWordGetCharQuad. I suggest you do."
I checked. PDWordGetCharQuad always comes back true.
"2. You say you created the PDFs yourself (in a text editor?) Does it happen with PDF files you did not make?" I created PDF file in Acrobat. I checked on several files (created not only by me). The results are the same.
Copy link to clipboard
Copied
Test_Screen_Name if you have the time and opportunity, check for yourself. Create some PDF-file and read all the words from this file. I used algorithm described at the page https://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/#t=Acro12_MasterBook%2FPlugins_Wor...
Copy link to clipboard
Copied
I noticed one very interesting thing. The PDWordGetCharQuad and PDWordGetNthQuad methods return the same quad for the same word with PDWordGetNumQuads (pdWord) == 1.
Copy link to clipboard
Copied
This is true if the property PDWordFinderConfigRec.noExtCharOffset is set to true.
Copy link to clipboard
Copied
I never noticed this option before. Good that you have a solution.