Skip to main content
Boris56
Known Participant
December 20, 2019
Question

How to get the coordinates of a word character?

  • December 20, 2019
  • 1 reply
  • 2307 views

Hello from Hello from St.Petersburg.

I need to extract the text from the page of the PDF document in the sequence in which it is displayed on the screen. I sequentially read the words from the page. After that I want to sort the received characters in the desired sequence. For this I use PDWordGetCharQuad method. This method should return character's quad specified in user-space coordinates. It turned out that for all characters of one word, the PDWordGetCharQuad method returns quad with the same coordinate values. Why is that?

I would be grateful for your help.

1 reply

Legend
December 20, 2019

This is not expected. Does this Nth Word have any normal characters?

Boris56
Boris56Author
Known Participant
December 23, 2019

This Word have normal characters.

The most amazing thing is that I created the pdf file from scratch. I wrote several lines into it and checked which quads the PDWordGetCharQuad method reverses. It turned out that for all characters of one word the coordinates of quad are the same. And this is true for all words. 

Next, I give the code with which I got these results.

ACCB1 ASBool ACCB2 wordEnumerator(PDWordFinder wObj, PDWord pdWord, ASInt32 pgNum, void* clientData)
{
	char str[128];
	PDWordGetString(pdWord, str, sizeof(str));

	ASFixedQuad quad;
        FILE* pOutput;
        pOutput = fopen("1.txt", "w+b");
	for (int i = 0; i < PDWordGetLength(pdWord); i++) {
		bool b = PDWordGetCharQuad(pdWord, i, &quad);
		fprintf(pOutput, "%c %d, %d   %d, %d   %d, %d   %d, %d\n",
			str[i], quad.tl.h, quad.tl.v, quad.tr.h, quad.tr.v, quad.bl.h, quad.bl.v, quad.br.h, quad.br.v);
	}
        fclose(pOutput);
	return true;
}

 

Boris56
Boris56Author
Known Participant
January 20, 2020

I noticed one very interesting thing. The PDWordGetCharQuad and PDWordGetNthQuad methods return the same quad for the same word with PDWordGetNumQuads (pdWord) == 1.


This is true if the property PDWordFinderConfigRec.noExtCharOffset is set to true.