How to understand that the characters are in different cells of the table?

Question

I am reading text from a table. Is it possible to understand that the characters being read are in different cells of the table?

Thom Parker · Accepted Answer

I decipher the content of the page. For elements whose type is kPDEPath, I call the getTblPath procedure. It shows the type and content of the element. For one of the tables I received the following data:

Stroke
Rectangle 4648338, 33683472, 36216374, 28081717
MoveTo 4654891, 31721062
LineTo 36209820, 31721062
MoveTo 4654891, 31662080
LineTo 36209820, 31662080
MoveTo 12629116, 33683472
LineTo 12629116, 28081718
MoveTo 20202520, 33683472
LineTo 20202520, 28081718

Using the AVPageViewDrawPolygonOutline and AVPageViewDrawRectOutline operators, I drew lines in accordance with the received data. These lines completely coincided with the boundary lines of the table on the screen. I was delighted, decided that the problem was solved, but moved to another page of the document. There was another table. I received the following data for it:

Fill
Rectangle 2786591, 4086956, 34367735, 5758910
Stroke
MoveTo 0, 0
LineTo 13730710, 0
Stroke
MoveTo 0, 0
LineTo 6485639, 0
Stroke
MoveTo 0, 0
LineTo 4086956,0
Stroke
MoveTo 0, 0
LineTo 7245071, 0
Stroke
MoveTo 0, 0
LineTo 0, 1995047
etc.

Id est this page has many kPDEPath elements.
The type of the 1st element is kPDEFill, the rest are kPDEStroke.
If you draw the lines in accordance with the received data, they do not coincide with the boundary lines of the table. I would really appreciate help on how to decrypt this information.

For the purposes of finding page text dividers I only use the bounding box of a graphics element. It's much faster. On a properly formated page, text is not going to cross a graphic. You also have to look at the shap of a bounding box. Lines are obvious.

Test Screen Name · Answer

PDPath is a subclass of PDGraphic, which in turn is made available to callbacks from PDPageEnumContents. HOWEVER, you should not use PDPageEnumContents. The documentation says: "Note: This method is provided only for backwards compatibility. It has not been updated beyond PDF Version 1.1 and may not work correctly for newly created PDF 1.2 or later files. You should use the PDFEdit API to enumerate page contents."

So, you can go into the world of PDFEdit, which requires full knowledge of the PDF graphics and text models. You might as well abandon PDWordFinder and use PDFEdit to get the text too (which may not be in reading order).

サインアップ

ソーシャルログイン

コミュニティへログイン

ソーシャルログイン