• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

How to understand that the characters are in different cells of the table?

Explorer ,
Mar 10, 2020 Mar 10, 2020

Copy link to clipboard

Copied

I am reading text from a table. Is it possible to understand that the characters being read are in different cells of the table?

TOPICS
Acrobat SDK and JavaScript

Views

1.1K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Apr 07, 2020 Apr 07, 2020

For the purposes of finding page text dividers I only use the bounding box of a graphics element. It's much faster. On a properly formated page, text is not going to cross a graphic.   You also have to look at the shap of a bounding box. Lines are obvious. 

 

 

Votes

Translate

Translate
Community Expert ,
Mar 10, 2020 Mar 10, 2020

Copy link to clipboard

Copied

Can you explain your problem? PDF files doesn't have haves tables or cells.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Mar 11, 2020 Mar 11, 2020

Copy link to clipboard

Copied

I get an arbitrary pdf file. It is user-generated. How it is formed, I do not know. When viewing this file, I see that it has tables. I understand that in Acrobat Reader there is no such entity as a table, but visually it looks like a table. I need to understand that the characters of the text that I get using the PDWordFinder object and the PDWordFinderAcquireWordList method look like a table and determine on which character the transition to a new cell of the table occurs. If I had information about the position of the boundary lines of the table, this would not be difficult to do, but so far I do not know how to get it.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 11, 2020 Mar 11, 2020

Copy link to clipboard

Copied

Lines are quite easy to find by searching for paths in the page content stream. 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Mar 11, 2020 Mar 11, 2020

Copy link to clipboard

Copied

Thanks for the advice.

Do I understand correctly that this can be done using the PDPath object?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 11, 2020 Mar 11, 2020

Copy link to clipboard

Copied

Well yes, the code needs to search the content for path object, then decide whether or not they look like a line. I just get the overall width and height. Lines are skinny, so come up with a definition of skinny. 4 points in any direction works. Problem is that lines on a page are usually made up of several paths, so the code will need to stitch all the paths that are lined up together. 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Mar 17, 2020 Mar 17, 2020

Copy link to clipboard

Copied

As I understand it, I have to use the method
void PDPathEnum (PDPath obj, PDPathEnumMonitor mon, void * clientData)
Unfortunately, I have not found an example of working with this method anywhere.
The first question I have is how to get PDPath obj ?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 17, 2020 Mar 17, 2020

Copy link to clipboard

Copied

I would have to go back and look at my old code to be sure, but I think the easy solution it so walk through the page content, using the content enumerator callback. When a path is encountered, get the bounding box. The actual path segments aren't as important as the shape of the box.   

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 10, 2020 Mar 10, 2020

Copy link to clipboard

Copied

Yes, but it's not free or easy. I've written a ton of code for parsing tables and there are a few approaches. If the document is tagged,then use the tags.  Otherwise it has to be parsed. If the table format is known up front, then its easy (or at least the easiest to handle), But the general approach is to sort the word blocks into rows and columns and assume each block is a cell.  Other approaches are refinments on this, such as finding lines and using them as hard dividers. 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 17, 2020 Mar 17, 2020

Copy link to clipboard

Copied

PDPath is a subclass of PDGraphic, which in turn is made available to callbacks from PDPageEnumContents. HOWEVER, you should not use PDPageEnumContents. The documentation says: "Note: This method is provided only for backwards compatibility. It has not been updated beyond PDF Version 1.1 and may not work correctly for newly created PDF 1.2 or later files. You should use the PDFEdit API to enumerate page contents."

 

So, you can go into the world of PDFEdit, which requires full knowledge of the PDF graphics and text models. You might as well abandon PDWordFinder and use PDFEdit to get the text too (which may not be in reading order).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 17, 2020 Mar 17, 2020

Copy link to clipboard

Copied

Ha ha, I use both PDFEdit and WordFinder since each provides some different data.  You never could rely on Wordfinder for word ordering, or even proper text size data. 

 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Mar 18, 2020 Mar 18, 2020

Copy link to clipboard

Copied

So, it is impossible to work with the PDPageEnumContents method. But one can work with the PDPathEnum method. It may be easier to go this way than to figure out how to get graphs using PDFEdit.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 18, 2020 Mar 18, 2020

Copy link to clipboard

Copied

To get a PDPath you would have to use PDPageEnumContents. They are not there waiting to be discovered, they are created dynamically by the obsolete page enumerator methods.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Mar 18, 2020 Mar 18, 2020

Copy link to clipboard

Copied

It turns out that the PDPathEnum method cannot be used either. It is strange that this is not written anywhere in the documentation.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Mar 18, 2020 Mar 18, 2020

Copy link to clipboard

Copied

Looks like I found an example that describes how to get graphic content.
https://github.com/datalogics/adobe-pdf-library-samples/tree/master/CPlusPlus/Sample_Source/Display/...
I will sort this out. Thank you all for your help, especially Test_Screen_Name

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Apr 07, 2020 Apr 07, 2020

Copy link to clipboard

Copied

I decipher the content of the page. For elements whose type is kPDEPath, I call the getTblPath procedure. It shows the type and content of the element. For one of the tables I received the following data:

Stroke
Rectangle 4648338, 33683472, 36216374, 28081717
MoveTo 4654891, 31721062
LineTo 36209820, 31721062
MoveTo 4654891, 31662080
LineTo 36209820, 31662080
MoveTo 12629116, 33683472
LineTo 12629116, 28081718
MoveTo 20202520, 33683472
LineTo 20202520, 28081718

 

Using the AVPageViewDrawPolygonOutline and AVPageViewDrawRectOutline operators, I drew lines in accordance with the received data. These lines completely coincided with the boundary lines of the table on the screen. I was delighted, decided that the problem was solved, but moved to another page of the document. There was another table. I received the following data for it:

Fill
Rectangle 2786591, 4086956, 34367735, 5758910
Stroke
MoveTo 0, 0
LineTo 13730710, 0
Stroke
MoveTo 0, 0
LineTo 6485639, 0
Stroke
MoveTo 0, 0
LineTo 4086956,0
Stroke
MoveTo 0, 0
LineTo 7245071, 0
Stroke
MoveTo 0, 0
LineTo 0, 1995047
etc.

 

Id est this page has many kPDEPath elements.
The type of the 1st element is kPDEFill, the rest are kPDEStroke.
If you draw the lines in accordance with the received data, they do not coincide with the boundary lines of the table. I would really appreciate help on how to decrypt this information.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 07, 2020 Apr 07, 2020

Copy link to clipboard

Copied

For the purposes of finding page text dividers I only use the bounding box of a graphics element. It's much faster. On a properly formated page, text is not going to cross a graphic.   You also have to look at the shap of a bounding box. Lines are obvious. 

 

 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Apr 08, 2020 Apr 08, 2020

Copy link to clipboard

Copied

Thanks. Great idea. Did I understand you correctly that you use the PDEElementGetBBox method to determine the position of the bounding box of a graphics element? I have done it. Indeed, the drawn lines coincide with the boundary lines of the table on the screen.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Apr 08, 2020 Apr 08, 2020

Copy link to clipboard

Copied

I was not quite right. If each row in the table is presented as a separate graphic element, then this approach is correct, but there are tables implemented as a single graphic element. In this case, to get all the rows of the table, you need to analyze the data of this graphic element, as it is done in the DisplayPath function in the dpcPath.cpp module.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Apr 08, 2020 Apr 08, 2020

Copy link to clipboard

Copied

Error. If each line in the table ...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Apr 14, 2020 Apr 14, 2020

Copy link to clipboard

Copied

LATEST

For almost all the tables, I managed to find the boundary lines. However, in one of the documents I found a table whose borders I could not find. These boundary lines on the screen look much thicker than regular boundary lines. This is visible in the picture. At first I thought it was not a Path element, but experiments showed that it was a Path element (there are no other elements on the page).
I would appreciate any help.scr.PNG

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines