Copy link to clipboard
Copied
Hi community,
I got 2 questions on PDF to XML conversion:
1. Sometimes Adobe could not extract images (inline images) in a table. Like the button icon below.
2. To extract tables from a PDF, I think Adobe is currently using the 'Stream' method by default which may not identify a cell's colspan and rowspan (a cell of a table may span multiple rows or columns or both) properly. I found that 'lattice' method could be a better solution on this.
Are there any solutions available in terms of the above issues?
Thanks
Copy link to clipboard
Copied
No. You will need to write your own PDF parser to do that, and in doing so you'll see how incredibly complex that process is...
Copy link to clipboard
Copied
Great, finally I got an evidence to explain to my boss how complicated my task is. Thanks:-)
Copy link to clipboard
Copied
This is a great opportunity to sit down with your boss and discuss your salary raise.
If your boss really see value in this project, ask to get the company to get you enrolled in javascripting classes at their expense.
It all depends on how your boss can see return on investing in you, OR, outsource a developer like Try67.
I'm pretty sure that if your boss is a reasonable individual money should be the least of his/her worries.
Copy link to clipboard
Copied
Hi, do you know any workarounds in JS can solve this issue? Can you plz share some details with me?
And I think my boss will definitely prefer a developer like Try67 than me since I am always the trouble maker.
Copy link to clipboard
Copied
Hi,
My work arounds are the product of trial and error since I am learning javascript.
XML to PDF parsing is very far from what I can achieve at this time, but Try67 was the first developer who answered my very first post when I joined the forums and I follow his guidance to the best of my ability.
I have purchased some of his paid-for apps, for example, which are convenient and extremely powerful to get your stuff done for a very affordable and reasonable price.
If it helps in your boss decission-making process, sometimed outsourcing is the way to go.
Very long time ago, I used to work in Internet consulting sales. You would be surprised how many business opt to hire a consultant or subcontractor to develop a solution.
And you have to look at it from the perspective of how much money I pay and how fast can I get this thing running, and how much it's going to save me money in a year from now, 3 years from now, 5 years from now.
I believe that is what they refer to ROI (return on investment). The business profits from the investment by serving their Clients better, and having money coming back almost immediately, since you don't have to send employees to school during production downtime, for example.
The business will keep producing while somebody else(your hired developer) takes care of your issue.
Copy link to clipboard
Copied
Thanks, your suggestion inspired me a lot since I have very limited understandings of bussiness/commercial field.
Copy link to clipboard
Copied
You're welcome .. that's good!
Copy link to clipboard
Copied
There are no workarounds in JavaScript for extracting table data. And extracting images is outside the capabilities of the Acrobat JS model.
Copy link to clipboard
Copied
OH!
Sorry, I left out Thom Parker in my reply... the Jedi Master!
From all the developers that I've followed in these forums, I would definitely recommend these two individuals hands down.
Copy link to clipboard
Copied
Re your first question: What command are you using to extract those images? Could you share a sample file with us?
Copy link to clipboard
Copied
Thanks for your reply! For the first question, I tried to use the 'Export PDF as XML' tool. And I've contacted Adobe's customer service. They replied that they are trying to fix this issue, so let's hope for the best!
Copy link to clipboard
Copied
I'm not sure that "stream" and "lattice" methods apply to how Acrobat extracts table data. You're talking about a different tool?
I have written table parsers for PDF. The tables you've shown above are simple ones. The lines are enough to separate out the bits. Tables can get very complicated and Acrobat doesn't come anywhere close to parsing even the simple ones. You'll either need a special tool for this (and there are free ones out there) or spend a lot of time building a table parser yourself. Acrobat JavaScript could be used, but it is poorly suited for this type of task. For example, JavaScript could not extract the button icon in the first table, because content images are not available to the scripting model. A plug-in is what you need to do this in Acrobat.
Copy link to clipboard
Copied
Yean, I've tried some other tools for table extraction like pdfplumber, tabula, camolot, pdfminer, etc. I also tried to write a table parser in python. But they all failed on generality(tables we got are in high variation, and they can be complicated).
Basically, our goal is to get the 'rowspan' and 'colspan' attributes of each cell. For example, the first cell should have 'rowspan=4, colspan=1' such kind of attributes. Could you plz share some sepcial tools(better to be free:) for solving this problem? Or can Acrobat JavaScript or a plug-in solve this to some extent? If so, can I get more details?
Thanks
Copy link to clipboard
Copied
At this time your not going to get any better than the tools you've already mentioned, especially for free. There is a company that does high end AI based PDF parsing, but not only are they extremely expensive, I don't think they are much better than the open source table parsers.
Copy link to clipboard
Copied
If you want to hire a developer, then contact me through this forum or at www.windjack.com.
Copy link to clipboard
Copied
This requires a plugin written in C/C++.
Copy link to clipboard
Copied
Could you plz bring me more details about which plugin I should use and how to use it?
Copy link to clipboard
Copied
No, it needs you to learn c/C++ and write a plugin. You will be the programmer. Or pay one. Start by reading the PDF reference. This is about 1000 pages of highly technical info, but it’s a possible task for the right person with many months or years to do end in the task.
Find more inspiration, events, and resources on the new Adobe Community
Explore Now