convert PDF to XML

Report · Mar 15, 2020

Hi community,

I got 2 questions on PDF to XML conversion:

1. Sometimes Adobe could not extract images (inline images) in a table. Like the button icon below.

2. To extract tables from a PDF, I think Adobe is currently using the 'Stream' method by default which may not identify a cell's colspan and rowspan (a cell of a table may span multiple rows or columns or both) properly. I found that 'lattice' method could be a better solution on this.

Are there any solutions available in terms of the above issues?

Thanks

Report · Mar 16, 2020

No. You will need to write your own PDF parser to do that, and in doing so you'll see how incredibly complex that process is...

Report · Mar 16, 2020

Great, finally I got an evidence to explain to my boss how complicated my task is. Thanks:-)

Report · Mar 16, 2020

This is a great opportunity to sit down with your boss and discuss your salary raise.

If your boss really see value in this project, ask to get the company to get you enrolled in javascripting classes at their expense.

It all depends on how your boss can see return on investing in you, OR, outsource a developer like Try67.

I'm pretty sure that if your boss is a reasonable individual money should be the least of his/her worries.

Report · Mar 26, 2020

Hi, do you know any workarounds in JS can solve this issue? Can you plz share some details with me?

And I think my boss will definitely prefer a developer like Try67 than me since I am always the trouble maker.

Report · Mar 26, 2020

Hi,

My work arounds are the product of trial and error since I am learning javascript.

XML to PDF parsing is very far from what I can achieve at this time, but Try67 was the first developer who answered my very first post when I joined the forums and I follow his guidance to the best of my ability.

I have purchased some of his paid-for apps, for example, which are convenient and extremely powerful to get your stuff done for a very affordable and reasonable price.

If it helps in your boss decission-making process, sometimed outsourcing is the way to go.

Very long time ago, I used to work in Internet consulting sales. You would be surprised how many business opt to hire a consultant or subcontractor to develop a solution.

And you have to look at it from the perspective of how much money I pay and how fast can I get this thing running, and how much it's going to save me money in a year from now, 3 years from now, 5 years from now.

I believe that is what they refer to ROI (return on investment). The business profits from the investment by serving their Clients better, and having money coming back almost immediately, since you don't have to send employees to school during production downtime, for example.

The business will keep producing while somebody else(your hired developer) takes care of your issue.

Report · Mar 29, 2020

Thanks, your suggestion inspired me a lot since I have very limited understandings of bussiness/commercial field.

Report · Mar 29, 2020

You're welcome .. that's good!

Report · Mar 27, 2020

There are no workarounds in JavaScript for extracting table data. And extracting images is outside the capabilities of the Acrobat JS model.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Mar 27, 2020

OH!

Sorry, I left out Thom Parker in my reply... the Jedi Master!

From all the developers that I've followed in these forums, I would definitely recommend these two individuals hands down.

Report · Mar 17, 2020

Re your first question: What command are you using to extract those images? Could you share a sample file with us?

Report · Mar 26, 2020

Thanks for your reply! For the first question, I tried to use the 'Export PDF as XML' tool. And I've contacted Adobe's customer service. They replied that they are trying to fix this issue, so let's hope for the best!

Report · Mar 17, 2020

I'm not sure that "stream" and "lattice" methods apply to how Acrobat extracts table data. You're talking about a different tool?

I have written table parsers for PDF. The tables you've shown above are simple ones. The lines are enough to separate out the bits. Tables can get very complicated and Acrobat doesn't come anywhere close to parsing even the simple ones. You'll either need a special tool for this (and there are free ones out there) or spend a lot of time building a table parser yourself. Acrobat JavaScript could be used, but it is poorly suited for this type of task. For example, JavaScript could not extract the button icon in the first table, because content images are not available to the scripting model. A plug-in is what you need to do this in Acrobat.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Mar 26, 2020

Yean, I've tried some other tools for table extraction like pdfplumber, tabula, camolot, pdfminer, etc. I also tried to write a table parser in python. But they all failed on generality(tables we got are in high variation, and they can be complicated).

Basically, our goal is to get the 'rowspan' and 'colspan' attributes of each cell. For example, the first cell should have 'rowspan=4, colspan=1' such kind of attributes. Could you plz share some sepcial tools(better to be free:) for solving this problem? Or can Acrobat JavaScript or a plug-in solve this to some extent? If so, can I get more details?

Thanks

Report · Mar 27, 2020

At this time your not going to get any better than the tools you've already mentioned, especially for free. There is a company that does high end AI based PDF parsing, but not only are they extremely expensive, I don't think they are much better than the open source table parsers.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Mar 27, 2020

If you want to hire a developer, then contact me through this forum or at www.windjack.com.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Mar 26, 2020

This requires a plugin written in C/C++.

Report · Mar 29, 2020

Could you plz bring me more details about which plugin I should use and how to use it?

Report · Mar 29, 2020

No, it needs you to learn c/C++ and write a plugin. You will be the programmer. Or pay one. Start by reading the PDF reference. This is about 1000 pages of highly technical info, but it’s a possible task for the right person with many months or years to do end in the task.