Copy link to clipboard
Copied
I am trying to find a way to convert PDF to DOCX with creating proper table structure <table> <tr> <td> etc.
The tricky part is that the conversion should be done without user interaction. I pass the information about the PDF location via API or the document itself and I would like to get DOCX at the output
Ideally, it should be able to run directly on the server.
Copy link to clipboard
Copied
I think the tricky part is wanting table structure. Have you found any converter, app or service anywhere that is able to do this for you? Are the PDFs guaranteed to be correctly tagged (for accessibility)?
Copy link to clipboard
Copied
Part of the challenges in general with tables in PDFs is that PDF quality can vary depending on how their created and impacts downstream conversions back into Word documents or other formats.
If you use the PDF to Word conversion API and you aren't getting the results that you find ideal, try using Adobe Acrobat on the PDFs to go through an accessibility checker to see how well the PDFs are tagged. That might be an indication of where some of the challenges could come from.
Another option is the forthcoming PDF Extract API coming later this month uses some of that accessibility technology and AI to interpret your PDF and provide back JSON data. As part of this, it also can extract CSV of the tables in the document pretty reliably.
Hope this helps.