Copy link to clipboard
Copied
Hello, we are using Adobe PDF service API heavily to convert PDF to Docx and looking to expand our usage. However, we have an issue with the XML formatting from the converted PDF to DOCX, and would like some advice as to your logic for generating the XML paragraphs and runs.
In some cases what looks like multiple paragraphs in the PDF are merged into a single paragraph object <p> in the XML. From my investigation, I can not understand the logic in the XML Paragraph that divides up the paragraph into two or more paragraphs to be displayed. For example, in the PDF it looks like two individual paragraphs:
This is the first paragraph.
This is the second paragraph.
This would normally show up as two individual paragraph objects <p> in the XML document.xml, but it is not, it is all in one paragraph object. It looks something like this:
<w:p>
<w:pPr>
...
<w:rPr> ... </w:rPr>
</w:pPr>
<w:r>
<w:rPr> ... </w:rPr>
<w:t>This the first paragraph.</w:t>
</w:r>
<w:r>
<w:rPr> ... </w:rPr>
<w:t> </w:t>
</w:r>
<w:r>
<w:rPr> ... </w:rPr>
<w:t>This is the second paragraph.</w:t>
</w:r>
</w:p>
The only thing that "divides" up the two paragraphs is a run with a single space as text. This logic would normally result in the two paragraphs displayed on the same row:
This is the first paragraph. This is the second paragraph.
We post-process the XML data and have done this for years using Microsoft Word generated XML’s. However, we consistently get a variation of the format with PDF Services and want to understand why and how, so we can make adjustments on our side.
Have something to add?