Skip to main content
Known Participant
June 3, 2016
Question

Export OCR in PDF to XML

  • June 3, 2016
  • 3 replies
  • 6275 views

Is it possible to export the OCR in a PDF into an XML file?

I've tried using File - Save As - XML (with various settings), but that doesn't save the OCR'd text.

With Content Editing - Export, there is no XML option.

OCR seems to be output with every other file type, except for XML.

I've also tried saving as a Word document, and then saving the resulting Word doco as XML. But that doesn't work either as it seems to turn images into XML. I just want the OCR'd text.

I'm using Adobe Acrobat XI Pro.

This topic has been closed for replies.

3 replies

Known Participant
June 17, 2016

Still not having any luck at all in exporting OCR into any kind of file at all. XML, Word, DOC, RTF etc.

Is it even possible?

Legend
June 17, 2016

Depending on the settings, in many PDF files you have the original scanned document (picture only) combined with invisible text (for searching and copy/paste). Anything that exports XML is quite likely to ignore invisible stuff. Instead of XML export see if there's a way to extract text that works for you. Simplest is save as TXT.

Known Participant
June 19, 2016

Test Screen Name, thanks for your reply. Saving as TXT doesn't yield any content at all.

I have found a solution, though - use Abbyy FineReader instead of Adobe Acrobat Pro to export.

It looks like OCR'd text in Adobe PDFs can only be exported by using whatever software generated the OCR in the first place.

Participant
June 3, 2016

@Bernd Alheit

I am having a similar issue, except OCR was done in Adobe DC. Essentially, I have been correcting the layer of OCR using the "correct recognized text" option. Because I am using a scan of an old document, the text of 76014 might have been OCR'd as "7B014." The scanned document is 1,000 pages so I have made numerous corrections like this. However, when I export the pdf, those changes to the OCR are not exported. Instead, the export would still show 7B014.

If I select all in Adobe, I can copy and paste the corrected OCR. But is there a way to export the corrected OCR to xml?

CtDave
Participating Frequently
June 4, 2016

Try an alternative. Export an "uncorrected" PDF to a text editor (Word, what ever).

Use the text editor to do corrections.

Export the text editor's file to xml.

Be well...

Bernd Alheit
Community Expert
Community Expert
June 3, 2016

What options do you use for OCR?

Known Participant
June 3, 2016

Hi Bernd, the PDFs in question are scanned and OCR'd by a third party. I don't know what options they used. are the options used for OCRng likely to make a difference for XML output?

Bernd Alheit
Community Expert
Community Expert
June 3, 2016

What did you get when you save as Word?