Export OCR in PDF to XML

Report · Jun 02, 2016

Is it possible to export the OCR in a PDF into an XML file?

I've tried using File - Save As - XML (with various settings), but that doesn't save the OCR'd text.

With Content Editing - Export, there is no XML option.

OCR seems to be output with every other file type, except for XML.

I've also tried saving as a Word document, and then saving the resulting Word doco as XML. But that doesn't work either as it seems to turn images into XML. I just want the OCR'd text.

I'm using Adobe Acrobat XI Pro.

Report · Jun 03, 2016

What options do you use for OCR?

Report · Jun 03, 2016

Hi Bernd, the PDFs in question are scanned and OCR'd by a third party. I don't know what options they used. are the options used for OCRng likely to make a difference for XML output?

Report · Jun 03, 2016

What did you get when you save as Word?

Report · Jun 03, 2016

I get the same error in word or in excel. I think the issue is that when I export, if I uncheck "recognize text if needed," the pdf is exported as images without any OCR. If I do check that box, the changes I made to the OCR in Adobe are lost as the OCR is redone and thus back to the original errors.

I am new to Adobe so I may have made an error. I had a scanned document, used enhance, and then correct recognized text. In Adobe, the OCR is now correct. Any suggestions?

Report · Jun 03, 2016

@Bernd Alheit

I am having a similar issue, except OCR was done in Adobe DC. Essentially, I have been correcting the layer of OCR using the "correct recognized text" option. Because I am using a scan of an old document, the text of 76014 might have been OCR'd as "7B014." The scanned document is 1,000 pages so I have made numerous corrections like this. However, when I export the pdf, those changes to the OCR are not exported. Instead, the export would still show 7B014.

If I select all in Adobe, I can copy and paste the corrected OCR. But is there a way to export the corrected OCR to xml?

Report · Jun 03, 2016

Try an alternative. Export an "uncorrected" PDF to a text editor (Word, what ever).

Use the text editor to do corrections.

Export the text editor's file to xml.

Be well...

Report · Jun 16, 2016

Still not having any luck at all in exporting OCR into any kind of file at all. XML, Word, DOC, RTF etc.

Is it even possible?

Report · Jun 17, 2016

Depending on the settings, in many PDF files you have the original scanned document (picture only) combined with invisible text (for searching and copy/paste). Anything that exports XML is quite likely to ignore invisible stuff. Instead of XML export see if there's a way to extract text that works for you. Simplest is save as TXT.

Report · Jun 19, 2016

Test Screen Name, thanks for your reply. Saving as TXT doesn't yield any content at all.

I have found a solution, though - use Abbyy FineReader instead of Adobe Acrobat Pro to export.

It looks like OCR'd text in Adobe PDFs can only be exported by using whatever software generated the OCR in the first place.

Report · Jun 19, 2016

Can you elaborate on the solution you found? Is it as simple as opening the pdf in Abbyy FineReader and choosing export?

(I do not currently have Abbyy FineReader so I can't see for myself. I am deciding whether to purchase it for this specific reason as I have cleaned up hundreds of pages using the invisible OCR in Adobe, which I am currently unable to export cleanly).

Report · Jun 19, 2016

Hi Alex, looks like I've made an erroneous assumption.

I thought AbbyyFR was using the existing OCR layer in OCR'd PDFs; but turns out it was scanning them anew. So my assumption that Abbyy was reading the existing OCR is not correct. Drat, it made sense at the time.

Now to try to figure out how to uncorrect that "Correct Answer".

Report · Oct 17, 2016

Hi Bernadette/Alex,

We apologize for the delay in response and the inconvenience thus caused to you.

Please try the following steps:

1. Open the PDF file

2. Go to "Tools" -> "Enhance Scans"

3. Select "Recognize Text" -> "In this File" -> "Settings"

4. Select "Editable Text and Images" from the "Output" dropdown and Click "OK"

5. Click on the "Recognize Text" button

6. Select "Recognize Text" from the menu again -> "Correct Recognized text"

7.Make the corrections and save the PDF

8. Now, try exporting the saved file to any format you want (The corrected OCR'ed text should be exported)

Please let us know if this helps.

Thanks and Regards,

Girija

Report · May 17, 2018

Hey Bernadette, Alex, Girija,

I've attempted three different methods to exported corrected OCR'ed text, with three different, and ultimately unsatisfactory results.

Method 1:

Open the PDF file
Go to "Tools" -> "Enhance Scans"
Select "Recognize Text" -> "In this File" -> "Settings"
Select "Searchable Image" from the "Output" dropdown and Click "OK"
Click on the "Recognize Text" button
Select "Recognize Text" from the menu again -> "Correct Recognized text"
Make the corrections and save the PDF
Export the saved file to .doc and .txt

Result: I got the same results as alexw71856384. The exported text is uncorrected. I would guess that Test Screen Name is correct. The export is ignoring the invisible later (that contains the corrections), and just re-OCRing the entire document.

Method 2 (Based on girijaAgarwal suggestion):

I started with my corrected OCR text from Method 1 (steps 1 -7)
Select "Editable Text and Images" from the "Output" dropdown and Click "OK"
Click on the "Recognize Text" button.
This succesfully converted my corrected OCR text from a Searchable Image to editable Text and Images (see: Better PDF OCR. ClearScan is smaller, looks better )
Export the saved file to .doc and .txt

Result: girijaAgarwal, this was by far the worst option. I got an unusable mess: invisible characters/words, out of order etc.

Method 3:

I started with my corrected OCR text from Method 1 (steps 1 -7)
I used the "Preflight" -> "Make OCR text visible" (detailed instructions: Hidden Gems in Acrobat DC: How to Optimize Hidden OCR Text | Adobe Blog )
Open the Layer panel on the left to reveal the new layers.
Toggle the 'Invisible text' layer to on, and the 'Visible page content' layer to off.
Change layer settings so that the 'Invisible text' layer always exports, and the 'Visible page content' to never exports
Export the saved file to .doc (without images) and .txt

Result: So the good news, is that this method exported OCR-ed text with corrections. Unfortunately, it introduce new errors into the exported text. Mainly missing spaces and extra spaces that weren't in Method 1's output or the corrected OCR text in the PDF document.