Export/OCR from an Always Partially Upside-Down Scan

Report · Sep 19, 2016

I am running Adobe Acrobat Pro XI on a Windows 8.1 PC.

I have a PDF from a scan of an original with an even number of columns, with a header centered across each pair.

Except that adjacent columns are upside-down with respect to each other (content is different in adjacent columns). This is intentional in the source.

So my scan looks something like this:

------ ------

▲ ▼ ▲ ▼

I ultimately want text from this. If I, say, export to Word the first and third columns come through OK, but the OCR fails badly on the second and fourth columns. If you think about the arrangement of the icons above ... my scan is always partially upside down.

Obviously, I would not have a problem if my source had been arranged this way:

------ ------

▲ ▲ ▲ ▲

FWIW: So that I would have something reasonably human readable, I rotated each page 180 degrees when doing the scanning. So I have a duplicate of the above in this form:

------ ------

▼ ▲ ▼ ▲

Thus a normal human can read columns 1 and 3 from page 1, then columns 1 and 3 from page 2 ... and get the whole thing.

Suggestions anyone?

Report · Sep 19, 2016

Is there a reason for this up and down?

First off, I'd scan the page(s) at TIF images

Then I'd open the scanned documents in Photoshop, marquee the text facing the wrong direction, do a Command-v to bring that selected region to a new layer, then Command-t to Transform and rotate that item. click OK and when done, re-save. Be sure to flatten the document prior to save (and/or save a copy of the original in case there is an issue that you have to go back and deal with).

Now you can drag that new image onto the Acrobat icon and it will be converted to a PDF (and if you have auto OCR set, you're done otherwise you'll need to do the OCR process.

Let me know if this works for you. (and I'm assuming that you have Photoshop, sorry if you do not. If you do not, I do not know what other options you may have.

Report · Sep 20, 2016

Thanks but this isn't workable.

I didn't actually create the PDF, and I'm not sure I can get hold of the original to rescan into a TIFF.

I don't own, and have rarely used PhotoShop ... but I can probably find a copy around the office, and the rotating layer part of your suggestion seems straightforward.

I suppose there is a reason for the up and down in the original, but I don't know what it is. If I owned the original, I would have just cut it in half and rearranged the strips.

So I'm still at square one. I have only a PDF. It has text I'd like to OCR rather than retype. and part of the file is always upside down.

Report · Sep 20, 2016

Hmmm, just thought of a more practical approach: Take this PDF and print it. Cut it into pieces and tape them to a piece of paper so the upside down parts are right-side up.

Now scan it again and then process via Acrobat.

Personally I'd go with the PS approach but I use PS daily so what I proposed is 2nd nature to me (and I do not have to go looking for a copy of PS around the office.

I would suggest you scan at a high dpi to maintain the quality of the original. Be sure when you tape the thing together, try to NOT have the tape overlap any text. That could create a false line that could throw the OCR-ing off.

How would this work for you?

Report · Sep 20, 2016

As Homer Simpson would say, "Doh!".

I was kinda' hoping for a flashy software solution ... but this might actually work. I will try this on Wednesday.

FWIW: it crossed my mind when I borrowed the original source to cut it or fold it. But I felt that was wrong. It seems I may have given myself a mental block about going down this avenue with a hard copy.

Report · Sep 21, 2016

Yes, gary_sc's suggestion works.

It is a solution, but not a very good one. I am still open to other suggestions.

The reason is that at the next stage, proper export from the OCR apparently relies fairly heavily on following rows of text across the page. It's an open question to me as to whether it performs better in the original (with half the text in excellent shape, and half upside down but repeated right-side up in excellent shape on the next page), or in the cut into strips and rearranged version (where everything is right-side up, but the rows don't line up perfectly).

Performance of this method is better when I take each strip and make a new copy by hand on its own page. I will end up with about 200 strips on separate pages, so it's not an impossible task. But it won't be fun.

Report · Oct 14, 2016

We can help you getting a Tiff images from the PDF you have. Open file and go to Tools> Export PDF> Images> Tiff and click Export

Also can you please share a sample document, for a better solution where we can identify text of both side upwards and upside-down.

You can use https://cloud.acrobat.com/send to share the file

a. Open this link and click on “Select files to Send”

b. Select the file to share

c. Click on Create link and Share this link

Thanks.

Report · Oct 14, 2016

I do not need a TIFF, unless it serves the purpose of ultimately getting the data into TXT.

Report · Oct 14, 2016

You are working with a very challenging format in your source document, but it should be possible to get this to OCR correctly. But, it will not be a simple approach, and will require a few more steps than your standard document.

Take a look at this "page splitting tutorial" I wrote a while ago:

Splitting PDF Pages - KHKonsulting LLC

This was done for a complete different workflow, but you should be able to use the same approach to separate your columns, and then rotate these pages that contain the upside down columns. You then run OCR on the final document that contains one column per page, with all columns correctly rotated.

If your columns are always in the same location, you can automate the splitting, otherwise you may have to manually crop the pages to only show one column (as you indicated would not be much fun )

Report · Oct 14, 2016

Ah ... now this makes more sense.

I have uploaded 1 file in 2 versions: there is an original of the PDF, and a zip containing a folder of TIFF's exported from Acrobat XI Pro from that PDF. (I must say I was a bit surprised at how Acrobat broke the PDF down into those TIFF's ... I can see potential with this route already).

The original PDF shows the columns I am dealing with. More or less in the same locations throughout, although some are shorter or empty on certain pages.

Here is the link: https://files.acrobat.com/a/preview/17d7b398-80a8-4d83-af3e-f3ed5df70763

FWIW: I am working on (yet another) coding of an old card-based board game. Ultimately I need to get all the data in those cards into TXT in my code somewhere. Borrowing the cards to do the scan took some doing, and probably can't be repeated without shelling out $50 or so. If your method works, I have 6 other files like this PDF. Not impossible to type by hand, but clearly something to be avoided.