Removing Black Scan Edges from PDF's... pre OCR
I have been working on a 800 page scanned PDF of a book written in the 1800's... the objective... make it look as good as possible, and create a fully searchable text indexed PDF set that was - perfect.
I have worked with Acrobat a great deal in the past but never to do this type of task, normally print based work where the content is perfect and the process I am about to explain to you would never be required. So this was a search and destroy mission for me... I did alot of searching over the past few days. Alot of downloading of software trials to see only that they were in short - buggy rubbish.
The biggest problem came in that the entire 800 pages were scanned manually from a bound document. Meaning the had very large black areas to all four edges of both the odd and even pages, they varied in thickness from page to page all the way though the document and that meant cropping the PDF was pointless for a few reasons. The first is that when you crop a PDF you are doing nothing like a crop command in Photoshop. You are really only putting your hand over your eye - so to speak so that you can not see the side of the document that you have obscured. The crop in PDF alters the page size - meaning if you want your document to look like a dogs breakfast and have every page a different width or height... thats exactly what you will be getting. I did not.
It also means if you have varying widths of black in your document from page to page you are facing a nightmare, as each page would need to be cropped by itself... a feat requiring about 12 commands or keypresses/mouse clicks. 800 pages x 12 = 9,600 things I did not want to do. Especially when it doesn't fix it. The reason is - I want my document to remain A4. You might want yours to remain the size you want... and cropping individual pages would be the wrong direction to go in for that end.
So say you do go down the manually cropping each page direction - because you only have a few pages... but then after cropping each page you decide now you are going to be smart and resize the PDF so that the page size is the right size and the edge of white (you assume) comes back. Wrong. This is where the crop command fails... all it is doing is consealing the edge of the page... you uncrop or resize the page and it give you back the original blakc scan edge. Fail.
I went lateral... Acrobat has a tool for "obscuring" anything... the NSA probably use it... or should... and its called "Redaction". Its nicely hidden in Acrobat - on the Right hand side "Tools" if the "Protection' Tab is not visible you will need to turn it on... under the "Comment" button there is a bunch of tick marks... turn on "Protection". Its a stupid name... but anyway here are the steps.
Redaction allows for two types of consealment of items in a PDF. Mark for Redaction & Redact Pages. Within the Mark for Redaction section there is - "Redacting Text" & "Redact Blocks". Text is handy if you are wanting to hide the names of xxx and xxx in a document you are sending to someone as you can search and replace with this tool. But I was not. You will be using the "Mark for Redaction" Tool but only for areas or blocks. The tool automatically flips to redact text when you get too close to some text ... and the Mouse arrow change to a "I" Bar. You need to move the mouse further away from the text and get back to the cross hairs target thingy.
(1) Firstly Select "Redaction Properties"
I do this but you may have a better method. I alter the colour of my redaction tool to white and also the edge of the redaction to be RED or something... it doesn't matter as its going to wipe out to white as that is the Fill colour. I only set it to Red so that if I miss my target I can click on the area I have drawn and delete it... the red border allows you to see it.
(2) Masking out the redactions
Then simply go to the "Mark for Redaction" tool... and you are now going to be maksing out... the edges of your scanned PDF that are all black and crappy... make the PDF page sit inside the main areas of the window so you can see the extents of the page... then to mask out the RIGHT edge of a page... drag from OUTSIDE the right of the page (the artboard if you like)... to the bottom left of that side of the page. You cannot EDIT the selection... once you click off its in place... BUT you can click on the selected area and hit delete and kill it if you accidentally go over something you didn't want to hide. I often do 2 or 3 areas per side of the page... in little blocks of white... if the shape of the blackness is an irregular shape it doesn't matter. So now continue masking the other black edges of the PDF page... FROM outside the edge of the page to inside the PDF page... you will have noticed by now that the mask that you ahve been drawing "vanishes" for any areas outside the PDF page... this is PERFECT as it means your page size has not been critically changed unlike the crop command... it has not stuffed up your document. Thats huge.
(3) Cementing the redactions.
So you can go on and on masking areas of pages and pages... before you make the redactions PERMANENT and when I say that - I mean it. There is NO undo. It would not be much good if some smarty could open a redacted PDF and just delete all the white boxes and then read all the redacted information ... so when you are ready to cement in place your redacted black scanned areas... hit the "Apply Redactions" button on the Protection tool bar. It gives you a warning... and you confirm it and bang... it races through the document - totally wiping out everything under your masked areas. At the end of that it gives you options to see anything else thats invisible in the document, like overlapping text... this won't be the case on your scanned document though so decline that.
(4) Saving.
Your redacted document is not saved... when you hit save on a freshly redacted document - it never saves. It confirms what you want to save and inserts the word "Redacted" into the PDF name. Thats damn handy. Mix up a redacted document and send the wrong one to someone and your job could suddenly become untenable.
So you save the document as a redacted PDF... and now you have the original saved PDF without redactions, and the newly named Redacted PDF with no massive black scanned edges through it.
The whole process is incredibly fast... muich much much faster than the 30 mins it took me to write this for you... but this is such an incredibly massive power user tip I figured you all needed to know the best way to get rid of scanned page edges.
Yes it works on both Mac and Windows Acrobat.
Once you have done this to a PDF the edges will be pristine white... no artifacts to trip up the OCR system... so you will get alot less OCR errors out in the page edges !!
Have fun
Guy
