how to OCR and remove the crossed out lines
Hello,
Inside a PDF I have some crossed out lines. I would like to OCR the text and keep only the text (without these crossed out lines). How to do it ?

Hello,
Inside a PDF I have some crossed out lines. I would like to OCR the text and keep only the text (without these crossed out lines). How to do it ?

OCR works on the basis of the software seeing, and recognizing certain shapes. Once you’ve crossed over text, such as in your screenshot, those letters are no longer the text that it can recognize. I can suggest two options.
I decided to test this last one, and the results are interesting. In this first example, I took the screenshot above, opened it up in Photoshop and used the “Remove Tool.” I drew a mark across the offending lines and got this. OK, but no cigars.

I then took the same screenshot, and ran it through Topaz Photo AI to get a better quality larger image. [Note: the quality of OCR increases dramatically as the resolution of the text goes up. So, a scan at 600 ppi will provide much better OCR results than a similar scan at 300 ppi.] Plus, at the same time, the Topaz sofware got rid of the JPG degretation in your image, so the text was much clearer, and and used the same “Remove Tool” as before, and got this:

Now, here’s the kicker: I do not know if you have Photoshop (not an old one, only the latest versions have the Remove Tool), and I kinda doubt you’ll have Topaz Photo AI. But, my next question is did you do the scan? If you did, redo the scan at 600 ppi, and save it in the TIF format and see if the Photoshop you have can remove that line. After that, good luck!
For more suggestion on how to get a better quality scan, see this blog I wrote for Adobe a number of years ago. If you still have questions, please feel free to ask.
https://community.adobe.com/questions-9/scanning-clean-searchable-pdfs-1278321#M89
Already have an account? Login
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.