Extract text line-by-line from an OCR scan created "editable text and images" pdf file?

Report · Mar 03, 2022

I am trying to convert some photocopied bank statements into a more usable form. I am able to successfully use the OCR scanning tool to create a pdf file which contains editable text and images. Now I need to know how to extract the editable text from the resulting file line-by-line like the "Read out Loud" tool does. If I simply try to use the mouse to select the main body of the page (which contains a table of transactions with mm/dd date on the left, a description, and a dollar amount), as I drag the selected area across the page, the selected area expands upward and downward to include editable text at the top and bottom of the page, which I don't want. If I then paste the selected text into a plain text file, I get a completely jumbled result which cannot possibly be parsed into what I want. The issue seems to be that the copy operation proceeds in a kind of vertical columnar manner from left to right, over the entire page. It is obviously possible to process the editable text in a line-by-line left to right manner, because the "Read out Loud" tool does it. So, how do I extract editable text in a line-by-line fashion? Do I have to write code to parse the pdf file? God I hope not. There must be a better way. Help!

Report · Mar 04, 2022

Hi, thanks very much but I found a Java API that solves my problem, thanks to this web page:

https://www.tutorialspoint.com/how-to-read-data-from-pdf-file-and-display-on-console-in-java

Initially the sample program didn't give me the text in the order I wanted but after reading some of the source code for PDFTextStripper class I discovered an option called "setSortByPosition" which if set to true gives me the line-by-line behavior I need. Here is the modified Java program:

import java.io.File;

import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.text.PDFTextStripper;

public class PdfToConsole {

public static void main(String args[]) throws IOException {

// Load an existing document

File file = new File("Scan 1.pdf");

PDDocument document = PDDocument.load(file);

// Instantiate PDFTextStripper class

PDFTextStripper pdfStripper = new PDFTextStripper();

// Must set this flag to get line by line text vs column by column

pdfStripper.setSortByPosition(true);

//Retrieve text from PDF document

String text = pdfStripper.getText(document);

System.out.println(text);

//Close the document

document.close();

}

View solution in original post

Report · Mar 04, 2022

It might be possible with a custom-made script, but it's a complex task. It would need to read each word on the page separately and then sort them by their physical location (because the internal order does not match what you see on the page, it seems), and then split them into lines and export them to a plain-text file.

I've developed similar tools for my clients in the past and would be happy to create one for you as well, for a fee of course. You can contact me privately via a Private Message to discuss it further (click my user-name and then on "Send a message"), if you're interested.

Report · Mar 04, 2022