Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Extract text line-by-line from an OCR scan created "editable text and images" pdf file?

New Here ,
Mar 03, 2022 Mar 03, 2022

I am trying to convert some photocopied bank statements into a more usable form. I am able to successfully use the OCR scanning tool to create a pdf file which contains editable text and images. Now I need to know how to extract the editable text from the resulting file line-by-line like the "Read out Loud" tool does.  If I simply try to use the mouse to select the main body of the page (which contains a table of transactions with mm/dd date on the left, a description, and a dollar amount), as I drag the selected area across the page, the selected area expands upward and downward to include editable text at the top and bottom of the page, which I don't want. If I then paste the selected text into a plain text file, I get a completely jumbled result which cannot possibly be parsed into what I want. The issue seems to be that the copy operation proceeds in a kind of vertical columnar manner from left to right, over the entire page. It is obviously possible to process the editable text in a line-by-line left to right manner, because the "Read out Loud" tool does it. So, how do I extract editable text in a line-by-line fashion? Do I have to write code to parse the pdf file? God I hope not. There must be a better way. Help!

TOPICS
Edit and convert PDFs , Scan documents and OCR
2.4K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
1 ACCEPTED SOLUTION
New Here ,
Mar 04, 2022 Mar 04, 2022
LATEST

Hi, thanks very much but I found a Java API that solves my problem, thanks to this web page:

https://www.tutorialspoint.com/how-to-read-data-from-pdf-file-and-display-on-console-in-java

Initially the sample program didn't give me the text in the order I wanted but after reading some of the source code for PDFTextStripper class I discovered an option called "setSortByPosition" which if set to true gives me the line-by-line behavior I need. Here is the modified Java program:

import java.io.File;

import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.text.PDFTextStripper;

public class PdfToConsole {

   public static void main(String args[]) throws IOException {

      // Load an existing document

      File file = new File("Scan 1.pdf");

      PDDocument document = PDDocument.load(file);

      // Instantiate PDFTextStripper class

      PDFTextStripper pdfStripper = new PDFTextStripper();

      // Must set this flag to get line by line text vs column by column

      pdfStripper.setSortByPosition(true);

      //Retrieve text from PDF document

      String text = pdfStripper.getText(document);

      System.out.println(text);

      //Close the document

      document.close();

   }

}

View solution in original post

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 04, 2022 Mar 04, 2022

It might be possible with a custom-made script, but it's a complex task. It would need to read each word on the page separately and then sort them by their physical location (because the internal order does not match what you see on the page, it seems), and then split them into lines and export them to a plain-text file.

 

I've developed similar tools for my clients in the past and would be happy to create one for you as well, for a fee of course. You can contact me privately via a Private Message to discuss it further (click my user-name and then on "Send a message"), if you're interested.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Mar 04, 2022 Mar 04, 2022
LATEST

Hi, thanks very much but I found a Java API that solves my problem, thanks to this web page:

https://www.tutorialspoint.com/how-to-read-data-from-pdf-file-and-display-on-console-in-java

Initially the sample program didn't give me the text in the order I wanted but after reading some of the source code for PDFTextStripper class I discovered an option called "setSortByPosition" which if set to true gives me the line-by-line behavior I need. Here is the modified Java program:

import java.io.File;

import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.text.PDFTextStripper;

public class PdfToConsole {

   public static void main(String args[]) throws IOException {

      // Load an existing document

      File file = new File("Scan 1.pdf");

      PDDocument document = PDDocument.load(file);

      // Instantiate PDFTextStripper class

      PDFTextStripper pdfStripper = new PDFTextStripper();

      // Must set this flag to get line by line text vs column by column

      pdfStripper.setSortByPosition(true);

      //Retrieve text from PDF document

      String text = pdfStripper.getText(document);

      System.out.println(text);

      //Close the document

      document.close();

   }

}

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines