Skip to main content
karimulla_bashap60075824
Participant
October 17, 2018
Question

Extracting empty text or funny characters from Scanned PDF using Apache Tika Tesseract OCR in Ubuntu 16.04

  • October 17, 2018
  • 1 reply
  • 342 views

Hi,

When I use Apache Tika Tesseract OCR program in Windows I can be able to extract the text from multiple scanned PDFs from a given directory.But when I use same program in Ubuntu 16.04 OS, for couple of documents I am getting funny characters during extraction and some times empty text extraction is coming.Can you please let me know what could be the reason and what should I used to extract text properly.

Thanks and Regards,

Karim

This topic has been closed for replies.

1 reply

Bernd Alheit
Community Expert
Community Expert
October 18, 2018

Why do you post the question in the forum for Adobe Acrobat?