Extracting empty text or funny characters from Scanned PDF using Apache Tika Tesseract OCR in Ubuntu 16.04

Forum|Forum|7 years ago
October 17, 2018
1 reply
342 views

Hi,

When I use Apache Tika Tesseract OCR program in Windows I can be able to extract the text from multiple scanned PDFs from a given directory.But when I use same program in Ubuntu 16.04 OS, for couple of documents I am getting funny characters during extraction and some times empty text extraction is coming.Can you please let me know what could be the reason and what should I used to extract text properly.

Thanks and Regards,

Karim