Extracting empty text or funny characters from Scanned PDF using Apache Tika Tesseract OCR in Ubuntu 16.04
Hi,
When I use Apache Tika Tesseract OCR program in Windows I can be able to extract the text from multiple scanned PDFs from a given directory.But when I use same program in Ubuntu 16.04 OS, for couple of documents I am getting funny characters during extraction and some times empty text extraction is coming.Can you please let me know what could be the reason and what should I used to extract text properly.
Thanks and Regards,
Karim