How to convert PDF documents into .txt files?
Copy link to clipboard
Copied
Hi,
I am working on a research project in Machine Learning, where my dataset is a large collection of PDF files. I need to convert these PDF files into a format which I can use as input in a Text Classification model, such as .txt files. I have been unable to find a tool which does this well, and would be delighted if I could be pointed in the right direction.
Thanks!
Copy link to clipboard
Copied
I doubt any tool will do it "well" unless you are very lucky, because PDFs don't always convert or even contain text. Do you have any paid-for Adobe services or products related to PDF?
Copy link to clipboard
Copied
I do not, but I wanted some advice as to whether there were any paid-for services which work relatively well, and if there was an option to test them out before paying for them. The PDFs in question are all research reports, so all of them contain significant amount of text, and as you rightly pointed out, they have not been converting well using the free tools available online.
Copy link to clipboard
Copied
Hi,
You can get a free trial to Acrobat Pro to see if it works
Download Adobe Acrobat free trial | Acrobat Pro DC
And you can subscribe for a month for $25. Note that there is a lower price for annual, paid monthly.
Plans and pricing | Adobe Acrobat DC
Part of whether it works depends on how the PDF was made. Adobe created PDF, but gave it away and PDFs are now made by lots of vendors other than Adobe. Some are poorly made.
Using Acrobat and a well made PDF, you can also convert to Word and Excel.
Copy link to clipboard
Copied
Thanks a lot, I'll try these out!
Copy link to clipboard
Copied
The chances are many tools will do a similar job, and the limitation is in the PDF itself.
If the issue is word separators this might change.
If it is no text coming out, or gobbledegook, nothing will fix it.
Retyping might be a large and time consuming part of your project.
Copy link to clipboard
Copied
Thank you for clearing that up!
Copy link to clipboard
Copied
If you had Acrobat then you could export the pages as images, then create a new PDF file from those images and run Recognize Text on it. If the results are good you'll end up with a document that has readable text and that can be exported to a text file...
Copy link to clipboard
Copied
I use the SODA PDF PREMIUM for this conversion. I am able to convert the pdf print file over 100 pages (does not work with pdf scan file).
Must use the PREMIUM versiopn do this conversion.
You can down load the try version and see if this is applicable to your files.

