• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

How to convert PDF documents into .txt files?

New Here ,
Jan 18, 2019 Jan 18, 2019

Copy link to clipboard

Copied

Hi,

I am working on a research project in Machine Learning, where my dataset is a large collection of PDF files. I need to convert these PDF files into a format which I can use as input in a Text Classification model, such as .txt files. I have been unable to find a tool which does this well, and would be delighted if I could be pointed in the right direction.

Thanks!

Views

8.9K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 18, 2019 Jan 18, 2019

Copy link to clipboard

Copied

I doubt any tool will do it "well" unless you are very lucky, because PDFs don't always convert or even contain text. Do you have any paid-for Adobe services or products related to PDF?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 19, 2019 Jan 19, 2019

Copy link to clipboard

Copied

I do not, but I wanted some advice as to whether there were any paid-for services which work relatively well, and if there was an option to test them out before paying for them. The PDFs in question are all research reports, so all of them contain significant amount of text, and as you rightly pointed out, they have not been converting well using the free tools available online.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 19, 2019 Jan 19, 2019

Copy link to clipboard

Copied

Hi,

You can get a free trial to Acrobat Pro to see if it works

Download Adobe Acrobat free trial | Acrobat Pro DC

And you can subscribe for a month for $25. Note that there is a lower price for annual, paid monthly.

Plans and pricing | Adobe Acrobat DC

Part of whether it works depends on how the PDF was made. Adobe created PDF, but gave it away and PDFs are now made by lots of vendors other than Adobe. Some are poorly made.

Using Acrobat and a well made PDF, you can also convert to Word and Excel.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2019 Jan 20, 2019

Copy link to clipboard

Copied

Thanks a lot, I'll try these out!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 19, 2019 Jan 19, 2019

Copy link to clipboard

Copied

The chances are many tools will do a similar job, and the limitation is in the PDF itself.

If the issue is word separators this might change.

If it is no text coming out, or gobbledegook, nothing will fix it.

Retyping might be a large and time consuming part of your project.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2019 Jan 20, 2019

Copy link to clipboard

Copied

Thank you for clearing that up!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 21, 2019 Jan 21, 2019

Copy link to clipboard

Copied

If you had Acrobat then you could export the pages as images, then create a new PDF file from those images and run Recognize Text on it. If the results are good you'll end up with a document that has readable text and that can be exported to a text file...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Oct 06, 2019 Oct 06, 2019

Copy link to clipboard

Copied

LATEST

I use the SODA PDF PREMIUM for this conversion. I am able to convert the pdf print file over 100 pages (does not work with pdf scan file).

Must use the PREMIUM versiopn do this conversion.

You can down load the try version and see if this is applicable to your files.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines