Skip to main content
tinab11558511
Participant
June 13, 2018
Answered

Gesamten Text aus PDF extrahieren?

  • June 13, 2018
  • 1 reply
  • 792 views

Hallo,

ich möchte für eine Analyse im Rahmen meiner Doktorarbeit die Texte aus Nachhaltigkeitsberichten und Kundemagazinen von Unternehmen extrahieren. In den Dokumenten sind sehr viele Bilder, die ich alle nicht brauche. Ich brauche wirklich nur den reinen Text.

Bisher habe ich folgendes versucht: pdf als rtf gespeichert --> Text verrutscht, Buchstaben fehlen etc.

Ich habe Adobe Acrobat XI Pro.

Kann mir jemand helfen?

Danke!

This topic has been closed for replies.
Correct answer Karl Heinz Kremer

Converting from PDF to Word, Excel or any other format is one of the most complex things you can try to do with a PDF file. It works very well in some cases, in other cases the output has very little to do with the original file. The key for success is that the PDF file needs to be "tagged" - which means that it contains information about the information that is displayed in the file. The best way to make sure that a PDF file is tagged correctly is by using the PDFMaker in Acrobat to create the PDF file from Word or Excel (that's the Acrobat ribbon or toolbar).

Unfortunately there is not much you can do to improve the output without spending a lot of time (e.g. by manually tagging the file). Also, if you are using Adobe's ExportPDF service and don't have access to Acrobat, that is not even an option.

The only thing you can do is complain to the original author of the file and tell them that they used a bad PDF generator to create the PDF file.

Sometimes it helps to save the PDF file as a set of high resolution (e.g. 600dpi) images, then import these images back into Acrobat, run OCR and then export to Word or Excel again.

1 reply

Karl Heinz  Kremer
Community Expert
Karl Heinz KremerCommunity ExpertCorrect answer
Community Expert
June 13, 2018

Converting from PDF to Word, Excel or any other format is one of the most complex things you can try to do with a PDF file. It works very well in some cases, in other cases the output has very little to do with the original file. The key for success is that the PDF file needs to be "tagged" - which means that it contains information about the information that is displayed in the file. The best way to make sure that a PDF file is tagged correctly is by using the PDFMaker in Acrobat to create the PDF file from Word or Excel (that's the Acrobat ribbon or toolbar).

Unfortunately there is not much you can do to improve the output without spending a lot of time (e.g. by manually tagging the file). Also, if you are using Adobe's ExportPDF service and don't have access to Acrobat, that is not even an option.

The only thing you can do is complain to the original author of the file and tell them that they used a bad PDF generator to create the PDF file.

Sometimes it helps to save the PDF file as a set of high resolution (e.g. 600dpi) images, then import these images back into Acrobat, run OCR and then export to Word or Excel again.