Gesamten Text aus PDF extrahieren?

Question

Hallo,ich möchte für eine Analyse im Rahmen meiner Doktorarbeit die Texte aus Nachhaltigkeitsberichten und Kundemagazinen von Unternehmen extrahieren. In den Dokumenten sind sehr viele Bilder, die ich alle nicht brauche. Ich brauche wirklich nur den reinen Text.Bisher habe ich folgendes versucht: pdf als rtf gespeichert --> Text verrutscht, Buchstaben fehlen etc.Ich habe Adobe Acrobat XI Pro.Kann mir jemand helfen?Danke!

Karl Heinz Kremer · Accepted Answer

Converting from PDF to Word, Excel or any other format is one of the most complex things you can try to do with a PDF file. It works very well in some cases, in other cases the output has very little to do with the original file. The key for success is that the PDF file needs to be "tagged" - which means that it contains information about the information that is displayed in the file. The best way to make sure that a PDF file is tagged correctly is by using the PDFMaker in Acrobat to create the PDF file from Word or Excel (that's the Acrobat ribbon or toolbar).

Unfortunately there is not much you can do to improve the output without spending a lot of time (e.g. by manually tagging the file). Also, if you are using Adobe's ExportPDF service and don't have access to Acrobat, that is not even an option.

The only thing you can do is complain to the original author of the file and tell them that they used a bad PDF generator to create the PDF file.

Sometimes it helps to save the PDF file as a set of high resolution (e.g. 600dpi) images, then import these images back into Acrobat, run OCR and then export to Word or Excel again.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded