Copy link to clipboard
Copied
Hi,
With Adobe Acrobat Pro, I am able to convert a PDF into a plain TXT format.
I would like to do this conversion via an API. In the Adobe PDF Services documentation, there is only an option to convert to RTF, not TXT.
I was wondering if there was a way to convert a PDF to plain TXT using an API service.
Thanks!
Copy link to clipboard
Copied
You can use the Extract API to get a JSON representation of the PDF then filter it to get only the text elements. From there you can output plain text.
Copy link to clipboard
Copied
Thanks for your reply!
When I convert the PDF to text via Acrobat Pro, it orders the text in a way that is useful to me -- the rows of the tables are formatted in paragraphs of text. I've attached a sample output PDF and TXT file. I don't think it would be possible to retain this format when using JSON.
This seems like a really roundabout way to complete a simple task. Why does the API not support plain text? It would be really useful to us.
Copy link to clipboard
Copied
The usefulness of the JSON really depends on your goals. I find the output from Extract to be far more useful than plain text because I can easily format it into whatever I need. Also, tables are both represented as tables in the JSON similar to how HTML does and it can also output them as either .csv or .xslx.
Copy link to clipboard
Copied
I have a similar problem:
I only want to do the same from my C#-Application like it is doing Acrobat with the function "Save as...".
Only load the pdf and save it as .txt.
How can I do this?
Copy link to clipboard
Copied
Joel already answered. Extract gives you a JSON representation of the PDF. You can work with the results from that to generate a txt version of the PDF. It can get complex, for example, rendering tables, but it's possible.
Copy link to clipboard
Copied
Can you give me a link to an example (best in C#) how to do this?