Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Problem converting PDF to TEXT - only on some pages

New Here ,
Feb 17, 2022 Feb 17, 2022

Hi,

We download PDF files from Walmart with our POs.  These files are between 10-50 pages.

We use to export the PDF to TXT format and then import the TXT into our accounting program.

Lately, some of the pages (POs) have their data scrammbled.  It might happen on 1-3 pages in the 40 page PDF.

It always happens in the section with the line items (most important part) and instead of splitting the line into shorter 1 line per item, it will have one super long line that mixes up the column headers with the data on the next line.

It isn't consistent and I can't seem to find out why this happens.

I tried this with Acrobat Pro 2019 DC, 2020 DC and even the latest 2021 DC.  I even tried to non-DC 2020 version just to see what happens and the same scrambling of SOME sections on a few pages happens and always in the SAME place in the TXT file.

 

Strangley, I can usually use a workaround:

- open the PDF

- export to EXCEL format (option single worksheet)

- in Excel, SAVE AS ADOBE PDF (option entire workbook, fit to width)

- then open the new PDF and then export as TXT and it usually works properly

 

I tried online conversions but they are all terrible.  What is the best way to convert a PDF to TXT format?

 

Thanks

Richard

 

p.s. I can email anyone a PDF to demo this problem

 

TOPICS
Edit and convert PDFs , General troubleshooting
4.3K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
1 ACCEPTED SOLUTION
Community Expert ,
Feb 17, 2022 Feb 17, 2022
quote

When I open the original Walmart PDF and use File>Properties, Description Tab, the PDF Producer shows "iTextSharp.LGPLv2.Core 2021.9.0.3737" with PDF version 1.4 (Acrobat 5.x)  That causes the random problem.

By defaultpfk03mmrb4k1

 

Good sleuthing, Richard.

That version of the iText utility is fairly recent, but I'm concerned the Walmart is building the PDF to 1.4 standard. That was released in 2001 (see https://en.wikipedia.org/wiki/Adobe_Acrobat_version_history), 21 years ago.

 

Acrobat 1.6 is from 2004, a bit better and the industry has been using Acrobat 1.7 (released in 2006) ever since. Although Acrobat 2.0 standard was released a couple of years ago, I still don't see it being used by the industry. That will change in time.

quote

However, when I export the PDF to Excel (I have to adjust a few columns width-takes only a few sec), and then I save as a new Adobe PDF to convert back.  This new file show a new producer "Adobe PDF Library 19.21.90" and PDF version 1.6 (Acrobat 7.x)

 

That's a better, more recent version, but you might want to check that your version of Acrobat is up to date. The current PDF Library is 21.11.71 and Acrobat Pro is at version 21.011.20039.  See Adobe's release notes at https://helpx.adobe.com/acrobat/release-note/release-notes-acrobat-reader.html 

 

It does appear that Walmart changed something in their iText workflow that creates the PDFs. It would be interesting to see what application and producer are listed in a PDF that used to work for you, and compare them with one that doesn't work.

 

Looks like you're on a path that will get you something you can work with. Unfortunately, the problem lies with the PDF created by Walmart. As you explore different ways to export from Acrobat, look also at the options in the various dialogue boxes to see if any adjustments there improve your conversion.

 

Best to you!

 

 

|    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents |
|    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |

View solution in original post

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 17, 2022 Feb 17, 2022

Without seeing a sample PDF, it's difficult to determine what's happening.

I'm suspecting it might be how the PDF is either exported or is interpreted.

 

1) Try some of the Export options from Acrobat.

Note that there are 2 different types of .txt files the PDF can be exported to, as well as to a spreadsheet and XML. One of these might produce a better end result for you.

 

Export-To_01.pngexpand image

 

2) When you open the exported .txt versions, you'll be prompted to choose an encoding for it.  My first choice is usually Windows (Default), but depending upon the content, you might need to try a different encoding.

PDF-TO-WORD_plaintext_02.pngexpand image

 

|    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents |
|    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 17, 2022 Feb 17, 2022

Hi Bevi,

 

Thanks for your ideas.  I think I may have an idea.  When I open the original Walmart PDF and use File>Properties, Description Tab, the PDF Producer shows "iTextSharp.LGPLv2.Core 2021.9.0.3737" with PDF version 1.4 (Acrobat 5.x)  That causes the random problem.

 

However, when I export the PDF to Excel (I have to adjust a few columns width-takes only a few sec), and then I save as a new Adobe PDF to convert back.  This new file show a new producer "Adobe PDF Library 19.21.90" and PDF version 1.6 (Acrobat 7.x)

 

With this new PDF, I can usually convert properly to a TXT file.

I can't calls/ask Walmart what they changed and why (I'd have better luck reaching a human at Google or Amazon), I tried to email Walmart tech support, but they claim nothing changed and it isn't their fault...big surprise.

 

>Note that there are 2 different types of .txt files the PDF can be exported to

When I tried the Accessible option, I get a single word/number on a line and that would require a complete rewrite of the parsing routines I wrote.  (I look at the left colum for key words: Purchase Order, Date, Line Item, etc. and then import the text on those lines into a custom invoicing program I wrote in dBase decades ago)

 

> When you open the exported .txt versions, you'll be prompted to choose an encoding for it.

Strange, I know what you mean, but I don't see those options.  Maybe I open the TXT with Notepad (or the much better Notepad++), I don't get any choice about encoding options.

 

Thanks for the ideas, I will test out Exporting to Word and XML options to see if I can minimize the number of steps the users have to do.

 

Richard

 

 

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 17, 2022 Feb 17, 2022
quote

When I open the original Walmart PDF and use File>Properties, Description Tab, the PDF Producer shows "iTextSharp.LGPLv2.Core 2021.9.0.3737" with PDF version 1.4 (Acrobat 5.x)  That causes the random problem.

By defaultpfk03mmrb4k1

 

Good sleuthing, Richard.

That version of the iText utility is fairly recent, but I'm concerned the Walmart is building the PDF to 1.4 standard. That was released in 2001 (see https://en.wikipedia.org/wiki/Adobe_Acrobat_version_history), 21 years ago.

 

Acrobat 1.6 is from 2004, a bit better and the industry has been using Acrobat 1.7 (released in 2006) ever since. Although Acrobat 2.0 standard was released a couple of years ago, I still don't see it being used by the industry. That will change in time.

quote

However, when I export the PDF to Excel (I have to adjust a few columns width-takes only a few sec), and then I save as a new Adobe PDF to convert back.  This new file show a new producer "Adobe PDF Library 19.21.90" and PDF version 1.6 (Acrobat 7.x)

 

That's a better, more recent version, but you might want to check that your version of Acrobat is up to date. The current PDF Library is 21.11.71 and Acrobat Pro is at version 21.011.20039.  See Adobe's release notes at https://helpx.adobe.com/acrobat/release-note/release-notes-acrobat-reader.html 

 

It does appear that Walmart changed something in their iText workflow that creates the PDFs. It would be interesting to see what application and producer are listed in a PDF that used to work for you, and compare them with one that doesn't work.

 

Looks like you're on a path that will get you something you can work with. Unfortunately, the problem lies with the PDF created by Walmart. As you explore different ways to export from Acrobat, look also at the options in the various dialogue boxes to see if any adjustments there improve your conversion.

 

Best to you!

 

 

|    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents |
|    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 18, 2022 Feb 18, 2022

Hi Bev,

 

Thanks for the info on PDF version history. I am using Acrobat 2019 so I might upgrade to 21.011.20039 (latest).  I usually am up to date with Windows and all security program (antivirus, antispyware, etc.) but I didn't see a need to keep Acrobat up to date.  I figured if it isn't broken, then don't fix it:)

 

It is a long shot, but I will try to reach someone in Walmart IT department and get their reply.

 

Thanks again for your help!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 25, 2022 Feb 25, 2022
LATEST

Just a short followup....

 

While I formally complained and sent all my findings to Walmart, I haven't back in over a week.

I offered to share my finding and help them, but as expected, there was no reply.

At least I have a work around.

Thanks

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines