• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Need help with Acrobat to Word Export :: Certain paragraphs are converting to garbage

Community Expert ,
Dec 14, 2020 Dec 14, 2020

Copy link to clipboard

Copied

Hi Smart Acrobat People:

 

I am tasked with converting a PDF to FrameMaker and the original Word doc used to create the PDF is long gone. Engineers have been editing the PDF over the course of several years. I thought, no problem, convert the PDF to Word or RTF, clean up the doc and I'll be off and running. 

 

Here is how a small section looks in Acrobat:

1.png

And the same section in Word:

2.png

 

I have tried:

  1. exporting using Acrobat Pro DC on my Mac and Acrobat Pro DC on Windows, 
  2. exporting to Word Document, Word 97–2003 document and RTF, and 
  3. every combination of export settings on both platforms. (The font is Arial.)

 

Copy/Paste (paste as formatted text, paste as unformatting text, paste in InDesign, Word, FrameMaker actually gets worse, in that those crazy characters become ?s in boxes. Again, changing the font does not help.

 

This is a long document, so multiply the Headings issue by Hundreds of pages. Any ideas?

 

~Barb 

TOPICS
Edit and convert PDFs , Standards and accessibility

Views

1.1K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Dec 15, 2020 Dec 15, 2020

Oh, no, Barb...don't do that!

The problem is that the fonts in the original PDF either 1) weren't embedded into the PDF at the time it was made, 2) weren't Unicode/OpenType fonts, or 3) both.

We have this sort of problem all the time when we make older PDFs accessible. Try this workaround:

  1. If possible, do this on a Windows 10 computer with up-to-date fonts and Acrobat PRO DC.
  2. Before you do anything, check that the fonts on your computer are Unicode/OpenType versions, not older TrueTypes. Go dow
...

Votes

Translate

Translate
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

Hi again:

 

I'm still struggling to figure out a way to move forward without having to retype hundreds of page of headings, in Spanish. As I look at the fonts, I see various encodings for Arial Bold (some Arial Bold is converting clearly and some is not), along with TrueType v Type 1 (CID) vs TrueType (CID). 

 

Is there a clue in this dialog box?

fonts.png

I don't know anything about "Indentity-H", other than a post from @Dov Isaacs explaining that it is a perfectly valid encoding method per the PDF specification, but I while am surmising that those are the paragraphs that are mismapping, I recognize that this may be entirely incorrect.

 

Really, this comes down to is there any way to remap these characters, after the fact? Again, the original Word document is gone, and this one PDF is all we have to work with. 

 

~Barb 

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

It looks like I am talking to myself... but for others who encounter this issue in the future, I am able to recover the text using the following process:

  1. Export the doc to jpeg (produces 1 jpeg per page)
  2. Open a jpeg page in Acrobat
  3. Run OCR
  4. Export the OCR'd page to Word
  5. Clean up the mess
  6. Repeat

 

I'll write an action to automate this, but if someone has a better way, please tell me!

 

~Barb 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

Oh, no, Barb...don't do that!

The problem is that the fonts in the original PDF either 1) weren't embedded into the PDF at the time it was made, 2) weren't Unicode/OpenType fonts, or 3) both.

We have this sort of problem all the time when we make older PDFs accessible. Try this workaround:

  1. If possible, do this on a Windows 10 computer with up-to-date fonts and Acrobat PRO DC.
  2. Before you do anything, check that the fonts on your computer are Unicode/OpenType versions, not older TrueTypes. Go down the list in the fonts panel (above screen capture) and check each one on your system. FYI, Acrobat defaults to the fonts on the user's computer when they are not fully or correctly embedded in the PDF.
  3. Open the PDF on your system (with the Unicode fonts) and embed the fonts into the PDF: follow the guidance here: How to Embed Fonts into a PDF with Adobe Acrobat.
  4. Save the PDF, and close it.
  5. Reopen the PDF and export it to Word (not Word 97-2003, which was pre-XML and pre-Unicode).
  6. Open the new Word.docx file in Word (Windows) and see how well it converted the glyphs.

 

This is definitely caused by non-compliant fonts in the original PDF. Are you able to open the original PDF and check which fonts are being used where?

  1. Open it in a Windows version of Acrobat Pro.
  2. Edit / Edit Text, and select a portion of the text that's not converting correctly, such as REGLA 6.
  3. It's font "call" will be in the Edit panel.
  4. Let us know which fonts are being called at the trouble spots.

 

|    Bevi Chagnon   |  Designer & Technologist for Accessible Documents
|    Classes & Books for Accessible InDesign, PDFs & MS Office |

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

Ok, one more diagnostic task:

Can you post a screen capture of the original PDF's File Properties / Description panel? It will show which software was used to create the original PDF.

|    Bevi Chagnon   |  Designer & Technologist for Accessible Documents
|    Classes & Books for Accessible InDesign, PDFs & MS Office |

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

@Bevi Chagnon - PubCom.com!

 

Thank you for stepping in. 😊

 

Oddly, the original app isn't listed. My first plan of attack was to request they they try again to locate the original file.

acro.png

 

Am in the process of making sure everything is updated on Windows—I'll work through your list once it is and will check back in. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

PDF Producer is the software utility that converts the source file into the PDF.

The screen capture is telling: they made a PDF from a PDF.

So whatever shortcomings were in the original were carried over into the newer version.

Your client needs a better workflow and training. If you'd like to do that, contact me offlist and we'll can be your backup tech support coaches on it.

 

|    Bevi Chagnon   |  Designer & Technologist for Accessible Documents
|    Classes & Books for Accessible InDesign, PDFs & MS Office |

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

@Bevi Chagnon - PubCom.com:

 

I'm stuck at #2. Windows is showing that the Arial installed on my computer is TrueType font. 
fonts.png

I can ask my client to purchase an OpenType version (i.e., https://www.linotype.com/145867/arial-family.html) but before I do that, I want to confirm that I'm understanding you correctly, and knowing that I have access to many other OpenType fonts through my CC subscription, that I can't map just map Arial to another OpenType font. But if this is the best option, then we will go this route.

 

The font calls for the crazy text are Arial Black and Arial, Arial Bold, Arial Italic and Arial Bold Italic.

 

~Barb 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

Barb, Microsoft uses the TTF extension on all its fonts, whether traditional TrueType or OpenType (TrueType flavored). So drill a bit deeper and see if you can see this dialogue box that confirms it. (Looking at the copyright date and file version, I'm going to assume you have an OpenType version).

Confirming OpenType status.Confirming OpenType status.

 

|    Bevi Chagnon   |  Designer & Technologist for Accessible Documents
|    Classes & Books for Accessible InDesign, PDFs & MS Office |

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

Don't buy Arial (reg, ital, etc.). They're included free with MS products.

https://docs.microsoft.com/en-us/typography/font-list/arial  See if you can download them from the MS website.

 

And don't forget to get a good copy of Arial Black. https://docs.microsoft.com/en-us/typography/font-list/arial-black

 

Once you confirm you have Unicode versions (with a post-2010 copyright date), go ahead and embed the fonts.

 

|    Bevi Chagnon   |  Designer & Technologist for Accessible Documents
|    Classes & Books for Accessible InDesign, PDFs & MS Office |

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

@Bevi Chagnon - PubCom.com:

 

Thank you, thank you, thank you!

 

Embedding the fonts worked exactly as you said it would and solved the issue. I could never gotten here without your help. I🙏🏼

 

~Barb 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 15, 2020 Dec 15, 2020

Copy link to clipboard

Copied

LATEST

@Barb Binder, you're welcome, my friend!

Glad to have been able to help.

 

|    Bevi Chagnon   |  Designer & Technologist for Accessible Documents
|    Classes & Books for Accessible InDesign, PDFs & MS Office |

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines