Copy link to clipboard
Copied
I am working with a pdf file for a client's memoir and have converted each chapter to a docx format using a free tool online. Now, when I open it on my mac, it opens up on pages and the text styles are different from the pdf format. When I go to import it into the InDesign text box for book design it imports with words being combined. For example, the text would look like this:
" I saw a duck, butitranaway."
Now, I don't want to go in manually trying to figure out which words to separate as a lot of the words also have slang that I don't fully know. I also want a more efficient way to fix this issue. I have tried googling this issue, but I can't seem to find people solving this issue. This memoir is around 280 pages, and money-wise, it is not beneficial to do everything manually. Does anyone know a solution to this?
The simple solution is to open any such files in a real copy of Word, and re-save in RTF, DOC and DOCX format under a new name. InDesign will often import one of those much more cleanly than the pretender version.
By @James Gifford—NitroPress
Yes, I was quite suprised this works - found this out a few months ago too.
But no matter what - something always goes out of kilter.
Best success I have had is to save as a .doc file not a .docx
Copy link to clipboard
Copied
Have you tried exporting from the PDF to a Word Doc using Acrobat Pro?
Copy link to clipboard
Copied
I have not, I will try this. Thank you.
Copy link to clipboard
Copied
The takeaway here is that not all putative Word files are actually fully Word standard or compliant — most export options from things like Google Docs and Pages and such do just a good enough job that Word itself can usually open the files. But most (I'm tempted to say all) files from these secondary sources and converters are not compatible with InDesign import.
The simple solution is to open any such files in a real copy of Word, and re-save in RTF, DOC and DOCX format under a new name. InDesign will often import one of those much more cleanly than the pretender version.
As Derek notes, using Acrobat Pro to export PDF to Word is one of the more reliable workflows, although it's still a good idea to open the result and do basic tidying and cleanup in Word before saving as a valid Word file set.
I have also seen more than one reference to "spacing issues with Word files" that trace to the file being set for full justification. Select all styles in Word and set them to left justification, and on top of the other fixes above, I'd be surprised if that doesn't clear up the spacing issue.
All that said, keep in mind that PDF does not store most text as absolutely linear flows. Depending on what tool and version created the PDF (and there are many, many second-rate ones), the text may actually have soft or hard returns at the end of each line as it was presented in the PDF. Search and replace to eliminate all soft returns (change to spaces) can help at the technical level, although you'll still have a lot of cleanup formatting and proofreading to do.
There just isn't any really clean way to get text back out of PDF, not without fairly specialized tools that are often too expensive for one-shot use.
Copy link to clipboard
Copied
The simple solution is to open any such files in a real copy of Word, and re-save in RTF, DOC and DOCX format under a new name. InDesign will often import one of those much more cleanly than the pretender version.
By @James Gifford—NitroPress
Yes, I was quite suprised this works - found this out a few months ago too.
But no matter what - something always goes out of kilter.
Best success I have had is to save as a .doc file not a .docx
Copy link to clipboard
Copied
MS ran out of kilter long ago. 🙂
I have found all three formats to work in varying situations, but DOC does seem to be the most consistent. I automatically create all three options for any but the most expedient workflow, just so I don't have to back up and dig for more kilter.
Copy link to clipboard
Copied
Yeh various ways have different results - sometimes I import the docx or the doc file - sometimes it needs to go all the way to RTF.
Sometimes this doesn't work.
Sometimes I'll copy and paste directly, but then .docx first (as supplied) then .doc then last resort rtf.
Things often don't import/paste in correctly - and it's a struggle.
Copy link to clipboard
Copied
Thank you, This is very helpful. I wanted to learn more about indesign as I worked with the work I am working with and learning why pdf's don't work well is good.
Copy link to clipboard
Copied
Thank you, This is very helpful. I wanted to learn more about indesign as I worked with the work I am working with and learning why pdf's don't work well is good.
By @defaultov3ej5blkkjy
PDF is the end result of the work done in the InDesign.
It's never "the first step" or source - it's the "last resort" if you don't have access to the original file.
Copy link to clipboard
Copied
Here's one more thing to look at: click on the sentence with "butitranaway". Enable Type > Show Hidden Characters and then choose Edit > Edit in Story Editor. This is an unformatted view of the same story. Do you see blue dots between the words?
~Barb
Copy link to clipboard
Copied
I realize there's a limit to the number of blades even a mega-Swiss Army Knife like ID can have, but especially for these fairly common workflows between Adobe/Adobe-compliant formats and such... I just sigh at the need for One More D*mned Plugin. 😛
(Plugin, conversion service, outside tool, script, helper, whatever.)
Copy link to clipboard
Copied
You have to remember that a PDF is, essentially, merely a container of the print instructions to create a page, and as such has no idea of how the original document was constructed or laid out. More often than not, chunks of text are broken up into individual objects and, whether you use Acrobat's own export-to-Word function, or a sketchy free online tool, both have to GUESS how to put things back together, like how words are spaced and also which blocks of text are part of a paragraph. Better tools do this better than others, but NONE are perfect.
Copy link to clipboard
Copied
Excellent summary. I'll just add that it was never meant to be an editable format, either, any more than a printed sheet of paper. All of the edit/modify/extract etc. features are badly glued on to a structure that doesn't really support them.
Copy link to clipboard
Copied
Copy link to clipboard
Copied
I wanted to see if I could still use just the pdf without asking as they want to make sure their content stays safe.
Copy link to clipboard
Copied
I wanted to see if I could still use just the pdf without asking as they want to make sure their content stays safe.
By @defaultov3ej5blkkjy
I'm sorry but i don't understand?
"Safe"??
Copy link to clipboard
Copied
That is fine, don't need to understand that part. It's NDA contract.
Copy link to clipboard
Copied
That is fine, don't need to understand that part. It's NDA contract.
By @defaultov3ej5blkkjy
You said it's just a memoir...
So you are working on a text - but you can't have access to the original / source and need to recreate it from "printed" version?
Copy link to clipboard
Copied
Since the primary question (all the lousy ways to get text out of PDF for editing) has been answered, I will note that—
All of the problems would be solved or minimized by getting the live material, so resolving the reasons why that hasn't been done (some/any of the above, or who-knows) is the next useful step.
Find more inspiration, events, and resources on the new Adobe Community
Explore Now