Copy link to clipboard
Copied
The PDF-to Word converter doesn't handle line breaks well. Hyphenations at line breaks are kept as hard hyphens. Line breaks that are a result of formatting is kept as hard line breaks. It requires a lot of work in Word afterwards to remove those.
It would be nice to have some options for the conversion process so you could make the conversion a bit more intelligent.
The fault is with the application that created the PDF. It probably added the hyphens and line-breaks to it, which it shouldn't have. The export simply maintains what it finds in the PDF.
Copy link to clipboard
Copied
The fault is with the application that created the PDF. It probably added the hyphens and line-breaks to it, which it shouldn't have. The export simply maintains what it finds in the PDF.
Copy link to clipboard
Copied
In an untagged PDF there is no mark to say if a hyphen is hard or soft. They are the same character. So it's down to guesswork. There are no paragraph marks, no soft or hard returns. Just the text where it is. It's all guesswork, sometimes Acrobat guesses more as we would like, sometimes not.
Copy link to clipboard
Copied
Some PDF creators add line-breaks characters at the end of lines in a PDF, although it's unnecessary, which causes the output to be incorrect.
Copy link to clipboard
Copied
There isn't really such a thing as a line break character in an untagged PDF. Well, nevertheless some PDF creators might use CR and/or LF characters in a string, which luckily in most fonts show as nothing at all... but in the PDF they aren't line endings, or special.
Copy link to clipboard
Copied
In case anybody is interested, I found a sort of a workaround for this problem (and it is definitely a problem) -- after creating the Word doc, I use the "Replace" (as in Find/Replace) in Word -- I replace "- " (hyphen space) with an optional hyphen (go to more-special-optional hyphen) -- then it will replace the hard hyphen with an optional (discretionary) hyphen and automatically still break the word there, but it will now be one word, and the discretionary hyphenation will stay intact for future purposes. Now, you're goint to have to hit the replace button one at a time so you can see what you're replacing as you go, because there may be legitimate instances (such as seven-day basis) that you don't want to change by using the "Replace All" button, so pay attention. It's the best option I've figured out so far for this problem.
Copy link to clipboard
Copied
Try searching for a "hard" hyphen followed by a line-break and replacing if with the "soft" hyphen.