Errors In Converting PDF TO HTML
Copy link to clipboard
Copied
First timer here with Acrobat, so I'm probably doing something incorrectly. But I'm trying to convert a pdf to HTML and the results don't look good, at all. And I'm doing that with a trial version of Adobe Acrobat Pro DC running on Win 10. (Plus, I'm on Day 2 of a 7 day trial, and it isn't going well so I'm getting worried.)
The pdf I'm trying to convert to HTML is shown on the PN to ID# Cros-Ref tab on my webpage here. (There's a screen grab at the top, but below that is the pdf embedded on that page.)
To the right are 4 more tabs with different HTML results from different applications, and none of them are "right". You may have to click the right arrow to get to the Adobe tab, but you can see that the formatting isn't correct. In addition, I had the OCR box checked and it turned "5" into "S" in many cases, but I can probably fix that by unchecking that box.
Anyway, what am I doing incorrectly? I am opening that file in Acrobat, clicking Export PDF, selecting HTML and then single page. I then open the file that is created, copy the HTML and put it on my webpage.
Thanks in advance!
Copy link to clipboard
Copied
Long time since we heard from anyone trying this. HTML conversion may preserve the text and pictures from simple pages. But it’s pretty useless for most purposes. If HTML could do what PDF does, PDF would never have been invented.
The HTML exported may be a starting point for editing, though Wodd export is usually more practical.
Copy link to clipboard
Copied
Well, that's disappointing. And, not to argue, but I would have thought that file I used would have qualified as a "simple page", which was why I was testing it since most of what I do is MUCH more complex than that. Oh well, glad I asked as I've been beating my head against this wall for days.
So, if HTML export won't work, then what is "Wodd export"?
And, let me tell you what I'm doing as you may well have a better idea. My site is the world's best (only?) documentation site for 1980-86 Ford trucks. I have massive quantities of documentation on them in paper format, which I'm scanning, OCR'ing, and embedding on the pages. But I've come to realize that my embedded files aren't crawled/indexed by Google, so the contents cannot be found in a search. However, HTML is crawled/indexed and can be found. So, if I was able to convert the documents reliably I'd put them on that way.
As for editing, there's just way too many pages, some of which are 40 pages long, to even consider that.
Anyway, thanks a bunch. It wasn't the answer I wanted, but at least I can stop destroying the wall.
Copy link to clipboard
Copied
That's a special kind of export only available when using autocorrect. Sorry. I mean Word export. Word documents support more complexity than HTML, and also sets you up for editing imperfect or awful results.
The key thing is that PDF remains unique, and you cannot expect any conversion to be as powerful and flexible as PDF.
Indexing is another thing. Google SHOULD index PDFs, it has done for more than a decade. So long as the PDF has searchable text. Maybe what you really need is OCR on your PDFs.
Copy link to clipboard
Copied
By the way... I just looked at Wheel Covers - Gary's Garagemahal (the Bullnose bible)
It says there is an embedded PDF, but there isn't. What it embeds is a link to https://onedrive.live.com/embed?cid=80736256535317EF&resid=80736256535317EF%2127154&authkey=APL8vf0y... which purports to be a Word document but is actually a hugely complex piece of dynamic HTML generated by OneDrive. Text seems to be extractable, but I can well imagine Google wouldn't be allowed near it.
I'd work back to what your originals really are, and where they are.
Copy link to clipboard
Copied
Oh, wow! This is getting complex, but I really appreciate it since you know far more than I do.
Yes, what is on the page is actually a link back to OneDrive. And, the file is only available to "people who have the link". So, from what I can tell, a/the reason Google doesn't index it is that Google doesn't know who has the link, so why bother since they won't present you with search results you can't see.
I can put the files on my Google Drive, which has an option to make files accessible to anyone on the web, and Google will then index the file. But, it will find the file on my Google Drive, either in addition to or in place of my website. But, since I'm trying to make my website THE place to go for these trucks, I don't want them finding things elsewhere. Hence my quest for finding a way to put searchable content on the site.
But, I wasn't aware that what Microsoft is doing when I embed a link is to take you to a "hugely complex piece of dynamic HTML". However, now that I think about it that makes sense.
As for my "originals", they are sheets of paper. I scan them in as a PDF, OCR them, and save them on my OneDrive. Tell Microsoft to generate the embed code, and then paste that code into an "embed" widget on my Weebly website.
I've tried exporting to a Word doc, but they include graphics which I can't seem to use as what I have on Weebly is a "text" widget, which won't take the graphics. And even then the formatting is all wonky, which would require serious editing and I really don't want to do that.
Thoughts? Suggestions?
THANK YOU!

