Sanitizing PDFs manually from WebHelp input (or worse)

Report · Aug 05, 2018

Long story short, right now (soon to change) I have zero control over the Development of PDFs sent to my group. Sometimes, they’ll download them from our output, and edit them... and we Have had folks paste PDFs from output in to input. It sucks, the process is changing and we are migrating to version control, RH server, and switching to HTML5. For now I need what little I have to make it to the finish line.

How can I manually remove coding and text searching from a PDF (we have adobe acrobat x) ? I don’t exactly know what RoboHelp does to. PDF when it generates so I’m lost on what to change. Basically, I just want to make sure that before anything goes in to the project files it is washed... because I have no clue where the PDF came from the only place I have control is right before we import. Do I turn off text searching? Is there a way to see code and delete it?

Report · Aug 05, 2018

I think we need more information about your workflow.

Are the pdfs included in your webhelp as pdf files, or do you import and convert them to html topics? If included as pdfs, do you generate the pdf files from RH to include in your webhelp or do you import them as baggage files and edit within Acrobat?

I'm not sure why you would want to prevent people searching in a pdf, but I suspect you'll have to post on the Acrobat forum. If you're generating the pdfs from RH, the best I could suggest is poking around the settings you can configure for PDFs in your layout.

I'm not sure what 'coding' is, so I can't offer any suggestions.

I know there's a way to prevent editing pdfs, so you could check the pdf security settings - if you're generating the pdfs from RH you can set this information in your output settings. Otherwise if you just edit the pdfs in Acrobat, there should be somewhere in the settings for pdf to configure this.

Report · Aug 05, 2018

Hi Amebr, thank you!

They are baggage files - so we get a file from someone and they want us to import it in to the project and link to it (for example, a PDF of a newsletter that they want linked on a newsletter page for folks to open or download).

A PDF is different on input than output - to my knowledge, when it is generating WebHelp, RoboHelp will make the PDF searchable. Which is what I want.

So, imagine a scenario where someone has downloaded the news letter from my output - changes something about it in acrobat and sends it back to my developer who blindly pastes it ontop of the original input. So two things have gone wrong: the newsletter writer didn’t update their original newsletter, and my developer brought In or overwrote an input file with what is now an edited output file.

What I am wondering is, if I had two files in front of me that look exactly the same visually except one is from the input and the other is from the output, how would I tell them apart? Similarly, lets say my project source files are on a computer burned in a fire, and now all i have is output and I need to reconstruct my project - what, if anything, would need to happen to those PDFs?

i don’t know enough about what RoboHelp does that makes a source version of the file a searchable version.

No, it’s not normal, no, it’s not a good process - there is a myriad of problems with the how that has caused this to happen I just want to explain how my People can look out for it, and how to fix it if it happens.

Report · Aug 05, 2018

I don't think Robohelp does anything to pdf baggage files, except to copy them to the output folder when you generate the project. The only thing you can control is whether searching in Webhelp or HTML5 help will show pdf files in the webhelp/html5 search results. That's a setting in your SSL/Output settings (Search tab). You might also find the "Exclude Unreferenced Baggage Files from Search" useful - it will make sure baggage files that aren't explicitly linked (in the TOC or a topic) aren't included in the search.

For comparing two pdf files, you would need to ask that question on the Acrobat forums.

If you manually change the security settings in each pdf you receive, you could at least prevent people from updating the pdf that you have published. But it would have to be for each pdf, as you are not generating the pdf files from content within RH. And from the sound of your workflow, there is no way to stop developers from updating their original copy. Again, more specific information about PDF files would be better asked on the Acrobat forum.

Another idea is to add a prefix to the pdf files you receive. e.g. You receive acrobat_file.pdf, rename it RH_acrobat_file.pdf and import in to Robohelp. Then you could tell developers not to copy content from files called RH_, but rather follow whatever process they should be following. Use any naming convention that makes sense for your situation, and tell them not to use that naming for their original files.

If you need to rebuild your project from Output, Peter Grainge (www.grainge.org/index.htm) has a section about reverse engineering on his website. If you just needed the pdf files and not the RH files (e.g. you decided to start the project from scratch), you could just copy the pdfs from your output and use them as normal.

Report · Aug 05, 2018

You can compare a source HTM file with an output HTM as the code is markedly difference, you only need to look at the first lines in a text editor. However you are comparing PDFs.

A quick Google shows there are tools out there to compare PDF files but can you identify what to compare with what?

As to using the reverse engineering method on my site, I have never tested that it retrieves baggage files. It should but that's an unknown.

As to the fire scenario, I was taught to have two backups. One on site and one off site. At the time the offsite one meant a physical delivery but now we have the cloud. I have projects that are in the likes of Dropbox so there is an automatic backup. That wouldn't work with multiple authors on the same project but you'll get the idea.

See www.grainge.org for free RoboHelp and Authoring information.

@petergrainge

Help others by clicking Correct Answer if the question is answered. Found the answer elsewhere? Share it here. "Upvote" is for useful posts.

Report · Aug 06, 2018

Peter - Thank you! I did go through your site looking at the reverse engineering. Thankfully, I have my output files and can fix them - I did have difficulty finding the content linked on your site though. For RoboWizard it took me to a splash, which forwarded me on to the main site - instead of Newproject.htm, index.htm in the link will take you there - here Is that, unsure if the link on your site 404’ing for anyone else RoboWizard . I went through wayback machine to figure it out. I did see the tools online - sadly despite my insistance, I have “Fisher Price ABC My First Computer” access to my computer so cannot install any tools. i wanted to see what I could DIY first! We make weekly backups thankfully... to A shared and theN local area. Our stuff is crippled with garbage but we’ve got it! I’m killing it all, moving us to HTML5, on To server, on to version control... using style sheets instead of literally removing zero inline formatting... I dropped a page size down to 50% the original JUST removing span tags... and cleared 800..... unlinked baggage. It’s a nightmare. Once I clean this for them they are hiring a new group of staff. ANYWAY. Your site has been so helpful and will be some mandatory reading when I get a group fresh faces with no bad habits.

Report · Aug 06, 2018

Hi there

Since the RoboWizard site is my own site, I'm naturally curious as to what you did that made you think you needed to use the wayback machine? Sounds like there may be a problem I've not seen or noticed.

I know the site was modified to use responsive HTML output from RoboHelp. So maybe something got lost in the translation. I'm not sure.

Thanks... Rick

Report · Aug 06, 2018

No 404 for me. The link went to a page that the RoboWizard site handled and diverted me to the home page of the site. I'll get the correct link but no 404.

@Rick Please let me have the correct link.

See www.grainge.org for free RoboHelp and Authoring information.

@petergrainge

Help others by clicking Correct Answer if the question is answered. Found the answer elsewhere? Share it here. "Upvote" is for useful posts.

Report · Aug 06, 2018

Amebr, thank you again! Sadly all this baggage is linked - any that isn’t should be shipped off to the graveyard. They’re good at removing the link... we have that part! I’m going to ask the acrobat folks on some ways to remove OCR/recognize text... From what I’ve read w/o Pro I cant do batches... my project is dragging itself with one arm toward me asking to die LOL. We do have specific file names and prefixes... hasnt Stopped anyone - the will save over their own with their old name. The folks writing the docs aren’t technical writers... I’m working on a version control set up whereby we will have more involvement in their doc flow and change management. Until then... it is Wild West.

Thank you again - it sounds like no changes Are made to the docs - which might explain why I couldn’t conceive of what they were. Something happened to our output... after someone carried in a bunch of .js files (not kidding) having no clue that that would be an issue. Ive been sitting and exhaustively watching it compile and it hangs when generating WebHelp (creating contents) and it is becsuse it is making all the .htm files that go with the PDFs... there are 1600 files of baggage. It was taking those and making triple the .htm (those rhA21, rhC32D, etc type files) - before I fixed some of it and removed about 800 files it was taking a 1600 file input and spitting 9000 files out.

Report · Aug 06, 2018

I have no idea how a pdf file could pull in a bunch of javascript files into your project. 0_0 Sounds like a complete mess. Good luck! Get some sleep!!!

Report · Aug 06, 2018

Hi Rick & Peter - It diverted me to the main site - I called The handling page a 404 as I do custom 404s that forward in similar fashion.

My apologies, I feel like everything I’m saying makes zero sense. I’ve been on 70hr work weeks and exhausted.

I usEd the index to look up reverse engineer which did 404 me (to a non custom 404. I can’t grab the address right now just go through index > reverse engineer ) - could just be my iPhone. At this point I hadn’t thought to read the URL closer or I’d have realized what to look for under the head and month so I went to WBM.

Adobe Community

Sanitizing PDFs manually from WebHelp input (or worse)