Skip to main content
Participating Frequently
August 5, 2018
Question

Sanitizing PDFs manually from WebHelp input (or worse)

  • August 5, 2018
  • 3 replies
  • 771 views

Long story short, right now (soon to change) I have zero control over the Development of PDFs sent to my group. Sometimes, they’ll download them from our output, and edit them... and we Have had folks paste PDFs from output in to input. It sucks, the process is changing and we are migrating to version control, RH server, and switching to HTML5. For now I need what little I have to make it to the finish line.

How can I manually remove coding and text searching from a PDF (we have adobe acrobat x) ? I don’t exactly know what RoboHelp does to. PDF when it generates so I’m lost on what to change. Basically, I just want to make sure that before anything goes in to the project files it is washed... because I have no clue where the PDF came from the only place I have control is right before we import. Do I turn off text searching? Is there a way to see code and delete it?

This topic has been closed for replies.

3 replies

Participating Frequently
August 6, 2018

Hi Rick & Peter - It diverted me to the main site - I called The handling page a 404 as I do custom 404s that forward in  similar fashion.

My apologies, I feel like everything I’m saying makes zero sense. I’ve been on 70hr work weeks and exhausted.

I usEd the index to look up reverse engineer which did 404 me (to a non custom 404. I can’t grab the address right now just go through index > reverse engineer ) - could just be my iPhone. At this point I hadn’t thought to read the URL closer or I’d have realized what to look for under the head and month so I went to WBM.

Participating Frequently
August 6, 2018

Hi Amebr, thank you!

They are baggage files - so we get a file from someone and they want us to import it in to the project and link to it (for example, a PDF of a newsletter that they want linked on a newsletter page for folks to open or download).

A PDF is different on input than output - to my knowledge, when it is generating WebHelp, RoboHelp will make the PDF searchable. Which is what I want.

So, imagine a scenario where someone has downloaded the news letter from my output - changes something about it in acrobat and sends it back to my developer who blindly pastes it ontop of the original input. So two things have gone wrong: the newsletter writer didn’t update their original newsletter, and my developer brought In or overwrote an input file with what is now an edited output file.

What I am wondering is, if I had two files in front of me that look exactly the same visually except one is from the input and the other is from the output, how would I tell them apart? Similarly, lets say my project source files are on a computer burned in a fire, and now all i have is output and I need to reconstruct my project - what, if anything, would need to happen to those PDFs?

i don’t know enough about what RoboHelp does that makes a source version of the file a searchable version.

No, it’s not normal, no, it’s not a good process - there is a myriad of problems with the how that has caused this to happen I just want to explain how my People can look out for it, and how to fix it if it happens.

Community Expert
August 6, 2018

I don't think Robohelp does anything to pdf baggage files, except to copy them to the output folder when you generate the project. The only thing you can control is whether searching in Webhelp or HTML5 help will show pdf files in the webhelp/html5 search results. That's a setting in your SSL/Output settings (Search tab). You might also find the "Exclude Unreferenced Baggage Files from Search" useful - it will make sure baggage files that aren't explicitly linked (in the TOC or a topic) aren't included in the search.

For comparing two pdf files, you would need to ask that question on the Acrobat forums.

If you manually change the security settings in each pdf you receive, you could at least prevent people from updating the pdf that you have published. But it would have to be for each pdf, as you are not generating the pdf files from content within RH. And from the sound of your workflow, there is no way to stop developers from updating their original copy. Again, more specific information about PDF files would be better asked on the Acrobat forum.

Another idea is to add a prefix to the pdf files you receive. e.g. You receive acrobat_file.pdf, rename it RH_acrobat_file.pdf and import in to Robohelp. Then you could tell developers not to copy content from files called RH_, but rather follow whatever process they should be following. Use any naming convention that makes sense for your situation, and tell them not to use that naming for their original files.

If you need to rebuild your project from Output, Peter Grainge (www.grainge.org/index.htm) has a section about reverse engineering on his website. If you just needed the pdf files and not the RH files (e.g. you decided to start the project from scratch), you could just copy the pdfs from your output and use them as normal.

Peter Grainge
Community Expert
Community Expert
August 6, 2018

You can compare a source HTM file with an output HTM as the code is markedly difference, you only need to look at the first lines in a text editor. However you are comparing PDFs.

A quick Google shows there are tools out there to compare PDF files but can you identify what to compare with what?

As to using the reverse engineering method on my site, I have never tested that it retrieves baggage files. It should but that's an unknown.

As to the fire scenario, I was taught to have two backups. One on site and one off site. At the time the offsite one meant a physical delivery but now we have the cloud. I have projects that are in the likes of Dropbox so there is an automatic backup. That wouldn't work with multiple authors on the same project but you'll get the idea.


See www.grainge.org for free RoboHelp and Authoring information.

@petergrainge

Use the menu (bottom right) to mark the Best Answer or Highlight particularly useful replies. Found the answer elsewhere? Share it here.
Community Expert
August 6, 2018

I think we need more information about your workflow.

Are the pdfs included in your webhelp as pdf files, or do you import and convert them to html topics? If included as pdfs, do you generate the pdf files from RH to include in your webhelp or do you import them as baggage files and edit within Acrobat?

I'm not sure why you would want to prevent people searching in a pdf, but I suspect you'll have to post on the Acrobat forum. If you're generating the pdfs from RH, the best I could suggest is poking around the settings you can configure for PDFs in your layout.

I'm not sure what 'coding' is, so I can't offer any suggestions.

I know there's a way to prevent editing pdfs, so you could check the pdf security settings - if you're generating the pdfs from RH you can set this information in your output settings. Otherwise if you just edit the pdfs in Acrobat, there should be somewhere in the settings for  pdf to configure this.