Copy link to clipboard
Copied
I'm trying to figure out how we can prevent our HTML webhelp from being crawled by search engines like Google. I found these instructions while digging through some of the discussions on this forum: Stop search engine robots indexing Your private folders by ‘robots.txt’. | Internet marketing Blog
However, you would think we could add some code into the project itself in order to stop the search engines from crawling the help. We tried adding this code into our master page since the master page is applied on all topics, but the code didn't remain after the output was generated:
<meta name="robots" content="NOINDEX, NOFOLLOW" />
Does anyone know how we can prevent search engines from crawling our help?
1 Correct answer
The masterpage header won't work for this. Personally, I would also do a find and replace in the output. That's the fastest way.
Just remember that search engines not indexing your site based on meta tags is a courtesy, it doesn't block bots completely. Only the nice guys such as Google will listen. Not even a robots.txt will block crawlers. (For example, see: Learn about robots.txt files - Search Console Help) If you really don't want unauthorised access, you have to force authentication on your
...Copy link to clipboard
Copied
Adding a meta tag and a robots.txt is only a courtesy. A search engine *may* decide to skip your site. But there is no guarantee.
If you really don't want your content to be indexed, you have to cut of the access to your content. If you require authentication (for example, by using a .htaccess file Htaccess Authentication - Htaccess Tools) the search engines are no longer able to index your content.
Copy link to clipboard
Copied
I need to do this as well. There doesn't seem to be a way to do it within RoboHelp. Several sources have suggested Find/Replace to add the <meta> tag to each .htm file. Arduous and error prone. Any other thoughts?
Copy link to clipboard
Copied
A weird "feature" that might work for you.
Make sure there is a Robohelp header section in your master page.
Switch to HTML view and paste the meta code into the "?rh_region_start type=header" and "?rh_region_end type=header" tags.
Save. RH automagically moves the code between the master page "head" tags.
When you generate, the meta tag will be in each page, but not within the "head" tags - you will find it further down the page, just above the first content in the topic (e.g the topic H1). I'm not sure if the placement affects the webcrawlers, though.
Copy link to clipboard
Copied
Hi Amebr
Thanks for the info. I tried it, and it looks good...the meta tag moves up into the header section of the Master page. But, when I publish the webhelp, it is not in the <head> section of the .htm files. It is in the <body> section and appears as:
<div style="width: 100%; position: relative;" id="header">
<meta name="robots" content="noindex, nofollow" />
<p> </p>
</div>
When I look at the topics in the help, there is extra space at the top of the topic, above the breadcrumbs, so clearly something is there. But, it's not between the <head> and </head> tags in the .htm.
Too bad. That would have been easy.
This is what I did to get the meta tag in the right place:
- Publish the help to a designated folder (as usual).
- In RH, select Edit -> Find and Replace in Files.
- Specify </head> in the Find what field.
- Specify <meta name="robots" content="noindex, nofollow"/> </head> in the Replace with field.
- Specify the folder with the published webhelp output in the Look in field.
- Select Text file types (*.htm ; *.html ; *.txt) in the Files of type field.
- Check the Include Subfolders option.
- Click Find Next, and then Replace All.
I chose to do the Find/Replace at the top level, so the folder that contains all of the output (the folder that contains the resource folder, whdata folder, whgdata foler, etc.). This means that the meta tag is in all of the .htm files, not just the ones with the topic content. I don't think there's any harm in that.
Now I need to get the meta tag in the head section of the .htm files of the responsive HTML5 output from FrameMaker. Any thoughts on that?
Copy link to clipboard
Copied
Yeah, as I said, not in the head, but I don't know enough about the web side to know how much of a problem that is/isn't.
You can add the code into the screen layout although that can be a little hairy. It would need to go into every .slp file I believe. Willam van Weeldenmight be able to offer more advice.
Copy link to clipboard
Copied
Thanks for the suggestions! Nice to have a place to knock ideas around.
I put the meta tag into the head area of the Screen Layout for topics (Topic.slp). In RH HTML view, the tag is in the correct place. When I open Topic.slp in Notepad, it's in the correct place. But, when I generate the webhelp, it is inserted in the body as:
<div style="width: 100%; position: relative;" id="header">
<p> </p>
<meta name="robots" content="noindex, nofollow" />
</div>
Perhaps Willam van Weelden will have another idea.
Copy link to clipboard
Copied
Ah oops. I missed the bit about webhelp. The screen layouts are for Multiscreen or Responsive HTML5 output so updating them won't result in a change in webhelp. What you are seeing would be the code you added to the master page before.
I don't know if you can update the webhelp skin in the same way as the screen layouts, sorry.
Copy link to clipboard
Copied
The masterpage header won't work for this. Personally, I would also do a find and replace in the output. That's the fastest way.
Just remember that search engines not indexing your site based on meta tags is a courtesy, it doesn't block bots completely. Only the nice guys such as Google will listen. Not even a robots.txt will block crawlers. (For example, see: Learn about robots.txt files - Search Console Help) If you really don't want unauthorised access, you have to force authentication on your server.
Copy link to clipboard
Copied
Thanks! This helps confirm that my company needs to work on forcing an authentication, which we're trying to do.

