Surgically blocking crawlers w/ .htaccess

Report · Nov 19, 2018

Is there a way for me to tell my .htaccess file to :

- Allow only specific pages to be indexed by outside crawlers/bots

- Block all crawlers/bots except Google

Basically, I have specific pages I'd like Google to index, and no one else (like archive.org)

Thanks!

Report · Nov 19, 2018

You could start with a robots.txt file: Meta Robots Tag & Robots.txt Tutorial for Google, Bing & Other Search Engines I believe the user agent for archive.org bot is: ia_archiver

That at least will allow you control bots that are obeying the robot.txt standard - obviously the naughty bots like email harvesters will ignore it and go ahead anyway, then you're in the world of trying to block by IP address and/or user-agent in htaccess and its a game of cat and mouse.

Good web hosts are pretty efficient at blocking a lot of the junk, if you're on a dedicated server and doing it yourself good luck this resource might be handy: Bad Bots user-agent / bot

#Example blocking by user-agent in htaccess

RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (BadBot|EmailGrabber|NaughtyBot) [NC] RewriteRule (.*) - [F,L]

#Block some by IP addresses

RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999 [OR] RewriteCond %{REMOTE_ADDR} ^911\.911\.911\.911 [OR] RewriteCond %{REMOTE_ADDR} ^111\.222\.333\.444 RewriteRule (.*) - [F,L]

Paul-M - Community Expert

Report · Nov 21, 2018

I see there's a robots.txt generator there, but it doesn't seem to like my browser very much.

Are there no pre-filled robots-txt files out there? Maybe even recommended ones, like there are block lists for Twitter?

I would imagine "block everything known to man but Google" would be a popular one.

Report · Nov 21, 2018

say you wanted to allow only google to crawl your entire site and disallow all other bots, robots.txt would look like this:

User-agent: Google
Disallow:
User-agent: *
Disallow: /

Paul-M - Community Expert

Report · Nov 22, 2018

Energize wrote
say you wanted to allow only google to crawl your entire site and disallow all other bots, robots.txt would look like this:
User-agent: Google
Disallow:
User-agent: *
Disallow: /

That's a great starter to build and learn on, thanks.

Two questions :

Say I wanted to modify that so that Google only be able to index 4 specific landing pages, and not the other folders or files it might be able to learn about via browser referencing or bot crawling... how would I do that?
I understand that "bad" robots won't obey anything, but with regards to archive.org specifically, will it respect a "disallow all" or does the rule have to be more specific to them?

Thanks!

Report · Nov 22, 2018

Slight amendment to that, Google user agent is 'Googlebot, so:

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

Paul-M - Community Expert

Report · Nov 22, 2018

Energize wrote
Slight amendment to that, Google user agent is 'Googlebot, so:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

Thanks for the correction.

I also had 2 more questions for you in the reply just before the one you just added

Report · Nov 22, 2018

I'm pretty sure archive.org's crawler will obey the robots.txt instruction

How many files and folder do you want to stop Google crawling? You can do like this:

#Ask Google NOT to crawl these areas of the website
User-agent: Googlebot
Disallow: /private-subfolder/
Disallow: /admin-subfolder/
Disallow: /private-page.php
Tell all other robots not to crawl the website
User-agent: *
Disallow: /

Paul-M - Community Expert

Report · Nov 22, 2018

One other caveat:

If you want to stop bots crawling sensitive areas of the website that could be a target for hackers like login areas I would use the htaccess option rather than a robots.txt which is public. Would-be hackers could use the robot.txt file to identify sensitive areas of your site

Paul-M - Community Expert

Report · Nov 22, 2018

Energize wrote
One other caveat:
If you want to stop bots crawling sensitive areas of the website that could be a target for hackers like login areas I would use the htaccess option rather than a robots.txt which is public. Would-be hackers could use the robot.txt file to identify sensitive areas of your site

If there's no point to a robots.txt file, and the same can be achieved w/ .htaccess, then let's not use one at all.

How do I use .htaccess to let only Googlebot in, and ONLY to index 12 specific files (or if I group them in a folder, only to that folder and its contents, whichever's easier)? I don't even know if this is doable, I'm venturing far out of my comfort zone even touching .htaccess

Thx!

Report · Nov 22, 2018

If there's no point to a robots.txt file, and the same can be achieved w/ .htaccess, then let's not use one at all.
How do I use .htaccess to let only Googlebot in, and ONLY to index 12 specific files (or if I group them in a folder, only to that folder and its contents, whichever's easier)? I don't even know if this is doable, I'm venturing far out of my comfort zone even touching .htaccess

Group in one folder and try this in htaccess:

RewriteCond %{HTTP_USER_AGENT} !Googlebot [NC]
RewriteCond %{REQUEST_URI} ^/FolderName [NC]
RewriteRule .* - [R=403,L]

Change 'FolderName' to the name of the folder you group the files in

Paul-M - Community Expert

Report · Nov 23, 2018

Energize wrote
If there's no point to a robots.txt file, and the same can be achieved w/ .htaccess, then let's not use one at all.
How do I use .htaccess to let only Googlebot in, and ONLY to index 12 specific files (or if I group them in a folder, only to that folder and its contents, whichever's easier)? I don't even know if this is doable, I'm venturing far out of my comfort zone even touching .htaccess
Group in one folder and try this in htaccess:
RewriteCond %{HTTP_USER_AGENT} !Googlebot [NC]
RewriteCond %{REQUEST_URI} ^/FolderName [NC]
RewriteRule .* - [R=403,L]
Change 'FolderName' to the name of the folder you group the files in

So it's do-able? I honestly thought I was pushing it with that request.

What are the downsides of doing this, however obvious they may seem to you superhero-types? Hypothetically speaking, say these pages display images that are located one folder over... can Google index those images in their Image Search Results? They're not located in the 'allowed' folder, but they are requested by pages located there.

Totally unrelated follow-up question : I've been carrying a bit of code in my .htaccess file that was recommended to me by another one of you superhero types many years ago. Could you tell me how relevant it is for me to keep this in 2018? I can't even remember the original reason for it.

AddType text/x-component .htc
RewriteCond %{HTTP_USER_AGENT} Wget [OR]
RewriteCond %{HTTP_USER_AGENT} CherryPickerSE [OR]
RewriteCond %{HTTP_USER_AGENT} CherryPickerElite [OR]
RewriteCond %{HTTP_USER_AGENT} EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ExtractorPro
RewriteRule ^.*$ X.html

Report · Nov 23, 2018

What are the downsides of doing this, however obvious they may seem to you superhero-types? Hypothetically speaking, say these pages display images that are located one folder over... can Google index those images in their Image Search Results?

You're not blocking Google anyway, the rule matches all user-agents that are NOT Googlebot so no need to worry, Google will crawl, pages and images OK.

Paul-M - Community Expert

Report · Nov 23, 2018

Energize wrote
What are the downsides of doing this, however obvious they may seem to you superhero-types? Hypothetically speaking, say these pages display images that are located one folder over... can Google index those images in their Image Search Results?
You're not blocking Google anyway, the rule matches all user-agents that are NOT Googlebot so no need to worry, Google will crawl, pages and images OK.

What if I only want the 12 PHP files located in the 'allowed' folder to be "index-able" by Google? Wasn't that the point of this exercise? Are you saying Google can still crawl all over the place and index everything it finds, or only what the files in the protected zone are linking to? Apologies for being so slow, I'm having a hard time seeing the line of what's protected and what isn't.

My goal - if even doable - is to have ONLY Google indexing ONLY 12 specific files (.php's) in its search results. Meaning, if I can avoid having my graphics, videos or text files indexed in Google's search results, great. Unless, as a rule, all the images referenced by files in the protected folder - even if those images themselves are not located in said folder - become index-able by Google regardless of protection... in which case, I'll just eat it, as they say.

Report · Nov 23, 2018

I think in the first instance it'd be easier if you to learn and start with a robot.txt file

#Tell Google to only crawl one specific folder
User-agent: Googlebot
Allow: /SomeFolder/
Disallow: /
#All other bots go away
User-agent: *
Disallow: /

Paul-M - Community Expert

Report · Nov 23, 2018

Once you've got your robots.txt file in place you really need to block all the bad bots specifically in htaccess. Here's an example and list of bad bots to block in htacess: apache-ultimate-bad-bot-blocker/robots.txt at master · mitchellkrogza/apache-ultimate-bad-bot-blocke... The thing with blocking garbage bots is it an ongoing thing DIsregard my earlier htacess rule unless you literlly want to block everything and everyone except Googlebot,

Paul-M - Community Expert

Report · Nov 24, 2018

Energize wrote
Once you've got your robots.txt file in place you really need to block all the bad bots specifically in htaccess. Here's an example and list of bad bots to block in htacess: apache-ultimate-bad-bot-blocker/robots.txt at master · mitchellkrogza/apache-ultimate-bad-bot-blocke...

Whoa, that is one long list. But it's literally a robots.txt file, I thought you said to never mind that file and use .htaccess? So confused. Do you want me to put the robots.txt content in my .htaccess file? I noticed archive.org isn't on the list, either. And that's the first one I want blocked.

I'm still not sure what the possible advantage of not blocking every bot except Google (and maybe Bing) is. What am I missing out on by allowing only those two to crawl my sites? In other words, in which ways would I be crippling myself or businesses? I'm sure there's a reason, since people are going through the trouble of maintaining "bad bot" lists, thus suggesting there are good ones. I'm just wondering whether the "good bots" are worth the filter, once you've cleared Google & Bing for entry.

Report · Nov 25, 2018

If you have legitimate concerns about hostile bots taking your sites down, you need better server security. You can't possibly do all this yourself. The list of of potential threats is too massive and growing all the time.

I use Secure Live real time server monitoring. When SL identifies a potential threat, the IP is blocked and a copy of the report is sent to law enforcement agencies. In the beginning, I received 3-4 threat reports per week. Now I get 1-2 reports per month. So it works.

https://securelive.com/

Nancy O'Shea— Product User, Community Expert & Moderator

Report · Nov 25, 2018

https://forums.adobe.com/people/Nancy+OShea wrote
If you have legitimate concerns about hostile bots taking your sites down, you need better server security. You can't possibly do all this yourself. The list of of potential threats is too massive and growing all the time.
I use Secure Live real time server monitoring. When SL identifies a potential threat, the IP is blocked and a copy of the report is sent to law enforcement agencies. In the beginning, I received 3-4 threat reports per week. Now I get 1-2 reports per month. So it works. https://securelive.com/

I couldn't even tell you what threats I'm protecting myself from with a robots.txt file at all. I just couldn't think of a good reason to let ANY bot (not called Googlebot or Bing) crawling all over my websites, and asked what I'd be sacrificing by doing so. Haven't gotten an answer yet, so looks like I'm just going to block everything but Googlebot + bing via .htaccess and see how that goes (unless you advise against that and can tell me why).

As for commercial solutions, I'll wait until I'm actually attacked -- this is just me informing myself in a more general sense, while being proactive even if I have no immediate reason to be. (It's just going to be a graphic portfolio site, no one is actually going to care enough to harm it.)

Report · Nov 25, 2018

https://forums.adobe.com/people/Under+S. wrote
I just couldn't think of a good reason to let ANY bot (not called Googlebot or Bing) crawling all over my websites, and asked what I'd be sacrificing by doing so.

Google does gather information from other, lesser search engines. Blocking all bots could have SEO impact. How much is something you'll need to monitor.

Nancy O'Shea— Product User, Community Expert & Moderator

Report · Nov 26, 2018

https://forums.adobe.com/people/Nancy+OShea wrote
https://forums.adobe.com/people/Under+S. wrote
I just couldn't think of a good reason to let ANY bot (not called Googlebot or Bing) crawling all over my websites, and asked what I'd be sacrificing by doing so.
Google does gather information from other, lesser search engines. Blocking all bots could have SEO impact. How much is something you'll need to monitor.

If that's true, then blocking everything BUT Googlebot might red-flag me BY Googlebot for suspicious activity, which is what I was hoping to avoid by giving Google (and only Google) the keys to the place... right? I mean, if Google is comparing notes with lesser engines.

I wouldn't mind backtracking on the idea of only letting Google in, if I could limit the scope of what ALL the crawlers can find to just the 12 specific urls.

In other words, I don't need Google Images (or anyone else) hotlinking individual files, like JPGs or other media. If there's a zip file I temporarily placed in the root folder for someone to pick up, or a text file I forgot to clean up from an older folder on the server, I don't want them indexed for everyone to click on. Just the 12 official urls for the 12 pages on the site that are meant for public consumption.

Energize helped me remove /pages from all the urls, thus not only masking those pages' true locations, but shortening the urls to only what's necessary. However, the longer urls still work (so 2 ways to access each file) so I'm pretty sure any crawler's going to pick up on that unless I find a way to limit the scope of the crawlers to only those shortened urls.

(Note that I am using 12 as an arbitrary number right now, the site has 3 pages and will have 2 more by February... even with the occasional article I plan to put up, I don't see the site exceeding 12 pages in the next year.)

Report · Nov 26, 2018

If you don't want bots to find your media, put it behind a password protected barrier.

Give search engines an XML site map with 12 URLs to follow.

Nancy O'Shea— Product User, Community Expert & Moderator

Surgically blocking crawlers w/ .htaccess

1 Correct answer