Inspiring

Answered

Surgically blocking crawlers w/ .htaccess

Forum|Forum|7 years ago
November 20, 2018
2 replies
6036 views

Is there a way for me to tell my .htaccess file to :

- Allow only specific pages to be indexed by outside crawlers/bots

- Block all crawlers/bots except Google

Basically, I have specific pages I'd like Google to index, and no one else (like archive.org)

Thanks!

This topic has been closed for replies.

Correct answer Paul-M

You could start with a robots.txt file: Meta Robots Tag & Robots.txt Tutorial for Google, Bing & Other Search Engines I believe the user agent for archive.org bot is: ia_archiver

That at least will allow you control bots that are obeying the robot.txt standard - obviously the naughty bots like email harvesters will ignore it and go ahead anyway, then you're in the world of trying to block by IP address and/or user-agent in htaccess and its a game of cat and mouse.

Good web hosts are pretty efficient at blocking a lot of the junk, if you're on a dedicated server and doing it yourself good luck this resource might be handy: Bad Bots user-agent / bot

#Example blocking by user-agent in htaccess

RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (BadBot|EmailGrabber|NaughtyBot) [NC] RewriteRule (.*) - [F,L]

#Block some by IP addresses

RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999 [OR] RewriteCond %{REMOTE_ADDR} ^911\.911\.911\.911 [OR] RewriteCond %{REMOTE_ADDR} ^111\.222\.333\.444 RewriteRule (.*) - [F,L]

P

Paul-M

Legend

Once you've got your robots.txt file in place you really need to block all the bad bots specifically in htaccess. Here's an example and list of bad bots to block in htacess: apache-ultimate-bad-bot-blocker/robots.txt at master · mitchellkrogza/apache-ultimate-bad-bot-blocker · GitHub The thing with blocking garbage bots is it an ongoing thing DIsregard my earlier htacess rule unless you literlly want to block everything and everyone except Googlebot,

Paul-M - Community Expert

U

Under S.Author

Inspiring

Energize wrote
Once you've got your robots.txt file in place you really need to block all the bad bots specifically in htaccess. Here's an example and list of bad bots to block in htacess: apache-ultimate-bad-bot-blocker/robots.txt at master · mitchellkrogza/apache-ultimate-bad-bot-blocker · GitHub

Whoa, that is one long list. But it's literally a robots.txt file, I thought you said to never mind that file and use .htaccess? So confused. Do you want me to put the robots.txt content in my .htaccess file? I noticed archive.org isn't on the list, either. And that's the first one I want blocked.

I'm still not sure what the possible advantage of not blocking every bot except Google (and maybe Bing) is. What am I missing out on by allowing only those two to crawl my sites? In other words, in which ways would I be crippling myself or businesses? I'm sure there's a reason, since people are going through the trouble of maintaining "bad bot" lists, thus suggesting there are good ones. I'm just wondering whether the "good bots" are worth the filter, once you've cleared Google & Bing for entry.

Nancy OShea

Community Expert

If you have legitimate concerns about hostile bots taking your sites down, you need better server security. You can't possibly do all this yourself. The list of of potential threats is too massive and growing all the time.

I use Secure Live real time server monitoring. When SL identifies a potential threat, the IP is blocked and a copy of the report is sent to law enforcement agencies. In the beginning, I received 3-4 threat reports per week. Now I get 1-2 reports per month. So it works.

https://securelive.com/

Nancy O'Shea— Product User & Community Expert

P

Paul-MCorrect answer

Legend

You could start with a robots.txt file: Meta Robots Tag & Robots.txt Tutorial for Google, Bing & Other Search Engines I believe the user agent for archive.org bot is: ia_archiver

That at least will allow you control bots that are obeying the robot.txt standard - obviously the naughty bots like email harvesters will ignore it and go ahead anyway, then you're in the world of trying to block by IP address and/or user-agent in htaccess and its a game of cat and mouse.

Good web hosts are pretty efficient at blocking a lot of the junk, if you're on a dedicated server and doing it yourself good luck this resource might be handy: Bad Bots user-agent / bot

#Example blocking by user-agent in htaccess

RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (BadBot|EmailGrabber|NaughtyBot) [NC] RewriteRule (.*) - [F,L]

#Block some by IP addresses

RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999 [OR] RewriteCond %{REMOTE_ADDR} ^911\.911\.911\.911 [OR] RewriteCond %{REMOTE_ADDR} ^111\.222\.333\.444 RewriteRule (.*) - [F,L]

Paul-M - Community Expert

U

Under S.Author

Inspiring

I see there's a robots.txt generator there, but it doesn't seem to like my browser very much.

Are there no pre-filled robots-txt files out there? Maybe even recommended ones, like there are block lists for Twitter?

I would imagine "block everything known to man but Google" would be a popular one.

P

Paul-M

Legend

say you wanted to allow only google to crawl your entire site and disallow all other bots, robots.txt would look like this:

User-agent: Google
Disallow:
User-agent: *
Disallow: /

Paul-M - Community Expert

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded