Skip to main content
Inspiring
November 20, 2018
Answered

Surgically blocking crawlers w/ .htaccess

  • November 20, 2018
  • 2 replies
  • 6036 views

Is there a way for me to tell my .htaccess file to :

- Allow only specific pages to be indexed by outside crawlers/bots

- Block all crawlers/bots except Google

Basically, I have specific pages I'd like Google to index, and no one else (like archive.org)

Thanks!

    This topic has been closed for replies.
    Correct answer Paul-M

    You could start with a robots.txt file: Meta Robots Tag & Robots.txt Tutorial for Google, Bing & Other Search Engines  I believe the user agent for archive.org bot is: ia_archiver

    That at least will allow you control bots that are obeying the robot.txt standard - obviously the naughty bots like email harvesters will ignore it and go ahead anyway, then you're in the world of  trying to block by IP address and/or user-agent in htaccess and its a game of cat and mouse.

    Good web hosts are pretty efficient at blocking a lot of the junk, if you're on a dedicated server and doing  it yourself good luck this resource might be handy: Bad Bots user-agent / bot

    #Example blocking by user-agent in htaccess

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} (BadBot|EmailGrabber|NaughtyBot) [NC]
    RewriteRule (.*) - [F,L]

    #Block some by IP addresses

    RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999 [OR]
    RewriteCond %{REMOTE_ADDR} ^911\.911\.911\.911 [OR]
    RewriteCond %{REMOTE_ADDR} ^111\.222\.333\.444

    RewriteRule (.*) - [F,L]

    2 replies

    Legend
    November 24, 2018

    Once you've got your robots.txt file in place you really need to block all the bad bots specifically in htaccess. Here's an example and list of bad bots to block in htacess: apache-ultimate-bad-bot-blocker/robots.txt at master · mitchellkrogza/apache-ultimate-bad-bot-blocker · GitHub The thing with blocking garbage bots is it an ongoing thing DIsregard my earlier htacess rule unless you literlly want to block everything and everyone except Googlebot,

    Paul-M - Community Expert
    Under S.Author
    Inspiring
    November 25, 2018

    Energize  wrote

    Once you've got your robots.txt file in place you really need to block all the bad bots specifically in htaccess. Here's an example and list of bad bots to block in htacess: apache-ultimate-bad-bot-blocker/robots.txt at master · mitchellkrogza/apache-ultimate-bad-bot-blocker · GitHub

    Whoa, that is one long list. But it's literally a robots.txt file, I thought you said to never mind that file and use .htaccess? So confused. Do you want me to put the robots.txt content in my .htaccess file? I noticed archive.org isn't on the list, either. And that's the first one I want blocked.

    I'm still not sure what the possible advantage of not blocking every bot except Google (and maybe Bing) is. What am I missing out on by allowing only those two to crawl my sites? In other words, in which ways would I be crippling myself or businesses? I'm sure there's a reason, since people are going through the trouble of maintaining "bad bot" lists, thus suggesting there are good ones. I'm just wondering whether the "good bots" are worth the filter, once you've cleared Google & Bing for entry.

    Nancy OShea
    Community Expert
    Community Expert
    November 25, 2018

    If you have legitimate concerns about hostile bots taking your sites down, you need better server security.  You can't possibly do  all this yourself.  The list of of potential threats is too massive and growing all the time.

    I use Secure Live real time server monitoring.  When SL identifies a potential threat, the IP is  blocked and a copy of the report is sent to law enforcement agencies.    In the beginning, I received 3-4 threat reports per week.  Now I get 1-2 reports per month.   So it works.

    https://securelive.com/

    Nancy O'Shea— Product User & Community Expert
    Paul-MCorrect answer
    Legend
    November 20, 2018

    You could start with a robots.txt file: Meta Robots Tag & Robots.txt Tutorial for Google, Bing & Other Search Engines  I believe the user agent for archive.org bot is: ia_archiver

    That at least will allow you control bots that are obeying the robot.txt standard - obviously the naughty bots like email harvesters will ignore it and go ahead anyway, then you're in the world of  trying to block by IP address and/or user-agent in htaccess and its a game of cat and mouse.

    Good web hosts are pretty efficient at blocking a lot of the junk, if you're on a dedicated server and doing  it yourself good luck this resource might be handy: Bad Bots user-agent / bot

    #Example blocking by user-agent in htaccess

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} (BadBot|EmailGrabber|NaughtyBot) [NC]
    RewriteRule (.*) - [F,L]

    #Block some by IP addresses

    RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999 [OR]
    RewriteCond %{REMOTE_ADDR} ^911\.911\.911\.911 [OR]
    RewriteCond %{REMOTE_ADDR} ^111\.222\.333\.444

    RewriteRule (.*) - [F,L]
    Paul-M - Community Expert
    Under S.Author
    Inspiring
    November 22, 2018

    I see there's a robots.txt generator there, but it doesn't seem to like my browser very much.

    Are there no pre-filled robots-txt files out there? Maybe even recommended ones, like there are block lists for Twitter?

    I would imagine "block everything known to man but Google" would be a popular one.

    Legend
    November 22, 2018

    say you wanted to allow only google to crawl your entire site and disallow all other bots, robots.txt would look like this:

    User-agent: Google

    Disallow:

    User-agent: *

    Disallow: /

    Paul-M - Community Expert