Copy link to clipboard
Copied
Is there a way for me to tell my .htaccess file to :
- Allow only specific pages to be indexed by outside crawlers/bots
- Block all crawlers/bots except Google
Basically, I have specific pages I'd like Google to index, and no one else (like archive.org)
Thanks!
You could start with a robots.txt file: Meta Robots Tag & Robots.txt Tutorial for Google, Bing & Other Search Engines I believe the user agent for archive.org bot is: ia_archiver
That at least will allow you control bots that are obeying the robot.txt standard - obviously the naughty bots like email harvesters will ignore it and go ahead anyway, then you're in the world of trying to block by IP address and/or user-agent in htaccess and its a game of cat and mouse.
Good web hosts are pretty effic
...Copy link to clipboard
Copied
You could start with a robots.txt file: Meta Robots Tag & Robots.txt Tutorial for Google, Bing & Other Search Engines I believe the user agent for archive.org bot is: ia_archiver
That at least will allow you control bots that are obeying the robot.txt standard - obviously the naughty bots like email harvesters will ignore it and go ahead anyway, then you're in the world of trying to block by IP address and/or user-agent in htaccess and its a game of cat and mouse.
Good web hosts are pretty efficient at blocking a lot of the junk, if you're on a dedicated server and doing it yourself good luck this resource might be handy: Bad Bots user-agent / bot
#Example blocking by user-agent in htaccess
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (BadBot|EmailGrabber|NaughtyBot) [NC]
RewriteRule (.*) - [F,L]
#Block some by IP addresses
RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999 [OR]
RewriteCond %{REMOTE_ADDR} ^911\.911\.911\.911 [OR]
RewriteCond %{REMOTE_ADDR} ^111\.222\.333\.444
RewriteRule (.*) - [F,L]
Copy link to clipboard
Copied
I see there's a robots.txt generator there, but it doesn't seem to like my browser very much.
Are there no pre-filled robots-txt files out there? Maybe even recommended ones, like there are block lists for Twitter?
I would imagine "block everything known to man but Google" would be a popular one.
Copy link to clipboard
Copied
say you wanted to allow only google to crawl your entire site and disallow all other bots, robots.txt would look like this:
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Copy link to clipboard
Copied
Energize wrote
say you wanted to allow only google to crawl your entire site and disallow all other bots, robots.txt would look like this:
User-agent: Google
Disallow:
User-agent: *
Disallow: /
That's a great starter to build and learn on, thanks.
Two questions :
Thanks!
Copy link to clipboard
Copied
Slight amendment to that, Google user agent is 'Googlebot, so:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
Copy link to clipboard
Copied
Energize wrote
Slight amendment to that, Google user agent is 'Googlebot, so:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
Thanks for the correction.
I also had 2 more questions for you in the reply just before the one you just added
Copy link to clipboard
Copied
I'm pretty sure archive.org's crawler will obey the robots.txt instruction
How many files and folder do you want to stop Google crawling? You can do like this:
#Ask Google NOT to crawl these areas of the website
User-agent: Googlebot
Disallow: /private-subfolder/
Disallow: /admin-subfolder/
Disallow: /private-page.php
Tell all other robots not to crawl the website
User-agent: *
Disallow: /
Copy link to clipboard
Copied
One other caveat:
If you want to stop bots crawling sensitive areas of the website that could be a target for hackers like login areas I would use the htaccess option rather than a robots.txt which is public. Would-be hackers could use the robot.txt file to identify sensitive areas of your site
Copy link to clipboard
Copied
Energize wrote
One other caveat:
If you want to stop bots crawling sensitive areas of the website that could be a target for hackers like login areas I would use the htaccess option rather than a robots.txt which is public. Would-be hackers could use the robot.txt file to identify sensitive areas of your site
If there's no point to a robots.txt file, and the same can be achieved w/ .htaccess, then let's not use one at all.
How do I use .htaccess to let only Googlebot in, and ONLY to index 12 specific files (or if I group them in a folder, only to that folder and its contents, whichever's easier)? I don't even know if this is doable, I'm venturing far out of my comfort zone even touching .htaccess
Thx!
Copy link to clipboard
Copied
If there's no point to a robots.txt file, and the same can be achieved w/ .htaccess, then let's not use one at all.
How do I use .htaccess to let only Googlebot in, and ONLY to index 12 specific files (or if I group them in a folder, only to that folder and its contents, whichever's easier)? I don't even know if this is doable, I'm venturing far out of my comfort zone even touching .htaccess
Group in one folder and try this in htaccess:
RewriteCond %{HTTP_USER_AGENT} !Googlebot [NC]
RewriteCond %{REQUEST_URI} ^/FolderName [NC]
RewriteRule .* - [R=403,L]
Change 'FolderName' to the name of the folder you group the files in
Copy link to clipboard
Copied
Energize wrote
If there's no point to a robots.txt file, and the same can be achieved w/ .htaccess, then let's not use one at all.
How do I use .htaccess to let only Googlebot in, and ONLY to index 12 specific files (or if I group them in a folder, only to that folder and its contents, whichever's easier)? I don't even know if this is doable, I'm venturing far out of my comfort zone even touching .htaccess
Group in one folder and try this in htaccess:
RewriteCond %{HTTP_USER_AGENT} !Googlebot [NC]
RewriteCond %{REQUEST_URI} ^/FolderName [NC]
RewriteRule .* - [R=403,L]
Change 'FolderName' to the name of the folder you group the files in
So it's do-able? I honestly thought I was pushing it with that request.
What are the downsides of doing this, however obvious they may seem to you superhero-types? Hypothetically speaking, say these pages display images that are located one folder over... can Google index those images in their Image Search Results? They're not located in the 'allowed' folder, but they are requested by pages located there.
Totally unrelated follow-up question : I've been carrying a bit of code in my .htaccess file that was recommended to me by another one of you superhero types many years ago. Could you tell me how relevant it is for me to keep this in 2018? I can't even remember the original reason for it.
AddType text/x-component .htc
RewriteCond %{HTTP_USER_AGENT} Wget [OR]
RewriteCond %{HTTP_USER_AGENT} CherryPickerSE [OR]
RewriteCond %{HTTP_USER_AGENT} CherryPickerElite [OR]
RewriteCond %{HTTP_USER_AGENT} EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ExtractorPro
RewriteRule ^.*$ X.html
Copy link to clipboard
Copied
What are the downsides of doing this, however obvious they may seem to you superhero-types? Hypothetically speaking, say these pages display images that are located one folder over... can Google index those images in their Image Search Results?
You're not blocking Google anyway, the rule matches all user-agents that are NOT Googlebot so no need to worry, Google will crawl, pages and images OK.
Copy link to clipboard
Copied
Energize wrote
What are the downsides of doing this, however obvious they may seem to you superhero-types? Hypothetically speaking, say these pages display images that are located one folder over... can Google index those images in their Image Search Results?
You're not blocking Google anyway, the rule matches all user-agents that are NOT Googlebot so no need to worry, Google will crawl, pages and images OK.
What if I only want the 12 PHP files located in the 'allowed' folder to be "index-able" by Google? Wasn't that the point of this exercise? Are you saying Google can still crawl all over the place and index everything it finds, or only what the files in the protected zone are linking to? Apologies for being so slow, I'm having a hard time seeing the line of what's protected and what isn't.
My goal - if even doable - is to have ONLY Google indexing ONLY 12 specific files (.php's) in its search results. Meaning, if I can avoid having my graphics, videos or text files indexed in Google's search results, great. Unless, as a rule, all the images referenced by files in the protected folder - even if those images themselves are not located in said folder - become index-able by Google regardless of protection... in which case, I'll just eat it, as they say.
Copy link to clipboard
Copied
I think in the first instance it'd be easier if you to learn and start with a robot.txt file
#Tell Google to only crawl one specific folder
User-agent: Googlebot
Allow: /SomeFolder/
Disallow: /
#All other bots go away
User-agent: *
Disallow: /
Copy link to clipboard
Copied
Once you've got your robots.txt file in place you really need to block all the bad bots specifically in htaccess. Here's an example and list of bad bots to block in htacess: apache-ultimate-bad-bot-blocker/robots.txt at master Ā· mitchellkrogza/apache-ultimate-bad-bot-blocke... The thing with blocking garbage bots is it an ongoing thing DIsregard my earlier htacess rule unless you literlly want to block everything and everyone except Googlebot,
Copy link to clipboard
Copied
Energize wrote
Once you've got your robots.txt file in place you really need to block all the bad bots specifically in htaccess. Here's an example and list of bad bots to block in htacess: apache-ultimate-bad-bot-blocker/robots.txt at master Ā· mitchellkrogza/apache-ultimate-bad-bot-blocke...
Whoa, that is one long list. But it's literally a robots.txt file, I thought you said to never mind that file and use .htaccess? So confused. Do you want me to put the robots.txt content in my .htaccess file? I noticed archive.org isn't on the list, either. And that's the first one I want blocked.
I'm still not sure what the possible advantage of not blocking every bot except Google (and maybe Bing) is. What am I missing out on by allowing only those two to crawl my sites? In other words, in which ways would I be crippling myself or businesses? I'm sure there's a reason, since people are going through the trouble of maintaining "bad bot" lists, thus suggesting there are good ones. I'm just wondering whether the "good bots" are worth the filter, once you've cleared Google & Bing for entry.
Copy link to clipboard
Copied
If you have legitimate concerns about hostile bots taking your sites down, you need better server security. You can't possibly do all this yourself. The list of of potential threats is too massive and growing all the time.
I use Secure Live real time server monitoring. When SL identifies a potential threat, the IP is blocked and a copy of the report is sent to law enforcement agencies. In the beginning, I received 3-4 threat reports per week. Now I get 1-2 reports per month. So it works.
Copy link to clipboard
Copied
https://forums.adobe.com/people/Nancy+OShea wrote
If you have legitimate concerns about hostile bots taking your sites down, you need better server security. You can't possibly do all this yourself. The list of of potential threats is too massive and growing all the time.
I use Secure Live real time server monitoring. When SL identifies a potential threat, the IP is blocked and a copy of the report is sent to law enforcement agencies. In the beginning, I received 3-4 threat reports per week. Now I get 1-2 reports per month. So it works. https://securelive.com/
I couldn't even tell you what threats I'm protecting myself from with a robots.txt file at all. I just couldn't think of a good reason to let ANY bot (not called Googlebot or Bing) crawling all over my websites, and asked what I'd be sacrificing by doing so. Haven't gotten an answer yet, so looks like I'm just going to block everything but Googlebot + bing via .htaccess and see how that goes (unless you advise against that and can tell me why).
As for commercial solutions, I'll wait until I'm actually attacked -- this is just me informing myself in a more general sense, while being proactive even if I have no immediate reason to be. (It's just going to be a graphic portfolio site, no one is actually going to care enough to harm it.)
Copy link to clipboard
Copied
https://forums.adobe.com/people/Under+S. wrote
I just couldn't think of a good reason to let ANY bot (not called Googlebot or Bing) crawling all over my websites, and asked what I'd be sacrificing by doing so.
Google does gather information from other, lesser search engines. Blocking all bots could have SEO impact. How much is something you'll need to monitor.
Copy link to clipboard
Copied
https://forums.adobe.com/people/Nancy+OShea wrote
https://forums.adobe.com/people/Under+S. wrote
I just couldn't think of a good reason to let ANY bot (not called Googlebot or Bing) crawling all over my websites, and asked what I'd be sacrificing by doing so.
Google does gather information from other, lesser search engines. Blocking all bots could have SEO impact. How much is something you'll need to monitor.
If that's true, then blocking everything BUT Googlebot might red-flag me BY Googlebot for suspicious activity, which is what I was hoping to avoid by giving Google (and only Google) the keys to the place... right? I mean, if Google is comparing notes with lesser engines.
I wouldn't mind backtracking on the idea of only letting Google in, if I could limit the scope of what ALL the crawlers can find to just the 12 specific urls.
In other words, I don't need Google Images (or anyone else) hotlinking individual files, like JPGs or other media. If there's a zip file I temporarily placed in the root folder for someone to pick up, or a text file I forgot to clean up from an older folder on the server, I don't want them indexed for everyone to click on. Just the 12 official urls for the 12 pages on the site that are meant for public consumption.
Energizeā helped me remove /pages from all the urls, thus not only masking those pages' true locations, but shortening the urls to only what's necessary. However, the longer urls still work (so 2 ways to access each file) so I'm pretty sure any crawler's going to pick up on that unless I find a way to limit the scope of the crawlers to only those shortened urls.
(Note that I am using 12 as an arbitrary number right now, the site has 3 pages and will have 2 more by February... even with the occasional article I plan to put up, I don't see the site exceeding 12 pages in the next year.)
Copy link to clipboard
Copied
If you don't want bots to find your media, put it behind a password protected barrier.
Give search engines an XML site map with 12 URLs to follow.