• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Preventing search engines from spidering sections of your site

LEGEND ,
Sep 05, 2018 Sep 05, 2018

Copy link to clipboard

Copied

Hello, all,

I just recently learned that DISA (Defense Information Systems Administration) considers a robots.txt file as a vulnerability finding Category II.

https://vaulted.io/library/disa-stigs-srgs/apache_site_22_for_windows/V-2260

So, I will be forced to remove the robots.txt file from our public site.  Does anyone know of another way to prevent search engines from spidering certain sections of your website?  I just want to keep spiders out of our components folders.

V/r,

^ _ ^

Views

265

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 05, 2018 Sep 05, 2018

Copy link to clipboard

Copied

You can include META directives in individual pages to prevent crawling. But this probably violates your DISA requirements too. Also, it's important to note that both of these approaches only work with "well-behaved" crawlers. Any HTTP client can simply ignore robots.txt or individual directives, and the reason for the DISA requirement is presumably that a malicious user could identify sensitive URL patterns by simply reading robots.txt.

As an alternative, you can require authentication for all pages, and deny access to unauthenticated or unauthorized clients. I'm using "authentication" pretty loosely here; it could simply look for specific client networks, for instance.

I do a lot of search work, so feel free to contact me directly if you have questions.

Dave Watts, Fig Leaf Software

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 05, 2018 Sep 05, 2018

Copy link to clipboard

Copied

https://forums.adobe.com/people/Dave+Watts  wrote

..and the reason for the DISA requirement is presumably that a malicious user could identify sensitive URL patterns by simply reading robots.txt.

That is precisely the reason.  DISA believes that it acts as a roadmap to sensitive functions for hackers and script-kiddies to target.  Which I understand, I get it.  And I know that not all search engines are reputable, and that those which are not will ignore robots.txt file directives.  I wasn't under any impression that the robots.txt file was protecting anything.  But I _was_ surprised that DISA considered it a Cat II vulnerability.

As far as authentication for all pages, did you mean like a logon?  If so, I don't see that happening.  This is in regards to our public pages (https://www.ustranscom.mil) and I seriously doubt that Brass will go for that.

If not, what did you have in mind?  Like searching for "bot" in the user client and deny access if found?

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 05, 2018 Sep 05, 2018

Copy link to clipboard

Copied

LATEST

I don't know enough about DISA to know how serious the categories are, but yeah, whatever it is it's probably overblown based on my prior experience. But here we are.

If you're allowed to have sitemaps, you could use these to list only the URLs you want to crawl. You can't really filter patterns here though, so it's not as good as robots.txt for preventing good crawler optimization. As mentioned before, you can include META directives in individual pages, and that will have the same overall effect as far as those individual pages go:

The Web Robots Pages

However, it does mean that your server may end up serving a lot of URLs to robots that then discover that those URLs should be discarded - at least for the first crawl of that robot. Many crawlers will track which URLs were successful, and only crawl those in the future. (Many others will not.)

As for "authentication" I was leaving that pretty loose because I didn't really know what your requirements were. So, it could be requests from specific trusted networks, etc. But if it's a public site, there's nothing you can really do except rely on information from the browser, which is not really trustworthy. On the other hand, you could build something that limits repeated requests. Or, on pages which you want to crawl but which link to pages you don't want to crawl, you could conditionally show links accordingly in the event that the page is being crawled by something that looks like a robot rather than a user-driven browser. That's also not reliable, but it might be reliable enough to serve your purpose.

Dave Watts, Fig Leaf Software

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation