Skip to main content
Participant
May 25, 2007
Question

spiders, robots.txt and /includes folder

  • May 25, 2007
  • 2 replies
  • 342 views
I am unsure whether or not I should put my /includes folder into my robots.txt file. When search engine spiders crawl through a site, how do they deal with files that are in the /includes folder? Do they only see those files when they are "cfincluded" by a calling page or do the spiders also see them as independent pages?

I don't want to see pages of mine showing up on a search engine's rankings that are devoid of sibling content. (For example, I wouldn't want just content from "column 1" without the pages' header, footer, column 2 and sidebar also being displayed.) This could give users a poor (and obviously misleading) impression of my site and its content.

So, should I put my /includes folder into my robots.txt file (ex. "Disallow: /includes/") or not? And would this prevent a spider from following a <cfinclude>? I definitely don't want that.

But if spiders ONLY crawl files within the /includes folder when they are called from another file, then I wouldn't have to worry about page components showing up in rankings under the guise of complete pages.

Any information on this topic would be greatly appreciated.

PS. On a separate, but slightly related note, can search engine spiders crawl JavaScript and CSS files?
    This topic has been closed for replies.

    2 replies

    Inspiring
    May 29, 2007
    Spiders can read any and all content that is in web accessible folders.
    The big boys who behave themselves are not going to bother with your
    include folders, they are only going to get the pages as the are
    presented by you. But if I can guess at your folder structure, and
    /includes/ is a very simple guess, I can access your includes folder,
    and can have a spider do it as well.

    The best way to protect this content is to have it outside of the web
    root. ColdFusion does not need the content to be in the web root to
    include it in the pages it returns with proper request, but if they are
    outside the web root then I or my spider can not get at it so easily.

    DO
    /includes/
    /wwwroot/
    /wwwroot/css/
    /wwwroot/javascript/

    DO NOT
    /wwwroot/
    /wwwroot/includes/
    /wwwroot/css/
    /wwwroot/javascript/
    Inspiring
    May 29, 2007
    I'm no expert but .. I imagine search engines would see files in the /includes folder as independent pages, same as any other folder. Cfinclude isn't like a http header. The CF code (<cfinclude> et... ) is executed on the server and converted to html. So the search engine should receive plain html - like a browser. The fact that a cfinclude was used to generate the html doesn't matter, because the spider doesn't see the cf code, just the html.