Having trouble getting GoogleBot or Bing to crawl through pages on a website. For some reason when you use Google Console or Bing Webmaster tools the crawlers timeout or get kicked from the site and i receive the error of "Page cannot be reached".
I've ensured robots.txt is not blocking the crawlers as well as no firewall is blocking them and IIS is allowing them into the page as the log files for the site sees the bot talking to it.
ColdFusion is in lockdown.
I've looked through the log files and haven't found anything to point to the crawlers getting booted.
Any suggestions would be greatly appreciated.
Have you made a formal request to GoogleBot and Bing to crawl your site?
I have been in conversations with Google and on their forums. Not much came of the conversation.
I did however enable Failed Request Logs in IIS and did come across one error that's constant on any page that runs on the site:
ModuleName:IIS Web Core
ErrorCode: The system cannot find the file specified. (0x80070002)
I'm not sure if that's what could be triggering the "bot" to abort, but when you run any page, it does load in a browser.
The error also led me to this group post on CFM from way back in 2015 regarding unchecking request restrictions checkbox. Unfortunately that didn't seem to work. The "solution" that this person found is second or third conversation thread from the bottom of the page.
The first thing I'd do is use a crawler advertising itself as GoogleBot instead of a browser. Browsers are a lot more forgiving than crawlers. So are their users, who typically don't just ransack every page looking for links to follow. You can do this with pretty much any crawler like wget or curl. I don't know how rapidly crawlers go through pages nowadays, but automated tools like wget and curl generally go pretty quickly, so if your site is slow to respond you might have issues here.
Dave Watts, Eidolon LLC
Thanks @Dave Watts for your suggestion for wget and curl. I used wget to plow through the site and all 1000+ dynamic pages came through with a 200 status. I still got the map_request_handler warning in the failed request logs but the crawler didn't seem to care about it. I like that tool though. The journey continues...
Since you appear to be using Windows, try spidering your website using Xenu Link Sleuth or HTTrack. Both programs are free & portable. We use Xenu to verify links, generate sitemap and identify exposed email addresses. We use HTTrack to create static, offline copies for fallback use at tradeshows (when internet access may not exist). HTTrack will enable you to specify a custom user agent in case you want to impersonate a specific GoogleBot.
We also use webhint (website tool, browser extension & VSCode extension) to analyze HTTP headers, accessibility, speed and cross-browser compatibility.
We really like it too. If you can't crawl your public web application with Xenu, there's a good chance that search engines can't crawl it either. (No crawl = no index.)
We checked back in August 2019 and there's a command-line version of Xenu available by request, but does require Windows, desktop access and a one-time $300 site license.
User-agent-wise, we actively identify & block many requests if they use known, generic user-agents (CURL, Wget, scrapers, bad bots, etc.) unless they come from trusted, whitelisted IPs for testing/scanning purposes. We perform reverse DNS lookups on GoogleBots as it's easy for any user to modify their user agent. (Fake bots are penalized unless they come from trusted, whitelisted IPs.) Most of this is automatically blocked via our StackPath WAF, but we've had to ease up on some rules because one financial gateway that we use doesn't follow the RFC and use a user-agent when they post to our endpoint.
Thanks @James Moberg for those great links and suggestions. Xenu does a really great job of scanning. I found some missing links and errors I didn't know existed. This is a very large site. Every page crawled successfully in Xenu but again the failed request log triggered on the dynamic pages in IIS. I'm not sure why google bot does not crawl this site. It likes the index pages, but doesn't crawl a URL when it's dynamic. Conflicted on if it's CFM or IIS that's bailing on Google. Side note, we have a few other sites on this server and they get treated the same way from Google. It doesn't like something on that server.
I recommend using Web Developer F12 tools (built into your browser) to see if you can identify anything radically different in the response headers.
The webhint Chrome browser extension will provide you with some insight too. (If you use Microsoft Edge, Webhint is built-in so there's no need to install.)
If you have malicious JS or something in your framework that performs blind redirects (ie, third-parties using your site to redirect traffic), then search engines may not want to add you site to the index. (I'm just mentioning this in case it's an issue based on content or unintended use that has been blacklisted.)
Could you provide any public static vs dynamic URLs so I can check? (If you desire, you can contact me outside of ASC. My name is James Moberg & my Twitter handle is GamesOver.)
We haven't encountered this issue with Google. Most services don't even know that we're running ColdFusion or IIS because we use suppress certain headers, have custom error messages and use IIS Rewrite so that ".cfm" is not in any spiderable URLs or links.
After looking through some lengthy Failed Request Log Files from IIS, and some great insight help from @James Moberg it's looking more like the reason that it's failing is due to the "StaticFile" Mapping Handler in IIS. According to the Trace in the Log file, It's one of the first thing that gets triggered, sends a warning, but continues on to where the CFMHander takes care of the dynamic URL address and delivers the page to the "browser". A question to anyone that may have fallen into this, is there a way to give the StaticFile Mapping handler a solution to ignore dynamic URL's and have CFM take over for it. I've also read about re-configuring the CFM connectors again, but they already work so not sure if that would help much.
Here's the error that is displayed internally.
Detailed Error Information:
Module: IIS Web Core
Error Code: 0x80070002
Requested URL http://www.blah.com/voodoo
Physical Path Drive-Letter:\cool-folder-name\voodoo
Logon Method: Anonymous
Logon User: Anonymous
Request Tracing Directory: drive-letter:\cool-folder-name\logs\FailedReqLogFiles
That's definitely an odd one. And the problem may have something to do with tweaking someone may have done in your IIS about the ORDER of the modules. And as such, changing the modules order might solve things. If I were in your shoes, I'd compare them on a working machine.
You can see the order of the modules by going to the server level in IIS, choosing modules, and on the right choose the "view ordered list". In a default install (as I just checked, granted on Windows 10), the staticfilemodule is about halfway down the list, and isapifiltermodule (key for CF) is near the top (5th down). (And FWIW, while the UI won't let you change the order of modules at a SITE level, someone could modify the underlying config files manually, so you may want to check the order in any "problem" site, if the default order seems "fine".)
One other thing: while I'm no fan of "just trying things", I will note that that a solution shared in the past for that error code on that staticfile module was this simple web.config to easily try: https://stackoverflow.com/a/33278362/90802
Let us know if you make any progress. There are lots of other potential "differences" in such config settings that could be impacting you, that you might not readily consider (app pool settings and more).