I have a different take than Pete's helpful thoughts, and mine may well prove to be good news for you, Biff, if it proves to match your challenge. But I will say first that the reason CF (and the DB) are going off the rails should not be for a reason that is "magical"--nor is it a "mystery", though it does call for a "tour" through appropriate diagnostics. 🙂 (Nod to the Beatles, for any not getting my quoted references.)
TLDR; First, confirm what OOM errors you find currently in coldfusion-error.log, as they me different than what you mentioned. Second, don't trust GA: your web server logs may show far more traffic than GA is showing (as I will explain). That logging info will also show the user-agent of requests, which traffic may well prove to be mostly from bots and other automated agents. You can easily block any you don't want, using that user-agent header, in IIS using "request filtering" and its simple "rules" feature, which returns a 404 to the requestor. Just be careful that blocking a given user-agent really makes sense. Further, be careful if you also have IIS set to send 404's to CF, as these "blocked" requests would still end up talking to CF. The above approach did prove to be EXACLTY the problem and solution for a client I helped just yesterday. And it's been happening to a lot of people recently (indeed for weeks and months, and longer). It may not be ovbious but it need not remain hidden/mysterious.
I hope folks reading the TLDR won't presume "that's not my problem". That's what the client thought as well, as I will explain. I do think think that if you'll take just a few minutes here to read along and then several minutes to do some assessment of your own diagnostics (which I can help also via screenshare consulting, of course), you may well both FIND your root cause problem and resolve it. 1) You refer to seeing "gc overhead limit exceeded" indications. That's a good first step.
Was that reported in your app/on-screen, or were you finding it in the CF logs? Either way, do open that coldfusion-error.log, noting first the lines at the top which tell you how far back in time that log goes. Then go to the bottom and do search "up" for that phrase: outofmemory.
Do you find any reporting instead "java heap"? Or perhaps "metaspace"? The former would be CF hitting that heap limit (the "overhead limit" is more a warning in advance of that), while the latter would be about that maxmetaspacesize you referred to in your first note. As Carl noted (and I discuss in the blog post I referred to), most people would do better just to REMOVE that argument. I explain why in the blog post.)
And if instead you are getting "java heap" errors, you could keep "chasing the rabbit" of increasing the heap, which MAY help. But I sense you're wanting to "find and resolve the root cause". You may even be thinking "something's wrong with ColdFusion", but I would suggest otherwise.
2) Indeed, you mentioned in passing (in this note today) that "it's just seen an increase in traffic mostly".
That's in fact where I would have wanted to turn your attention. Indeed that's what I was thinking of when I last replied that first day you wrote (which was 2 years ago rather than 3). I'd added that "If you want to wonder later "why it [the heap increase] was needed", we can discuss that after solving the more important problem, which can be so easily solved."
You never responded again, nor did you mark any reply as an answer, so the discussion lay dormant until now. What I am sharing now is what I would have proposed to share then, if you'd have been interested. And since others may find this thread (especially now that you've revived it), and to help you now that it's happening "again", I'm elaborating.)
3) So you've shared your GA chart, though you only shared a brief window of time in that screenshot (30 mins). Further, it's focused on the count of active users (and "views"), rather than request rates (let alone durations).
Still, we can't tell: is your conclusion that traffic IS up? or perhaps that it doesn't seem to have raised enough to cause problems? If so, I would propose that GA is not the bestplace to be looking for this sort of problem. That has classically under-reported traffic to servers like CF.
Why? Because it's been classically based on the browser receiving the page (and its html) requested and then its executing the little bit of GA js code you would have put into the site. (And of course, it ONLY tracks pages that DO have that little bit of js code.) But the problem may be that your server may be being pounded by traffic that is from automated agents (search engine bots, bad guys grabbing your data, attackers, or more recently AI bots). Those typically do NOT execute any js on the page. They just say "give me the next", and "give me the next".
They could be generating 5x, 10x, or 100x the rate of traffic than regular people--and that rate could have increased for you recently. And I'm not just talking theoretical possibilities: I have helped many people find and resolve this problem many times in recent days, weeks, months, and years.
In fact, just yesterday I helped someone in this same boat. CF was crashing, Task Manager showed it using high CPU and memory--and they'd already tried previously raising the memory on the box and the CF heap , which only forestalled the problem. They didn't have GA, but indeed their contention was that "this is an internal server that no outsider should be hitting". So I asked if they had anything internal beyond Task Manager to monitor things--especially perhaps FusionReactor, or the PMT, or any sort of CF monitor. Like many, they did not. That was not a show-stopper, as I will explain.
(If they did have eitherof thosetools, they are the best at tracking "what's going on in CF, specifically". And FR even creates a great request log tracking every CF request--and ONLY cf requests--including their start time, duration, number of queries, their duration, the requesting IP, its user agent, and several more valuable metrics. Sadly, neither CF itself nor the PMT offers such a log: the PMT stores its data in an ElasticSearch DB and the Tomcat underlying CF can be configured to create a request log.)
4) So instead I'd recommended to them what I now recommend for you: look to your web server logs. You mention being on Windows, so your web server is likely IIS, as it was for the other folks. I guided them to find where those are (which can be found specified within IIS itself, but the default is c:/inetpub/logs/logfiles), and within that is then a folder for each site, named by its IIS "site id" number. Again you can find that from the IIS "sites" section).
They had multiple sites and so multiple folders. We looked at each (rather than presume to know "this site is all we care about". I recommend you do that also.
And within each site's log folder I had them sort the list of files by the date modified. As you may know, IIS logs are stored by day, rolling over at midnight (by default)--though the IIS log lines are tracked as GMT time, so in their case the server was in US Eastern time, so we'd subtract 4 hours from that to find the equivalent to local time in the logs.)
4a) Anyway, the first thing I had them look at was whether any of the recent day's logs were larger than those of previous days or weeks (when there was "no problem"). Even across multiple folders, they didn't see much that stood out--so some might have been inclined to think that a waste of time. But I'll say it's often been VERY clear that SOME days logs were indeed FAR larger than other days, or recent weeks--and it may be in ONE site's folder that was unexpected.
4b) Moving on, and before giving up on the value of these web server logs, I had them open the most recent one. (Again, remember that midnight there would have been 8pm the night before.) I wanted to just take a look to see if we might readily spot some unusual nature of traffic.
I explained first that a real challenge with web server logs (as compared to FR's request logs) is that a web server log will track EVERY request made to it: so if a CF page served up html to the browser, that browser would then process that and make several (perhaps dozens of requests back to the server, such as for js files, css files, image files, and so on), which can make it more challenging to "weed through the logs" to focus only on cf request (and their rate). Further, some CF requests are made without even naming .cfm or .cfc as the file extension.
Still, a "beneficial" side-effect of how bots work is that they tend to again just say "give me this url", then "give me that url"..."I don't care about your silly images or js or css. I just want your content!"
5) And sure enough, on the first screen of their logs (showing about 40 log lines in their editor), it was nothing but cf page request after cf page request. NO requests for images, NO requests for js or css. Just one call for a CF page after another.
And the requests were clearly just going through their site asking for a url that named one product and category after another (of course, other sites might track any possible kind of content). Often sites have lists of categories for display, and within categories lists of items, and features for paging through them. That's like honey to a bear for the bots (or bad guys). They just trawl through them asking for page after page.
6) Then I pointed out how the IIS logs (and most web server logs) track also the "user-agent" making the request, which might be a "real browser" but often legit bots do identify themselves.
And indeed we saw that on that one page alone that there were 5 different bots in that timeframe of seconds: some from googlebot, bingbot, amazonbot, dotbot, and ahrefsbot. But the most were in fact from facebook. There were clearly NO requests from any REAL browsers on that screen, nor as we paged down. And remember, this was 8pm their time--but we found it was true pretty much all the time. I've just as often found chatgpt or other AI bots doing the same thing.
Finally, note also that by default the IIS logs track the duration as the last column of the log--and indeed these were taking several seconds, even dozens when things were bad, as they were in this random log we'd open at their 8pm. Clearly it seemed we'd found the culprit(s).
7) So "what to do"? Some people respond thinking, "we need to block those IP addresses", but others know that's a fools errand. Those bot frameworks (and many bad guys or theives) are sophisticated enough to spread their load over many IPs--which may well change day to day.
And I showed how to block them instead by useragent. But I warned first that if someone in your org WANTS that bot traffic, then you can't "just block it". More on that in a moment.
In their case, though, remember they said this was an internal server/site that they didn't think had ANY incoming outside traffic. (While we could turn our attention to that, addressing things from a firewall or other level, they just needed a quick solution because like you their CF was crashing constantly. We had confirmed in this 20 mins of work and discussion that THIS was their unexpected root cause problem.)
7a) So they was indeed interested in a solution that could block the traffic from THOSE bots (those "user-agents").
And for that I showed how easily we could use IIS to handle that. Either at the site or server level is a "request filtering" feature (among the buttons in the middle of the UI). Openign that shows a UI of tabs, one of this is "rules". Right-click in there to add a new rule. Call it "block bots", and in the header field add "user-agent" (no quotes), then in the values field enter (one per line) even just a portion of the long user agent string--enough to distinguish it. So we did dotbot, ahrefsbot, amazonbot, and facebook. Again, some people may want to think twice about that last one, or about googlebot or bingbot.
As soon as you submit that page the change takes effect. If the bottom of your current IIS log showed a high rate of such traffic, wait a few seconds or minutes and re-open the log and to look for these. BTW, it's not that they would no longer be logged: it's that now they would get a 404 from IIS. The Request Filtering feature literally just rejects the request with a 404: that's what the requester sees, and what the IIS log tracks--and we should see the duration is now just milliseconds, and they should no longer going to CF.
(And of course you can do this sort of blocking by user-agent header in Apache or nginx as well.)
7b) That said, I did warn them (and would warn you and readers) to beware something else: some CF folks modify their IIS "error pages" feature (again at either the site or server level) to have 404's passed to a cf page. While I realize that can offer benefits in some cases, do beware that in this sitation we would NOT be stopping the requests from affecting CF. The 404 handler setup in CF would be blown up with the same rate of requests as before. If you were doing any sort of DB lookup--or worse, tracking the 404 failures as new records--you'd be also burdening your db with this, simply a change in the nature of traffic rather than stopping it.
One could conceivably tweak their cf-based 404 handler to better accomodate this situation. Again in their case it was not a concern, as we confirmed no one had changed their IIS "error pages" setting for 404's to go to a CF page, so we didn't have to deal with this.
(If people wonder why my answers read like blog posts--and my blog posts read like term papers--it's because of these little nuances that are often neglected when simpler answers are offered. Again I'm trying to help you and future readers who find this--and the AI bots who will read it and offer it.)
8) So, all that said, within minutes of making these changes we confirmed first that all the requests from those user agents were indeed getting 404's and taking only milliseconds. Again, they did not have FR to allow us to "see how things were going within CF".
But the most important and wonderful thing for them was that now Task Manager not only no longer showed CF as the top user of CPU, it wasn't even in the top 10! Remember: we had not restarted CF. They were very happy with the result, all achieved within an hour of investigation, explanation, and remediation. (I realize some other situations may not resolve so readily.)
8a) Indeed, though a CF restart was not neccessary in their case, it MAY be in others. If CF was woefully bogged down, you may find that attempting to stop or restart the service may fail. In that case, you could kill the coldfusion.exe from "details" in Task Manager. That's not what you should ALWAYS do, but in a case of CF running out of memory and using all the CPU and unable to shutdown, it's an option. That said, if you just wait a minute after Windows Services reports that it couldn't stop the service, Windows will ITSELF kill the process for you.
Either way, then you will find you can start CF again.
9) The next question will be: do things remain settled?
I'll note that you may need to do another round of assessing your web server logs: there may be new and different bots or bad guys trying to break in. Some may be harder to handle than with this simply "blocking by user agent".
9a) Indeed, I'll add one more thought on blocking that way, especially with regard to the calls from facebook, or linkedin, or perhaps apple and others. Note that those may not be their "search engines" but instead they may reflect the calls made to your server from some resource being shared in your organization's Facebook feed, or that of other folks sharing resources on your site.
In this case, folks scrolling through their posts may pass one with a link to your site, and what's happening is that FB (or linkedin, or whoever) is FETCHING your page for the user, to show it in a PREVIEW window showing what the page WOULD have looked like if browsed. (It's even a bit more pernicious in that the site's may well fetch your page IN ADVANCE of the user seeing it the post, if they anticipate that the user may soon be scrolling to it.)
Would you really want to block those? Or to have it serve a 404 error as the user's preview of your page? Probably not. (And the social media folks in your org may want to string you up for causing that.)
So what could you do about that? Well, I helped one client facing this problem (and who'd made that mistake in their excitement) to consider that what the preview would have shown was page content that would be so tiny as to really be useless to the user scrolling on their devices. So I proposed they could modifying their CF code to detect when such a request was made (the CF variable cgi.http_user_agent holds the header), in which case they could just return a their company's logo. That may not work for some.
10) So, all that said, and I know it's a lot, I hope you may get to finding what is your root cause, Biff. And please let us know if this sort of diagnostic approach proves helpful or not. Again hopefully you don't need to even spend as much time assessing the diagnostics and resolving the problem as it took for you to read this. Trust me: it took a lot longer for me to write it! Apologies to those who hate elaboration.
... View more