Copy link to clipboard
Copied
Ever since we upgraded from ColdFusion 2021 to ColdFusion 2023 we have been dealing with out of memory issues. ColdFusion will run fine for roughly 24-30 hours, then we will start seeing CPU spikes to 100% every 30 seconds. Garbage collection can't free up enough memory so ColdFusion eventually crashes and we have to restart the server.
Things we have tried that don't seem to help:
- Downgrading to 17.0.11
- Tweaking the min and max heap sizes
- Tweaking the caching settings
- Changing the garbage collector algorithm to G1GC
- Tweaking our websites to cache queries for a shorter period of time (1 hour down to 15 minutes down to 5 minutes)
Here are our current settings:
Min Heap: 8192
Max Heap: 8192
Garbage Collector: UseParallelGC
Cached Templates: 1000
Cached Queries: 5000
We do have Fusion Reactor installed on all of our servers but this is like trying to find a needle in a haystack. I really don't know what I should be looking at.
Here is a most recent screenshot from 2 days ago that shows the ventual demise on one of our servers.
I am really at my wit's end here. If this isn't a memory leak I don't know what the heck it is. If anyone has any recommendations on what to try next I would appreciate it.
Copy link to clipboard
Copied
This is what I was referring to in my original post regarding CPU spikes when the garabage collector is attempting to free memory.
Copy link to clipboard
Copied
It could be session management issue? Especially with the number of bots crawling sites these days. For every bot that hits, CF will create a session. If those sessions don't expire, they keep building up and taking memory. We had this issues not too long ago.
First, try reducing the length of time your seesions stay active. You can do this in the Application.cfc file. For instance, here would be a setting for a 20 minute session:
<cfset THIS.sessiontimeout=CreateTimeSpan(0,0,20,0) />
In addition, you can stop bot sessions altogether by testing to see if the session cookie exists, since bots don't use cookies. Put this in your Application.cfc file.
<cfif StructKeyExists(cookie, "cfid") or StructKeyExists(cookie, "jsessionid")>
<cfset THIS.sessiontimeout=CreateTimeSpan(0,0,20,0) />
<cfelse>
<cfset this.sessiontimeout = CreateTimeSpan(0,0,0,3) />
</cfif>
Copy link to clipboard
Copied
Hi,
Thanks for the suggestion. Our session timeout is currently set to 30 minutes so I don't think it is the cause of the problem, but I will try anything at this point.
Copy link to clipboard
Copied
Dave, if you're thinking perhaps that the change to cf2023 is the cause of your problem, I'll say that I'm not aware of any known issue in cf2023 that's newly susceptible to increased memory usage.
Yet it's not quite clear you have a memory "leak". The growth is steady over the course of a day, and you hit the limit (primarily in the old gen, which holds most of the heap). If you could increase the heap, that might buy you some time, but unless what's holding memory were to release it in 24 hours, it would indeed keep climbing.
You do really need to find WHAT is holding onto memory which can't be gc'ed. (Changing gc algorithms is not the solution, as you found out.)
Since you have fr, you could try to use its memory profiling feature. But it often proves hard to relate to the specific cf objects at issue. Still, it's worth a shot. See its docs or their videos on that.
Instead, I'd recommend you focus on the nature of traffic you're getting. Paul referred to the possibility of unexpected traffic load, perhaps leading to an increase in sessions. Fr can help you see that. Look at the "uem&sessions" >sessions page. Are they increasing at the same rate as memory itself, over that 24 hour period? If not, then it's something else.
And some things to consider are NOT reflected in FR. It may be tough to assess things via such back and forth here. If we run out of steam, just know I can help directly--and often things become more clear in such a direct screenshare session that can't be communicated or anticipated here. More at carehart.org/consulting.
Copy link to clipboard
Copied
Judging from the FusionReactor displays, it seems to me that none of the 5 things you mention is the root cause of the problem.
Therefore, I would suggest that you return each of the 5 settings back to its original value.
I think the cause of the issue is memory-intensive code. By this I mean code that increasingly uses memory, without any pause to free memory. Think, for example, of:
Where to start looking for the offending code? FusionReactor's Memory (MB) and CPU (%) displays offer a clue. Notice how there is a Memory (MB) dip precisely at times when there is a CPU (%) peak. The times at which these occur are approximately 12:32:16, 12:32:29, 12:32:42, 12:32:48, 12:32:58, 12:33:10, 12:33:20.
Now check FusionReactor's logs for the requests that were running at those times. Identify which of them were high-CPU. Those were the requests which actually attempted to reduce memory usage. Examine the corresponding code. Identify the processes which cost so much CPU to reduce so little memory.
Copy link to clipboard
Copied
Hi BK,
Thanks for your response. We did revert back to our original settings that we used on ColdFusion 2021 which were:
Min Heap: 8192
Max Heap: 8192
GC Algorithm: ParallelGC
Cached Templates: 1000
Cached Queries: 5000
1. I have checked Fusion Reactor's "Requests > Slow Requests" and "Requests > Longest Requests" and nothing is over 30 seconds so I'm thiking we can cross off an infinite loop somewhere.
2. I'm not sure how I would check for a collection of ever increasing size, but I don't think that is happening.
3. We did identify a few downloads (Google Feeds) under "Requests > Requests By Memory" that were appearing at the top of that Fusion Reactor report that we moved to another server. Unfortunately, that didn't help.
4. I don't see any deadlocks at the moment. If there were any, I would be getting error emails from the websites because we are using cferror and I am emailing myself.
In that screenshot, you mentioned checking the logs for what was running when those dips in memory occurred, but I think what's happening here is that Java is attempting to garbage collect and that's the reason for the high CPU. I could be wrong, but that's how I read that image.
Since both of our front end servers were close to crashing this morning, I took the liberty of changing the version of Java we're using. We are not using the official Oracle JDK on either one. We are now testing Amazon Coretto JDK on Server 1 and Microsoft's OpenJDK on Server 2. Will it help? Who knows. I am still searching for answers.
Do you know how to decipher a heap dump? I've taken several snapshots but I have no idea how to look at this data.
This is how both servers are looking roughly 5 hours after a ColdFusion restart.
Copy link to clipboard
Copied
Yes, indeed, it makes more sense that the cpu spikes would coincide with the major (mark sweep) Gc's, and frs graphs of each of those would confirm it. (Sadly, your original gc graphs were not for the same timeframe as the cpu graph, so we couldn't conclude for sure. But you can.)
As I said in my response (the only one you've not yet replied to), it feels like something is increasingly holding on to memory that can't be Gc'ed. You'll want to find that. I gave you a couple of approaches.
And I'll say now that heap dump analysis would often be more challenging than the other options I'd mentioned, but it certainly can be done and MAY help. But it's just a point in time.
The fr memory profiler on the other hand can be triggered OVER time, like an hour after the instance has been started, and then a few hours later (in your case of slow, steady increase). Then it lets you compare the profiles to see what Java objects are increasing in size or count. Again, sometimes it's not clear what those objects equate to in cfml (same with a heap dump), but sometimes it may be clear.
But again I'd suggest you check the sessions page. I know you told Paul you have 30-minute sessions. That could be the admin default, but code could override it. Worth at least looking. If that's not it, a strong candidate is something growing in one of your application scopes, or perhaps the server scope. Or use of cf caching, and so on. A true "leak", like a cf bug is generally the least likely cause in my experience.
Usually it's some aspect of code, config, and/or load. And you may well have a difference in config between cf2021 and 2023 which has escaped notice.
Finally, the jvm choice is also not likely at issue...though to be clear, Adobe only supports our using Oracle's jvm (which they license for our use). Using another seems a needless risk. But I'm just sharing perspective, not telling you what to do.
Copy link to clipboard
Copied
Hey Charlie,
Sessions have been holding steady at around 10K and I don't see any large increases over time so I think we can throw that out. Nothing seems too out of line.
As for memory profiling, I see that Fusion Reactor has this turned on by default. I don't see any data in Profile > Active Profoling, but there is some histroy in Profile > Profile History. When I click into that report, I don't see much of anything over a few seconds load time so I'm not sure what to think about this feature.
I do realize that I am testing other versions of Java right now but I am running out odf ideas and have been working on this issue for almost a month now.
I am including a couple screenshots below that show session counts on both servers.
PS - I did quadruple check that I am using the same settings as ColdFusion 2021. They are all exactly the same as before.
Copy link to clipboard
Copied
Charlie,
I forgot to mention what values we're using for session variables. Here they are:
Maximum Timeout: 2 hours
Default Timeout: 1 hour
Website Override: 30 minutes
As always, thanks for your insight.
Copy link to clipboard
Copied
Ok on the admin and app session timeouts. Your previous message showing they weren't increasing diminishes the likelihood it's an issue (though it's not considering the SIZE of any of the sessions).
But again I'd also proposed you look at the use of application and server scopes (fr doesn't reliably help with that). Then I'd also proposed considering your app's use of caching, though FR can't directly help assess their size or use. The profile might.
Finally, beware presuming you "know" that your app "doesn't use such things". There could be an app that "no one uses" and that "hasn't been touched in years", but which a spider or bot could be trolling.
And recent ai bot scans have been especially notorious for this. Fr can help assess this, as can we server logs. And again I can help assess any/all this, if you don't find it on your own or with others.
Copy link to clipboard
Copied
As for the profile feature, again see the docs for how to use it. You need to create a profile, and THEN you would see it.
The docs will also help you better use the ui once you have one, or ones to compare. Or again I can help directly.
Copy link to clipboard
Copied
Hmmm maybe I'm looking at the wrong docs. Is this it?
Copy link to clipboard
Copied
No, that's the request profiler. The memory profiler is here :
https://docs.fusion-reactor.com/Memory/Overview/
I do see now that the left nav menu (at least on mobile) does not help you readily discern that's about the memory profiler, nor that that particular profiler is about requests.
Same with the search feature, where I couldn't even find it mention the memory/heap profiling feature in a search for profile, profiler, or profiling.
Those are things I'd recommend you report to them. They're very responsive to customer concerns. (I present too many to get attention for any one.)
Copy link to clipboard
Copied
Here are a couple other overview resources. Note how the first one clearly shows the feature being referred to as memory profiling, in the url and text. Just adding that in case anyone noticed that the doc page doesn't use that term, a d might think ijwas mistaken in using it.
https://fusion-reactor.com/features/performance-troubleshooting-old/memory-profiler/
The second is just a couple minute video, but may still prove helpful.
https://www.youtube.com/watch?v=a3iwB5zsXRM
Copy link to clipboard
Copied
Let me start with what I consider crucial feedback to your last post. To repeat what Charlie advised, I, too, would discourage running ColdFusion 2023 on Amazon Coretto JDK or Microsoft's OpenJDK. In other words, in spite of the performance issues you're facing, I would suggest that you continue to run ColdFusion 2023 on the latest Java version that Adobe recommends, namely, Java SE 17.0.12 (LTS). There is a reason why I say that.
ColdFusion consists of hundreds of Java applications working together. Before release, the Adobe team exhaustively tests and optimizes each, as well as the integration of all of them into one application server. Using the Java version that the team recommends stands you in good stead to take maximum advantage of the optimization.
Now, on to your answer to my previous post. It seems the strategy I recommended wasn't clear. The strategy is:
To repeat, look not for the slowest processes, but for the highest-memory consumers. Hence the example I gave, showing a possible way to identify such high-memory processes/pages/requests.
On the point that CPU usage may spike during garbage collection, I of course agree with you. But you are talking of CPU peaks, whereas I am talking of high-CPU peaks.
In my experience, CPU typically peaks at around 20 to 40% during garbage collection, even in memory-intensive ColdFusion applications. Whereas, here, CPU consistently peaks at over 50% during garbage collection, frequently reaching 70 to 90%.
Combine that with the fact that: (1) the garbage collections occur within seconds of each other, and (2) the application's memory usage is hovering at between 85 and 90%. Frequent garbage collection usually indicates memory pressure, often caused by processes generating excessive objects. The high memory usage confirms this. To me, it all points to memory leaks or high object-churn. That is the reason why I think the root cause is to be found in the code.
The strategy I suggest consists of two parts:
Copy link to clipboard
Copied
Bkbk, it seems your contention is that some code is creating a lot of objects, in a short period of time, right? I know we could read your last reply another way, but I think the totality of it confirms this to be your expectation.
And if so, I'll say my money is instead on the opposite: there may be zero requests running for an entire minute or hour in Dave's situation, and yet the memory will remain high. My expectation is that thousands or millions of requests--even from hours ago--might have incrementally added just a small number/amount of objects which are holding memory...but those are something which lives on LONG AFTER the request ends, and indeed seems to be living for longer than the 20 or so hours his graphs showed.
To me, that's what needs to be found. And that's why the fr memory profiler may best identify WHAT KIND of objects are increasing in size/count.
Now, COULD that be related to coding choices? Sure. Config choices? Sure. Exacerbated by spiders/bots/automated requests? Absolutely--though not necessarily. So I'm saying that a first priority seems to be to try to find what IS piling up, if possible. Then we can focus on how/why.
But I just doubt the explanation will be in what's in fr's "requests by mem" or "longest requests".
Time will tell which of us has guessed right in this case. That said, I don't mean to knock the value of what you're offering, in that it may help in OTHER cases, sure. We'll see what Dave finds.
Copy link to clipboard
Copied
Bkbk, it seems your contention is that some code is creating a lot of objects, in a short period of time, right? I know we could read your last reply another way, but I think the totality of it confirms this to be your expectation.
And if so, ...
By @Charlie Arehart
@Charlie Arehart , I didn't put any emphasis on "in a short period of time". My contention emphasizes "some code is creating a lot of objects". From what I have read, that is apparently your contention, too.
Copy link to clipboard
Copied
No, it's not. But let's let it go. I think others will discern the differences in our perspectives. And what matters most is what Dave ultimately finds to be the culprit.
Copy link to clipboard
Copied
I think others will discern the differences in our perspectives. And what matters most is what Dave ultimately finds to be the culprit.
By @Charlie Arehart
I couldn't agree with you more. 🙂
Copy link to clipboard
Copied
Remark:
Your last 2 "Memory Overview" displays look fine to me. 🙂
Some questions:
What is the maximum number of distinct users of the application at any time?
On average, how many users use the application per day?
Copy link to clipboard
Copied
Hey BK,
We average about 100-200 users at any one time. If there is a sale going on, it's a bit higher.
We average roughly 8,000 users per day.
Copy link to clipboard
Copied
Hi @davecordes , thanks for the session info.
200 simultaneous users and 8000 users per day - that is really no sweat for ColdFusion. But I can see a likely problem when I take into account the number of active sessions at any time (10 000).
10 000 seems unusually high to me, given an average of 8,000 users per day and about 200 simultaneous users. Under normal circumstances, you would expect the number of active sessions to closely align with the number of simultaneous users, with some fluctuation depending on the session timeout settings and user activity. So, let's at least rule this out.
Potential causes of high Active Session count, and recommended solutions:
<cfset this.name="name_of_your_application">
<cfset this.applicationTimeout="#createTimeSpan(1,0,0,0)#"> <!--- assumed: 1 day --->
<cfset this.sessionmanagement="yes">
<cfset this.sessiontimeout="#createTimeSpan(0,0,30,0)#">​
<Manager pathname="" />​
Copy link to clipboard
Copied
@davecordes , another request for information: please share the contents of your jvm.config file(s).
Copy link to clipboard
Copied
Bkbk, while his session count is indeed high, it seems all that consideration is overkill (though maybe your effort will benefit other readers). Let me offer again a different perspective, and folks can weigh them together.
1) First, Dave had already shown us above that the session count was remaining stable throughout the day--and he'd shown us also the FR graph that clearly indicates the sessions are being destroyed at the same rate they are beng created. That means they ARE timing out. (He'd said it was a 30 minute timeout.)
And that 10k sessions translates to about 1.8 requests per second, which jives with an earlier screen he'd shared showing they get about 5 cf requests per second on average. That would translate to a little less than half their requests coming from bots or automated traffic (as I'd suggested previously), which is quite common for a lot of servers.
2) But rhe reason that alone is not THE issue is that memory (heap use) was climbing THROUGHOUT the day. More specifically, the trough to which used memory FELL was an increasingly higher number.
That suggests clearly that SOMETHING is remaining "in use" even beyond the session timeout.
And that's where my money is: something unexpected that is created to live BEYOND the life of the request (like caching, for example) and never being released within the 20 hour window his screenshots showed. (It may well prove to be something set to cache for 24 hours, if the memory graphs started stabilizing at 24 hours. Sadly they'd not go DOWN at 24 hours, because the rate things were now timing out would be the same they were coming in, presuming the previous pattern.)
3) And that then is why we need to find WHAT objects are piling up in the heap. Dave now has what he needs to use the FR heap profiler effectively, to see if comparisons if it over time might clearly spot what object or objects are the culprit.
Once we know that, we maybe able to temper whatever that is. And that's the kind of unexpected memory use that I've contended from the outset is a common cause for what seems otherwise a "memory leak". But I'd argue the latter is a term better used for something unintended and NOT within our control in cf code or config. There HAVE been such on rare occasions (like in db drivers or due to a mistake by Adobe), but they're far less likely the cause in my experience. The things I discuss above are the far more common cause.
4) Finally, to your point about tomcat sessions, that can indeed be a surprising impact for people. But I would not see it affecting memory. Instead, the mechanism woukd by default save sessions to A FILE (sessions.ser), to be used by cf (tomcat) to "persist" sessions over cf restarts. As such, it would not affect memory in my experience.
To be clear, that's also able to even be used (assuming it's configured in the context.xml, as you note) only if one enables the "j2ee sessions" feature in the cf admin. It would not apply to normal cf sessions. The latter are controlled by cf, not tomcat, which is the reverse for j2ee sessions--which allows those to be persisted, optionally.
FWIW, I've written and presented a lot more on this (cf/tomcat session persistence) in the past, first when it came out with cf10 (the first cf version to run natively on tomcat). Then more recently I've presented and written about cf offering session persistence via Redis session storage (new since cf2016--but which works ONLY if we do NOT use cf's j2ee sessions feature).
But all this (point 4) is separate from the main problem here. I just wanted to offer it as a PS for those who might notice your mention of Tomcat session persistence and be intrigued. 🙂