Possible Memory Leak - ColdFusion 2023 + Java 17.0.12

Report · Oct 08, 2024

Ever since we upgraded from ColdFusion 2021 to ColdFusion 2023 we have been dealing with out of memory issues. ColdFusion will run fine for roughly 24-30 hours, then we will start seeing CPU spikes to 100% every 30 seconds. Garbage collection can't free up enough memory so ColdFusion eventually crashes and we have to restart the server.

Things we have tried that don't seem to help:

- Downgrading to 17.0.11

- Tweaking the min and max heap sizes

- Tweaking the caching settings

- Changing the garbage collector algorithm to G1GC

- Tweaking our websites to cache queries for a shorter period of time (1 hour down to 15 minutes down to 5 minutes)

Here are our current settings:

Min Heap: 8192

Max Heap: 8192

Garbage Collector: UseParallelGC

Cached Templates: 1000

Cached Queries: 5000

We do have Fusion Reactor installed on all of our servers but this is like trying to find a needle in a haystack. I really don't know what I should be looking at.

Here is a most recent screenshot from 2 days ago that shows the ventual demise on one of our servers.

I am really at my wit's end here. If this isn't a memory leak I don't know what the heck it is. If anyone has any recommendations on what to try next I would appreciate it.

Report · Oct 08, 2024

This is what I was referring to in my original post regarding CPU spikes when the garabage collector is attempting to free memory.

Report · Oct 08, 2024

It could be session management issue? Especially with the number of bots crawling sites these days. For every bot that hits, CF will create a session. If those sessions don't expire, they keep building up and taking memory. We had this issues not too long ago.

First, try reducing the length of time your seesions stay active. You can do this in the Application.cfc file. For instance, here would be a setting for a 20 minute session:

<cfset THIS.sessiontimeout=CreateTimeSpan(0,0,20,0) />

In addition, you can stop bot sessions altogether by testing to see if the session cookie exists, since bots don't use cookies. Put this in your Application.cfc file.

<cfif StructKeyExists(cookie, "cfid") or StructKeyExists(cookie, "jsessionid")>
	<cfset THIS.sessiontimeout=CreateTimeSpan(0,0,20,0) />
<cfelse>
	<cfset this.sessiontimeout = CreateTimeSpan(0,0,0,3) />
</cfif>

Report · Oct 08, 2024

Hi,

Thanks for the suggestion. Our session timeout is currently set to 30 minutes so I don't think it is the cause of the problem, but I will try anything at this point.

Report · Oct 08, 2024

Dave, if you're thinking perhaps that the change to cf2023 is the cause of your problem, I'll say that I'm not aware of any known issue in cf2023 that's newly susceptible to increased memory usage.

Yet it's not quite clear you have a memory "leak". The growth is steady over the course of a day, and you hit the limit (primarily in the old gen, which holds most of the heap). If you could increase the heap, that might buy you some time, but unless what's holding memory were to release it in 24 hours, it would indeed keep climbing.

You do really need to find WHAT is holding onto memory which can't be gc'ed. (Changing gc algorithms is not the solution, as you found out.)

Since you have fr, you could try to use its memory profiling feature. But it often proves hard to relate to the specific cf objects at issue. Still, it's worth a shot. See its docs or their videos on that.

Instead, I'd recommend you focus on the nature of traffic you're getting. Paul referred to the possibility of unexpected traffic load, perhaps leading to an increase in sessions. Fr can help you see that. Look at the "uem&sessions" >sessions page. Are they increasing at the same rate as memory itself, over that 24 hour period? If not, then it's something else.

And some things to consider are NOT reflected in FR. It may be tough to assess things via such back and forth here. If we run out of steam, just know I can help directly--and often things become more clear in such a direct screenshare session that can't be communicated or anticipated here. More at carehart.org/consulting.

/Charlie (troubleshooter, carehart.org)

Report · Oct 10, 2024

@davecordes ,

Judging from the FusionReactor displays, it seems to me that none of the 5 things you mention is the root cause of the problem.

Therefore, I would suggest that you return each of the 5 settings back to its original value.

I think the cause of the issue is memory-intensive code. By this I mean code that increasingly uses memory, without any pause to free memory. Think, for example, of:

excessive storage of objects in session, application, or server scope;
an infinite loop missing a cfbreak;
too many threads being created, and staying alive;
one or more collections of ever-increasing size;
excessive (that is, over-abundant, duplicate or unnecessary) caching;
excessive use of persistent CFCs;
large or frequent file downloads/uploads;
deadlock or circular dependencies (procees P1 waits for process P2 which waits for process P1 or procees P1 waits for process P2 which waits for process P3 which waits for Process P1).

Where to start looking for the offending code? FusionReactor's Memory (MB) and CPU (%) displays offer a clue. Notice how there is a Memory (MB) dip precisely at times when there is a CPU (%) peak. The times at which these occur are approximately 12:32:16, 12:32:29, 12:32:42, 12:32:48, 12:32:58, 12:33:10, 12:33:20.

Now check FusionReactor's logs for the requests that were running at those times. Identify which of them were high-CPU. Those were the requests which actually attempted to reduce memory usage. Examine the corresponding code. Identify the processes which cost so much CPU to reduce so little memory.

Report · Oct 10, 2024

Hi BK,

Thanks for your response. We did revert back to our original settings that we used on ColdFusion 2021 which were:

Min Heap: 8192

Max Heap: 8192

GC Algorithm: ParallelGC

Cached Templates: 1000

Cached Queries: 5000

1. I have checked Fusion Reactor's "Requests > Slow Requests" and "Requests > Longest Requests" and nothing is over 30 seconds so I'm thiking we can cross off an infinite loop somewhere.

2. I'm not sure how I would check for a collection of ever increasing size, but I don't think that is happening.

3. We did identify a few downloads (Google Feeds) under "Requests > Requests By Memory" that were appearing at the top of that Fusion Reactor report that we moved to another server. Unfortunately, that didn't help.

4. I don't see any deadlocks at the moment. If there were any, I would be getting error emails from the websites because we are using cferror and I am emailing myself.

In that screenshot, you mentioned checking the logs for what was running when those dips in memory occurred, but I think what's happening here is that Java is attempting to garbage collect and that's the reason for the high CPU. I could be wrong, but that's how I read that image.

Since both of our front end servers were close to crashing this morning, I took the liberty of changing the version of Java we're using. We are not using the official Oracle JDK on either one. We are now testing Amazon Coretto JDK on Server 1 and Microsoft's OpenJDK on Server 2. Will it help? Who knows. I am still searching for answers.

Do you know how to decipher a heap dump? I've taken several snapshots but I have no idea how to look at this data.

This is how both servers are looking roughly 5 hours after a ColdFusion restart.

Report · Oct 10, 2024

Yes, indeed, it makes more sense that the cpu spikes would coincide with the major (mark sweep) Gc's, and frs graphs of each of those would confirm it. (Sadly, your original gc graphs were not for the same timeframe as the cpu graph, so we couldn't conclude for sure. But you can.)

As I said in my response (the only one you've not yet replied to), it feels like something is increasingly holding on to memory that can't be Gc'ed. You'll want to find that. I gave you a couple of approaches.

And I'll say now that heap dump analysis would often be more challenging than the other options I'd mentioned, but it certainly can be done and MAY help. But it's just a point in time.

The fr memory profiler on the other hand can be triggered OVER time, like an hour after the instance has been started, and then a few hours later (in your case of slow, steady increase). Then it lets you compare the profiles to see what Java objects are increasing in size or count. Again, sometimes it's not clear what those objects equate to in cfml (same with a heap dump), but sometimes it may be clear.

But again I'd suggest you check the sessions page. I know you told Paul you have 30-minute sessions. That could be the admin default, but code could override it. Worth at least looking. If that's not it, a strong candidate is something growing in one of your application scopes, or perhaps the server scope. Or use of cf caching, and so on. A true "leak", like a cf bug is generally the least likely cause in my experience.

Usually it's some aspect of code, config, and/or load. And you may well have a difference in config between cf2021 and 2023 which has escaped notice.

Finally, the jvm choice is also not likely at issue...though to be clear, Adobe only supports our using Oracle's jvm (which they license for our use). Using another seems a needless risk. But I'm just sharing perspective, not telling you what to do.

/Charlie (troubleshooter, carehart.org)

Report · Oct 10, 2024

Hey Charlie,

Sessions have been holding steady at around 10K and I don't see any large increases over time so I think we can throw that out. Nothing seems too out of line.

As for memory profiling, I see that Fusion Reactor has this turned on by default. I don't see any data in Profile > Active Profoling, but there is some histroy in Profile > Profile History. When I click into that report, I don't see much of anything over a few seconds load time so I'm not sure what to think about this feature.

I do realize that I am testing other versions of Java right now but I am running out odf ideas and have been working on this issue for almost a month now.

I am including a couple screenshots below that show session counts on both servers.

PS - I did quadruple check that I am using the same settings as ColdFusion 2021. They are all exactly the same as before.

Report · Oct 10, 2024

Charlie,

I forgot to mention what values we're using for session variables. Here they are:

Maximum Timeout: 2 hours

Default Timeout: 1 hour

Website Override: 30 minutes

As always, thanks for your insight.

Report · Oct 10, 2024

Ok on the admin and app session timeouts. Your previous message showing they weren't increasing diminishes the likelihood it's an issue (though it's not considering the SIZE of any of the sessions).

But again I'd also proposed you look at the use of application and server scopes (fr doesn't reliably help with that). Then I'd also proposed considering your app's use of caching, though FR can't directly help assess their size or use. The profile might.

Finally, beware presuming you "know" that your app "doesn't use such things". There could be an app that "no one uses" and that "hasn't been touched in years", but which a spider or bot could be trolling.

And recent ai bot scans have been especially notorious for this. Fr can help assess this, as can we server logs. And again I can help assess any/all this, if you don't find it on your own or with others.

/Charlie (troubleshooter, carehart.org)

Report · Oct 10, 2024

As for the profile feature, again see the docs for how to use it. You need to create a profile, and THEN you would see it.

The docs will also help you better use the ui once you have one, or ones to compare. Or again I can help directly.

/Charlie (troubleshooter, carehart.org)

Report · Oct 10, 2024

Hmmm maybe I'm looking at the wrong docs. Is this it?

https://docs.fusion-reactor.com/Profiler/Profiler/

Report · Oct 10, 2024

No, that's the request profiler. The memory profiler is here :

https://docs.fusion-reactor.com/Memory/Overview/

I do see now that the left nav menu (at least on mobile) does not help you readily discern that's about the memory profiler, nor that that particular profiler is about requests.

Same with the search feature, where I couldn't even find it mention the memory/heap profiling feature in a search for profile, profiler, or profiling.

Those are things I'd recommend you report to them. They're very responsive to customer concerns. (I present too many to get attention for any one.)

/Charlie (troubleshooter, carehart.org)

Report · Oct 10, 2024

Here are a couple other overview resources. Note how the first one clearly shows the feature being referred to as memory profiling, in the url and text. Just adding that in case anyone noticed that the doc page doesn't use that term, a d might think ijwas mistaken in using it.

https://fusion-reactor.com/features/performance-troubleshooting-old/memory-profiler/

The second is just a couple minute video, but may still prove helpful.

https://www.youtube.com/watch?v=a3iwB5zsXRM

/Charlie (troubleshooter, carehart.org)

Report · Oct 11, 2024

Let me start with what I consider crucial feedback to your last post. To repeat what Charlie advised, I, too, would discourage running ColdFusion 2023 on Amazon Coretto JDK or Microsoft's OpenJDK. In other words, in spite of the performance issues you're facing, I would suggest that you continue to run ColdFusion 2023 on the latest Java version that Adobe recommends, namely, Java SE 17.0.12 (LTS). There is a reason why I say that.

ColdFusion consists of hundreds of Java applications working together. Before release, the Adobe team exhaustively tests and optimizes each, as well as the integration of all of them into one application server. Using the Java version that the team recommends stands you in good stead to take maximum advantage of the optimization.

Now, on to your answer to my previous post. It seems the strategy I recommended wasn't clear. The strategy is:

identify the processes/pages/requests that consume a lot of memory
(especially the ones that do so without pausing to free memory);

To repeat, look not for the slowest processes, but for the highest-memory consumers. Hence the example I gave, showing a possible way to identify such high-memory processes/pages/requests.

On the point that CPU usage may spike during garbage collection, I of course agree with you. But you are talking of CPU peaks, whereas I am talking of high-CPU peaks.

In my experience, CPU typically peaks at around 20 to 40% during garbage collection, even in memory-intensive ColdFusion applications. Whereas, here, CPU consistently peaks at over 50% during garbage collection, frequently reaching 70 to 90%.

Combine that with the fact that: (1) the garbage collections occur within seconds of each other, and (2) the application's memory usage is hovering at between 85 and 90%. Frequent garbage collection usually indicates memory pressure, often caused by processes generating excessive objects. The high memory usage confirms this. To me, it all points to memory leaks or high object-churn. That is the reason why I think the root cause is to be found in the code.

The strategy I suggest consists of two parts:

Use whichever method to identify the requests or CFM/CFC pages associated with high-memory usage. FusionReactor (Ultimate Edition) will certainly be of great help here. For example, in FusionReactor's Request View you can filter the requests by memory-usage. In the Transactions View you can filter the transactions by memory-usage. If the longest running request/transaction happens to also be the one consuming the most memory, then that will be a prime candidate to investigate.
As you are already aware, FusionReactor's Profiler will provide stack traces and memory usage data, as will a thread dump. There are online thread-dump-analyzers to get you started.
Investigate the CFM/CFC code corresponding to the processes or requests you identified in step 1.
Search for possible memory leaks or for code that generates a lot of temporary objects. Here is where you could use the bulleted list I provided as a guide.

Report · Oct 11, 2024

Bkbk, it seems your contention is that some code is creating a lot of objects, in a short period of time, right? I know we could read your last reply another way, but I think the totality of it confirms this to be your expectation.

And if so, I'll say my money is instead on the opposite: there may be zero requests running for an entire minute or hour in Dave's situation, and yet the memory will remain high. My expectation is that thousands or millions of requests--even from hours ago--might have incrementally added just a small number/amount of objects which are holding memory...but those are something which lives on LONG AFTER the request ends, and indeed seems to be living for longer than the 20 or so hours his graphs showed.

To me, that's what needs to be found. And that's why the fr memory profiler may best identify WHAT KIND of objects are increasing in size/count.

Now, COULD that be related to coding choices? Sure. Config choices? Sure. Exacerbated by spiders/bots/automated requests? Absolutely--though not necessarily. So I'm saying that a first priority seems to be to try to find what IS piling up, if possible. Then we can focus on how/why.

But I just doubt the explanation will be in what's in fr's "requests by mem" or "longest requests".

Time will tell which of us has guessed right in this case. That said, I don't mean to knock the value of what you're offering, in that it may help in OTHER cases, sure. We'll see what Dave finds.

/Charlie (troubleshooter, carehart.org)

Report · Oct 12, 2024

Bkbk, it seems your contention is that some code is creating a lot of objects, in a short period of time, right? I know we could read your last reply another way, but I think the totality of it confirms this to be your expectation.

And if so, ...

By @Charlie Arehart

@Charlie Arehart , I didn't put any emphasis on "in a short period of time". My contention emphasizes "some code is creating a lot of objects". From what I have read, that is apparently your contention, too.

Report · Oct 12, 2024

No, it's not. But let's let it go. I think others will discern the differences in our perspectives. And what matters most is what Dave ultimately finds to be the culprit.

/Charlie (troubleshooter, carehart.org)

Report · Oct 12, 2024

I think others will discern the differences in our perspectives. And what matters most is what Dave ultimately finds to be the culprit.

By @Charlie Arehart

I couldn't agree with you more. 🙂

Report · Oct 11, 2024

Remark:

Your last 2 "Memory Overview" displays look fine to me. 🙂

Some questions:

What is the maximum number of distinct users of the application at any time?

On average, how many users use the application per day?

Report · Oct 11, 2024

Hey BK,

We average about 100-200 users at any one time. If there is a sale going on, it's a bit higher.

We average roughly 8,000 users per day.

Report · Oct 12, 2024

Hi @davecordes , thanks for the session info.

200 simultaneous users and 8000 users per day - that is really no sweat for ColdFusion. But I can see a likely problem when I take into account the number of active sessions at any time (10 000).

10 000 seems unusually high to me, given an average of 8,000 users per day and about 200 simultaneous users. Under normal circumstances, you would expect the number of active sessions to closely align with the number of simultaneous users, with some fluctuation depending on the session timeout settings and user activity. So, let's at least rule this out.

Potential causes of high Active Session count, and recommended solutions:

Session Timeout Configuration
Cause: You might, perhaps unintentionally or accidentally, have configured session timeout to be too long. This may cause the session count to accumulate throughout the day because ColdFusion then retains inactive sessions in memory.

Recommended solution:
Go to the page Server Settings > Memory Variables in the ColdFusion Administrator. Ensure that the checkboxes to enable application variables and session variables are checked.
Also ensure that the Cookie Timeout value matches the application's session timeout.
If you make any changes, remember to press the Submit button.

Application.cfc: Make sure that Application.cfc is visible to the pages of the application. (You should test for this, for example, by checking whether cfoutput in onRequestStart appears on a CFM page within the application).
Make sure Application.cfc contains, at the very least:
```
<cfset this.name="name_of_your_application">
<cfset this.applicationTimeout="#createTimeSpan(1,0,0,0)#"> 
<cfset this.sessionmanagement="yes"> 
<cfset this.sessiontimeout="#createTimeSpan(0,0,30,0)#">
```
Persistent sessions enabled
Cause: You might unintentionally or accidentally have enabled persistent sessions in the underlying Tomcat application server. If you did, the sessions would continue to be counted as active even when users are no longer interacting with the application.

Recommended solution:
Open the configuration file {CF_HOME}\{CF_INSTANCE}\runtime\conf\context.xml.
Make sure that the <Manager> element only consists of the following line:
```
<Manager pathname="" />
```
Multiple session creations per user
Cause: Session duplication. A bug in the application may be causing multiple sessions to be created for the same user, especially if sessions aren’t properly reused or maintained across requests.

Recommended solution:
Review your application code to confirm that session management is handled correctly. Ensure that
(a) new sessions aren’t inadvertently created on every request, and
(b) existing sessions are reused as intended.
Session cleanup issues
Cause: There could be issues with ColdFusion's session cleanup process, where sessions are not being invalidated or are not expired correctly.

Recommended solution:
Use FusionReactor to closely monitor session activity and identify any patterns where sessions are being created but not properly expiring. Perform tests to ensure sessions expire according to the configured timeout.

If session cleanup isn’t automatic, consider forcing session termination by means of scheduled jobs. You can also use ColdFusion’s SessionInvalidate() function to manually terminate sessions programmatically.
Bot, crawler or extraneous traffic
Cause: Unrecognized traffic in the application. Automated systems, web crawlers, or bots may be accessing your application and creating sessions, much like regular human users.

Recommended solution:
Analyze ColdFusion and FusionReactor logs to detect unusual traffic patterns. If bots or crawlers are contributing to session inflation, consider implementing strategies to limit their impact, such as utilizing robots.txt, CAPTCHAs, or blocking specific IP addresses.
Load balancing or clustering issues
Cause: A misconfigured load balancer, for example. If your load balancer is improperly configured, or if session replication across a clustered server environment is mismanaged, sessions may not terminate correctly or may be unnecessarily duplicated, increasing the session count.

Recommended solution:
Review your load balancer and clustering configuration to ensure session replication and session affinity are functioning as expected. Misconfigurations can lead to sessions persisting unnecessarily across servers, contributing to the high session count.

Report · Oct 12, 2024

@davecordes , another request for information: please share the contents of your jvm.config file(s).

Report · Oct 12, 2024

Bkbk, while his session count is indeed high, it seems all that consideration is overkill (though maybe your effort will benefit other readers). Let me offer again a different perspective, and folks can weigh them together.

1) First, Dave had already shown us above that the session count was remaining stable throughout the day--and he'd shown us also the FR graph that clearly indicates the sessions are being destroyed at the same rate they are beng created. That means they ARE timing out. (He'd said it was a 30 minute timeout.)

And that 10k sessions translates to about 1.8 requests per second, which jives with an earlier screen he'd shared showing they get about 5 cf requests per second on average. That would translate to a little less than half their requests coming from bots or automated traffic (as I'd suggested previously), which is quite common for a lot of servers.

2) But rhe reason that alone is not THE issue is that memory (heap use) was climbing THROUGHOUT the day. More specifically, the trough to which used memory FELL was an increasingly higher number.

That suggests clearly that SOMETHING is remaining "in use" even beyond the session timeout.

And that's where my money is: something unexpected that is created to live BEYOND the life of the request (like caching, for example) and never being released within the 20 hour window his screenshots showed. (It may well prove to be something set to cache for 24 hours, if the memory graphs started stabilizing at 24 hours. Sadly they'd not go DOWN at 24 hours, because the rate things were now timing out would be the same they were coming in, presuming the previous pattern.)

3) And that then is why we need to find WHAT objects are piling up in the heap. Dave now has what he needs to use the FR heap profiler effectively, to see if comparisons if it over time might clearly spot what object or objects are the culprit.

Once we know that, we maybe able to temper whatever that is. And that's the kind of unexpected memory use that I've contended from the outset is a common cause for what seems otherwise a "memory leak". But I'd argue the latter is a term better used for something unintended and NOT within our control in cf code or config. There HAVE been such on rare occasions (like in db drivers or due to a mistake by Adobe), but they're far less likely the cause in my experience. The things I discuss above are the far more common cause.

4) Finally, to your point about tomcat sessions, that can indeed be a surprising impact for people. But I would not see it affecting memory. Instead, the mechanism woukd by default save sessions to A FILE (sessions.ser), to be used by cf (tomcat) to "persist" sessions over cf restarts. As such, it would not affect memory in my experience.

To be clear, that's also able to even be used (assuming it's configured in the context.xml, as you note) only if one enables the "j2ee sessions" feature in the cf admin. It would not apply to normal cf sessions. The latter are controlled by cf, not tomcat, which is the reverse for j2ee sessions--which allows those to be persisted, optionally.

FWIW, I've written and presented a lot more on this (cf/tomcat session persistence) in the past, first when it came out with cf10 (the first cf version to run natively on tomcat). Then more recently I've presented and written about cf offering session persistence via Redis session storage (new since cf2016--but which works ONLY if we do NOT use cf's j2ee sessions feature).

But all this (point 4) is separate from the main problem here. I just wanted to offer it as a PS for those who might notice your mention of Tomcat session persistence and be intrigued. 🙂

/Charlie (troubleshooter, carehart.org)