Possible Memory Leak - ColdFusion 2023 + Java 17.0.12

Report · Oct 08, 2024

Ever since we upgraded from ColdFusion 2021 to ColdFusion 2023 we have been dealing with out of memory issues. ColdFusion will run fine for roughly 24-30 hours, then we will start seeing CPU spikes to 100% every 30 seconds. Garbage collection can't free up enough memory so ColdFusion eventually crashes and we have to restart the server.

Things we have tried that don't seem to help:

- Downgrading to 17.0.11

- Tweaking the min and max heap sizes

- Tweaking the caching settings

- Changing the garbage collector algorithm to G1GC

- Tweaking our websites to cache queries for a shorter period of time (1 hour down to 15 minutes down to 5 minutes)

Here are our current settings:

Min Heap: 8192

Max Heap: 8192

Garbage Collector: UseParallelGC

Cached Templates: 1000

Cached Queries: 5000

We do have Fusion Reactor installed on all of our servers but this is like trying to find a needle in a haystack. I really don't know what I should be looking at.

Here is a most recent screenshot from 2 days ago that shows the ventual demise on one of our servers.

I am really at my wit's end here. If this isn't a memory leak I don't know what the heck it is. If anyone has any recommendations on what to try next I would appreciate it.

Report · Oct 12, 2024

@Charlie Arehart , you call my suggestions "overkill" and yet, instead of justifying why they are overkill, you systematically proceed to use each as a springboard for your own comments. Which is flattering in a way.

In any case, this is a discussion forum. No one has a monopoly of ideas here.

What we're all offering are suggestions. Some may be good, some not so good. Some may turn out to be right, others not. That's okay. The forum is all the better for the diversity of ideas. No single contributor should aim to orchestrate the discussion.

There is definitely an issue, the root cause of which we don't yet know. The more suggestions we offer Dave, and the more diverse they are, the more likely he will solve the problem.

Report · Oct 12, 2024

The overkill is that your post focused again on session management, at length this time. And I simply demonstrated that (since session counts are not changing while memory is rising) Dave's problem is not seemingly about sessions, per se. That WAS my "justification".

And I then added related content for context. I don't think it was overstated. And it was not meant to compliment what you wrote, though it complements it.

As such, my words don't propose a monopoly: they propose a counterpoint. And as always I sincerely hope they're something that motivated readers would learn from. (I can't control how it's perceived, but in trying to avoid offense I may sometimes have to add more words than some might prefer.)

/Charlie (troubleshooter, carehart. org)

Report · Oct 12, 2024

... I simply demonstrated that (since session counts are not changing while memory is rising) ...
That WAS my "justification".

By @Charlie Arehart

"Session counts are not changing while memory is rising"? Where do you draw that conclusion from?

As far as I can see, there is no evidence of the relation between session count and rising memory in all of the preceding discussion. You can therefore not demonstrate or jusfity it.

Report · Oct 12, 2024

Bkbk, I can't help but think you're either not reading every word I've written, or you've got some some sort of blinders on. I literally gave links to the specific posts of Dave's which offered the info on which I based my assertions.

On that point, the defense rests its case, leaving rhe case in the hands of the jury.

I really would love to hear if others might want to say they see your point or mine, but I realize some will not want to venture into the debate at all. Others may be like Homer Simpson, backing away (into the bush) right about now:

/Charlie (troubleshooter, carehart. org)

Report · Oct 12, 2024

Bkbk, I can't help but think you're either not reading every word I've written, or you've got some some sort of blinders on. I literally gave links to the specific posts of Dave's which offered the info on which I based my assertions.

By @Charlie Arehart

I did read what you had written. You are mistaken, @Charlie Arehart . Here again is what you wrote, including the links you mention:

1) First, Dave had already shown us above that the session count was remaining stable throughout the day--and he'd shown us also the FR graph that clearly indicates the sessions are being destroyed at the same rate they are beng created. That means they ARE timing out. (He'd said it was a 30 minute timeout.)

And that 10k sessions translates to about 1.8 requests per second, which jives with an earlier screen he'd shared showing they get about 5 cf requests per second on average. That would translate to a little less than half their requests coming from bots or automated traffic (as I'd suggested previously), which is quite common for a lot of servers.

By @Charlie Arehart

The first link is incorrect. It points away from this discussion, to a different thread of months ago. But let's ignore that.

To return to the subject at hand, I shall say it again. As far as I can see, there is no evidence of the relation between session count and rising memory in all of the preceding discussion. You can therefore not demonstrate or jusfity it.

Report · Oct 12, 2024

Bk, first, the mistaken link was simply that, a copy/paste mistake. I've corrected it.

Second, are you saying I can't "justify" a contention that Dave's info indicates a relationship "between session count and rising memory"? If so, I agree. Indeed, I'm saying I DO NOT SEE ONE.

And specifically, my first reply today was in response TO YOUR message that seemed excessively focused on addressing session management. You said "10 000 [sessions] seems unusually high to me" and you went on to detail 6 reasons that could be so. (And it seemed a lot like points an AI would indicate, but good on you if it was all your own).

And I acknowledged that could be useful to someone, but since Dave's problem is about memory, I retorted that it seemed overkill. So I offered a different perspective. I still stand 100% behind what I said, every word of every response I've written, trying to advance the discussion.

I hope your confusion is cleared up. If not, please let it go. Let others chime in. Otherwise we're going in circles here. I keep hoping to get off the ride, but you keep putting in tokens.

/Charlie (troubleshooter, carehart. org)

Report · Oct 12, 2024

There is a reason for exploring sessions comprehensively. Imagine that they are heavy sessions, each weighing in at an average of 1 MB. Then 10 000 active sessions will pack quite a punch

Report · Oct 12, 2024

Bkbk,

Dave said towards the beginning of this thread.

"Sessions have been holding steady at around 10K and I don't see any large increases over time so I think we can throw that out. Nothing seems too out of line."

Since the sessions are holding steady but the memory is growing that would indicate to me the memory issue isn't tied directly to the number of active sessions. I believe this is also what Charlie is pointing out.

Report · Oct 12, 2024

@Scott_PALADEM , I understand what Dave, Charlie and you say. My questions and suggestions about sessions are meant to rule sessions out as high-memory user. The fact that the session count has been holding steady at 10 K says nothing about how much memory the sessions actually use. Your session may hold 1 kilobyte of data now, and 1 megabyte a minute later.

As I said, 10 000 active sessions seem to me to be excessive, given 200 simultaneous users and 8000 daily users. Hence my questions.

Is session management working as it should? What are the 10K sessions? Are they all legitimate? You say the sessions are holding steady at 10 K - then, tell me, what level of memory is that? I think that those are questions we have to answer, to rule out sessions.

Long story short: we're software engineers. A robust way to show that the sessions are not responsible for the high memory usage is to determine how much memory is involved in generating and maintaining the 10 000 sessions, or find what else is responsible for the high memory usage.

Report · Oct 12, 2024

Bk, I'll step back in since you ask a question in saying, "You say the sessions are holding steady at 10 K - then, tell me, what level of memory is that?"

Again, what I'd pointed to originally is Dave's graph above showing cf memory/heap use RISING all day (again more specifically, the point to which a gc FALLS is increasing all day). And if sessions are always at 10k level all day then THAT is what seems to "rule sessions out as high-memory user".

Do you see the logic now? And this is why I've said it seems SOMETHING else which should explain his issue, like caching.

Even if you may wonder if "some sessions are lasting long", I'd pointed out also how the fr session graph showed them being destroyed (timed out) at the same rate they were being created. That's an issue, sure. And I'd said that days ago. It just doesn't seem to explain the steady memory rise.

We won't know for sure until Dave confirms what the profile/heap analysis identifies.

If you reread what I wrote first today, perhaps now it will click that this is what I've been saying all along today.

If not, I now IMPLORE you to please let go debating it further for now. Let's instead wait to see what Dave says/reports. I'm not playing list policeman. I'm just a neighbor asking you to "turn the music down for the rest of the night".

/Charlie (troubleshooter, carehart. org)

Report · Oct 13, 2024

Bk, I'll step back in since you ask a question in saying, "You say the sessions are holding steady at 10 K - then, tell me, what level of memory is that?"

Again, what I'd pointed to originally is Dave's graph above showing cf memory/heap use RISING all day (again more specifically, the point to which a gc FALLS is increasing all day). And if sessions are always at 10k level all day then THAT is what seems to "rule sessions out as high-memory user".

Do you see the logic now?

By @Charlie Arehart

Charlie, I saw the logic from the very beginning. As I said in my answer to @Scott_PALADEM , I can understand why Dave, you and he interprete a steady session count of 10 000 the way you do. It is worth repeating that that is besides the suggestions I am making to Dave.

Here is the crux of my suggestions, in graphical form:

The pictured scenario is possible. A session, s1, can use 5 KB of memory at the start of the application, and 1 MB later on. Multiply that by 10 000. I hope you now understand why I am suggesting to Dave to rule sessions out.

Sessions might indeed turn out to have nothing to do with the memory issue. But then, let our investigations provide the evidence to eliminate them as suspects.

Report · Oct 13, 2024

@BKBK

I understand what you are saying and certainly @davecordes and @Charlie Arehart have also understood your point. In order to check this out Dave can analyze whether there is any functionality on the application that adds additional data to the session scope on various screens.

In a scenario where the application is adding additional data to the session as the user navigates the app like you are suggesting, I personally would expect to see the memory usage on the server to have more peaks and valleys throughout the day as various sessions start and end. This is because normal traffic doesn't all follow the same patterns and timing, However, fusion reactor is showing a steady increase in memory usage, which I personally would interpret as pointing towards something to do with data being added to the application scope, or caching as others have suggested.

I think that everyone understands everyone's suggestions and there is no need to keep belaboring your point. Dave has plenty of information to check on both possibilities.

-Scott

Report · Oct 13, 2024

@BKBK

Just a picture of how I personally interpret the graphs. In case that adds clarity.

the old generation memory supports my theory as well because typically session data doesn't cause this to grow unless you have all your users maintaining their sessions over a long period of time.

The peaks and valleys here are what I would expect from longer sessions starting and ending but the steady orange is typically application level data.
Very unlikely in my opinion that would be caused by session data, but certainly Dave will want to check all avenues.

Report · Oct 13, 2024

Ok all, first let me say thank you to all who are helping me get through this. As a ColdFusion developer with 25+ years of experience (since ColdFusion 3.1), this has been one of the most difficult challanges I have faced.

I am getting super frustrated in attempting to understand heap snapshots. I have been taking a bunch of them but every time I run a comparison I really don't know what I'm looking at. Also, reserarching these class names has been terrible as well. I can't find any information on classes like "coldfusion.sql.InParameter".

That being said, I have decided to take a different approach. I want to see what happens if I can isolate the website with the most traffic. So I have decided to fire back up our ColdFusion 2021 servers to perform a test.

I have moved the website with the most traffic back to the previous version of ColdFusion where we did not notice any problems for over 2+ years.

- This server is running ColdFusion 2021, Update 16 with Java 11.0.24

- This server has the same min and max heap sizes as the ColdFusion 2023 servers (8192 for both)

- This server has the same cached templates and cached queries (1000 and 5000 respectively)

This test has been running for over 22 hours and I am not seeing any indications of a memory leak on either server.

Here is a Resources > Memory Overview screenshot for the last 24 hours ColdFusion 2021 server.

Here is a Resources > Memory Overview screenshot for the last 24 hours ColdFusion 2023 server.

Now, I realize that I could be jumping the gun here and may need to wait a few more hours for the memory leak to show itself, but right now, I am not seeing it.

I just realized that there is a small difference in the JDBC drivers we are using on ColdFusion 2021 vs 2023.

ColdFusion 2021 - Configured with the "Other" datasource type.

- Version 42.3.3

ColdFusion 2023 - Configured with the "PostgreSQL" datasource type.

- Version 42.5.1

Could there be a memory leak related to the 42.5.1 PostgreSQL JDBC driver?

PS - I am seeing the "coldfusion.sql.InParameter" class name in the second postion most of the time on the ColdFusion 2021 heaps. I assume this is because we relying heavily on IN statements in our queries. Also, we are using cfqueryparam everywhere so I am wondering if this is creating a lot of objects.

PPS - Regarding sessions. Remember that Fusion Reactor effectively doubles the session count because of how J2EE sessions work inside of the Java engine. So when I say there are 10K sessions, it's really half that.

Here is more information:

https://docs.fusionreactor.io/UEM-and-Sessions/User-Experience-Monitoring/

Root session tracking in ColdFusion When tracking sessions in ColdFusion you'll see a session with the name /root. This is caused by the way ColdFusion uses the J2EE sessions. When J2EE sessions are enabled in ColdFusion, the sessions are stored within the tomcat /root context.There is also then a ColdFusion session created that wraps this J2EE session. By default, ColdFusion sessions will expire based on a worker thread, but the J2EE sessions will expire based on time since last used. So the ColdFusion session can be destroyed by the worker thread, but the J2EE session will remain. For example, if a user comes back, a new ColdFusion session is created, but the same J2EE session will be used.

Report · Oct 13, 2024

Dave, your cf2021 graph DOES reflect the same memory issue. Don't you see the increasing low point of the oldgen over the 22 hours? Let's see indeed how things look in a couple or few hours.

/Charlie (troubleshooter, carehart. org)

Report · Oct 13, 2024

Yeah, it does look like it might happen but the troughs are still low enough to make me say that it has a chance to recover. We shall see. Here is the last hour.

Report · Oct 13, 2024

Is that the 2021 server? I was referring to that. And the low point of the Gc's over the 22 hours shown were indeed steadily increasing, at least on average.

So how does it look now over that last hour (compared to that one from your last reply 3 hours ago)? Or how does the day view appear?

Is the trough to which they fall somehow settling down or still increasing?

/Charlie (troubleshooter, carehart. org)

Report · Oct 13, 2024

Yes, that was the 2021 server. Here is the day view. Things seems to be holding for now.

Report · Oct 13, 2024

This is the class name we've been talking about. It goes up and down with the GC but it's been consistent number 2 in the heap. Maybe this is the culprit.

Report · Oct 14, 2024

@davecordes , Thanks for the update and for the clarification on sessions. Your returning to test on ColdFusion 2021 is an inspired move.

Let's all now have a look at the new information. Hang on in there. I can sense that we are close to finding the cause of the problem.

By the way...

@davecordes wrote:

As a ColdFusion developer with 25+ years of experience (since ColdFusion 3.1),...

Report · Oct 14, 2024

@davecordes , while we continue to look for the root cause, I've just had a simple idea to rule out session-scoped and application-scoped variables.

Store the following code as a CFM test-page, under the webroot of the application. Launch the page when memory peaks at a problematic level.

<!--- Dump session scope and application scope as HTML within the current directory --->
<cfdump var="#session#" label="Session Scope Dump" format="html" output="#expandPath('sessionScopeDump.html')#">

<cfdump var="#application#" label="Application Scope Dump" format="html" output="#expandPath('applicationScopeDump.html')#">

The size of each file will be indicative of how much memory the scope uses.

Report · Oct 14, 2024

... I can't find any information on classes like "coldfusion.sql.InParameter".

...

I just realized that there is a small difference in the JDBC drivers we are using on ColdFusion 2021 vs 2023.

ColdFusion 2021 - Configured with the "Other" datasource type.

- Version 42.3.3

ColdFusion 2023 - Configured with the "PostgreSQL" datasource type.

- Version 42.5.1

Could there be a memory leak related to the 42.5.1 PostgreSQL JDBC driver?

PS - I am seeing the "coldfusion.sql.InParameter" class name in the second postion most of the time on the ColdFusion 2021 heaps. I assume this is because we relying heavily on IN statements in our queries. Also, we are using cfqueryparam everywhere so I am wondering if this is creating a lot of objects.

By @davecordes

You are right to call into question the class coldfusion.sql.InParameter and the PostgreSQL v42.5.1 JDBC driver.

coldfusion.sql.InParameter

This is of course one of ColdFusion's. It is related to cfqueryparam (or, the equivalent for stored procedures, cfprocparam). The fact that objects of coldfusion.sql.InParameter and java.lang,Integer occur so frequently on the heap might indeed point to a problem.

In some tests I've done, coldfusion.sql.InParameter and java.lang,Integer occur together when there is a cfqueryparam error. See attached example.

So, question is: are there cfqueryparam errors in your application's application.log and exception log? None? A few? Loads and loads?

In the example, I assigned a non-numeric value in <cfqueryparam cfsqltype="cf_sql_integer"> to a column of integer type. You can reproduce my findings yourself using:

<!--- Here the datatype of id is integer --->
<cfquery name="q_result" datasource="your_datasource_name">
select *
from yourTable
where id = <cfqueryparam cfsqltype="cf_sql_integer" value="abc">
</cfquery>
<cfdump var="#q_result#" >

The error message Invalid data abc for CFSQLTYPE CF_SQL_INTEGER. should appear in application.log and exception.log.

PostgreSQL v42.5.1 JDBC driver

It happens that the PostgreSQL 42.5.1 driver has a security vulnerability. The vulnerability allows an attacker, under certain circumstances ( 'prefeQueryMode=simple' ), to inject SQL. So, you should look into it.

It might in fact be an idea to upgrade to a more recent PostgreSQL JDBC driver. In any case, you should consider it.

Report · Oct 14, 2024

BK,

We get hundreds, if not thousands of those "invalid data" cfqueryparam errors per day. Usually it's Google or some other bot attempting to load a non numeric value into the URL scope. Or it's some sort of penetration test by a bad actor. Either way, if this is the part of the reason for the memory leak, hopefully the change below will have some effect.

In an attempt to calm those down, I am now prevalidating that the value is indeed numeric so that "invalid data" error would not be thrown anymore. It looks smiilar to this.

select * from tbl_test where id = <cfif IsNumeric(URL.test)>

I also updated the PostgreSQL JDBC driver to 42.7.4 which is the latest version for Java 8+.

https://jdbc.postgresql.org/download/

Here is the last screenshot of the ColdFusion 2021 server before I restarted ColdFusion. It was only a matter of time.

I did check around for that "preferQueryMode=simple" in the JVM arguments but I couldn't find it.

I am now considering the most recent Tomcat upgrade as the cause of the memory leak, but I need to do more research.

Report · Oct 14, 2024

@davecordes

In the code you mentioned (below), I would suggest that you validate the data before the cfquery. Then if it isn't numeric, don't run the query at all. I would return some user friendly "invalid request " error and avoid unnecessary load on the db.

"select * from tbl_test where id = <cfif IsNumeric(URL.test)>

<cfqueryparam value="#URL.test#"cfsqltype="CF_SQL_INTEGER"><cfelse>0</cfif>"

Report · Oct 14, 2024

Hi Scott,

Yep I will certainly do that. This was just a quick fix before I restarted ColdFusion.

In your opinion, what scenario is better?

- Allowing an invalid URL parameter to throw a cfqueryparam error (Current situation)

- Stripping the invalid value before it gets to the database and using cfthrow to throw a manual error

- Stripping the invalid value before it gets to the database and do not throw any sort of error