Possible Memory Leak - ColdFusion 2023 + Java 17.0.12

Report · Oct 08, 2024

Ever since we upgraded from ColdFusion 2021 to ColdFusion 2023 we have been dealing with out of memory issues. ColdFusion will run fine for roughly 24-30 hours, then we will start seeing CPU spikes to 100% every 30 seconds. Garbage collection can't free up enough memory so ColdFusion eventually crashes and we have to restart the server.

Things we have tried that don't seem to help:

- Downgrading to 17.0.11

- Tweaking the min and max heap sizes

- Tweaking the caching settings

- Changing the garbage collector algorithm to G1GC

- Tweaking our websites to cache queries for a shorter period of time (1 hour down to 15 minutes down to 5 minutes)

Here are our current settings:

Min Heap: 8192

Max Heap: 8192

Garbage Collector: UseParallelGC

Cached Templates: 1000

Cached Queries: 5000

We do have Fusion Reactor installed on all of our servers but this is like trying to find a needle in a haystack. I really don't know what I should be looking at.

Here is a most recent screenshot from 2 days ago that shows the ventual demise on one of our servers.

I am really at my wit's end here. If this isn't a memory leak I don't know what the heck it is. If anyone has any recommendations on what to try next I would appreciate it.

Report · Oct 14, 2024

In your opinion, what scenario is better?

- Allowing an invalid URL parameter to throw a cfqueryparam error (Current situation)

- Stripping the invalid value before it gets to the database and using cfthrow to throw a manual error

- Stripping the invalid value before it gets to the database and do not throw any sort of error

By @davecordes

Making a connection to the database and executing a query are expensive processes. Does the application know who is making the URL request? If not, I wouldn't advise a trip to the database.

I would therefore do it as follows:

<!--- 
It is assumed that the user has been validated. 
That is, the application has code for a boolean such as isValidUser.
--->

<cfif isValidUser and isNumeric(URL.text)>
    <cfquery>
    select *
    from tbl_test
    where id=<cfqueryparam cfsqktype="CF_SQL_INTEGER" value="URL.test">
    </cfquery>
<cfelseif isValidUser and not isNumeric(URL.text)>
    <!--- Ensure that an appropriate error page is displayed to the user --->
<cfelse>
    <!--- Ignore --->
    <!--- If necessary, log details of requester, for example, IP, location, visit frequency, etc., for forensic purposes ---> 
</cfif>

Report · Oct 14, 2024

Let's revisit the heap histogram from FusionReactor:

The classes coldFusion.sql.InParameter and java.lang.Integer led us to cfqueryparam errors, which in turn led us to bots.. Given what you've just discovered about the impact of bot traffic, I think there is more where that came from.

Errors are transient. So you wouldn't expect ColdFusion to maintain objects pertaining to errors in memory. The one thing pertaining to cfqueryparam that makes sense for ColdFusion to maintain in memory is a query. In other words, a cached query.

Therefore, we could hypothesize that those coldFusion.sql.InParameter and java.lang.Integer objects on the heap represent cached queries containing cfqueryparam of integer cfsqltype.

Suppose this hypothesis is correct. Then drastically reducing bot traffic - in essence, the number of cached queries currently in memory and bot access to cached-query code - will lead to a reduction of the frequency of coldFusion.sql.InParameter and java.lang.Integer objects on the heap. Hence, to a reduction of memory usage.

Needless to add, it makes sense to review the cachedWithin attribute of every cached query. Assign to each the smallest feasible time-span value. In fact, only cache where necessary.

Report · Oct 14, 2024

@davecordes , Thanks for the update. Sorry to see that ColdFusion 2021 is also giving problems.

PreferQueryMode is a PostgreSQL setting. That is why you couldn't find it among ColdFusion's settings. To look for it, open the ColdFusion Administrator and navigate to the Data Sources page. Then click on the link of the PostgreSQL datasource.

There are then two possible places to look:

(1) the JDBC URL field, if that is what the datasource uses to connect, or alternatively,

(2) the Connection String field, via the button Show Advanced Settings.

Make sure that the setting, if it is there, does NOT include preferQueryMode=simple. That is the unsafe value.

The default value of the setting, preferQueryMode=extended, is the safe option. This includes the case where no value is explicitly set.

I would second @Scott_PALADEM 's suggestion that you avoid querying the database when URL.test is not of numeric type.

Report · Oct 14, 2024

BK,

Ahh ok, cool. I checked the datasource and didn't see that string under advanced settings so I think we are good there. I am working on the other adjustment to bypass the query.

Report · Oct 14, 2024

Dave, on your last point regarding tomcat, that's possible--though since that was part of the August cf updates, it could have been part of that or other cf updates.

Indeed, I'd wondered whether the cf2021 you "moved back to" was at the same update level it was at BEFORE you moved to cf2023. If not, that may be why the memory problem is happening now also on cf2021. You can look in the cfusion/hf-updates folder to find when cf updates were installed (or un-installed). Might prove interesting for you to confirm, regardless of expectations/recollections.

And to this and MANY recent discussions here, I'd had this and other thoughts to share, but I've opted to remain quiet other than to respond to what you present (or if anyone asks me) nas therenhave been plenty of ideas for you to consider already. I'm sure it's overwhelming.

Given that cf2021 is seemingly now exhibiting the same issue as 2023, this new info should help direct our focus (and it supports my very first contention that I wasn't aware of anything in cf2023 that was a new source of memory use). The driver may be the issue, or it may be something else. As bk said, it feels we're getting closer.

/Charlie (troubleshooter, carehart. org)

Report · Oct 14, 2024

Hey Charlie,

We were running ColdFusion 2021, Update 12 and Java 11.0.21 before updating it to Update 16, Java 11.0.24 when I moved that one website over a couple days ago.

I beleive Update 13 was the Tomcat update (and scope updates) so maybe that's why we didn't notice any memory leaks for over 2+ years. I see that Update 15 also had a Tomcat update as well.

https://helpx.adobe.com/coldfusion/kb/coldfusion-2021-update-13.html

https://helpx.adobe.com/coldfusion/kb/coldfusion-2021-update-15.html

I do beleive we are getting closer.

Report · Oct 14, 2024

Well, yes, update 13 (from March) has a tomcat update and MUCH more, while update 15 (from August) had ONLY a tomcat update.

But yes, bottom line, it seems that SOME update (or perhaps any other change) made to your cf2021 is causing the issue now (and would explain the issue in 2023, assuming similar changes made there).

Since you changed both the cf update AND Java on the cf2021 server, a seemingly optimal route to identifying the key change would be to a) set the cf2021 server back to however it was configured before the problem.

(And you would want to make sure over a few days that there is no memory issue. That would prove there's not some OTHER issue, if otherwise the problem DOES still happen on that setup of cf2021.)

Then, b) make one change at a time (apply only one cf update at a time) to see if the memory issue returns. If not, then c) make the jvm change.

I realize that could take days to wait between tests. I just said it was a seemingly optimal approach. You may prefer not to expend that effort, or you may put your hope in the bot management. This, too, is something I alluded to originally--and I wondered if it might even be that your traffic had changed RIGHT around your move to cf2023. That's what could explain both the "new problem" AND also would seemingly follow you to cf2021 if all the same traffic (for all the apps) was running there...perhaps even if it was configured as when all was well before.

But let's see what you find by whatever approach you follow, perhaps including things others have suggested here, or that you come up with on your own.

/Charlie (troubleshooter, carehart. org)

Report · Oct 14, 2024

Charlie,

Yeah, I kind of screwed that up by updating both ColdFusion and Java at the same time when I should have left it alone.

At this point I am willing to leave it alone while I explore other avenues.

Report · Oct 14, 2024

coldfusion.sql.InParameter

This is of course one of ColdFusion's. It is related to cfqueryparam (or, the equivalent for stored procedures, cfprocparam).

By @BKBK

I later added the part in bold text for completeness.

Report · Oct 14, 2024

BK,

Gotcha. We are not using cfprocparam so we are good there.

Report · Oct 14, 2024

@BKBK

I understand what you are saying and certainly @davecordes and @Charlie Arehart have also understood your point.

By @Scott_PALADEM

That wasn't clear to me. You and Charlie, in particular, presumed that a steady session count of 10 000 could not result in an increase in memory. Look back at the posts.

In a scenario where the application is adding additional data to the session as the user navigates the app like you are suggesting, I personally would expect to see the memory usage on the server to have more peaks and valleys throughout the day as various sessions start and end.

By @Scott_PALADEM

I agree with you. However, I wondered what would happen if sessions never ended. Hence my suggestion that Dave look into this.

I think that everyone understands everyone's suggestions and there is no need to keep belaboring your point.

By @Scott_PALADEM

Isn't it because I belaboured the point that you now understand it?

Report · Oct 14, 2024

@BKBK

I understood your points from your first comment.. I simply don't think that session data is most likely cause of the symptoms I am seeing.

if what you're suggesting was the issue then this graph would look completely different:

You will note that 30 minutes into this chart, the sessions are being destroyed after the 30 minute timeout. There is then a steady level of sessions being created and destroyed.

if there were 10k active sessions that were being maintained by bots for 24 hours. Then there would be less orange and a lot more green on this chart.

Report · Oct 14, 2024

@davecordes

I am curious if you have a robots.txt file configured on your app. I had a client recently whose application was being bombarded by various bots and causing the database to max out the CPU. Adding some dissalows to robots.txt eased the pressure on the database.

There has been a significant increase in bots scraping data from websites (likely looking for data to train AI models and stuff).

It would probably be a good idea to turn on the Log User-Agent setting in fusion reactors request logging settings. Then you can get a good feel for what kind of bot traffic you might want to start blocking.

Report · Oct 14, 2024

Hi Scott,

We do have a robots.txt file but it's not very restrictive. We are considering adding the following line item to see if we can suppress some of the bot traffic we are seeing right now.

Disallow: /*?*

Using Disallow: /*?* can be an effective way to manage the crawling of URLs with parameters, but it’s essential to ensure it aligns with your overall SEO strategy.

Do you use this strategy on any of your client websites?

Report · Oct 14, 2024

It really depends on your preferences. You can go the "whitelist" route of denying all by default, then allow specific ones that you want to crawl your site, or you can go the "blacklist" route of specifically denying ones that are troublesome. For my recent client we went the black list route because they have an e-commerce site and didn't want to have any negative impact on their SEO, so we just did some logging of user agents and then researched the various bots and blocked specific ones that were troublesome. We ende up with the following robots.txt which had an immediate effect on his server resources not being hammered (we also implemented a crawl delay to slow down the good bots):

User-agent: SemrushBot
Disallow: /

User-agent: SerpStat
Disallow: /

User-agent: barkrowler
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: *
Crawl-Delay: 5

Report · Oct 14, 2024

I also have just recently starting using cloudflare for DNS which does have some bot protection settings as well. I haven't been using it long enough to have a firm recommendation, but early evidence suggests that it is at least reducing certain types of bots that are scanning websites for common vulnerabilities (For example I am seeing less errors that some bot is futilely tying to find wordpress administrator urls on my CF applications so it can try to brute force attack them or whatever)

Report · Oct 14, 2024

@Scott_PALADEM , Take another look at the displays:

Your reading of the graphs at the top chimes with mine. Sessions are indeed being destroyed about as fast as they are being created.

However, whereas your focus is on the topmost graphs, mine is on the ones below. That is, the blue. I also took into account the table underneath.

The first ColdFusion instance created, on average, 22757 sessions per hour and destroyed, on average, 25667 sessions per hour. The second ColdFusion instance created, on average, 22735 sessions per hour and destroyed, on average, 22394 sessions per hour.

With 200 simultaneous users and 8000 daily users, I wondered how the 10 000 active sessions arose and what their impact might be.

Dave has offered some clarification in his recent post. If I understand it correctly, we have to read that as 5 000 active J2EE sessions. Even so, I am still at a loss to know what's going on. I hope my cfdump suggestion will provide a more definitive clue.

To be honest, I have no answers on this subject ("sessions"), only questions. That explains why my list of suggestions was extensive.

Report · Oct 14, 2024

@BKBK

I was taking into consideration all of the graphs that have been shared. I am simply saying that the amount of data stored in the session scope is not a likely factor.

It IS certainly possible the sheer volume of traffic may be at play, especially if it is bot traffic designed to scrape all the data it can out of the clients database. This kind of unnatural traffic can sometimes cause stress on the database and if it is pulling large amounts of data into memory and caching it that may also be a factor. I don't know enough about this particular app, but analyzing bot traffic is something I would definitely investigate.

Report · Oct 11, 2024

Hi @davecordes , is possible to have prev and actual JVM parameters to compare it?

Do you use CFTHREAD tag?

Report · Oct 11, 2024

Hi Paolo,

I do have the previous and current JVM arguments to compare but those aren't very helpful since we are using the same parameters.

Min Heap is the same on both servers.

Max Heap is the same on both servers.

We do not use CFTHREAD.

Report · Oct 12, 2024

@davecordes , have you used Spotify's online thread dump analyzer yet? If so, what were the results?

In case that didn't help, here is another thread dump tool, FastThread. It is free for limited use.

Report · Oct 14, 2024

Hey BK,

I have not tried that yet. I've been busy working on a strategy to suppress some bot traffic.

Report · Oct 14, 2024

@davecordes , Have you tried my cfdump suggestion? If so what were the file sizes?
(Given that the code is all there, the test should take you all of 45 seconds.)

Could you also share the contents of jvm.config?

Report · Oct 14, 2024

BK,

Yes I did try your dump suggestions. The files were pretty small.

- applicationScopeDump.html was 118K B

- sessionScopeDump.html was 10 KB

JVM Config

#
# VM configuration
#
# Where to find JVM, if {java.home}/jre exists then that JVM is used
# if not then it must be the path to the JRE itself

java.home=D:/Java/jdk-11.0.24

#
# If no java.home is specified a VM is located by looking in these places in this
# order:
#
#  1) ../runtime/jre
#  2) registry (windows only)
#  3) JAVA_HOME env var plus jre (ie $JAVA_HOME/jre)
#  4) java.exe in path
#

# Arguments to VM

java.args=-server  -Xms8192m -Xmx8192m --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/sun.util.cldr=ALL-UNNAMED --add-opens=java.base/sun.util.locale.provider=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED -XX:+UseParallelGC -Djdk.attach.allowAttachSelf=true -Dcoldfusion.home={application.home} -Duser.language=en -Dcoldfusion.rootDir={application.home} -Dcoldfusion.libPath={application.home}/lib -Dorg.apache.coyote.USE_CUSTOM_STATUS_MSG_IN_HEADER=true -Dcoldfusion.jsafe.defaultalgo=FIPS186Random -Dorg.eclipse.jetty.util.log.class=org.eclipse.jetty.util.log.JavaUtilLog -Djava.util.logging.config.file={application.home}/lib/logging.properties -Dtika.config=tika-config.xml -Djava.locale.providers=COMPAT,SPI -Dsun.font.layoutengine=icu -Dcom.sun.media.jai.disableMediaLib=true -Dcoldfusion.datemask.useDasdayofmonth=true -Dcoldfusion.classPath={application.home}/lib/updates,{application.home}/lib/,{application.home}/gateway/lib/,{application.home}/wwwroot/WEB-INF/cfform/jars,{application.home}/bin/cf-osgicli.jar -javaagent:D:/FusionReactor/instance/cfusion.cf2021/fusionreactor.jar=name=cfusion.cf2021,address=8088 -agentpath:D:/FusionReactor/instance/cfusion.cf2021/frjvmti_x64.dll

# Comma separated list of shared library path
java.library.path={application.home}/lib,{application.home}/jintegra/bin,{application.home}/jintegra/bin/international

# Comma separated list of shared library path for non-windows
java.nixlibrary.path={application.home}/lib

java.class.path=

Report · Oct 14, 2024

@davecordes , thanks for the update on session and application scope, and for the JVM settings. The small file sizes tell us that application-scoped and session-scoped variables are unlikely to the culprits.

Now looking into the JVM settings.