CF/Jrun Memory Usage Problems: could fileexists() be part of the problem?

Report · Jun 06, 2008

We've got a website that we've recently launched that has caused our CF 8.x server to continuously run out of memory. Under both load testing in development and in production Jrun's memory profile just keeps rising and rising, and rarely seems to release any memory before eventually hitting it's max heap size (as defined in CFadmin). This is happening on the production server without there being a lot of traffic on the site, save for the Yahoo, MSN and Google bots, which are fairly aggressively indexing the site's content.

Originally we were thinking either we had one of the usual problems: a coding mistake causing an infinite loop, possibly loading too much data into session scope, general site applcation errrors, the site spawning too many sessions (i.e., by having search engines trawling all site pages and links; the site is an online museum collections database, and there are literally thousands of links throughout the application, as users drill down into the site and browse the collection by various topical trees).

We did:

* An intensive code review and subsequent fixes (the cf logfiles show basically no application errors now)

* Logged pages that were taking a lot of time, optimized code and SQL business logic accordingly.

* Added a robots.txt, site XML file and a special content indexing .cfm page for search engines, added index no-follow directives to site pages to keep bots out of the website, except on pages we wanted them to index.

* Optimized the session management on the site, to minimize the memory footprints of user sessions, and to also eliminate the possibility of search engine bots causing CF to set a new session on every request.

Still no luck.

We have two remaining things we are looking at:

* Further SQL query optimization: some of the queries can return a few thousand records (displayed via a typical web paging navigation system, i.e., next/previous N records). We're looking at using SQL 2005's record paging functionality to further reduce the amount of data that gets loaded into memory on each request (although one would think CF would eventually do garbage collection to release this, no? Especially if you're not caching the queries explicitly, and the number of cached queries in CFADMIN is set to a low number?)

* fileexists(): we're using CF's fileexists() function in several places in the application: to detect if an artifact image exists, and if not, to either display a placeholder image, and/or generate one on the fly (for each artifact in the database there can be up to five different images of various sizes: the application auto-generates some of the image versions, with the client only uploading the primary artifact when they add new artifacts to the database). Even though we've created separate directories for various image types, this still means that CF is having to run the fileexists function on folders that have thousands of artiface image files in them. I'm wondering if this could be causing some of our memory problems? Does the fileexists() function basically do recursion on the directory that it scans, and could this be causing server issues?

Also, the server was completely stable before we published this new site to it. All other sites on it are developed by us, so there isn't any third party code to worry about. Testing in development/staging environment generates identical problems (running a custom search bot on it, using Microsoft's stress testing tool, other...), with the added note that I've noticed on our staging server that we are getting recursion and memory/stack overflow errors occasionally returned to browsers as developers are working on their projects and testing. I haven't seen that specific error in a browser on production, but that could just be a reflection of the amount of times during the course of a day we're looking at staging versus production. There are of course a lot of other sites on the dev/staging box, so it could be unlrelated.

Server/App Specs:

* Dual Quadcore Dell Servers
* 4 gigs RAM in server
* Mirrored RAID (Ultra SCSI 320, not SATA)

Web Server:

* Win 2003, latest service pack
* CF 8,0,0,176276
* Native SQL Server database connection (cf datasource basically using default datasource settings, with the exception of the Allowed SQL Permits)

Database Server:

* Win 2003, latest service pack
* SQL 2005 Standard Edition, latest service pack

CF Settings:

* CF's JVM has 512 megs min heap, 1024 megs max, maxpermsize set to 256 megs
* CF is already configured to minimize other possible memory usage (max number of simulaneous requests is 12, number of cached queries, templates etc. has been lowered below the defaults to see if that would help, which it doesn't)

Any ideas/suggestions?

Thanks in advance,

Sean

Report · Jun 06, 2008

Good analysis / summary of the situation.

Maybe try FusionReactor or something like that to monitor what's running
and when, and see if anything sticks out at you. CF8 (Enterprise) itself
has some sort of monitoring stuff... I haven't looked at it, so can't say
how useful it is, but it could help out.

If you force a GC, does any memory free up?

I dunno about the fileExists() theory. It doesn't do any *recursion* as
far as I could estimate it: it looks at the specific location you give it;
it doesn't search for anything. You could always swap the call out for a
Java equivalant, or - temporarily, and in your lab environment - take out
the code and let it fail if the file's not there, or try/catch it or
something. That should help factor it out, anyhow.

Have you activiely *checked* all existing sessions when RAM ramps up to
make sure there's not something you don't expect in there? Ditto
application-scope. What if you specifically wipe out all sessions and the
application scope and then wait for a GC to happen?

Good idea re the paging of the queries. You really only ought to be
returning the data you use for a given request. I presume you're not using
SELECT * anywhere? That stuff should clear itself up though, even if you
were.

Are all your local variables in your CFCs correctly VARed, especially ones
with persisted instances?

One thing we had in the past which consumed a ridiculous amount of RAM was
storing CFC instances in session. They seemed to be taking... from
memory... about 400kB each, even for fairly lightweight objects. This
*was* a few years ago under CFMX6.1, so might not be such a consideration
now.

Are you doing anything slightly off-the-wall, code-wise? That's possibly
hard to quantify, I know. But something different in this app compared
with your other apps that aren't giving you gyp.

Maybe try *increasing* the number of templates to cache in RAM. The
management of the caching of them might itself be leaking. Count your
*.cf? files and set the value to be around that.

Try the same with the queries.

Needless to say: only try one thing at a time.

Are your JDBC drivers completely up to date (I'd assume "yes" with CF8).

Is it possible to just load test sections of your site, to see if you can
narrow down some unexpectedly errant piece of functionality? Or the
reverse - factor out parts of the functionality which *doesn't* cause the
problem.

That's all I can think of for the moment. Feed back your further findings
to see if they flag anything else.

Good luck. These are the "fun" ones to troubleshoot.

--
Adam

Report · Jun 08, 2008

Thanks Adam,

Between your post and the responses I got to my post on CF-Talk we've got a number of things to explore. I'll report back when we've finally tracked down the source(s) of our problems.

Cheers,

Sean

Report · Jun 12, 2008

Hey everyone,

An update:

* We installed FusionReactor

* Deployed the site on a development box running CF Enterprise

* Ran a number of load tests and tests on discrete parts of the application

* Did various application and memory monitoring using both FusionReactor and CF's Server Monitor

* Tweaked some more code based on the tests, including:

** fixing a few application errors;

** optimizing a few cfc methods and queries that were slow:

** reconfirming that we don't have any improperly scoped variables in the app, and that there wasn't any weird recursion or referencing of objects

** converting some pages that require complex but more or less static database queries into includes that get generated when the data changes in the database when a new data import is performed.

** temporarily commenting out some code that invokes cfx_image3 to generate artifact thumbnails.

* Changed CF's Query Caching settings from 100, to 50 to zero.

* Retested

* Played with various JVM settings that people suggested.

No luck. Memory still keeps climbing. In fact, on some of the search results requests that return a large number of records you can watch Jrun grab a good 15-25 megs and not let it go. If you keep hitting the same page you can quickly force Jrun to hit it's buffer and stop responding. You shut this site down and restart Jrun in either dev or production and the server runs beautifully, no memory problems.

Here's the weird thing: when you analyze these requests and the server state while running the load test, the total amount of memory being consumed by application and session variables, objects created and destroyed during the request cycle, even the size of the data returned from the queries, is completely insignificant:

* The total amount of memory the application's objects that are being stored in the application scope is less than a Meg.

* User sessions rarely take up more than a few kilobytes.

* There's nothing in the server scope.

* The largest query object clocks in at well under a megabyte.

Moreover, in the CF Server Monitor it shows that CF is actually destroying the objects as expected and releasing memory back into the pool. However, if you actually look at Jrun in the Task Manager, it's not. And, hitting Run GC to invoke garbage collection in the Server Monitor doesn't get the memory back either.

I'm running out of ideas. The last couple of things on our list are to rewrite some of the SQL to use SQL 2005's paging functionality (haven't got to that yet), and to try eliminating one technique we've been using. Some of our application-scoped singletons store a reference to our app's SessionFacade within themselves. This should be fine, because the SessionFacade has NO private/public properties, and when we invoke the SessionFacade's methods inside the other objects we're doing so according to best practices: declaring a variable that is private, i.e., <cfset otherLang = variables.sessionFacade.getOtherLanguage("en") />

Might we have run into a memory leak in the native SQL Server database driver? Or some kind of obscure JVM problem? Or maybe some kind of recursion hidden way in some cfc or view code that isn't causing a request to hang, but is somehow chewing up memory?

My next steps after this might be to engage Adobe or maybe Charlie Arehart's company. Anyone have experience dealing with either?

Thanks,

Sean

Report · Jun 12, 2008

> Moreover, in the CF Server Monitor it shows that CF is actually destroying the
> objects as expected and releasing memory back into the pool. However, if you
> actually look at Jrun in the Task Manager, it's not.

No, you wouldn't see this. The JVM will grab memory from the OS up to the
max it is allowed to, and doesn't ever release memory back to the OS once
it's allocated. GC is an internal Java operation, not an OS one, so it'll
free up memory *within* the memory allocation that Java has from the OS.

A case in point is if - for example - your JVM settings are like
-Xms:1500mb -Xmx:1500mb (or whatever the syntax is, I don't have it in
front of me), then Java will grab the whole lot at once, so your system
memory usage will jump by 1500MB... that's not to say Java is using it all,
it's just told the OS that it is.

When checking what RAM is free for Java to use, you can't check any
OS-level measuring tools (like task manager). Use something like
FusionReactor, or query the JVM directly.

How are you finding the CF server monitoring stuff compared to F/R, btw?
I've only used F/R, myself.

I'd probably engage Charlie Arehart over Adobe, to be honest. If only
because I imagine he's easy to get hold of than all the hoop-jumping that
seems to be required to attract Adobe Support's attention. I have only
limited experience in either, though, I hasten to add.

--
Adam

Report · Jun 12, 2008

> Maybe try *increasing* the number of templates to cache in RAM. The
> management of the caching of them might itself be leaking. Count your
> *.cf? files and set the value to be around that.

Did you try this?

--
Adam

Report · Jun 12, 2008

> In fact, on some of the search results

Search results. Verity or DB? Or other?

> requests that return a large number of records you can watch Jrun grab a good
> 15-25 megs and not let it go. If you keep hitting the same page you can quickly
> force Jrun to hit it's buffer and stop responding.

That seems a bit odd.

And you're just putting these search results in local variables, which
should be cleaned up @ the end of the request?

Can you post some code?

--
Adam

Report · Jun 13, 2008

Hey Adam,

>No, you wouldn't see this. The JVM will grab memory from the OS up to the
>max it is allowed to, and doesn't ever release memory back to the OS once
>it's allocated. GC is an internal Java operation, not an OS one, so it'll
>free up memory *within* the memory allocation that Java has from the OS.

To coin a political phrase, I 'misspoke' a bit 🙂 In the Task Manager the JVM has grabbed its full allocation (which indeed doesn't get released), but what is still happening within the JVM is that it is only releasing a small portion of the memory originally used to process the request.

> Maybe try *increasing* the number of templates to cache in RAM. The
> management of the caching of them might itself be leaking. Count your
> *.cf? files and set the value to be around that.

>>Did you try this?

I did bump the number for this.

>>Search results. Verity or DB? Or other?

DB only. Some complex queries, but the performance now is very optimized and the amount of RAM the query objects take up is very minimal, even with a large returned dataset.

> requests that return a large number of records you can watch Jrun grab a good
> 15-25 megs and not let it go. If you keep hitting the same page you can quickly
> force Jrun to hit it's buffer and stop responding.

>>That seems a bit odd.

>>And you're just putting these search results in local variables, which
>>should be cleaned up @ the end of the request?

Yep. That's what's so frustrating. The actual total memory and processing footprint of the app and the processing of the app's most complex page requests is quite small. There's just this additional continuously growing memory overhead that isn't accounted for when we look at the application level. Even if I run garbage collection or reload the site's application scope we don't get anything more back than the footprint of the application, yet it's the application seemingly that is causing Jrun's overall memory usage to eventually max out.

What we did do today that in fact did help somewhat was run Mike Schierberl's varScoper. This is a fantastic utility. Despite many passes through the code varScoper still found some unscoped variables in our application. Those are fixed now. This may have helped a bit, but not enough to prevent the memory usage from still climbing.

We're still working on a couple of other things. I'll post some code if they don't resolve the issue.

Thanks for your help,

Sean