Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

CF MX 8 CPU Spikes

New Here ,
Oct 24, 2008 Oct 24, 2008
System:
CF MX 8 Ent. MultiServer
Server 2003 Enterprise. x64
Dual socket Quad core Xeon L5335
2.00GHz 4GB RAM

Problem:
Within 24 hours of instance startup, JRun reaches max heap, begins radically spiking CPU times across all eight cores. Not a solid CPU ramp, but spikes. These spikes appear to correlate to high PF Delta and Mem Dealta (Task Manager). PF Delta will hit 1k-2k just before a spike. At that very time, there is a mem alloc and dealloc justbefore, during, and after.


Background:
We currently have one instance for dev and one for live. The live instance gets about 500,000 hits per day, so a generous amount of traffic. Nearly all the pages are driven by 1-5 SQL (caching) queries to a neighboring SQL Server inches away. None of them are long running queries, all can execute within 16-95ms.

The most frequently used queries are built in onApplicationStart() and scoped to the Appliacation. Any page using these, cflock and duplicate to the variables scope.

There are 3 cfobject calls for CFC's in onApplicationStart() and scoped to the Appliacation. Any page using these, cflock and duplicate to the variables scope.

After a fresh start up of the live instance, everything runs fine for maybe a day. Memory climbs up to about 700-900M within a few minutes, everything is stable, the site is very responsive. Server Monitor shows response times near 10-20ms, requests per second bouncing between 6-30, with a narrow band around 15. Template cache eventually hits the limit and we're getting 100% cache hit. Query cache hits the limit and we get 90%+ cahche hit. Things are still very stable.

A fresh start yesterday morning and it ran great all day (after learning to NOT leave SM memory tracking on). I get up this morning and JRun has taken 1.25GB of RAM, and is spiking all over the CPU cores. SM Response times now leap 0 to 200ms with occasional 1000ms response times. Requests per second now leap 0 to 40, as if it is queuing things up and finally being able to deliver. Site responsivness simulates what I see in SM requests per second. Lots of delay and then quick access, back and forth.

The site has bee deployed for 1 week and I've had to restart it every morning to clean it up. It really behaves like a memory leak.

JMC Settings:
Max Heap Size: 1GB (I've tried 2GB and it only delays the problem.)

VM Args:
-server -Dsun.io.useCanonCaches=false -XX:MaxPermSize=192m -XX:+UseParallelGC -Xbatch -Dcoldfusion.rootDir={application.home}/ -Djava.security.policy={application.home}/servers/cfusion/cfusion-ear/cfusion-war/WEB-INF/cfusion/lib/coldfusion.policy -Djava.security.auth.policy={application.home}/servers/cfusion/cfusion-ear/cfusion-war/WEB-INF/cfusion/lib/neo_jaas.policy


Instance Settings:
Max Templates Requests: 400
Max Running JRun Threads: 25
Max Queued JRun Threads: 600
Cached Templates: 400
Cached Queries: 400

What to look for? Direction to proceed in for troubleshooting.

-Dan
1.1K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Nov 15, 2008 Nov 15, 2008
Ever resolve this? Have you tried to debug it with a third-party tool like FusionReactor? You are really going to need some third party tools to help. Do you have ProcessExplorer.exe and have you tracked down the culprit threads? Others tools to consider are SeeFusion, JProfiler, or JProbe.

Also if you have that many hits and running CF Enterprise, why aren't you clustering them? Seems you should have 3-4 instances set up on a cluster within the administrator. Oh, and put some more memory on that box.

Are there possibly other instances running on the box? Other IIS sites? Do you map out your cfide directory under your IIS site and do you have more than one IIS site mapped out to the same cfide directory? I mention this because we were have a similar cpu spiking issue. FusionReactor and ProcessExplorer helped us track down the issue, showing some cfgrid threads locking up and using about 12% cpu usage per thread. We have about 10 instances, 10 IIS sites, about 300 websites on a Dual Quad box. We had all the cfide directories added as virtual directores and pointed to the same cfide directory unde the root. It seemed almost like the calls to the cfide directory for cfgrid would run into some sort of a conflict. We ended up spliting the cfide directories out into thier own, so now I have 10 cfide directories. This appears to a have resolved our issue, but I'm still working on the how and why it was happening.
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 16, 2008 Nov 16, 2008
Dan, besides what NeoRye offered, given the "24 hour nature", I would wonder first if you may have the CF Admin client variable purge time set to that 24 hours (it defaults to 1 hour 7 minutes, and people often raise it without considering the implications.) And don't ignore this because you "don't use client vars" I could elaborate, but just check first.

Also, do you have cf's debugging output enabled? even if restricted to some ip addresses? Try turning that off, too.

Finally, it may pay to analyze the web server logs to see if the 24 hour limit may really be preceded by a spike of some certain requests, or type of requester (scheduled task, load test, spiders, bots), or if the DB may have its own glitch that makes CF instead a victim.

As Neo proposes, FR may help here. It has tremendously valuable logging (low-overhead) that can help you spot such issues. More at fusion-reactor.com.

/Charlie (troubleshooter, carehart. org)
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Nov 19, 2008 Nov 19, 2008
LATEST
NeoRye,

I was able to clean it up some. I found some areas where we didn't dump some Duplicate()'d structures and queries. I also discovered that you DO NOT leave memory tracking on in the server monitor. Lastly, I believe the biggest culprit was desiring to cache lots of queries and thinking 2GB was a lot of memory (Remember Windows keeps 2G of the 4GB we have for itself). I took away the cachedWithin attribute to the most volatile queries and things have really calmed down. CFIDE won't be an issue, this box is dedicated to this one site.



C. Arehart,

The purge client vars is at the default 1h 7m. I didn't mean to imply that right at 24h, it went sour, just a rough time window before I was forced to restart JRun. I'm now at about 1 week before it goes haywire. I have looked at FR and was very enticed.

After my recent changes, I believe two things. More memory would help this in that caching volatile queries means LOTS of cached queries. However, by not caching these, the site is no less responsive than with them cached; leaving the more static queries cached. Caching static queries is nicer to code than scoping them to the Application for persistence. Secondly, I've found that the GC might be what is causing the spikes in trying to purge cached queries to make room for more recently executed queries. It seems to correlate to the same effect as when I run GC from the Statistics | Mem Usage Summary in Server Monitor. For that very moment, the Requests per second drop to zero, Average Response time leaps, and I get a CPU spike on one of the cores. Why would GC impact an instance that drastically? Obviously, I can see memory thrashing taking place if you have a lot of queries trying to replace each other in cache. We really need a way of evaluating memory usage ahead of deployment.

I have found another variable to this equation. We have a mass mailer that goes out at 3am (13k subscribers) and another at 4am (5k subscribers). I happened to pull an all nighter one night and noticed that right after each mailing, memory increased significantly (over 600MB) without ever decreasing. Nothing complicated here; read a list of subscribers, build the email content, cfmail a copy to each subscriber. I'm thinking of offloading this to a single instance tailored specifically for scheduled tasks. Why should cfmail nuke memory like this?

Lastly, what dictates when to employ multiple instances for a single site other than fail over? How much traffic is too much for a single instance? The hit rate I have here barely impacts the server, other than mem. usage. Can I deploy and cluster multiple instances on the same physical server or is that just plain dumb? I don't have budget to lease another server and buy another CF MX 8 Ent. ugh....

-Dan

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources