Skip to main content
Participating Frequently
November 27, 2008
Question

Server crashes

  • November 27, 2008
  • 9 replies
  • 2830 views
Hi,

We are running CF 7 on a Linux server. the issue is our server keeps on crashing and the techs at rackspace cant help as they have no CF techs available.

Here is the info they provided me:

"Thank you for your patients. As per my response I simply restarted cold fusion to get the site responsive again. It seems that jrun was consuming most of the resources not allowing the site to resolve and after restarting cold fusion your site began to respond. If you have any further questions or concerns, please feel free to update this ticket or contact us here directly. Thank you again."

Can anyone shed any light on this? It seems to happen every few weeks and has done so for months.

Thanks again,

Wladimir
    This topic has been closed for replies.

    9 replies

    BKBK
    Community Expert
    Community Expert
    December 5, 2008
    @Carehart
    I don't argue the point that a crash happens, by definition, at runtime. Neither did I say or suggest that a missing file is the cause of this crash.

    My point is that you should first eliminate program errors before looking for the cause of a crash. I advised Wladimir to study the logs and to rule things out. In any case, one can easily create a hypothetical use-case that involves a missing file and a server crash.

    Suppose page1.cfm errors because a file is missing. If you don't attend to that, it might lead to a crash if page2.cfm contains code that enters an inifinite loop as a result of the missing file. That's all academic, I agree. Just illustration.

    @Wwbr,
    While you wait, here are some stabs in the dark that hit the mark in the past. Prime suspect: your Application file. Examine it for loops, includes, cflocation, scheduled jobs, gateway calls, read/write processes and object-creation processes.





    Participant
    December 8, 2008
    thanks for usefull info.
    wwbrAuthor
    Participating Frequently
    December 9, 2008
    Hi Guys,

    Are these logs any good?

    90977_cfserverlog.txt
    90977_coldfusion-eventlog.txt
    90977_SAR_report.txt

    Or do I need to ask Racspace for something else?

    Thanks again,

    Wladimir
    BKBK
    Community Expert
    Community Expert
    December 4, 2008
    You also mention I should take a look at the "cf/runtime/logs" which one is that?

    If you cannot find anything in the CF logs, then looking at the runtime logs is indeed the next step. Application, exception and server runtime logs are all relevant. The key is to extract the entries at and just before the server crash.

    BKBK
    Community Expert
    Community Expert
    December 4, 2008
    Carehart wrote:
    I can't see how that would cause a crash, no. He was focused on your cf/logs.

    The hierarchy is page-request => application => server. An error in a request can escalate into a server crash. For example, an endless loop or a growing factory of objects will eventually bring down the server.

    The moral is clear. You cannot begin to debug a server crash when there are errors or bugs in your application.

    Charlie Arehart
    Community Expert
    Community Expert
    December 4, 2008
    BKBK, if you've seen a server crash because of this, then thanks for sharing your perspective. I don't disagree with your depiction of the processing hierarchy. I'm just saying that in the hundreds of instances per year where I've helped people in positions like Wladimir's, a missing file has not caused the sort of escalation you propose. Rather, it's always been something else that's crashed the server. And it's not taken ruling out every CFML error in the application.log to find and resolve the problem. Still, there's nothing wrong with trying that as an alternative when lacking other information or techniques.

    Further, and I should have said this to Wladimir, in my experience, it almost never that the server is really "crashing". Instead, it's usually that something is tying up all the request threads, so no more requests get in. That appears to the users (and admins) as a "hung" server. With tools like FusionReactor and SeeFusion (for CF 6, 7, and 8), or the CF8 Monitor, you can actually see what requests are running in a given moment. Usually it's something causing the requests to hang.

    It may be that they're making a call to the DB and it's locked, or they're making a call to a web service, or a CFHTTP call, or something else like that, which is hanging up, and therefore preventing any new requests. Think of it like a cashier line: if the credit card processing system goes down, everyone's going to start piling up.

    And sadly, in situations like this, you may note have any messages in the logs saying anything's "wrong". Even without those tools above, at least if you use the Jrun metrics, CFSTAT, or Perfmon (as I mentioned in the first note), you can see *if and how many requests* are running or queued (they start queuing when they can't run, just like people lining up behind the cashiers, though it's more like a bank teller line in that all are in one line and would go to the next available window. Indeed, sometimes it's really that all but one window is locked up, so some few people are indeed getting through, but again it looks and feels like the system's hung.

    Even if this is not W's case, perhaps this description may help others.
    /Charlie (troubleshooter, carehart. org)
    wwbrAuthor
    Participating Frequently
    December 5, 2008
    Hi Guys,

    Once again a big thank you to everyone. Im busy speaking to rackspace and I will get to the bottom of this! :)

    They've just updated me again:

    "I now attached the Cold Fusion event log file to this ticket to review.

    I also included a SAR report of the server’s resource details that include CPU utilization, memory and swap space utilization, queue length and load averages.
    More details as per the sar linux command line tool can be found in url (linux.die.net/man/1/sar).

    I also included the last 600 entries from the Cold Fusion log file (/opt/coldfusionmx7/logs/cfserver.log) that logged a number of fatal errors (IE Fatal: Stack size too small. Use 'java -Xss' to increase default stack size)

    If you have any more questions, please update the following ticket or contact the Rackspace Managed Hosting helpdesk."

    Still waiting for some more info and will post again.

    Thanks again,

    Wladimir
    Charlie Arehart
    Community Expert
    Community Expert
    December 2, 2008
    I meant to comment on that, Wladimir. I can't see how that would cause a crash, no. He was focused on your cf/logs. I recommended you look at your cf/runtime/logs. Any news from that? Or any of the other info I offered? Have you considered enabling Jrun metrics? Have you looked at CFSTAT? The answers are often there among the various diagnostics, provided or that can be enabled.
    /Charlie (troubleshooter, carehart. org)
    wwbrAuthor
    Participating Frequently
    December 3, 2008
    Hi carehart,

    Our server just died again and his is the info from Rackspace:

    "Per your request I have pulled the logs from your server preceding the issue and can see that there is a file missing from your directory which is triggering this alert.


    Usage: file [-bciknsvzL] [-f namefile] [-m magicfiles] file...
    Usage: file -C [-m magic]
    Try `file --help' for more information.
    Usage: file [-bciknsvzL] [-f namefile] [-m magicfiles] file...
    Usage: file -C [-m magic]
    Try `file --help' for more information.
    [Wed Dec 03 12:43:01 2008] [error] [client 193.108.87.5] File does not exist: /var/www/vhosts/default/htdocs/department
    [Wed Dec 03 12:44:57 2008] [error] [client 84.9.112.110] File does not exist: /var/www/vhosts/default/htdocs/favicon.ico
    [Wed Dec 03 12:45:00 2008] [error] [client 84.9.112.110] File does not exist: /var/www/vhosts/default/htdocs/favicon.ico"

    -------

    You also mention I should take a look at the "cf/runtime/logs" which one is that?

    http://www.dpivision.com/screenshot.jpg

    Thanks again,

    Wladimir



    Charlie Arehart
    Community Expert
    Community Expert
    December 4, 2008
    Wladimir, your screenshot shows that you're still looking in the cf/logs directory. Please reread my first note above. I said you need to look instead in the runtime logs. You ask where those are. Again, I referred to them in my first note above:

    "check out the [cf]/runtime/logs/ as well (if in multiserver/multi-instance mode, see [jrun4]/logs/). These other logs (including ones named [server]-out.log and [server]-event.log) are often far more helpful in understanding the cause of errors."

    I used [cf] since the location varies by version and OS. So if on CF7 (on a Server install), they're in cfusionmx7/runtime/logs. On a multiserver/multiinstance deployment, it's jrun4/logs (or wherever those equivalents are stored on your Linux server).

    You say the host says, "there is a file missing from your directory which is triggering this alert". What alert is he talking about? You said the server is crashing. I honestly have never heard of a server crashing because of a file missing. Messages like that are very common in the cf/logs, but the runtime/logs may tell a far different story. Even then, though, the answer may still not be obvious from those. As I said in my first note, though, the information to solve the problem is there, or can be added to make it be there.
    /Charlie (troubleshooter, carehart. org)
    Charlie Arehart
    Community Expert
    Community Expert
    December 1, 2008
    I have a few thoughts that may help you, Wladimir. It's a long-ish reply, but I hope it has some value for you or others.

    There's been some discussion of looking at logs, and that may help, but I think you'll need to look at far more than what's been mentioned. For instance, BKBK said to look at the "application, exception and server logs all matter", and that may be true, as far as the [cf]/logs/ are concerned.

    But you want to be sure to check out the [cf]/runtime/logs/ as well (if in multiserver/multi-instance mode, see [jrun4]/logs/). These other logs (including ones named [server]-out.log and [server]-event.log) are often far more helpful in understanding the cause of errors.

    Even then, though, they're often still not enough. There may also be hs*.logs in the [cf]/runtime/bin/ [jrun4]/bin/ that offer additional info on jvm crashes, if that's what's happening.

    Sometimes, it's not that CF crashes but that it's simply hung up as all request queues are busy. In that case, you need to know what's going on in the CF engine when things go bad. One thing that helps is if you enable jrun metrics, which logs status info at a chosen interval (such as every few seconds). The CFSTAT command (built into CF, in the [cf]/bin directory) can help as well, as can perfmon stats (though not on Linux).

    I discussed these and other sorts of resources for troubleshooting in a talk I gave at Max (at the CF Unconference) called, "CF911: Tools and techniques for Troubleshooting", which you can find online at http://www.carehart.org/presentations/#cf911. Hope that may be helpful.

    I'll just add in conclusion that there's always an explanation to CF hanging up. It's not "just broken", so it's a shame when hosts (and others) just "restart CF" to make the "problem go away". There's always a root cause, and as in your case, it repeats, so the problem will come back.

    The challenge is to find that root cause, when it "goes rogue". The issue may be due to CF config, jvm config, jvm version (there's a known issue with the built-in jvm in CF8, but you're on 7). It may be due to load (perhaps unexpected). You may be running out of memory or CPU. Your rackspace techs don't clarify.

    Since you're on 7, there's also a known issue of file uploads being a potential killer in that they use up memory (equal to the size of the file uploaded) that's never released. There's a hotfix for that. See http://www.adobe.com/go/kb401239 (and in my experience, it has nothing to do with CFCs, as suggested in the title and description). If you're running out of memory in CF, I'd highly recommend this (and it's not applied if you've applied even the latest cumulative hotfix for CF7).

    And speaking of hotfixes, I find many shops still running on the original release of whatever they have (such as 7.0). You should at least move up to 7.01 or 7.02. And even then they've often not applied cumulative hotfixes (or individual ones). Many times there are problems that are solved with these.

    Going back to the discussion of memory, are you (or they) tracking memory use within CF? whether by watching the memory used by the jrun process (less effective) or watching memory use within CF (more effective)? The JRun metrics can show you, or there are available java methods you can call (for instance, see http://www.petefreitag.com/item/115.cfm).

    Finally, there are also useful commercial tools like FusionReactor and SeeFusion which can help, and they're more than "just monitors", in that they track information that you can review after a crash (and especially more in FusionReactor, which does tremendous yet lightweight logging of lots of details about running requests, queries, and more).

    I use all these tools and logs (and more) when I help people solve these kinds of problems. Half the battle is knowing the tools and how to connect the dots in the diagnostic info they provide. I hope the info above may help, and of course this forum is a great resource so ask away.

    Note as well that there are various companies that can help also, whether on-site or over-the-web. See http://www.cf411.com/#cfconsult for a list of several. Some require days at a minimum, while some (like myself) have no minimum. Sorry if that sounds like a sales pitch to some. It's really not, and I've tried to offer a lot of info for free above and on my site (carehart.org). But sometimes people just want to make the pain go away as fast as possible,and I just want them to know they don't need to suffer if they'd rather pull in some help.
    /Charlie (troubleshooter, carehart. org)
    BKBK
    Community Expert
    Community Expert
    December 1, 2008
    That's a start. I would verify these for a start.

    The application calls a list with index 3 on line 169 in /httpdocs/site/product.cfm, whereas the list has just 2 elements. Coldfusion couldn't find a template included on page 92 in /httpdocs/site/product.cfm.


    wwbrAuthor
    Participating Frequently
    December 2, 2008
    Hi Guys,

    Firstly a big thank you to everyone for their help, its all really appreciated.

    @ BKBK

    "That's a start. I would verify these for a start.

    The application calls a list with index 3 on line 169 in /httpdocs/site/product.cfm, whereas the list has just 2 elements. Coldfusion couldn't find a template included on page 92 in /httpdocs/site/product.cfm. "

    Would this cause the server to crash over and over again though? Surely a 404 cant make that much damage?

    Thanks again,

    Wladimir
    November 29, 2008
    same problem here --- issue is JRun running out of virtual memory space immediately on loading... if I get it resolved I'll post again.
    BKBK
    Community Expert
    Community Expert
    November 29, 2008
    26MB isn't a big deal. They should be able to copy it for you.

    Application, exception and server logs all matter. However, you can narrow the search down to just a few lines of text. The key is to look for clues on the date and time of the server crash, and in the minutes preceding the crash. Bring the errors to the forum.



    wwbrAuthor
    Participating Frequently
    December 1, 2008
    Thank you BKBK and ive asked Rackspace just that:

    -----------------

    Rackspace said:
    2008-12-01 11:16:05 (UTC+0)

    Hi Wladimir,

    I understand now, thanks. Its going to be really tricky for us to help here - we just don't have the understanding of Coldfusion to be able to identify problems from background noise in the logfiles the way we can with supported systems like Apache and PHP.

    You are going to have to identify the time of one of the crashes and then go through the logfiles to pull out any alerts that occurred in the minutes preceeding the crash, then try to identify anything that shouldn't be happening there. Its a highly time consuming and manual task and really best performed by someone familiar with the application and coldfusion.

    The kind of commands that I would use to extract the information would be this kind of query..

    grep "11/22/08" application.log | grep Error | grep -v "File not found"

    Which pulls application errors such as this:

    "Error","jrpp-50036","11/22/08","23:55:02","vhsdirect","Invalid list index 3.In function ListGetAt(list, index [, delimiters]), the value of index, 3, is not a valid as the first argument (this list has 2 elements). Valid indexes are in the range 1 through the number of elements in the list. The specific sequence of files included or processed is: /var/www/vhosts/vhsdirect.co.uk/httpdocs/site/product.cfm, line: 169 "

    Then cross reference this with any errors from server logfile like this..

    grep "11/22" cfserver.log | grep "23:55"
    11/22 23:55:18 Error [jrpp-84358] - Could not find the included template ../layouts/.Note: If you wish to use an absolute template path (e.g. TEMPLATE=""/mypath/index.cfm"") with CFINCLUDE then you must create a mapping for the path using the ColdFusion Administrator. Using relative paths (e.g. TEMPLATE=""index.cfm"" or TEMPLATE=""../index.cfm"") does not require the creation of any special mappings. It is therefore recommended that you use relative paths with CFINCLUDE whenever possible. The specific sequence of files included or processed is: /var/www/vhosts/ktduk.com/httpdocs/site/product.cfm, line: 92
    11/22 23:55:02 Error [jrpp-50036] - Invalid list index 3.In function ListGetAt(list, index [, delimiters]), the value of index, 3, is not a valid as the first argument (this list has 2 elements). Valid indexes are in the range 1 through the number of elements in the list. The specific sequence of files included or processed is: /var/www/vhosts/vhsdirect.co.uk/httpdocs/site/product.cfm, line: 169

    The process is something that you or your developers would need to complete though.

    Kind Regards,

    Andrew

    --------------------

    BKBK
    Community Expert
    Community Expert
    November 27, 2008
    Ask for and study the log files.

    wwbrAuthor
    Participating Frequently
    November 28, 2008
    Hi BKBK,

    Thank you very much for getting back to me. Ive asked Rackspace for my logs but the only issue is there is 26MB worth of them! :)

    Ive listed a screen shot below below which shows how many files there are:

    http://www.dpivision.com/screenshot.jpg

    Any help would be greatly appreciated.

    Thanks again,

    Wladimir