We have been experiencing an ongoing situation where our application pool randomly fails bringing our site down until we are able to stop and start the application pool bring the site back up. This happens randomly, but usually at least once every 4 or 5 days.
We are running Coldfusion 10 with Update 18 applied and the connector rebuilt after applying update 18. The web server is IIS 7.5 running on Windows Server 2008 R2. We are using the following settings in our workers.properties file:
Our isapi_redirect.dll file shows version 184.108.40.206.
When the site fails, requests appear to just hang. Using Fusion Reactor we see that requests are no longer being passed on to Coldfusion. If we attempt to stop the application pool in IIS it will sometimes shutdown properly and other times throw an error that it is unable to respond to the control request. If the application pool does stop successfully, I'm able to manually start the application pool and everything will be good until the application pool fails again. If the application pool does not respond the attempt to stop it, I will need to stop and start the website or IIS as a whole to recover the sites.
In our application pool settings we have Rapid-Fail Protection disabled. We have Recycling set to be at specific times; 22:00:00 and 06:00:00 respectively.
When the server stops responding will see entries in our httperr file for all requests that end in 2 Client_Reset followed by the application pool name. When I stop the application pool, future requests will end in 503 2 Disabled as the application pool is no longer running. As soon as I start the application pool, the server will run as expected until the issues return. It may be a day or two or up to a week before it fails again.
Coldfusion-out.log appears normal until the Client_Reset errors start in the httperr log, which at that time the outputto Coldfusion-out.log stops as well as the output to the IIS log file. Nothing appears out of the ordinary leading up to the start of the client_reset errors. We don't see any relevant entries in the coldfusion exception.log around the time that the site goes down, but we do see many of the following errors when users attempt to download pdf files through our CMS. It is my understanding that this error is ok, just a notification that the client ended the request before the entire response was completed.
"Error","ajp-bio-8012-exec-3138","01/10/16","10:41:56","MURA904069D3142238F45DE68D824E76D9BC","coldfusion.tagext.OutputException: The cause of this output exception was that: org.apache.catalina.connector.ClientAbortException: java.net.SocketException: Connection reset by peer: socket write error.
I'm hoping for any advice on how to troubleshoot this further so that our server is stable and does randomly crash the application pool.
Might be worth knowing what values you have in AJP section of CF runtime conf server.xml EG:
<Connector port="8012" redirectPort="8447" protocol="AJP/1.3" tomcatAuthentication="false" maxThreads="900" connectionTimeout="60000"/>
Sometimes folks have benefited but using tomcat native ajp-apr rather than ajp-bio as you currently have.
Sometimes over a few days up-time the problem can be Java. Do you know if the memory variables for Server Settings > Java and JVM > Maximum JVM Heap Size and in > JVM Arguments XX:MaxPermSize (CF10 so I am guessing Java 7 rather than Java 8 which would be XX:MaxMetaspaceSize) are configured well?
Thanks for responding! We have the following in our server.xml
<Connector protocol="AJP/1.3" port="8012" redirectPort="8445" maxThreads="900" connectionTimeout="60000" tomcatAuthentication="false"></Connector>
I think are configuration for the JVM is good, we tuned it awhile back to resolve issues with Coldfusion hanging, but the current issues is just the failure of the application pool, Coldfusion is still fine when the application pool is restarted.
Our minimum and maximum JVM heap size is 3000mb and our MaxPermSize is 400mb. We are actually on Java 8.
"Sometimes folks have benefited but using tomcat native ajp-apr rather than ajp-bio as you currently have."
I'm not familiar with using the native ajp-apr instead of the ajp=bio, but will look into this now.
One other thing to consider with tomcat threads and pools, it can be a good idea to specify an initial setting. EG considering your max setting:
server.xml AJP section -
If you are interested to try tomcat native ajp-apr copy 64 bit tcnative-1.dll to CF\cfusion\lib then restart CF.
Check content of coldfusion-error.log for presence of ajp type.
I guess like you the CF Java end is ok since you recycle the application pool to overcome issue. Of note since you are using Java 8 that consumes Metaspace not PermSize. Having PermSize present is a syntax error which is not "stop the world". Metaspace if not defined should auto size but in doing so can lead to full garbage collection to size up to a new "watermark". Java reminders full GC is pause effect. Having
said all that I doubt a full GC pause will be causing the overall problem then again hard to say with certainty.
Thanks again for the suggestions for resolving this issue. I'm going to look into and test setting initial values for the Tomcat threads and pool size.
Good point about using Metaspace and not PermSize with Java 8. Would I define Metaspace similar to how I've set MaxPermSize previously?
I tend to find if you know a Java 7 value for MaxPermSize worked well use same value for Java 8 Metaspace. It can be a good idea to configure an initial value. EG syntax:
HTH again, Carl.
Thanks Carl. I'm going to make these jvm changes and connector configuration changes and then update our connectors one more time. I'm hoping that the iaspi_redirect.log provides some insight into the issue the next time our application pool stops responding. I've been thinking this was really an IIS issue, as restarting the application pool always resolves the issue, but it does appear to be an issue with the connector that causes the initial problem with the application pool.
I think your issue is a little different as we don't have those errors in our isapi_redirect.log, but thanks for sharing your issue in case they were the same. Hoping we both find solutions to our recent issues quickly.