Copy link to clipboard
Copied
Hi
First post here and very new to CF/Java. I have inherited a somewhat legacy application that is running on windows/IIS fine. However, I am trying to deploy the same application in a docker container ( linux ) and running into some strange slow down issues.
The application runs fine though it seems that half the time ( or more than that ) the application doesnt seem to be able to cache all classes and compiles them everytime .
For testing, I remove the container and deploy everything again. Sometimes the system runs fine and is reponsive after deployment. Once its responsive, I can use the application all day without any slowdowns.
At other times ( more often than not ), when I deploy the container, it slows down from the first request and never picks up speed. It feels as if the classes are being created everytime or being loaded in mem with every click. The difference is on the order of 2 minutes to browse through all links in the application to up to 2 hours for the same sequence to finish when the container is not running properly.
There is no load on the server as its only one user. The same code works perfectly fine on our windows deployment. The confusion is the inconsistancy of the issue and no apparant pattern for it.
Does anyone have any idea whats going on ? I am not sure what logs to provide but can do so if someone can point me to whats needed
I have tried :
Thanks in advance
So it seems this problem was introduce early on in the project with a bad jvm config
there was an entry in the config with -Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true. Removing this seems to resolve the issue
I believe someone familer with jvm options would have found this this issue easily but the random nature of the issue threw me off. hope the solution sticks 🙂
Copy link to clipboard
Copied
Faraz, this is certainly an interesting challenge, and there could be many explanations for it. Before I share any "guesses", I'd ask first:
Since it's sometimes slow from the start, even with "no load", that would seem to suggest there's no cf configuration problem (since it works fine "all day" in some cases)
I really think the best way to solve your issue is to get diagnostics in place, to tell you WHY any one request is slow, as well as what else is going on when this happens.
And sadly, the logs may or may not help in this case. First, have you looked at ALL the cf logs within the container (those recently modified)? As you may know, the docker logs shows only the equivalent of the coldfusion-out.log and coldfusion-error logs, not the other cf logs). Second, in a case like you're describing, the logs often won't help anyway, as some problems don't lead to anything showing in the cf logs--or they don't show what we need to actually diagnose the problem.
And as you may know, there are indeed various solutions that can help do that (some better than others, depending on the problem), from traditional Java diagnostic tools, to popular Apms (like Newrelic, Datadog, and Dynatrace), to more cf-specific tools like the cf PMT (new since cf2018) or FusionReactor.
All these can be used with containers also, though there can be new challenges doing that, even for folks familiar with the tools (and with running cf in a container). There are also tools specific to your container runtime (Docker) and to the container and host OS which could help, depending on where the problem is (may be cf, may be docker/container config, may be host resource issues). All that's a lot to consider.
And I'm not saying one has to assess them ALL. But I'm saying that one would investigate a few key diagnostics, and go from there to consider others.
But if one may be new to BOTH running cf in containers AND using such diagnostic tools, I'd argue that most folks in that situation would have a really tough time going it alone. There are just so many variables, from container issues to tool use to interpreting what the tools report. Or they may overlook a vital clue. And there's just no way to convey here all one would need to consider.
Then there are questions of you went about setting up your images/containers. There's no "one way", and any one choice could be at issue here. Of course, the fact that things to change for you midstream is another wrinkle that only complicates the diagnosis.
So while it may be possible that someone else may chime in with JUST the right answer based solely on what you've shared here (or folks may start tossing out guesses of things to try), I'd recommend given all the above that the fastest and most effective way to resolve your specific problem would be to have me (or someone similarly experienced) to join you in a remote screenshare session (like zoom), where together these diagnostics could be considered.
And it may surprise you to hear that I think we could be done within an hour (maybe less, maybe more), since I do this sort of cf troubleshooting daily. And most folks learn a lot as we go. You can learn about my rates, approach, satisfaction guarantee, and more at carehart.org/consulting. I offer there also my online calendar, with slots today and any day this week or coming ones.
I wish I had just "the answer" for you. If you follow these forums, I do usually reply with that or some specific things to consider. There are just far too many variables in this scenario for me to see any to recommend without the diagnostics above. There is likely ONE problem and solution. The challenge is finding it, and I would look forward to helping solve this for you.
Copy link to clipboard
Copied
My answer will be shorter than @Charlie Arehart 's. It seems like you haven't been able to narrow down the problem very much - this isn't your fault, but basically you're presenting this as "I'm running on Docker and having this problem." I suggest you try a broader set of troubleshooting measures to see if you can find the root cause. What happens if you just run the app on Linux, for testing purposes? Just throw up an EC2 instance in AWS and put it on there, if you can do that - it'll save you a lot of OS configuration stuff. Put your container in a different Docker environment and see what happens.
As for upgrading Java, I recommend upgrading to the latest minor version. This usually isn't listed in the CF docs as a supported Java version, but it's always worked for me. Unfortunately, that doesn't work for major versions. So, for example, if your version of CF supports Java 11.0.14 you can safely upgrade it to 11.0.15 (a minor version) but not to 12 or higher (a major version).
Also, I doubt this has anything to do with compilation. CF actually started as an interpreter, not a compiler, and it doesn't do everything exactly as an ideal compiler would. In fact, you should be able to run CF fine as a single user without compiling anything.
Finally, I have no idea why you wouldn't be able to change the GC algorithm to something other than parallel GC - CF itself does support all kinds of GC options. But you might just want to take a look at the jvm.config and the startup log after you've relaunched in that Docker instance to see if your change picked up.
Dave Watts, Eidolon LLC
Copy link to clipboard
Copied
Thanks @Charlie Arehart and @Dave Watts for your replies
I should clarify a few things
- I only updated to the latest minor version for java . Did not go to java 12
- please ignore the garbase collection issue. I realised i made a mistake in the settings. However as suggested, updating GC doesnt help
- when i say compilation, it just shows my ignorance of java/CF . all i mean to say is that seems to be loading everything from scratch
For diagnostic , I have fusion reactor installed. all i can tell from that is unusual heap mem usage when its slow
each dip occurs after I click a page and then makes this saw tooth patern untill the page is loaded. however when the server is working fine i get a simple straight line for heap usage
slightly different when i change the GC to G1 and the server is still slow
Following @Dave Watts suggestion , I installed CF on a fresh EC2 ( amazon linux 2 ) and the exact same thing happenes. The server was responsive and fast for first boot and then slowed down when I restart CF. I did run into a small issue with a few errors but resolved it by setting the heartbeat_interval to 0 as per CF2018 sporadic Crash .
I am fairly comfortable with docker containers and linux though new to Java/CF as mentioned earlier.
Thanks again for any help
Copy link to clipboard
Copied
Hi @Faraz25024317f5d8 , from what you've shared, my thinking is as follows:
Copy link to clipboard
Copied
Hi @BKBK , Thanks for the input. all cache settings are properly set. I am not that concerned about the saw tooth in FR as it was just something i noticed. not indicative of anythin i think. The real problem is the nature of the problem thats making it harder to pinpoint. if it were always presenting it would be easier to track. that fact that sometimes randomly it behaves fine is whats confusing
Copy link to clipboard
Copied
I'm guessing that your CF on Linux app is doing something that both produced the pattern you see in FusionReactor, and interfering with page execution. Unfortunately, I have no idea what it is. There's nothing wrong with the pattern you're seeing, except you're also seeing the server lag at the same time.
At this point, I'd recommend going through all the normal issues that can cause lag, and just go back to FR when you think you solved the lag.
Dave Watts, Eidolon LLC
Copy link to clipboard
Copied
Thanks for the input @Dave Watts . I am not sure how well CF supports linux in general. It could be one of the libraries but would you know if it its actually not CF but the way httpd works with tomcat ? just grasping at straws really becuse dumping execution timestamp in the application code iteslef shows that the application slows down and takes longer to process .
My other guess is the database layer or hibernate creating these random issues. but once again thats grasping at straws
Copy link to clipboard
Copied
Faraz, I really doubt anyone is going to find the magic bullet for you here, using the modest info you are able to share this way. Again there is far more you can directly and specifically diagnose, but that too can't be shared effectively here.
So if you want to make the problem go away, I can help via remote screenshare. If I can't, you won't pay for time that you don't find valuable. See my first comment for more.
And once we solve it, you could relate as much as you'd like to help others who may find this thread. It may well be just one thing, but the challenge is finding it. And that may not take long, together.
Copy link to clipboard
Copied
Thanks for the option mate. I will get back to you.
We are also in the middle of evaluating lucee so will have to see how that goes.
I'll keep at it in the background and post here if I find a solution
Copy link to clipboard
Copied
Ok, but if you are considering that because you think this is a cf problem that won't happen with Lucee, I seriously doubt
it's something calling for that move. It would seem to make make more economic sense (assuming you already have the cf license) to spend perhaps an hour to see if this can be solved, rather than thr time (perhaps much more) trying to convert to Lucee.
Of course I'm not discouraging that move, nor denying that in SOME cases someone running their app on ACF COULD find it runs on Lucee with less than an hour of effort. But I know also of some apps that people felt could not at all be easily converted, while others took days or weeks.
And all that is what I say just as well about the prospect of migrating an app from one cf version to another.
Bottom line, this problem may seem somehow an intractable mystery, and you'd not the first to feel that there was no ready solution. But I help solve such problems literally every day.
I realize you may still opt to forego my direct help, for any number of reasons. In that case I leave this as a plea to others who may find this thread. Such problems nearly always can be solved, and often quickly. Better still, you would learn of tools or techniques that should help you solve the next problem which arises (as happens with nearly all tech platforms), and many which could be used with Lucee as well.
As always, just trying to help. 🙂
Copy link to clipboard
Copied
Hi @Charlie Arehart I think I have solved this. It was my own doing 😞 . I will post the resolution in the original post reply
Thanks for your offer for direct help. I asked around and it seemes our company has been in touch with you for other CF related help and has had a good experiance working with you :).
As for moving to Lucee instead of opting for direct help from you has nothing to do with this particular problem. We have been planning to move to Lucee for a while though our initial goal was to move to linux containers first. Lucee was a secondary ( though highly desirable ) goal. With this issue, we were just thinking to explore Lucee a bit early incase it solves our problem ( it doesnt . Lucee is giving its own problems but thats a different story )
Thanks again for ur help ( past and present ).
Copy link to clipboard
Copied
So it seems this problem was introduce early on in the project with a bad jvm config
there was an entry in the config with -Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true. Removing this seems to resolve the issue
I believe someone familer with jvm options would have found this this issue easily but the random nature of the issue threw me off. hope the solution sticks 🙂
Copy link to clipboard
Copied
So it seems this problem was introduce early on in the project with a bad jvm config
there was an entry in the config with -Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true. Removing this seems to resolve the issue
By @Faraz25024317f5d8
While I am glad that this resolves the issue, I am still surprised. You call -Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true a bad JVM config. Well, it isn't. It is a good JVM config. 🙂 In fact, if your application does a lot of XML processing, the config will be unmissable.
Now that you mention it, could you please show us all the JVM configs? You say you've moved the application from Windows to Linux. You might have carried over certain JVM settings that are not relevant to Linux.
Copy link to clipboard
Copied
A test of yet another idea:
1) Return the following setting back to jvm.config:
-Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true
2) Add to jvm.config the revised setting for Linux random number generation:
-Djava.security.egd=file:/dev/./urandom
3) Restart ColdFusion.
Copy link to clipboard
Copied
point taken. I only refer to the bad config as it was bad for me . as for you other suggestion, its already there ( but does not help if the BAD CONFIG is there )
here is the jvm config currently setup.
-server -XX:ReservedCodeCacheSize=1024m -XX:+UseG1GC --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/sun.util.cldr=ALL-UNNAMED --add-opens=java.base/sun.util.locale.provider=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/jdk.internal.loader=ALL-UNNAMED --add-opens=java.base/jdk.internal.reflect=ALL-UNNAMED --add-opens=java.base/jdk.internal.module=ALL-UNNAMED --add-opens=java.base/java.lang.module=ALL-UNNAMED --add-opens=java.base/jdk.internal.util.jar=ALL-UNNAMED --add-opens=java.base/jdk.internal.math=ALL-UNNAMED --add-opens=jdk.management/com.sun.management.internal=ALL-UNNAMED --add-opens=jdk.management.jfr/jdk.management.jfr=ALL-UNNAMED --add-opens=java.base/jdk.internal.platform.cgroupv1=ALL-UNNAMED -XX:MaxMetaspaceSize=1g -Djdk.attach.allowAttachSelf=true -Dcoldfusion.home={application.home} -Djava.security.egd=/dev/urandom -Duser.language=en -Dcoldfusion.rootDir={application.home} -Dcoldfusion.libPath={application.home}/lib -Dorg.apache.coyote.USE_CUSTOM_STATUS_MSG_IN_HEADER=true -Dcoldfusion.jsafe.defaultalgo=FIPS186Random -Dorg.eclipse.jetty.util.log.class=org.eclipse.jetty.util.log.JavaUtilLog -Dnet.sf.ehcache.sizeof.filter=/app/java_ehCacheOpenSource/sizeOfExclusions.config -Djava.locale.providers=COMPAT,SPI -Dsun.font.layoutengine=icu -javaagent:/opt/fusionreactor/fusionreactor.jar=name=cf-fr,address=8088
Copy link to clipboard
Copied
How much RAM does your Operating System have?
What are the values of Xmx and Xms? In any case, as I suggested earlier, they should have the same value. If the value is 4096 MB or less, you should use -XX:+UseParallelGC
I would delete the following flags:
-XX:ReservedCodeCacheSize=1024m
-XX:MaxMetaspaceSize=1g
After you delete them the JVM will use default values. Such defaults are optimized, based on your application and on the environment. It is well-nigh impossible to keep track of the sheer number of factors that the JVM considers when setting the value of a default. As such, the defaults will almost always be more optimal than the values chosen by the developer.
Copy link to clipboard
Copied
currently the OS runs with 16gb ( windows ) with 11gb allocated to the application via the CF admin. I ported these from the app currently running in windows as it seems these have been found to be the sweet spot for it to run stabily over the years. With just one person using the system, it soetimes goes up to 6gb in mem consumption so I am not that keen to tweak these just yet. Once we get some load on it , we can modify the settings and see how it performs . I found this https://coldfusion.adobe.com/2018/03/coldfusion-performance-issues-and-optimization/ which seems to support some of the settings we have
Copy link to clipboard
Copied
Faraz, are you saying your still having trouble? And your removal of "Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true" hasn't been the solution you'd hoped?
As for grasping at other jvm tweaks being discussed here, I feel it's really the tail wagging the dog. You show having fusionreactor in your jvm args. It should be able to clearly indicate what's going on with CF. Again I can help you do that. (Thanks for your kind regards above on Tuesday.)
Bur before that, here's something you've not made clear yet. You just said here that "the OS runs with 16gb ( windows ) with 11gb allocated to the application via the CF admin". That may be a clue.
So you're saying your Docker is running on windows. Are you using Docker Desktop? If so, is it using hyper-v or wsl 2? I ask because if hyper-v, the docs say that be default containers will be given only 2g of memory. If wsl, it will be given 8g. Both can be edited. It sounds like the latter. See if changing it to 12g helps:
https://docs.microsoft.com/en-us/windows/wsl/wsl-config#configuration-setting-for-wslconfig
Or, consider lowering that "11gb allocated to the application via the CF admin" (which is also the xmx that bkbk was asking for). I know you may think it "needs it". You're trying to prove you can run the same thing in docker as outside it. But let's see you walk before you run.
Maybe changing either of these will work, for extended testing. If so, you can then consider increasing them both to support what you seek.
But I will say finally that it's not a given that you should HAVE TO have the heap max as the 11g that your experience suggests. I have helped many shops dramatically reduce that memory "requirement", nearly always solely through configuration or very little code changing.
Let's hear how things may go first.
Copy link to clipboard
Copied
Hi Charlie
Let me clarify.
First up the problem is resolved after removing "Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true". So the issue is resolved
Second we are CURRENTLY running on windows hence my comment about the ram and the system running with the mem configuration I posted. I am porting this to docker containers on linux ( images built on top of official cf-image and running on amazon linux ) so i have copied most settings accross. The issues I was facing was on the containers running on linux not on the native windows due to the config I updated. I dont know if adding that config to windows will introduce the same issue or not
Hope this clarifies. Thanks for your help
Copy link to clipboard
Copied
As your Xmx/Xms value is 11GB, then G1GC is indeed to be preferred. Also, 11 GB out of a 16 GB server RAM is 69%. which is comforably within the customary limit of 80%.
- I ported these from the app currently running in windows as it seems these have been found to be the sweet spot for it to run stabily over the years. With just one person using the system, it soetimes goes up to 6gb in mem consumption so I am not that keen to tweak these just yet. Once we get some load on it , we can modify the settings and see how it performs.I found this https://coldfusion.adobe.com/2018/03/coldfusion-performance-issues-and-optimization/ which seems to support some of the settings we have
By @Faraz25024317f5d8
Quite. Did you notice that the link you provided actually recommends that you use -Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true for improved performace?
JVM settings are convenient to test. Even live, in production. As long as the application can tolerate it.
Just create an extra copy of the file jvm.config that contains the settings you wish to test. Run ColdFusion using the test config file.
If the test result is not to your satisfaction, then simple stop ColdFusion, put back the original jvm.config, and restart.
My suggestions for obtaining your test jvm.config file (using the current file as basis):
Delete the following flags:
-XX:ReservedCodeCacheSize=1024m
-XX:MaxMetaspaceSize=1g
Let me emphasize: You should let the JVM decide these values!
For 2 reasons: (i) they affect the performance of the entire JVM; (ii) the JVM takes into account a lot of factors and fuzzy logic - certainly many more than you and I ever could - when determining the optimal values.
Add the following flag:
--Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true
Copy link to clipboard
Copied
Yes I am aware that setting is suggested in the link I posted. I think thats where I got it from the first place.
I tried the suggestions you posted and it still slows down with --Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true . This was a setting I had introduced and did not exist in the original windows setup so I am comfortable leaving it off
Copy link to clipboard
Copied
So bottom like, Faraz, you no longer have a problem, right?
Can you help bkbk understand that, so he can hold off on making more suggestions? Or am I misunderstanding something that you still need to improve?
Copy link to clipboard
Copied
@Faraz25024317f5d8 , thanks for testing the suggestions and sharing the result.
@Charlie Arehart , I was aware that leaving out the XML setting appeared to solve the problem. My suggestions were meant to exclude any other possible causes.