Performance Monitoring Toolset keeps crashing
I've been having lots of problems with the Performance Monitoring Toolset since putting my CF2018 Enterprise intranet into production about a month ago. I have the PMT installed on my Test/Dev server (Windows Server 2019, 64Gb RAM, running a Development instance of CF18 plus the PMT), and it's monitoring my Dev site plus my production site on another server (same hardware/setup).
What seems to keep happening every couple days or so is that the Elasticsearch datastore services that underlie the PMT are crashing, which initially makes it impossible to login. When I find that, I go to restart the Datastore service, then the PMT service itself, but the PMT service won't start because it fails to connect to the Datastore. After several attempts, I usually get frustrated and uninstall and reinstall the whole thing, which is clearly not a viable long-term solution.
The datastore/logs files show stuff like:
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
java.lang.OutOfMemoryError: Java heap space
I know the latter, so on my most recent install, I bumped the heap size up to 4Gb, thinking maybe it's just not enough to keep track of my workload (seems unlikely).
The PMT logs show this when it won't start:
[ERROR] 2019-06-27 11:45:52.658 com.adobe.pms.es.client.ElasticSearchClient - Datastore Service not available. Retrying to connect...
[ERROR] 2019-06-27 11:46:27.884 com.adobe.pms.es.client.ElasticSearchClient - Datastore Service not available. Shutting down Performance Management Suite...
Even though the Datastore service seems to be running and its own logs show it having recovered.
I was hoping there'd be more mentions of these issues online or a patch, but I can't find anything related to it. It's concerning because I keep finding myself needing to be able to see what's running on the Production server and the PMT is down and so I have no visibility to my server's active jobs. Seems like the PMT was put together a bit hastily on a new stack of technologies that don't exactly work well together, and not having the Server Monitor any longer leaves us a bit blind as to what our servers are doing.
If anybody's had similar experience with the PMT or any insight on how to manage it, I'd sure appreciate it.
