We have recently migrated our .net4.6 WEB API to netcoreapp2.0 We are using AWS ECS docker containers for deployment of our services.
Short time load test works fine. But long running load test shows that docker containers recyles with error code 137.
During the entire load test, memory and CPU utilization is normal ~30 % both.
As Error 137 is memory related following fixes have been tried.
- Changed garbage collection mode :
< ServerGarbageCollection>true< /ServerGarbageCollection> < ConcurrentGarbageCollection>true< /ConcurrentGarbageCollection>
- Migrated to netcore 2.0.3 , as it has some fixes for memory management.
FROM microsoft/dotnet:2.0.3-runtime
- Configured cgroup, as below were some error in docker logs
cgroup: docker-runc (3365) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future. [ 23.104548] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Our ECS Task configurations are below :
- Number of running Tasks : 2 on 2 C4.xlarge EC2 behind ECS.
- Memory Soft limit : 2 Gb
- Have also validated our Healthcheck endpoint, which does not have any issues and responds fast. Even tried hard coding the healhcheck with 200 Ok
Some Docker logs : (Notice OOM killed is false, even there are no kernel level logs.)
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 137,
"Error": "",
"StartedAt": "2018-02-12T06:15:00.481719209Z",
"FinishedAt": "2018-02-12T07:13:02.962733905Z"
},
Some weird observations, if we run load test directly on docker container ip and port. They work just fine. If we run them through ALB, crash behavior is observed.
Please let me know any other linux command which can give me actual reason for process termination or any possible fixes for above case.