AWS ECS docker containers crash randomly with Error code 137 for netcoreapp2.0

Question

We have recently migrated our .net4.6 WEB API to netcoreapp2.0 We are using AWS ECS docker containers for deployment of our services.

Short time load test works fine. But long running load test shows that docker containers recyles with error code 137.

During the entire load test, memory and CPU utilization is normal ~30 % both.

As Error 137 is memory related following fixes have been tried.

Changed garbage collection mode :

< ServerGarbageCollection>true< /ServerGarbageCollection> < ConcurrentGarbageCollection>true< /ConcurrentGarbageCollection>

Migrated to netcore 2.0.3 , as it has some fixes for memory management.

FROM microsoft/dotnet:2.0.3-runtime

Configured cgroup, as below were some error in docker logs

cgroup: docker-runc (3365) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future. [ 23.104548] cgroup: "memory" requires setting use_hierarchy to 1 on the root

Our ECS Task configurations are below :

Number of running Tasks : 2 on 2 C4.xlarge EC2 behind ECS.
Memory Soft limit : 2 Gb
Have also validated our Healthcheck endpoint, which does not have any issues and responds fast. Even tried hard coding the healhcheck with 200 Ok

Some Docker logs : (Notice OOM killed is false, even there are no kernel level logs.)

 "State": {
        "Status": "exited",
        "Running": false,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 0,
        "ExitCode": 137,
        "Error": "",
        "StartedAt": "2018-02-12T06:15:00.481719209Z",
        "FinishedAt": "2018-02-12T07:13:02.962733905Z"
    },

Some weird observations, if we run load test directly on docker container ip and port. They work just fine. If we run them through ALB, crash behavior is observed.

Please let me know any other linux command which can give me actual reason for process termination or any possible fixes for above case.

did it worked for you ? I'm facing this issue now and let me know if you could fix it — Annette
– Annette, Commented Jul 26, 2018 at 23:14
After many tries, I had asked my devops guys to create Docker cluster from scratch on AWS ECS. After which it stopped happening. I was just lucky or not sure whether there was any configuration issue in previous cluster. — Pravin
– Pravin, Commented Aug 2, 2018 at 9:27

Pravin · Accepted Answer · 2022-06-02 09:12:58Z

1

Actual issue was due to thread starvation, as we were load testing, our application intensively uses threads. As load increases the containers were starving for threads and new requests were getting queued. Out of which healthcheck was also queued. Currently executing threads was not able to finish as they were also blocked for some reason. So docker was killing our containers due to healthcheck timeouts.

We increased min thread count to 100 in Startup.cs to fix this issue.

answered Jun 2, 2022 at 9:12

Pravin

8401 gold badge10 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

François Perin · Accepted Answer · 2018-02-19 17:16:37Z

0

Is your heatlhcheck response well from ALB request ?

137 exit code mean your container was killed by ECS agent.

answered Feb 19, 2018 at 17:16

François Perin

464 bronze badges

2 Comments

Pravin Over a year ago

Our docker logs had healthcheck timeout and failed logs. when containers were rolling. But we hardcoded our healthcheck API to always return 200 Ok to eliminate this reason. But still container is being stop with error 137.

François Perin Over a year ago

Have you activate log into AWS cloud watch to see request sended to your container (container definition into task defintion select awslog for log driver) ? In my case ALB request was rejected because HOST_NAME http header was Ip address. I had to add hack to allow this host in web site conf.

Collectives™ on Stack Overflow

AWS ECS docker containers crash randomly with Error code 137 for netcoreapp2.0

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related