2

We have recently migrated our .net4.6 WEB API to netcoreapp2.0 We are using AWS ECS docker containers for deployment of our services.

Short time load test works fine. But long running load test shows that docker containers recyles with error code 137.

During the entire load test, memory and CPU utilization is normal ~30 % both.

As Error 137 is memory related following fixes have been tried.

  1. Changed garbage collection mode :

< ServerGarbageCollection>true< /ServerGarbageCollection> < ConcurrentGarbageCollection>true< /ConcurrentGarbageCollection>

  1. Migrated to netcore 2.0.3 , as it has some fixes for memory management.

FROM microsoft/dotnet:2.0.3-runtime

  1. Configured cgroup, as below were some error in docker logs

cgroup: docker-runc (3365) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future. [ 23.104548] cgroup: "memory" requires setting use_hierarchy to 1 on the root

Our ECS Task configurations are below :

  1. Number of running Tasks : 2 on 2 C4.xlarge EC2 behind ECS.
  2. Memory Soft limit : 2 Gb
  3. Have also validated our Healthcheck endpoint, which does not have any issues and responds fast. Even tried hard coding the healhcheck with 200 Ok

Some Docker logs : (Notice OOM killed is false, even there are no kernel level logs.)

 "State": {
        "Status": "exited",
        "Running": false,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 0,
        "ExitCode": 137,
        "Error": "",
        "StartedAt": "2018-02-12T06:15:00.481719209Z",
        "FinishedAt": "2018-02-12T07:13:02.962733905Z"
    },

Some weird observations, if we run load test directly on docker container ip and port. They work just fine. If we run them through ALB, crash behavior is observed.

Please let me know any other linux command which can give me actual reason for process termination or any possible fixes for above case.

2
  • did it worked for you ? I'm facing this issue now and let me know if you could fix it Commented Jul 26, 2018 at 23:14
  • After many tries, I had asked my devops guys to create Docker cluster from scratch on AWS ECS. After which it stopped happening. I was just lucky or not sure whether there was any configuration issue in previous cluster. Commented Aug 2, 2018 at 9:27

2 Answers 2

1

Actual issue was due to thread starvation, as we were load testing, our application intensively uses threads. As load increases the containers were starving for threads and new requests were getting queued. Out of which healthcheck was also queued. Currently executing threads was not able to finish as they were also blocked for some reason. So docker was killing our containers due to healthcheck timeouts.

We increased min thread count to 100 in Startup.cs to fix this issue.

Sign up to request clarification or add additional context in comments.

Comments

0

Is your heatlhcheck response well from ALB request ?

137 exit code mean your container was killed by ECS agent.

2 Comments

Our docker logs had healthcheck timeout and failed logs. when containers were rolling. But we hardcoded our healthcheck API to always return 200 Ok to eliminate this reason. But still container is being stop with error 137.
Have you activate log into AWS cloud watch to see request sended to your container (container definition into task defintion select awslog for log driver) ? In my case ALB request was rejected because HOST_NAME http header was Ip address. I had to add hack to allow this host in web site conf.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.