1

I have an AWS Lambda function which is invoked by another service . The Lambda function calls external API endpoints and it sometime receives network time out/no response while calling external API. What is the best way to implement retry mechanism in aws lambda to handle failures if external service API is not responding or other server side errors? Also what should be the fallback mechanism strategy to use.

I followed this article below which suggests to use step functions for retry with backoff pattern implementation , is there any sample code for the same and what should be the cost considerations to keep in mind while using these services

Follwed these articles for the solution approach

7
  • Does your lambda return any response to the service that calls it? Or it just executes the lambda and does next steps Commented Dec 22, 2023 at 8:14
  • Yes, the response that's received from External API, lambda has to return that to the calling upstream service. For that only, need the fallback mechanism approach, and retry. Also, do we need to implement the AWS step functions for retries logic or can it be done with Lambda itself? Commented Dec 22, 2023 at 11:50
  • Lambda does not have any retry mechanism. Also saying that you want to add retry to make it work is very optimistic for something that is failing due to some external factor because in the worst case, even the retries will fail. Step functions can be used but debugging is not easy in it. It's better to wrap the API call in that lambda with a method that retries on failure before returning the final response. Commented Dec 22, 2023 at 17:54
  • hi @BhaveshParvatkar - In the last statement, you say, It's better to wrap the API call in that lambda with a method that retries on failure before returning the final response. Could you please help elaborate on that with an example and how to do it? Also, what’s the drawback/tradeoff of using AWS step functions for retry with backoff logic, any pointers would help. Commented Dec 26, 2023 at 8:11
  • Sure. Before that can you tell me that this "another service" is? Will answer with examples based on it. Commented Dec 26, 2023 at 18:52

1 Answer 1

1

So mainly your lambda has 3 stages:

Stage 1 (Pre-API execution): This stage runs before the API call is made.

Stage 2 (API execution): You wait for the response from the API

Stage 3 (Post API execution): You now continue your Lambda execution with data from Step 2 or halt if an error occurs.

Things to consider:

Lambda has a hard limit of 15 minutes of timeout.

Solution 1: Wrap the External Lambda with retry logic

Assumption: Your lambda can complete all retries and wait to process the next step

Example for nodejs

const axios = require('axios');

async function retryApiCall(url, maxAttempts = 3, delayBetweenAttempts = 1000) {
  let attempt = 1;

  while (attempt <= maxAttempts) {
    try {
      const response = await axios.get(url);

      if (response.status === 200) {
        return response.data;
      }
    } catch (error) {
      console.error(`Error on attempt ${attempt}:`, error.message);
    }
    attempt++;
    await new Promise(resolve => setTimeout(resolve, delayBetweenAttempts));
  }

  throw new Error(`Failed after ${maxAttempts} attempts`);
}

Pros:

  1. Easy to debug: All the executions remain in one lambda execution cycle. No need to go through separate logs.
  2. No lambda breakdown needed: As processes are handled within lambda itself, no code breakdown is needed for processes like step function.

Cons:

  1. Lambda has to wait until the API call has finished all retries in the worst case. You will be billed for the entire duration.
  2. Won't work if you need greater than 15 minutes duration for retries.

Solution 2: Step function path

When you start to consider this path. Your lambda shouldn't wait for all stage execution to complete. It will complete its first stage, start step function and end.

Phase 1: Your Lambda instead of calling API, it will start Step Function execution and end. (Only does Stage 1 as mentioned above)

Phase 2: Step function handles the retires and the delay mechanism for you. You can take advantage of a complete 15-minute execution for each retry, as there will be new lambda calls for each retry. Also, delays will be handled separately. (Stage 2)

Phase 3: New Lambda. Yes, once phase 2 is completed, its output will be sent to a new lambda and you will have to continue (Stage 3) from here.

Pros:

  1. You can take advantage of a full lambda timeout of 15 min, delay retry won't affect the lambda timeout.

Cons:

  1. As your processes are broken down into different stages, imagine how you will debug this. First, you see the Lambda in Phase 1. Then you will find the execution of the Step function in phase 2. Then look into each log of retries (they are separate for each execution). Then check the new lambda logs.
  2. Added billing, Step function billing is based on state transitions. Check out the examples here
  3. Additional data: If there is some data that you want to pass from Phase 1. You will have to think of a strategy for passing it across all phases of this step function. Also, remember that the step function has a hard limit of 256KB. If your data will be above it then something like an S3 file or alternate solution should be built. If not then you can pass it as input for all your lambda functions in Phase 2 and Phase 3.

Fallback

Both the solutions do not handle any fallback.

What is the fallback for you? If there is no guarantee whether on retries the external API will work or not. No point in going the complex route. It's a decision left on you.

Fallback Solution: Adding DLQ to handle these failed messages. If all the attempts were failed. It's better to handle them separately.

Solution 3: Add SQS to your Kafta topic instead of hitting Lambda

SQS has retry policies too.

Retry strategy: They don't have dynamic delay. It will be a static retry. It has visibility timeout which you can leverage for delay.

Implementation: Let's say your Lambda execution has a max timeout of 15 minutes and you keep a visibility timeout of 15 minutes too. If lambda fails, the message is retried only after 15 minutes (15 minutes once the lambda started executing) have passed, no matter if lambda terminated before it.

DLQ: Once all the retries are done you can push the message to DLQ. Add an alert that this message failed to process and you can take an action later.

SQS to Step function? Yes, you can add SQS to the step function too.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.