So mainly your lambda has 3 stages:
Stage 1 (Pre-API execution): This stage runs before the API call is made.
Stage 2 (API execution): You wait for the response from the API
Stage 3 (Post API execution): You now continue your Lambda execution with data from Step 2 or halt if an error occurs.
Things to consider:
Lambda has a hard limit of 15 minutes of timeout.
Solution 1: Wrap the External Lambda with retry logic
Assumption: Your lambda can complete all retries and wait to process the next step
Example for nodejs
const axios = require('axios');
async function retryApiCall(url, maxAttempts = 3, delayBetweenAttempts = 1000) {
let attempt = 1;
while (attempt <= maxAttempts) {
try {
const response = await axios.get(url);
if (response.status === 200) {
return response.data;
}
} catch (error) {
console.error(`Error on attempt ${attempt}:`, error.message);
}
attempt++;
await new Promise(resolve => setTimeout(resolve, delayBetweenAttempts));
}
throw new Error(`Failed after ${maxAttempts} attempts`);
}
Pros:
- Easy to debug: All the executions remain in one lambda execution cycle. No need to go through separate logs.
- No lambda breakdown needed: As processes are handled within lambda itself, no code breakdown is needed for processes like step function.
Cons:
- Lambda has to wait until the API call has finished all retries in the worst case. You will be billed for the entire duration.
- Won't work if you need greater than 15 minutes duration for retries.
Solution 2: Step function path
When you start to consider this path. Your lambda shouldn't wait for all stage execution to complete. It will complete its first stage, start step function and end.
Phase 1: Your Lambda instead of calling API, it will start Step Function execution and end. (Only does Stage 1 as mentioned above)
Phase 2: Step function handles the retires and the delay mechanism for you. You can take advantage of a complete 15-minute execution for each retry, as there will be new lambda calls for each retry. Also, delays will be handled separately. (Stage 2)
Phase 3: New Lambda. Yes, once phase 2 is completed, its output will be sent to a new lambda and you will have to continue (Stage 3) from here.
Pros:
- You can take advantage of a full lambda timeout of 15 min, delay retry won't affect the lambda timeout.
Cons:
- As your processes are broken down into different stages, imagine how you will debug this. First, you see the Lambda in Phase 1. Then you will find the execution of the Step function in phase 2. Then look into each log of retries (they are separate for each execution). Then check the new lambda logs.
- Added billing, Step function billing is based on state transitions. Check out the examples here
- Additional data: If there is some data that you want to pass from Phase 1. You will have to think of a strategy for passing it across all phases of this step function. Also, remember that the step function has a hard limit of 256KB. If your data will be above it then something like an S3 file or alternate solution should be built. If not then you can pass it as input for all your lambda functions in Phase 2 and Phase 3.
Fallback
Both the solutions do not handle any fallback.
What is the fallback for you?
If there is no guarantee whether on retries the external API will work or not. No point in going the complex route. It's a decision left on you.
Fallback Solution: Adding DLQ to handle these failed messages. If all the attempts were failed. It's better to handle them separately.
Solution 3: Add SQS to your Kafta topic instead of hitting Lambda
SQS has retry policies too.
Retry strategy: They don't have dynamic delay. It will be a static retry. It has visibility timeout which you can leverage for delay.
Implementation: Let's say your Lambda execution has a max timeout of 15 minutes and you keep a visibility timeout of 15 minutes too. If lambda fails, the message is retried only after 15 minutes (15 minutes once the lambda started executing) have passed, no matter if lambda terminated before it.
DLQ: Once all the retries are done you can push the message to DLQ. Add an alert that this message failed to process and you can take an action later.
SQS to Step function?
Yes, you can add SQS to the step function too.