2

I created an XGBoost model with AWS SageMaker. Now I'm trying to use it through Batch Transform Job, and it's all going pretty well for small batches.

However, there's a slightly bigger batch of 600.000 rows in a ~16MB file and I can't manage to run it in one go. I tried two things:

1.

Setting 'Max payload size' of the Transform job to its maximum (100 MB):

transformer = sagemaker.transformer.Transformer(
    model_name = config.model_name,
    instance_count = config.inference_instance_count,
    instance_type = config.inference_instance_type,
    output_path = "s3://{}/{}".format(config.bucket, config.s3_inference_output_folder),
    sagemaker_session = sagemaker_session,
    base_transform_job_name = config.inference_job_prefix,
    max_payload = 100
    )

However, I still get an error (through console CloudWatch logs):

413 Request Entity Too Large
The data value transmitted exceeds the capacity limit.

2.

Setting max_payload to 0, which, by specification, Amazon SageMaker should interpret as no limit on the payload size.

In that case the job finishes successfully, but the output file is empty (0 bytes).

Any ideas either what I'm doing wrong, or how to run a bigger batch?

1
  • Have you been able to solve this issue? My batch prediction output keeps outputting 0 bytes object. I have tried the bottom solutions. Still don't work... Commented Nov 27, 2020 at 6:56

5 Answers 5

6

Most of SageMaker algorithms set their own default execution parameters with 6 MB in MaxPayloadInMB, so if you are getting 413 from SageMaker algorithms, you are likely to be exceeding the maximum payload they can support. Assuming each row is less than 6 MB in the file, you can fix this by leaving MaxPayloadInMB unset to fallback to the algorithm's default size and setting SplitType to "Line" instead, so it can split the data into smaller batches (https://docs.aws.amazon.com/sagemaker/latest/dg/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType).

Sign up to request clarification or add additional context in comments.

Comments

3

this helped me resolve the issue by setting strategy='SingleRecord' in the transformer + you can also add a stronger instance via instance_type and distribute via instance_count.

2 Comments

I have tried this method, but the .transform() process returns "Internal server error"... Have you encountered this error by any chance?
@Bilguun how did you resolve the "Internal server error". I am facing the same issue
1

I have tried the above solutions, but unfortunately they didn't work for me.

Here is what worked for me: https://stackoverflow.com/a/55920737/7091978

Basically, I set "max_payload" from 0 to 1.

Comments

0

I think max payload is limited anyway at Sagemaker side. So your max payload size is limited somewhere below 100MB (as far as I know, for KNN - it was 6MB as defulat) Workaround is using SplitType when execute transformer job depending on your content type (TFRecord or RecordIO)

transformer.transform([your data path], content_type='application/x-recordio-protobuf',split_type='TFRecord')

https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html

Then you can seamlessly transform a data bigger than 6MB.

Comments

0

I got errors like below when running a batch transform job with Mistral-7B model:

io.netty.handler.codec.PrematureChannelClosureException: Channel closed while still aggregating message
#011at io.netty.handler.codec.MessageAggregator.channelInactive(MessageAggregator.java:436) [netty-codec-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:418) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:412) [netty-codec-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:377) [netty-codec-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:221) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1402) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:900) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:811) [netty-transport-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) [netty-common-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) [netty-common-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) [netty-common-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405) [netty-transport-classes-epoll-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) [netty-common-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.112.Final.jar:4.1.112.Final]
#011at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.112.Final.jar:4.1.112.Final]
#011at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

In data-log , I got something like this:

[sagemaker logs]: input.csv: Model server did not respond to /invocations request within 60 seconds

It turns out setting MaxPayloadSize to 0 caused the issue, and changing it to 1 fixing the issue. (I also verified the payload that always failed the batch transform job is only about 5KB, and I thought setting the param to 0 means no limit, but somehow that's not the case.)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.