SageMaker Limits on Sklearn Batch Transformer Payload

Question

I am following the gist of this tutorial:

https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/

where I am using a custom sklearn transformer to pre-process data before passing to xgboost. When I get to this point:

transformer = sklearn_preprocessor.transformer(
    instance_count=1, 
    instance_type='ml.m4.xlarge',
    assemble_with = 'Line',
    accept = 'text/csv')

# Preprocess training input
transformer.transform('s3://{}/{}'.format(input_bucket, input_key), content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_train = transformer.output_path

The location of the training data is S3 and there are multiple files there. I get an error that the max payload has been exceeded and it appears that you can only set up to 100MB. Does this mean that Sagemaker can not transform larger data as input into another process?

Alohahaha · Accepted Answer · 2020-01-07 20:41:59Z

1

In SageMaker batch transform, maxPayloadInMB * maxConcurrentTransform cannot exceed 100MB. However, a payload is the data portion of a request sent to your model. In your case, since the input is CSV, you can set the split_type to 'Line' and each CSV line will be taken as a record.

If the batch_strategy is "MultiRecord" (the default value), each payload will have as many records / lines as possible.

If the batch_strategy is "SingleRecord", each payload will have a single CSV line and you need to ensure each line is never larger than the max_payload_size_in_MB.

In short, if the split_type is specified (not 'None'), the max_payload_size_in_MB is nothing related to the total size of your input file.

https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB

Hope this helps!

answered Jan 7, 2020 at 20:41

Alohahaha

1062 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

B_Miner Over a year ago

It's odd then that in the case of one file, it fails. When I split up to be less than the max size, it succeeds.

ae0709 Over a year ago

What do you mean split it up? call .transform on several csvs instead? Also, for the sizing, do you mean each of the csv files you created are < maxPayloadInMB in size?

Collectives™ on Stack Overflow

SageMaker Limits on Sklearn Batch Transformer Payload

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related