2

I am following the gist of this tutorial:

https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/

where I am using a custom sklearn transformer to pre-process data before passing to xgboost. When I get to this point:

transformer = sklearn_preprocessor.transformer(
    instance_count=1, 
    instance_type='ml.m4.xlarge',
    assemble_with = 'Line',
    accept = 'text/csv')

# Preprocess training input
transformer.transform('s3://{}/{}'.format(input_bucket, input_key), content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_train = transformer.output_path

The location of the training data is S3 and there are multiple files there. I get an error that the max payload has been exceeded and it appears that you can only set up to 100MB. Does this mean that Sagemaker can not transform larger data as input into another process?

1 Answer 1

1

In SageMaker batch transform, maxPayloadInMB * maxConcurrentTransform cannot exceed 100MB. However, a payload is the data portion of a request sent to your model. In your case, since the input is CSV, you can set the split_type to 'Line' and each CSV line will be taken as a record.

If the batch_strategy is "MultiRecord" (the default value), each payload will have as many records / lines as possible.

If the batch_strategy is "SingleRecord", each payload will have a single CSV line and you need to ensure each line is never larger than the max_payload_size_in_MB.

In short, if the split_type is specified (not 'None'), the max_payload_size_in_MB is nothing related to the total size of your input file.

https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB

Hope this helps!

Sign up to request clarification or add additional context in comments.

2 Comments

It's odd then that in the case of one file, it fails. When I split up to be less than the max size, it succeeds.
What do you mean split it up? call .transform on several csvs instead? Also, for the sizing, do you mean each of the csv files you created are < maxPayloadInMB in size?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.