I am following the gist of this tutorial:
where I am using a custom sklearn transformer to pre-process data before passing to xgboost. When I get to this point:
transformer = sklearn_preprocessor.transformer(
instance_count=1,
instance_type='ml.m4.xlarge',
assemble_with = 'Line',
accept = 'text/csv')
# Preprocess training input
transformer.transform('s3://{}/{}'.format(input_bucket, input_key), content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_train = transformer.output_path
The location of the training data is S3 and there are multiple files there. I get an error that the max payload has been exceeded and it appears that you can only set up to 100MB. Does this mean that Sagemaker can not transform larger data as input into another process?