I've got an XGBoost model trained leveraging Sagemaker Hyperparameter Tuning job. Now, I want to generate predictions for about 182GB of Csv files. I've been testing different combinations of instance_types, counts, MaxPayloadInMB, and MaxConcurrentTransforms but haven't been able to run this fast than about 30 minutes... I wanted to see if I'm missing anything to speed this up? Here is my current boto3 call:
response = client.create_transform_job(
TransformJobName=transform_name,
ModelName=model_name,
BatchStrategy='MultiRecord',
TransformInput={
"DataSource": {
"S3DataSource":{
"S3DataType": "S3Prefix",
"S3Uri": f"s3://{bucket}/{prefix}/csv_prediction"
}
},
"ContentType": "text/csv",
"CompressionType": "None",
"SplitType": "Line"
},
MaxPayloadInMB=1,
MaxConcurrentTransforms=100,
DataProcessing={
"InputFilter": "$[1:]", # Use all columns except first (containing ID)
"JoinSource": "Input",
"OutputFilter": "$[0,-1]" # Return ID and Prediction only
},
TransformOutput={
"S3OutputPath": f"s3://{bucket}/{prefix}/batch_transform_results/{model_name}",
"Accept": "text/csv",
"AssembleWith": "Line"
},
TransformResources={
"InstanceType": "ml.c5.xlarge",
"InstanceCount": 16
}
)