0

I've got an XGBoost model trained leveraging Sagemaker Hyperparameter Tuning job. Now, I want to generate predictions for about 182GB of Csv files. I've been testing different combinations of instance_types, counts, MaxPayloadInMB, and MaxConcurrentTransforms but haven't been able to run this fast than about 30 minutes... I wanted to see if I'm missing anything to speed this up? Here is my current boto3 call:

response = client.create_transform_job(
  TransformJobName=transform_name,
  ModelName=model_name,
  BatchStrategy='MultiRecord',
  TransformInput={
    "DataSource": {
      "S3DataSource":{
        "S3DataType": "S3Prefix",
        "S3Uri": f"s3://{bucket}/{prefix}/csv_prediction"
      }
    },
    "ContentType": "text/csv",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  MaxPayloadInMB=1,
  MaxConcurrentTransforms=100,
  DataProcessing={
    "InputFilter": "$[1:]",  # Use all columns except first (containing ID)
    "JoinSource": "Input",
    "OutputFilter": "$[0,-1]"  # Return ID and Prediction only 
  },
  TransformOutput={
    "S3OutputPath": f"s3://{bucket}/{prefix}/batch_transform_results/{model_name}",
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  TransformResources={
    "InstanceType": "ml.c5.xlarge",
    "InstanceCount": 16
  }
)
2
  • What’s the largest instance type you tried? Can you share a few examples of instance combinations you tried? Commented Dec 18, 2022 at 4:37
  • The above is the fastest I was able to get it: instance type: ml.c5.xlarge count: 16 MaxPayload: 1 MaxConcurrent: 100 I changed Maxpayload and maxconcurrent to 2/50 and got the same results. Tried ml.c5.xlarge with 16 instances at maxconcurrent = 0 and payload = 1 to let SM decide on optimal events, but that was slower. ml.c5.xlarge with 10 instances at 1/100 payloadsize/concurrent was also slower. Commented Dec 19, 2022 at 15:09

2 Answers 2

1

When you use an instance type with more CPU cores, generally that means you can increase MaxConcurrentTransforms, which controls the number of concurrent /invocations requests in-flight to the Model server at any given. The rule of thumb is to set MaxConcurrentTransformsequal to the number of cores, although requires some empirical testing to find out if your particular Model implementation can keep up with a faster request rate without breaking. Generally Model servers DO match the rule of thumb, setting number of webserver workers equal to the number of cores.

There may also be room to tune the BatchStrategy and MaxPayloadInMB for better throughput, e.g. passing larger multi-record payloads will allow the Model to complete the same amount of work with less total requests, thus reducing any overhead that may build up from frequent HTTP communication. Again it depends on how large of a request payload the Model server can handle, which may also depend on how much memory is needed and available on the given instance type.

Sign up to request clarification or add additional context in comments.

1 Comment

Got it! Super helpful. Im using a built-in algorithm and the docs seem to indicate it can optimize in the most efficient way. In practice, I seem to be able to get better performance tuning as you mentioned above. Is that your experience as well?
1

Sometimes using a larger instance will not only be faster, but also more cost-effective. Because if the job finishes much faster, the overall cost may be less, even though the instance is more expensive.

With that said, have you considered using something larger than an xlarge? That's the third smallest compute-optimized instance type. You can go all the way up to 24xlarge with the c5 instance type, with 5 other sizes in-between. Plus, there's a newer generation, c6g, of Graviton based instances.

However, XGBoost is a memory-bound, not compute-bound algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5).

Have you tried using AWS's built-in algorithm for XGBoost, which has some optimizations for the environment? For XGBoost, the docs say that, "[the built-in] implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics than the original versions."

Finally -- and this may be the solution in combination with using the built-in algorithm -- have you checked AWS's "EC2 Instance Recommendation for the XGBoost Algorithm"? Here's an excerpt from that (with my emphasis):

SageMaker XGBoost version 1.2 or later supports single-instance GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. SageMaker XGBoost version 1.2 or later supports P2 and P3 instances.

SageMaker XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families.

To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3) and set the tree_method hyperparameter to gpu_hist in your existing XGBoost script. SageMaker XGBoost currently does not support multi-GPU training.

2 Comments

Thank you for the detailed answer! I will review the documentation linked and run some further tests. For clarification, I've trained a model leveraging AWS's built-in XGBoost (via hyperparameter tuning). Now I'm that model in a batch-transform operation.
Some feedback on your suggestions! I tested 24xlarge instances along with memory optimized instances and they're both slower than the configuration I posted above (45min to 1hour batch time vs 30min for above). I will test the GPU instance, but again this is for batch prediction not training so I don't expect this to change things much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.