Get Variable from SageMaker Script Processor

Question

I am using SageMaker for distributed TensorFlow model training and serving. I am trying to get the shape of the pre-processed datasets from the ScriptProcessor so I can provide it to the TensorFlow Environment.

script_processor = ScriptProcessor(command=['python3'],
                image_uri=preprocess_img_uri,
                role=role,
                instance_count=1,
                sagemaker_session=sm_session,
                instance_type=preprocess_instance_type)

script_processor.run(code=preprocess_script_uri,
                inputs=[ProcessingInput(
                        source=source_dir + username + '/' + dataset_name,
                        destination='/opt/ml/processing/input')],
                outputs=[
                        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
                        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test")
                ],

                arguments = ['--filepath', dataset_name, '--labels', 'labels', '--test_size', '0.2', '--shuffle', 'False', '--lookback', '5',])

preprocessing_job_description = script_processor.jobs[-1].describe()

output_config = preprocessing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    if output["OutputName"] == "train_data":
        preprocessed_training_data = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "test_data":
        preprocessed_test_data = output["S3Output"]["S3Uri"]

I would like to get the following data:

pre_processed_train_data_shape = script_processor.train_data_shape?

I am just not sure how to get the value out of the docker container. I have reviewed the documentation here:https://sagemaker.readthedocs.io/en/stable/api/training/processing.html

Neil McGuigan · Accepted Answer · 2022-06-02 17:03:21Z

2

There are a few options:

Write some data to a text file at /opt/ml/output/message, then call DescribeProcessingJob (using Boto3 or the AWS CLI or API) and retrieve the ExitMessage value
```
aws sagemaker describe-processing-job \
  --processing-job-name foo \
  --output text \
  --query ExitMessage
```
Add a new output to your processing job and send data there

If your train_data is in CSV, JSON, or Parquet then use an S3 Select query on train_data for it's # of rows/columns

aws s3api select-object-content \
  --bucket foo \
  --key 'path/to/train_data.csv' \
  --expression "SELECT count(*) FROM s3object" \
  --expression-type 'SQL' \
  --input-serialization '{"CSV": {}}' \
  --output-serialization '{"CSV": {}}' /dev/stdout

Set expression to select * from s3object limit 1 to get the columns

edited Jun 2, 2022 at 17:03

answered Mar 18, 2022 at 16:32

Neil McGuigan

48.5k12 gold badges130 silver badges156 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cnote Over a year ago

Neil thanks for your response. It seems that all of those options require writing the data to a source somewhere and then finding and loading that data again. I guess I was hoping for an easier server-side code solution. I will mark your solution as the answer.

Collectives™ on Stack Overflow

Get Variable from SageMaker Script Processor

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related