1

I am using SageMaker for distributed TensorFlow model training and serving. I am trying to get the shape of the pre-processed datasets from the ScriptProcessor so I can provide it to the TensorFlow Environment.

script_processor = ScriptProcessor(command=['python3'],
                image_uri=preprocess_img_uri,
                role=role,
                instance_count=1,
                sagemaker_session=sm_session,
                instance_type=preprocess_instance_type)

script_processor.run(code=preprocess_script_uri,
                inputs=[ProcessingInput(
                        source=source_dir + username + '/' + dataset_name,
                        destination='/opt/ml/processing/input')],
                outputs=[
                        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
                        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test")
                ],

                arguments = ['--filepath', dataset_name, '--labels', 'labels', '--test_size', '0.2', '--shuffle', 'False', '--lookback', '5',])

preprocessing_job_description = script_processor.jobs[-1].describe()

output_config = preprocessing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    if output["OutputName"] == "train_data":
        preprocessed_training_data = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "test_data":
        preprocessed_test_data = output["S3Output"]["S3Uri"]

I would like to get the following data:

pre_processed_train_data_shape = script_processor.train_data_shape?

I am just not sure how to get the value out of the docker container. I have reviewed the documentation here:https://sagemaker.readthedocs.io/en/stable/api/training/processing.html

1 Answer 1

2

There are a few options:

  1. Write some data to a text file at /opt/ml/output/message, then call DescribeProcessingJob (using Boto3 or the AWS CLI or API) and retrieve the ExitMessage value

    aws sagemaker describe-processing-job \
      --processing-job-name foo \
      --output text \
      --query ExitMessage
    
  2. Add a new output to your processing job and send data there

  3. If your train_data is in CSV, JSON, or Parquet then use an S3 Select query on train_data for it's # of rows/columns

    aws s3api select-object-content \
      --bucket foo \
      --key 'path/to/train_data.csv' \
      --expression "SELECT count(*) FROM s3object" \
      --expression-type 'SQL' \
      --input-serialization '{"CSV": {}}' \
      --output-serialization '{"CSV": {}}' /dev/stdout
    

Set expression to select * from s3object limit 1 to get the columns

Sign up to request clarification or add additional context in comments.

1 Comment

Neil thanks for your response. It seems that all of those options require writing the data to a source somewhere and then finding and loading that data again. I guess I was hoping for an easier server-side code solution. I will mark your solution as the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.