Skip to content

Unable to use ProcessingInput and ProcessingOutput with Spark-based SageMaker ScriptProcessor (ProcessingJob) #994

@cfregly

Description

@cfregly

@ChuyangDeng

I am trying to convert the following example...

https://github.com/awslabs/amazon-sagemaker-examples/blob/714652a4f96fd764a247a4cee30425293db86c59/sagemaker_processing/feature_transformation_with_sagemaker_processing/feature_transformation_with_sagemaker_processing.ipynb

...to use ProcessingInput and ProcessingOutput.

After several attempts, I realized the following:

  1. ProcessingInput and ProcessingOutput are not efficient due to the excessive data copying involved
  2. Spark tries to use HDFS which would require copying the data from /opt/ml/processing/input to HDFS (also very inefficient)
  3. Spark supports s3a:// directly, but this by-passes the ProcessingInput and ProcessingOutput convention with SageMaker Processing Jobs.

I understand this is only a sample, but can you provide guidance on how to best use ProcessingInput and ProcessingOutput with Spark-based Processing Jobs? This is a relatively-popular combination of technologies.

Should we just use Spark's native s3a:// support like you're using in this sample for the Spark reads and writes? Is there anything in progress to improve this integration?

For reference, here is a list of errors that I found with this integration:

spark.write.csv(/opt/ml/processing/output/) is trying to use HDFS, but the file is not in HDFS. SageMaker copied the file to the local filesystem. This is easily fixed by prepending file://, but worth highlighting:

Path does not exist: hdfs://x.x.x.x/opt/ml/processing/input/raw/reviews/xxx.csv;

spark.write.csv(/opt/ml/processing/output/) is trying to write to the /opt/ml/processing/output directory with overwrite=False

org.apache.spark.sql.AnalysisException: path file:/opt/ml/processing/output already exists.;

spark.write.csv(/opt/ml/processing/output/) is trying to write to the /opt/ml/processing/output directory with overwrite=True

fs.FileUtil: Failed to delete file or dir [/opt/ml/processing/output]: it still exists.

I tried using hadoop fs -cp, but this proved very inefficient, so I skipped it.

Any suggestions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions