Unable to use ProcessingInput and ProcessingOutput with Spark-based SageMaker ScriptProcessor (ProcessingJob)

@ChuyangDeng 

I am trying to convert the following example...

https://github.com/awslabs/amazon-sagemaker-examples/blob/714652a4f96fd764a247a4cee30425293db86c59/sagemaker_processing/feature_transformation_with_sagemaker_processing/feature_transformation_with_sagemaker_processing.ipynb

...to use `ProcessingInput` and `ProcessingOutput`.  

After several attempts, I realized the following:
1) `ProcessingInput` and `ProcessingOutput` are not efficient due to the excessive data copying involved
2) Spark tries to use HDFS which would require copying the data from /opt/ml/processing/input to HDFS (also very inefficient)
3) Spark supports `s3a://` directly, but this by-passes the ProcessingInput and ProcessingOutput convention with SageMaker Processing Jobs.

I understand this is only a sample, but can you provide guidance on how to best use `ProcessingInput` and `ProcessingOutput` with Spark-based Processing Jobs?  This is a relatively-popular combination of technologies.  

Should we just use Spark's native `s3a://` support like you're using in this sample for the Spark reads and writes?  Is there anything in progress to improve this integration?

For reference, here is a list of errors that I found with this integration:

`spark.write.csv(/opt/ml/processing/output/)` is trying to use HDFS, but the file is not in HDFS.  SageMaker copied the file to the local filesystem.  This is easily fixed by prepending `file://`, but worth highlighting:
```
Path does not exist: hdfs://x.x.x.x/opt/ml/processing/input/raw/reviews/xxx.csv;
```

`spark.write.csv(/opt/ml/processing/output/)` is trying to write to the `/opt/ml/processing/output` directory with `overwrite=False`
```
org.apache.spark.sql.AnalysisException: path file:/opt/ml/processing/output already exists.;
```

spark.write.csv(/opt/ml/processing/output/) is trying to write to the `/opt/ml/processing/output` directory with `overwrite=True`
```
fs.FileUtil: Failed to delete file or dir [/opt/ml/processing/output]: it still exists.
```

I tried using `hadoop fs -cp`, but this proved very inefficient, so I skipped it.

Any suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to use ProcessingInput and ProcessingOutput with Spark-based SageMaker ScriptProcessor (ProcessingJob) #994

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to use ProcessingInput and ProcessingOutput with Spark-based SageMaker ScriptProcessor (ProcessingJob) #994

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions