-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
@ChuyangDeng
I am trying to convert the following example...
...to use ProcessingInput and ProcessingOutput.
After several attempts, I realized the following:
ProcessingInputandProcessingOutputare not efficient due to the excessive data copying involved- Spark tries to use HDFS which would require copying the data from /opt/ml/processing/input to HDFS (also very inefficient)
- Spark supports
s3a://directly, but this by-passes the ProcessingInput and ProcessingOutput convention with SageMaker Processing Jobs.
I understand this is only a sample, but can you provide guidance on how to best use ProcessingInput and ProcessingOutput with Spark-based Processing Jobs? This is a relatively-popular combination of technologies.
Should we just use Spark's native s3a:// support like you're using in this sample for the Spark reads and writes? Is there anything in progress to improve this integration?
For reference, here is a list of errors that I found with this integration:
spark.write.csv(/opt/ml/processing/output/) is trying to use HDFS, but the file is not in HDFS. SageMaker copied the file to the local filesystem. This is easily fixed by prepending file://, but worth highlighting:
Path does not exist: hdfs://x.x.x.x/opt/ml/processing/input/raw/reviews/xxx.csv;
spark.write.csv(/opt/ml/processing/output/) is trying to write to the /opt/ml/processing/output directory with overwrite=False
org.apache.spark.sql.AnalysisException: path file:/opt/ml/processing/output already exists.;
spark.write.csv(/opt/ml/processing/output/) is trying to write to the /opt/ml/processing/output directory with overwrite=True
fs.FileUtil: Failed to delete file or dir [/opt/ml/processing/output]: it still exists.
I tried using hadoop fs -cp, but this proved very inefficient, so I skipped it.
Any suggestions?