-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
Processing job fails intermittently when using ProcessingOutput in Continuous mode and when the destination S3 path is used in the same job to load data into Redshift. We are using sqlalchemy to run the COPY command on Redshift. Logs below.
We have had successful runs of this processing job, but it fails sometimes about an S3 prefix not being present. Same container image for all the runs.
To reproduce
Create a processing job that uses ProcessingOutput with output mode set to Continuous (instead of EndofJob) and try to use the S3 file path in the same job. In our case, we are loading data from the file into Redshift.
Expected behavior
In Continuous mode, expecting the S3 path to be accessible and ready as soon as file write to local path completes.
Also, the exact behaviour of what Continuous really means in ProcessingOutput appears to be undocumented.
Screenshots or logs
Truncated log lines (full logs below):
psycopg2.errors.InternalError_: The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
...
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-7-9e65ad912afd> in <module>
4 s3_upload_mode='Continuous')
5
----> 6 etl_processor.run(outputs=[po])
...
UnexpectedStatusException: Error for Processing job <base-name>-2020-03-09-19-01-48-893: Failed. Reason: AlgorithmError: See job logs for more information
System information
A description of your system. Please provide:
- SageMaker Python SDK version:
1.50.10 - Framework name (eg. PyTorch) or algorithm (eg. KMeans): N/A (custom container)
- Framework version: N/A
- Python version: 3.7
- CPU or GPU: CPU
- Custom Docker image (Y/N): Y
Additional context
Traceback (most recent call last):
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
context)
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.InternalError_: The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
DETAIL:
-----------------------------------------------
error: The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
code: 8001
context:
query: 54196251
location: s3_utility.cpp:604
process: padbmaster [pid=26108]
-----------------------------------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/<custom-processing-code-file>.py", line 339, in <module>
res_redshift_load = conn.execution_options(autocommit=True).execute(sql_bulk_load)
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 942, in execute
return self._execute_text(object, multiparams, params)
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1104, in _execute_text
statement, parameters
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
context)
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
exc_info
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 248, in reraise
raise value.with_traceback(tb)
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
context)
File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.InternalError: (psycopg2.errors.InternalError_) The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
DETAIL:
-----------------------------------------------
error: The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
code: 8001
context:
query: 54196251
location: s3_utility.cpp:604
process: padbmaster [pid=26108]
-----------------------------------------------
[SQL: "copy <table-name> from 's3://sagemaker-<bucket>/<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' iam_role 'arn:aws:iam::<iam-role>' format as CSV;"] (Background on this error at: http://sqlalche.me/e/2j85)
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-7-9e65ad912afd> in <module>
4 s3_upload_mode='Continuous')
5
----> 6 etl_processor.run(outputs=[po])
~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/processing.py in run(self, inputs, outputs, arguments, wait, logs, job_name, experiment_config)
162 self.jobs.append(self.latest_job)
163 if wait:
--> 164 self.latest_job.wait(logs=logs)
165
166 def _generate_current_job_name(self, job_name=None):
~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/processing.py in wait(self, logs)
721 """
722 if logs:
--> 723 self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
724 else:
725 self.sagemaker_session.wait_for_processing_job(self.job_name)
~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/session.py in logs_for_processing_job(self, job_name, wait, poll)
3109
3110 if wait:
-> 3111 self._check_job_status(job_name, description, "ProcessingJobStatus")
3112 if dot:
3113 print()
~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
2613 ),
2614 allowed_statuses=["Completed", "Stopped"],
-> 2615 actual_status=status,
2616 )
2617
UnexpectedStatusException: Error for Processing job <job-base-name>-2020-03-09-19-01-48-893: Failed. Reason: AlgorithmError: See job logs for more information
In [8]: