Skip to content

Intermittent failures on Processing job that depends on ProcessingOutput in Continuous mode #1339

@ram-nadella

Description

@ram-nadella

Describe the bug
Processing job fails intermittently when using ProcessingOutput in Continuous mode and when the destination S3 path is used in the same job to load data into Redshift. We are using sqlalchemy to run the COPY command on Redshift. Logs below.

We have had successful runs of this processing job, but it fails sometimes about an S3 prefix not being present. Same container image for all the runs.

To reproduce
Create a processing job that uses ProcessingOutput with output mode set to Continuous (instead of EndofJob) and try to use the S3 file path in the same job. In our case, we are loading data from the file into Redshift.

Expected behavior
In Continuous mode, expecting the S3 path to be accessible and ready as soon as file write to local path completes.

Also, the exact behaviour of what Continuous really means in ProcessingOutput appears to be undocumented.

Screenshots or logs

Truncated log lines (full logs below):

psycopg2.errors.InternalError_: The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist

...

 ---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-7-9e65ad912afd> in <module>
      4     s3_upload_mode='Continuous')
      5
----> 6 etl_processor.run(outputs=[po])

...

UnexpectedStatusException: Error for Processing job <base-name>-2020-03-09-19-01-48-893: Failed. Reason: AlgorithmError: See job logs for more information

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.50.10
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): N/A (custom container)
  • Framework version: N/A
  • Python version: 3.7
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y

Additional context

Traceback (most recent call last):
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
    context)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.InternalError_: The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
DETAIL:
  -----------------------------------------------
  error:  The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
  code:      8001
  context:
  query:     54196251
  location:  s3_utility.cpp:604
  process:   padbmaster [pid=26108]
  -----------------------------------------------



The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/<custom-processing-code-file>.py", line 339, in <module>
    res_redshift_load = conn.execution_options(autocommit=True).execute(sql_bulk_load)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 942, in execute
    return self._execute_text(object, multiparams, params)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1104, in _execute_text
    statement, parameters
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
    context)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
    exc_info
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 248, in reraise
    raise value.with_traceback(tb)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
    context)
  File "/opt/conda/envs/market_performance_features/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.InternalError: (psycopg2.errors.InternalError_) The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
DETAIL:
  -----------------------------------------------
  error:  The specified S3 prefix '<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' does not exist
  code:      8001
  context:
  query:     54196251
  location:  s3_utility.cpp:604
  process:   padbmaster [pid=26108]
  -----------------------------------------------

 [SQL: "copy <table-name> from 's3://sagemaker-<bucket>/<job-base-name>-2020-03-09-19-01-48-893/output/processing-output/<file-name>.csv' iam_role 'arn:aws:iam::<iam-role>' format as CSV;"] (Background on this error at: http://sqlalche.me/e/2j85)



 ---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-7-9e65ad912afd> in <module>
      4     s3_upload_mode='Continuous')
      5
----> 6 etl_processor.run(outputs=[po])

~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/processing.py in run(self, inputs, outputs, arguments, wait, logs, job_name, experiment_config)
    162         self.jobs.append(self.latest_job)
    163         if wait:
--> 164             self.latest_job.wait(logs=logs)
    165
    166     def _generate_current_job_name(self, job_name=None):

~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/processing.py in wait(self, logs)
    721         """
    722         if logs:
--> 723             self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
    724         else:
    725             self.sagemaker_session.wait_for_processing_job(self.job_name)

~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/session.py in logs_for_processing_job(self, job_name, wait, poll)
   3109
   3110         if wait:
-> 3111             self._check_job_status(job_name, description, "ProcessingJobStatus")
   3112             if dot:
   3113                 print()

~/code/trfx/ml-workflow-test/venv/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   2613                 ),
   2614                 allowed_statuses=["Completed", "Stopped"],
-> 2615                 actual_status=status,
   2616             )
   2617

UnexpectedStatusException: Error for Processing job <job-base-name>-2020-03-09-19-01-48-893: Failed. Reason: AlgorithmError: See job logs for more information

In [8]:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions