-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
component: pipelinesRelates to the SageMaker Pipeline PlatformRelates to the SageMaker Pipeline Platformtype: bug
Description
Describe the bug
It seems that the cache mechanism does not work with the PySparkProcessor.
To reproduce
pyspark_processor = PySparkProcessor(
base_job_name="sm-spark",
framework_version="3.1",
role=role_arn,
instance_type="ml.m5.xlarge",
instance_count=8,
sagemaker_session=pipeline_session,
max_runtime_in_seconds=2400,
)
step_process_args = pyspark_processor.run(
submit_app="steps/preprocess.py",
outputs=[ProcessingOutput(output_name="train", source="/opt/ml/processing/output", destination=f"s3://{static_bucket}/{static_prefix}")],
)
step_process = ProcessingStep(
name="PySparkPreprocessing",
step_args=step_process_args,
cache_config=cache_config,
)
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.109.0
Additional context
I think #2790 does not solve the caching problem when using the "PySparkProcessor" class.
As far as I understand, this piece of code (src/sagemaker/workflow/steps.py) :
if code:
code_url = urlparse(code)
if code_url.scheme == "" or code_url.scheme == "file":
# By default, Processor will upload the local code to an S3 path
# containing a timestamp. This causes cache misses whenever a
# pipeline is updated, even if the underlying script hasn't changed.
# To avoid this, hash the contents of the script and include it
# in the job_name passed to the Processor, which will be used
# instead of the timestamped path.
self.job_name = self._generate_code_upload_path()
is only executed when the "code" argument is provided. This is not the case when using PySparkProcessor since we only provide a "submit_app" argument.
I found a temporary workaround, consisting of providing a static s3 path for the "submit_app" parameter (otherwise a time-stamped path is always generated and makes the cache fail).
Metadata
Metadata
Assignees
Labels
component: pipelinesRelates to the SageMaker Pipeline PlatformRelates to the SageMaker Pipeline Platformtype: bug