Skip to content

Cache does not work with the PySparkProcessor class #3384

@HarryPommier

Description

@HarryPommier

Describe the bug
It seems that the cache mechanism does not work with the PySparkProcessor.

To reproduce

pyspark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="3.1",
    role=role_arn,
    instance_type="ml.m5.xlarge",
    instance_count=8,
    sagemaker_session=pipeline_session,
    max_runtime_in_seconds=2400,
)

step_process_args = pyspark_processor.run(
    submit_app="steps/preprocess.py",
    outputs=[ProcessingOutput(output_name="train", source="/opt/ml/processing/output", destination=f"s3://{static_bucket}/{static_prefix}")],
)

step_process = ProcessingStep(
    name="PySparkPreprocessing",
    step_args=step_process_args,
    cache_config=cache_config,
)

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.109.0

Additional context
I think #2790 does not solve the caching problem when using the "PySparkProcessor" class.
As far as I understand, this piece of code (src/sagemaker/workflow/steps.py) :

    if code:
        code_url = urlparse(code)
        if code_url.scheme == "" or code_url.scheme == "file":
            # By default, Processor will upload the local code to an S3 path
            # containing a timestamp. This causes cache misses whenever a
            # pipeline is updated, even if the underlying script hasn't changed.
            # To avoid this, hash the contents of the script and include it
            # in the job_name passed to the Processor, which will be used
            # instead of the timestamped path.
            self.job_name = self._generate_code_upload_path()

is only executed when the "code" argument is provided. This is not the case when using PySparkProcessor since we only provide a "submit_app" argument.

I found a temporary workaround, consisting of providing a static s3 path for the "submit_app" parameter (otherwise a time-stamped path is always generated and makes the cache fail).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions