Skip to content

Feature proposal: Session can download from S3 #1000

@andremoeller

Description

@andremoeller

Please fill out the form below.

Describe the problem

sagemaker.session.Session has an upload_data method that allows users to upload a local file or directory to S3:

def upload_data(self, path, bucket=None, key_prefix="data", extra_args=None):
"""Upload local file or directory to S3.
If a single file is specified for upload, the resulting S3 object key is
``{key_prefix}/{filename}`` (filename does not include the local path, if any specified).
If a directory is specified for upload, the API uploads all content, recursively,
preserving relative structure of subdirectories. The resulting object key names are:
``{key_prefix}/{relative_subdirectory_path}/filename``.
Args:
path (str): Path (absolute or relative) of local file or directory to upload.
bucket (str): Name of the S3 Bucket to upload to (default: None). If not specified, the
default bucket of the ``Session`` is used (if default bucket does not exist, the
``Session`` creates it).
key_prefix (str): Optional S3 object key name prefix (default: 'data'). S3 uses the
prefix to create a directory structure for the bucket content that it display in
the S3 console.
extra_args (dict): Optional extra arguments that may be passed to the upload operation.
Similar to ExtraArgs parameter in S3 upload_file function. Please refer to the
ExtraArgs parameter documentation here:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html#the-extraargs-parameter
Returns:
str: The S3 URI of the uploaded file(s). If a file is specified in the path argument,
the URI format is: ``s3://{bucket name}/{key_prefix}/{original_file_name}``.
If a directory is specified in the path argument, the URI format is
``s3://{bucket name}/{key_prefix}``.
"""

But there's no corresponding way to download files from using Session into a local directory. Training jobs put model artifacts in S3, and transform jobs put batch transform output in S3, and any job (or even an Endpoint) may output to S3 during its execution but right now, users have to use boto3 instead of sagemaker_session

Proposal

Add sagemaker.session.Session.download_data with the following signature and behavior:

def download_data(self, s3_uri, local_path):
    """Downloads data under an S3 prefix from an S3 URI into a local path or directory.

    Args:
        s3_uri (str): An S3 path. All objects under this prefix will be downloaded into the
          current directory.
        local_path (str): A local path. "." means the current directory. Directories will be
          created as needed.
    """

Thoughts / feedback on this proposal are welcome. Thanks!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions