diff --git a/doc/using_mxnet.rst b/doc/using_mxnet.rst index d97640f726..0ecf5bff6c 100644 --- a/doc/using_mxnet.rst +++ b/doc/using_mxnet.rst @@ -1,39 +1,33 @@ -========================================= +######################################### Using MXNet with the SageMaker Python SDK -========================================= - -.. contents:: +######################################### With the SageMaker Python SDK, you can train and host MXNet models on Amazon SageMaker. -Supported versions of MXNet: ``0.12.1``, ``1.0.0``, ``1.1.0``, ``1.2.1``, ``1.3.0``, ``1.4.0``, ``1.4.1``. +For information about supported versions of MXNet, see the `MXNet README `__. -Supported versions of MXNet for Elastic Inference: ``1.3.0``, ``1.4.0``, ``1.4.1``. +For general information about using the SageMaker Python SDK, see :ref:`overview:Using the SageMaker Python SDK`. -Training with MXNet -------------------- +.. contents:: -Training MXNet models using ``MXNet`` Estimators is a two-step process. First, you prepare your training script, then second, you run this on SageMaker via an ``MXNet`` Estimator. You should prepare your script in a separate source file than the notebook, terminal session, or source file you're using to submit the script to SageMaker via an ``MXNet`` Estimator. +************************ +Train a Model with MXNet +************************ -Suppose that you already have an MXNet training script called -``mxnet-train.py``. You can run this script in SageMaker as follows: +To train an MXNet model by using the SageMaker Python SDK: -.. code:: python +.. |create mxnet estimator| replace:: Create a ``sagemaker.mxnet.MXNet`` Estimator +.. _create mxnet estimator: #create-an-estimator - from sagemaker.mxnet import MXNet - mxnet_estimator = MXNet('mxnet-train.py', - role='SageMakerRole', - train_instance_type='ml.p3.2xlarge', - train_instance_count=1, - framework_version='1.3.0') - mxnet_estimator.fit('s3://bucket/path/to/training/data') - -Where the S3 url is a path to your training data, within Amazon S3. The constructor keyword arguments define how SageMaker runs your training script and are discussed, in detail, in a later section. +.. |call fit| replace:: Call the estimator's ``fit`` method +.. _call fit: #call-the-fit-method -In the following sections, we'll discuss how to prepare a training script for execution on SageMaker, then how to run that script on SageMaker using an ``MXNet`` Estimator. +1. `Prepare a training script <#prepare-an-mxnet-training-script>`_ +2. |create mxnet estimator|_ +3. |call fit|_ -Preparing the MXNet training script -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Prepare an MXNet Training Script +================================ .. warning:: The structure for training scripts changed starting at MXNet version 1.3. @@ -41,8 +35,8 @@ Preparing the MXNet training script For information on how to upgrade an old script to the new format, see `"Updating your MXNet training script" <#updating-your-mxnet-training-script>`__. For versions 1.3 and higher -^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Your MXNet training script must be a Python 2.7 or 3.5 compatible source file. +--------------------------- +Your MXNet training script must be a Python 2.7 or 3.6 compatible source file. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following: @@ -95,9 +89,9 @@ If you want to use, for example, boolean hyperparameters, you need to specify `` For more on training environment variables, please visit `SageMaker Containers `_. For versions 1.2 and lower -^^^^^^^^^^^^^^^^^^^^^^^^^^ +-------------------------- -Your MXNet training script must be a Python 2.7 or 3.5 compatible source file. The MXNet training script must contain a function ``train``, which SageMaker invokes to run training. You can include other functions as well, but it must contain a ``train`` function. +Your MXNet training script must be a Python 2.7 or 3.6 compatible source file. The MXNet training script must contain a function ``train``, which SageMaker invokes to run training. You can include other functions as well, but it must contain a ``train`` function. When you run your script on SageMaker via the ``MXNet`` Estimator, SageMaker injects information about the training environment into your training function via Python keyword arguments. You can choose to take advantage of these by including them as keyword arguments in your train function. The full list of arguments is: @@ -144,18 +138,8 @@ When SageMaker runs your training script, it imports it as a Python module and t If you want to run your training script locally via the Python interpreter, look at using a ``___name__ == '__main__'`` guard, discussed in more detail here: https://stackoverflow.com/questions/419163/what-does-if-name-main-do . -Distributed training -'''''''''''''''''''' - -When writing a distributed training script, you will want to use an MXNet kvstore to store and share model parameters. -During training, SageMaker automatically starts an MXNet kvstore server and scheduler processes on hosts in your training job cluster. -Your script runs as an MXNet worker task, with one server process on each host in your cluster. -One host is selected arbitrarily to run the scheduler process. - -To learn more about writing distributed MXNet programs, please see `Distributed Training `__ in the MXNet docs. - -Saving models -''''''''''''' +Save the Model +-------------- Just as you enable training by defining a ``train`` function in your training script, you enable model saving by defining a ``save`` function in your script. If your script includes a ``save`` function, SageMaker will invoke it with the return-value of ``train``. Model saving is a two-step process, firstly you return the model you want to save from ``train``, then you define your model-serialization logic in ``save``. @@ -211,7 +195,7 @@ After your ``train`` function completes, SageMaker will invoke ``save`` with the If your train function returns a Gluon API ``net`` object as its model, you'll need to write your own ``save`` function. You will want to serialize the ``net`` parameters. Saving ``net`` parameters is covered in the `Serialization section `__ of the collaborative Gluon deep-learning book `"The Straight Dope" `__. Updating your MXNet training script -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +----------------------------------- The structure for training scripts changed with MXNet version 1.3. The ``train`` function is no longer be required; instead the training script must be able to be run as a standalone script. @@ -297,11 +281,11 @@ If there are other packages you want to use with your script, you can include a A ``requirements.txt`` file is a text file that contains a list of items that are installed by using ``pip install``. You can also specify the version of an item to install. For information about the format of a ``requirements.txt`` file, see `Requirements Files `__ in the pip documentation. -Running an MXNet training script in SageMaker -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Create an Estimator +=================== -You run MXNet training scripts on SageMaker by creating an ``MXNet`` estimators. -When you call ``fit`` on an ``MXNet`` estimator, a SageMaker training job with your script is started. +You run MXNet training scripts on SageMaker by creating an ``MXNet`` estimator. +When you call ``fit`` on an ``MXNet`` estimator, SageMaker starts a training job using your script as training code. The following code sample shows how you train a custom MXNet script "train.py". .. code:: python @@ -315,104 +299,17 @@ The following code sample shows how you train a custom MXNet script "train.py". 'learning-rate': 0.1}) mxnet_estimator.fit('s3://my_bucket/my_training_data/') -MXNet Estimators -^^^^^^^^^^^^^^^^ - -The ``MXNet`` constructor takes both required and optional arguments. - -Required arguments -'''''''''''''''''' - -The following are required arguments to the ``MXNet`` constructor. When you create an MXNet object, you must include these in the constructor, either positionally or as keyword arguments. - -- ``entry_point`` Path (absolute or relative) to the Python file which - should be executed as the entry point to training. -- ``role`` An AWS IAM role (either name or full ARN). The Amazon - SageMaker training jobs and APIs that create Amazon SageMaker - endpoints use this role to access training data and model artifacts. - After the endpoint is created, the inference code might use the IAM - role, if accessing AWS resource. -- ``train_instance_count`` Number of Amazon EC2 instances to use for - training. -- ``train_instance_type`` Type of EC2 instance to use for training, for - example, 'ml.c4.xlarge'. - -Optional arguments -'''''''''''''''''' - -The following are optional arguments. When you create an ``MXNet`` object, you can specify these as keyword arguments. - -- ``source_dir`` Path (absolute or relative) to a directory with any - other training source code dependencies including the entry point - file. Structure within this directory will be preserved when training - on SageMaker. -- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with - any additional libraries that will be exported to the container (default: ``[]``). - The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. - If the ``source_dir`` points to S3, code will be uploaded and the S3 location will be used - instead. For example, the following call - - >>> MXNet(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) +For more information about the sagemaker.mxnet.MXNet estimator, see `sagemaker.mxnet.MXNet Class`_. - results in the following inside the container: - .. code:: - - opt/ml/code - ├── train.py - ├── common - └── virtual-env - -- ``hyperparameters`` Hyperparameters that will be used for training. - Will be made accessible as a dict[str, str] to the training code on - SageMaker. For convenience, accepts other types besides str, but - str() will be called on keys and values to convert them before - training. -- ``py_version`` Python version you want to use for executing your - model training code. Valid values: 'py2' and 'py3'. -- ``train_volume_size`` Size in GB of the EBS volume to use for storing - input data during training. Must be large enough to store training - data if input_mode='File' is used (which is the default). -- ``train_max_run`` Timeout in seconds for training, after which Amazon - SageMaker terminates the job regardless of its current status. -- ``input_mode`` The input mode that the algorithm supports. Valid - modes: 'File' - Amazon SageMaker copies the training dataset from the - S3 location to a directory in the Docker container. 'Pipe' - Amazon - SageMaker streams data directly from S3 to the container via a Unix - named pipe. -- ``output_path`` Location where you want the training result (model artifacts and optional output files) saved. - This should be an S3 location unless you're using Local Mode, which also supports local output paths. - If not specified, results are stored to a default S3 bucket. -- ``output_kms_key`` Optional KMS key ID to optionally encrypt training - output with. -- ``job_name`` Name to assign for the training job that the fit() - method launches. If not specified, the estimator generates a default - job name, based on the training image name and current timestamp -- ``image_name`` An alternative docker image to use for training and - serving. If specified, the estimator will use this image for training and - hosting, instead of selecting the appropriate SageMaker official image based on - framework_version and py_version. Refer to: `SageMaker MXNet Docker Containers - <#sagemaker-mxnet-docker-containers>`_ for details on what the Official images support - and where to find the source code to build your custom image. -- ``distributions`` For versions 1.3 and above only. - Specifies information for how to run distributed training. - To launch a parameter server during training, set this argument to: -.. code:: - - { - 'parameter_server': { - 'enabled': True - } - } - -Calling fit -^^^^^^^^^^^ +Call the fit Method +=================== You start your training script by calling ``fit`` on an ``MXNet`` Estimator. ``fit`` takes both required and optional arguments. fit Required argument -''''''''''''''''''''' +--------------------- - ``inputs``: This can take one of the following forms: A string S3 URI, for example ``s3://my-bucket/my-training-data``. In this @@ -431,16 +328,28 @@ For example: .. optional-arguments-1: fit Optional arguments -'''''''''''''''''''''' +---------------------- - ``wait``: Defaults to True, whether to block and wait for the training script to complete before returning. - ``logs``: Defaults to True, whether to show logs produced by training job in the Python session. Only meaningful when wait is True. +Distributed training +==================== + +When writing a distributed training script, use an MXNet kvstore to store and share model parameters. +During training, SageMaker automatically starts an MXNet kvstore server and scheduler processes on hosts in your training job cluster. +Your script runs as an MXNet worker task, with one server process on each host in your cluster. +One host is selected arbitrarily to run the scheduler process. + +To learn more about writing distributed MXNet programs, please see `Distributed Training `__ in the MXNet docs. + -Deploying MXNet models ----------------------- + +******************* +Deploy MXNet models +******************* After an MXNet Estimator has been fit, you can host the newly created model in SageMaker. @@ -474,7 +383,7 @@ MXNet on SageMaker has support for `Elastic Inference `__. -Model serving -^^^^^^^^^^^^^ +Serve an MXNet Model +-------------------- After the SageMaker model server loads your model by calling either the default ``model_fn`` or the implementation in your script, SageMaker serves your model. Model serving is the process of responding to inference requests received by SageMaker ``InvokeEndpoint`` API calls. @@ -540,7 +449,7 @@ Defining how to handle these requests can be done in one of two ways: - writing your own ``transform_fn`` for handling input processing, prediction, and output processing Using ``input_fn``, ``predict_fn``, and ``output_fn`` -''''''''''''''''''''''''''''''''''''''''''''''''''''' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The SageMaker MXNet model server breaks request handling into three steps: @@ -592,8 +501,8 @@ If you rely solely on the SageMaker MXNet model server defaults, you get the fol In the following sections we describe the default implementations of input_fn, predict_fn, and output_fn. We describe the input arguments and expected return types of each, so you can define your own implementations. -Input processing -"""""""""""""""" +Process Model Input +------------------- When an InvokeEndpoint operation is made against an Endpoint running a SageMaker MXNet model server, the model server receives two pieces of information: @@ -633,8 +542,8 @@ If you provide your own implementation of input_fn, you should abide by the ``in # if the content type is not supported. pass -Prediction -"""""""""" +Getting Predictions from a Deployed Model +========================================= After the inference request has been deserialized by ``input_fn``, the SageMaker MXNet model server invokes ``predict_fn``. As with ``input_fn``, you can define your own ``predict_fn`` or use the SageMaker Mxnet default. @@ -666,8 +575,8 @@ If you implement your own prediction function, you should take care to ensure th first argument to ``output_fn``. If you use the default ``output_fn``, this should be an ``NDArrayIter``. -Output processing -""""""""""""""""" +Processing Model Output +----------------------- After invoking ``predict_fn``, the model server invokes ``output_fn``, passing in the return value from ``predict_fn`` and the InvokeEndpoint requested response content type. @@ -683,7 +592,7 @@ The function should return an array of bytes serialized to the expected content The default implementation expects ``prediction`` to be an ``NDArray`` and can serialize the result to either JSON or CSV. It accepts response content types of "application/json" and "text/csv". Using ``transform_fn`` -'''''''''''''''''''''' +^^^^^^^^^^^^^^^^^^^^^^ If you would rather not structure your code around the three methods described above, you can instead define your own ``transform_fn`` to handle inference requests. An error will be thrown if a ``transform_fn`` is present in conjunction with any ``input_fn``, ``predict_fn``, and/or ``output_fn``. ``transform_fn`` has the following signature: @@ -710,10 +619,10 @@ For versions 1.3 and lower: You can find examples of hosting scripts using this structure in the example notebooks, such as the `mxnet_gluon_sentiment `__ notebook. Working with existing model data and training jobs --------------------------------------------------- +================================================== -Attaching to existing training jobs -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Attach to Existing Training Jobs +-------------------------------- You can attach an MXNet Estimator to an existing training job using the ``attach`` method. @@ -734,8 +643,8 @@ The ``attach`` method accepts the following arguments: - ``sagemaker_session (sagemaker.Session or None):`` The Session used to interact with SageMaker -Deploying Endpoints from model data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Deploy Endpoints from Model Data +-------------------------------- As well as attaching to existing training jobs, you can deploy models directly from model data in S3. The following code sample shows how to do this, using the ``MXNetModel`` class. @@ -786,8 +695,9 @@ This uploads the contents of my_model to a gzip compressed tar file to S3 in the To run this command, you'll need the aws cli tool installed. Please refer to our `FAQ <#FAQ>`__ for more information on installing this. +******** Examples --------- +******** Amazon provides several example Jupyter notebooks that demonstrate end-to-end training on Amazon SageMaker using MXNet. Please refer to: @@ -795,37 +705,105 @@ https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-pytho These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the "sample notebooks" folder. -SageMaker MXNet Containers --------------------------- +*************************** +sagemaker.mxnet.MXNet Class +*************************** + +The following are the most commonly used ``MXNet`` constructor arguments. -When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control the environment your script runs in. +Required arguments +================== + +The following are required arguments to the ``MXNet`` constructor. When you create an MXNet object, you must include these in the constructor, either positionally or as keyword arguments. -SageMaker runs MXNet Estimator scripts in either Python 2.7 or Python 3.5. You can select the Python version by passing a ``py_version`` keyword arg to the MXNet Estimator constructor. Setting this to ``py2`` (the default) will cause your training script to be run on Python 2.7. Setting this to ``py3`` will cause your training script to be run on Python 3.5. This Python version applies to both the Training Job, created by fit, and the Endpoint, created by deploy. +- ``entry_point`` Path (absolute or relative) to the Python file which + should be executed as the entry point to training. +- ``role`` An AWS IAM role (either name or full ARN). The Amazon + SageMaker training jobs and APIs that create Amazon SageMaker + endpoints use this role to access training data and model artifacts. + After the endpoint is created, the inference code might use the IAM + role, if accessing AWS resource. +- ``train_instance_count`` Number of Amazon EC2 instances to use for + training. +- ``train_instance_type`` Type of EC2 instance to use for training, for + example, 'ml.c4.xlarge'. -Your MXNet training script will be run on version 1.2.1 by default. (See below for how to choose a different version, and currently supported versions.) The decision to use the GPU or CPU version of MXNet is made by the ``train_instance_type``, set on the MXNet constructor. If you choose a GPU instance type, your training job will be run on a GPU version of MXNet. If you choose a CPU instance type, your training job will be run on a CPU version of MXNet. Similarly, when you call deploy, specifying a GPU or CPU deploy_instance_type, will control which MXNet build your Endpoint runs. +Optional arguments +================== -The Docker images have the following dependencies installed: +The following are optional arguments. When you create an ``MXNet`` object, you can specify these as keyword arguments. -+-------------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+ -| Dependencies | MXNet 0.12.1 | MXNet 1.0.0 | MXNet 1.1.0 | MXNet 1.2.1 | MXNet 1.3.0 | MXNet 1.4.0 | MXNet 1.4.1 | -+-------------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+ -| Python | 2.7 or 3.5 | 2.7 or 3.5| 2.7 or 3.5| 2.7 or 3.5| 2.7 or 3.5| 2.7 or 3.6| 2.7 or 3.6| -+-------------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+ -| CUDA (GPU image only) | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.2 | 10.0 | -+-------------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+ -| numpy | 1.13.3 | 1.13.3 | 1.13.3 | 1.14.5 | 1.14.6 | 1.16.3 | 1.14.5 | -+-------------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+ -| onnx | N/A | N/A | N/A | 1.2.1 | 1.2.1 | 1.4.1 | 1.4.1 | -+-------------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+ -| keras-mxnet | N/A | N/A | N/A | N/A | 2.2.2 | 2.2.4.1 | 2.2.4.1 | -+-------------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+ +- ``source_dir`` Path (absolute or relative) to a directory with any + other training source code dependencies including the entry point + file. Structure within this directory will be preserved when training + on SageMaker. +- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with + any additional libraries that will be exported to the container (default: ``[]``). + The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. + If the ``source_dir`` points to S3, code will be uploaded and the S3 location will be used + instead. For example, the following call -The Docker images extend Ubuntu 16.04. + >>> MXNet(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) -You can select version of MXNet by passing a ``framework_version`` keyword arg to the MXNet Estimator constructor. Currently supported versions are listed in the above table. You can also set ``framework_version`` to only specify major and minor version, e.g ``1.4``, which will cause your training script to be run on the latest supported patch version of that minor version, which in this example would be 1.4.1. -Alternatively, you can build your own image by following the instructions in the SageMaker MXNet containers repository, and passing ``image_name`` to the MXNet Estimator constructor. + results in the following inside the container: -You can visit the SageMaker MXNet container repositories here: + .. code:: + + opt/ml/code + ├── train.py + ├── common + └── virtual-env + +- ``hyperparameters`` Hyperparameters that will be used for training. + Will be made accessible as a dict[str, str] to the training code on + SageMaker. For convenience, accepts other types besides str, but + str() will be called on keys and values to convert them before + training. +- ``py_version`` Python version you want to use for executing your + model training code. Valid values: 'py2' and 'py3'. +- ``train_volume_size`` Size in GB of the EBS volume to use for storing + input data during training. Must be large enough to store training + data if input_mode='File' is used (which is the default). +- ``train_max_run`` Timeout in seconds for training, after which Amazon + SageMaker terminates the job regardless of its current status. +- ``input_mode`` The input mode that the algorithm supports. Valid + modes: 'File' - Amazon SageMaker copies the training dataset from the + S3 location to a directory in the Docker container. 'Pipe' - Amazon + SageMaker streams data directly from S3 to the container via a Unix + named pipe. +- ``output_path`` Location where you want the training result (model artifacts and optional output files) saved. + This should be an S3 location unless you're using Local Mode, which also supports local output paths. + If not specified, results are stored to a default S3 bucket. +- ``output_kms_key`` Optional KMS key ID to optionally encrypt training + output with. +- ``job_name`` Name to assign for the training job that the fit() + method launches. If not specified, the estimator generates a default + job name, based on the training image name and current timestamp +- ``image_name`` An alternative docker image to use for training and + serving. If specified, the estimator will use this image for training and + hosting, instead of selecting the appropriate SageMaker official image based on + framework_version and py_version. Refer to: `SageMaker MXNet Docker Containers + <#sagemaker-mxnet-docker-containers>`_ for details on what the Official images support + and where to find the source code to build your custom image. +- ``distributions`` For versions 1.3 and above only. + Specifies information for how to run distributed training. + To launch a parameter server during training, set this argument to: + +.. code:: + + { + 'parameter_server': { + 'enabled': True + } + } + +************************** +SageMaker MXNet Containers +************************** + +For information about SageMaker MXNet containers, see the following topics: - training: https://github.com/aws/sagemaker-mxnet-container - serving: https://github.com/aws/sagemaker-mxnet-serving-container + +For information about the dependencies installed in SageMaker MXNet containers, see https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/mxnet/README.rst#sagemaker-mxnet-containers.