MLOPs on Amazon EKS

This project defines a prototypical solution for MLOps on Amazon Elastic Kubernetes Service (EKS). Key use cases for this solution are:

Building a comprehensive sandbox environment for MLOps experimentation on EKS.
Defining a canonical prototype for building custom MLOPs platforms on EKS.

This solution uses a modular approach to MLOps, whereby, you can enable, or disable, various MLOPs modules, as needed. Supported ML Ops modules include: Airflow, Kubeflow, KServe, Kueue, MLFlow, and Slinky Slurm.

For distributed training, the solution works with popular AI machine learning libraries, for example, Nemo, Hugging Face Accelerate, PyTorch Lightning, DeepSpeed, Megatron-DeepSpeed, Ray Train, Neuronx Distributed, among others. For distributed inference, the solution supports Ray Serve with vLLM, Triton Inference Server with Python, TensorRT-LLM and vLLM backends, and Deep Java Library (DJL) Large Model Inference (LMI) with all supported backends.

Legacy Note: This project started as a companion to the Mask R-CNN distributed training blog, and that part of the project is documented in this README.

Conceptual Overview

The key novel concept to understand is that this solution uses Helm charts to execute MLOps pipeline steps. See tutorials for a quick start.

Helm charts are commonly used to deploy applications within an Amazon EKS cluster. While typical applications deployed using Helm charts are services that run until stopped, this solutions uses Helm Charts to execute discrete steps within arbitrary MLOps pipelines. The Helm chart based pipeline steps can be executed directly via Helm CLI, or can be used to compose MLOps pipelines using Kubeflow Pipelines, or Apache Airflow.

Any MLOps pipeline in this solution can be conceptualized as a series of Helm chart installations, managed within a single Helm Release. Each step in the pipeline is executed via a Helm chart installation using a specific YAML recipe, whereby, the YAML recipe acts as a Helm Values File. Once the Helm chart step completes, the Helm chart is uninstalled, and the next Helm chart in the pipeline is deployed within the same Helm Release. Using a single Helm Release for a given ML pipeline ensures that the discrete steps in the pipeline are executed atomically, and the dependency among the steps is maintained.

The included tutorials provide examples of MLOps pipelines that use pre-defined Helm charts, along with YAML recipe files that model each pipeline step. Typically, in order to define a new pipeline in this solution, you do not need to write new Helm Charts. The solution comes with a library of pre-defined machine learning Helm charts. However, you are required to write a YAML recipe file for each step in your MLOps pipeline.

The tutorials walk you through each pipeline, step by step, where you manually execute each pipeline step by installing, and uninstalling, a pre-defined Helm chart with its associated YAML recipe.

For complete end-to-end automation, we also provide an example where you can execute all the pipeline steps automatically using Kubeflow Pipelines. This option requires you to enable Kubeflow platform module.

If you are a platform engineer, you may be interested in the system architecture overview of this solution.

Tutorials

After completing prerequisites, use the directory below to navigate the tutorials.

Category	Frameworks/Libraries
Inference	DJL Serving , RayServe, Triton Inference Server
Training	Hugging Face Accelerate, Lightning, Megatron-DeepSpeed, Nemo, Neuronx Distributed, Neuronx Distributed Training, RayTrain
Legacy	Mask R-CNN (TensorFlow), Neuronx Nemo Megatron

Prerequisites

Create and activate an AWS Account
Select your AWS Region. For the tutorial below, we assume the region to be us-west-2
Manage your Amazon EC2 service limits in your selected AWS Region. Increase service limits to at least 8 each for p4d.24xlarge, g6.xlarge, g6.2xlarge, g6.48xlarge, inf2.xlarge, inf2.48xlarge and trn1.32xlarge. If you use other Amazon EC2 GPU or AWS Trainium/Inferentia instance types in the tutorials, ensure your EC2 service limits are increased appropriately.
If you do not already have an Amazon EC2 key pair, create a new Amazon EC2 key pair. You will need the key pair name to specify the KeyName parameter when launching the build machine desktop.
You will need an Amazon S3 bucket. If you don't have one, create a new Amazon S3 bucket in the AWS region you selected. The S3 bucket can be empty at this point.
Use AWS check ip to get your public IP address of your laptop. This will be the IP address you will need to specify the DesktopAccessCIDR parameter when creating the build machine desktop.
Clone this Git repository on your laptop using git clone .

Launch Build Machine Desktop

To launch the build machine, you will need Administrator job function access to AWS Management Console. Use the AWS CloudFormation template ml-ops-desktop.yaml from your cloned repository to create a new CloudFormation stack using the AWS Management console, or using the AWS CLI.

The template ml-ops-desktop.yaml creates AWS Identity and Access Management (IAM) resources. If you are creating CloudFormation Stack using the console, in the review step, you must check I acknowledge that AWS CloudFormation might create IAM resources. If you use the aws cloudformation create-stack CLI, you must use --capabilities CAPABILITY_NAMED_IAM.

Connect to Build Machine Desktop using SSH

Once the stack status in CloudFormation console is CREATE_COMPLETE, find the ML Ops desktop instance launched in your stack in the Amazon EC2 console, and connect to the instance using SSH as user ubuntu, using your SSH key pair.
When you connect using SSH, and you see the message "Cloud init in progress! Logs: /var/log/cloud-init-output.log", disconnect and try later after about 15 minutes. The desktop installs the Amazon DCV server on first-time startup, and reboots after the install is complete.
If you see the message Amazon DCV server is enabled!, run the command sudo passwd ubuntu to set a new password for user ubuntu. Now you are ready to connect to the desktop using the Amazon DCV client
The build machine desktop uses EC2 user-data to initialize the desktop. Most transient failures in the desktop initialization can be fixed by rebooting the desktop.

Clone Git Repository

Clone this git repository on the build machine using the following commands:

cd ~
git clone https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow.git

Install Kubectl

Install kubectl on the build machine using following commands:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
./eks-cluster/utils/install-kubectl-linux.sh

Use Terraform to Create ML Ops PLatform

We use Terraform to create the ML Ops platform.

Enable S3 Backend for Terraform

Replace S3_BUCKETand S3_PREFIX with your S3 bucket name, and s3 prefix (no leading or trailing /), and execute the commands below

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
./eks-cluster/utils/s3-backend.sh S3_BUCKET S3_PREFIX

Initialize Terraform

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster-and-nodegroup
terraform init

Apply Terraform

Logout from AWS Public ECR as otherwise terraform apply commands below may fail:

docker logout public.ecr.aws

Specify at least three AWS Availability Zones from your AWS Region in azs below, ensuring that you have access to your desired EC2 instance types. Replace S3_BUCKET with your S3 bucket name and execute:

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]' -var="import_path=s3://S3_BUCKET/ml-platform/"

If you need to use AWS GPU accelerated instances with AWS Elastic Fabric Adapter (EFA), you must specify an AWS Availability Zone for running these instances using cuda_efa_az variable, as shown in the example below:

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2d","us-west-2b","us-west-2c"]' -var="import_path=s3://S3_BUCKET/ml-platform/" -var="cuda_efa_az=us-west-2c"

If you need to use AWS Trainium instances, you must specify an AWS Availability Zone for running Trainium instances using neuron_az variable, as shown in the example below:

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2d","us-west-2b","us-west-2c"]' -var="import_path=s3://S3_BUCKET/ml-platform/" -var="neuron_az=us-west-2d"

Note: Ensure that the AWS Availability Zone you specify for neuron_az or cuda_efa_az variable above supports requested instance types, and this zone is included in the azs variable.

Enabling Modular Components

This solution offers a suite of modular components for MLOps. All are disabled by default, and are not needed to work through included examples. You may toggle the modular components using following terraform variables:

Component	Terraform Variable	Default Value
Airflow	airflow_enabled	false
Kubeflow	kubeflow_platform_enabled	false
KServe	kserve_enabled	false
Kueue	kueue_enabled	false
MLFlow	mlflow_enabled	false
Nvidia DCGM Exporter	dcgm_exporter_enabled	false
SageMaker controller	ack_sagemaker_enabled	false
Slinky Slurm	slurm_enabled	false

Retrieve Static User Password

The static user's password is marked sensitive in the Terraform output. To show your static password, execute:

terraform output static_password

This password is used for Admin user for all web applications deployed within this solution.

Create Home Folder on EFS and FSx for Lustre

Attach to the shared file-systems by executing following steps:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
kubectl apply -f eks-cluster/utils/attach-pvc.yaml  -n kubeflow
kubectl exec -it -n kubeflow attach-pvc -- /bin/bash

Inside the attach-pvc pod, for EFS file-system, execute:

cd /efs
mkdir home
chown 1000:100 home
exit

For Fsx for Lustre file-system, execute:

cd /fsx
mkdir home
chown 1000:100 home
exit

FSx for Lustre File-system Eventual Consistency with S3

FSx for Lustre file-system is configured to automatically import and export content from and to the configured S3 bucket. By default, /fsx is mapped to ml-platform top-level S3 folder in the S3 bucket. This automatic importing and exporting of content maintains eventual consistency between the FSx for Lustre file-system and the configured S3 bucket.

Access Kubeflow Central Dashboard (Optional)

This section only applies if you enable Kubeflow platform module.

If your web browser client machine is not the same as your build machine, before you can access Kubeflow Central Dashboard in a web browser, you must execute following steps on the your client machine:

install kubectl client
Enable IAM access to your EKS cluster. Before you execute this step, it is highly recommended that you backup your current configuration by executing following command on your build machine:

kubectl get configmap aws-auth -n kube-system -o yaml > ~/aws-auth.yaml

After you have enabled IAM access to your EKS cluster, open a terminal on your client machine and start kubectl port-forwarding by using the local and remote ports as described below. Because we need to forward HTTPs port (443), we need root access to execute steps below:

For Mac:

sudo kubectl port-forward svc/istio-ingressgateway -n ingress 443:443

For Linux:

Ensure kubectl is configured for root user by executing following commands (one time only):

sudo su -
aws eks update-kubeconfig --region us-west-2 --name my-eks-cluster

Connect using kubectl port-forward:

kubectl port-forward svc/istio-ingressgateway -n ingress 443:443

Note: Leave the kubectl port-forward terminal open for next step below.

Next, modify your /etc/hosts file to add following entry:

127.0.0.1 	istio-ingressgateway.ingress.svc.cluster.local

Open your web browser to the KubeFlow Central Dashboard URL to access the dashboard. For login, use the static username [email protected], and retrieve the static password from terraform.

Note: When you are not using the KubeFlow Central Dashboard, you can close the kubectl port-forward terminal.

Use Terraform to Destroy ML Ops Platform

If you want to preserve any content from your EFS file-system, you must upload it to your Amazon S3 bucket, manually. The content stored on the FSx for Lustre file-system is automatically exported to your Amazon S3 bucket under the ml-platform top-level folder.

Please verify your content in Amazon S3 bucket before destroying the ML Ops platform. You can recreate your ML Ops platform using the same S3 bucket.

Use following command to check and uninstall all Helm releases:

for x in $(helm list -q -n kubeflow-user-example-com); do echo $x; helm uninstall $x -n kubeflow-user-example-com; done

Wait at least 5 minutes for Helm uninstall to shut down all pods. Use following commands to check and delete all remaining pods in kubeflow-user-example-com namespace:

kubectl get pods -n kubeflow-user-example-com
kubectl delete --all pods -n kubeflow-user-example-com

Run following commands to delete attach-pvc pod:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
kubectl delete -f eks-cluster/utils/attach-pvc.yaml  -n kubeflow

Wait 15 minutes to allow infrastructure to automatically scale down to zero.

Finally, to destroy all the infrastructure created in this tutorial, execute following commands:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster-and-nodegroup

terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2d","us-west-2b","us-west-2c"]' -var="import_path=s3://S3_BUCKET/ml-platform/"

Reference

YAML Recipes

The YAML recipe file is a Helm values file that defines the runtime environment for a MLOps step. The key fields in the Helm values file that are common to all charts are described below:

The image field specifies the required docker container image.
The resources field specifies the required infrastructure resources.
The git field describes the code repository we plan to use for running the job. The git repository is cloned into an implicitly defined location under HOME directory, and, the location is made available in the environment variable GIT_CLONE_DIR.
The inline_script field is used to define an arbitrary script file.
The pre_script field defines the shell script executed after cloning the git repository, but before launching the job.
There is an optional post-script section for executing post training script.
The training launch command and arguments are defined in the train field, and the data processing launch command and arguments are defined in the process field.
The pvc field specifies the persistent volumes and their mount paths. EFS and Fsx for Lustre persistent volumes are available by default at /efs and /fsx mount paths, respectively, but these mount paths can be changed.
The ebs field specifies optional Amazon EBS volume storage capacity and mount path. By default, no EBS volume is attached.

System Architecture

The solution uses Terraform to deploy modular ML platform components on top of Amazon EKS. The hardware infrastructure is managed by Karpenter and Cluster Autoscaler. Nvidia GPUs or AWS AI Chips (AWS Trainium and AWS Inferentia) based machines are automatically managed by Karpenter, while CPU-only machines are automatically managed by Cluster Autoscaler.

The Kubeflow platform version that may be optionally deployed in this project is 1.9.2, and includes Kubeflow Notebooks, Kubeflow Tensorboard. Kubeflow Pipelines. Kubeflow Katib, and Kubeflow Central Dashboard.

The solution makes extensive use of Amazon EFS and Amazon FSx for Lustre shared file-systems to store the machine learning artifacts. Code, configuration, log files, and training checkpoints are stored on the EFS file-system. Data, and pre-trained model checkpoints are stored on the FSx for Lustre file system. FSx for Lustre file-system is configured to automatically import and export content from, and to, the configured S3 bucket. Any data stored on FSx for Lustre is automatically backed up to your S3 bucket.

Contributing

See CONTRIBUTING for more information.

Security

See CONTRIBUTING for more information.

License

See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 606 Commits
charts		charts
containers		containers
eks-cluster		eks-cluster
examples		examples
kfp		kfp
tutorials/maskrcnn-blog		tutorials/maskrcnn-blog
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build-ecr-images.sh		build-ecr-images.sh
ml-ops-desktop.yaml		ml-ops-desktop.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLOPs on Amazon EKS

Conceptual Overview

Tutorials

Prerequisites

Launch Build Machine Desktop

Connect to Build Machine Desktop using SSH

Clone Git Repository

Install Kubectl

Use Terraform to Create ML Ops PLatform

Enable S3 Backend for Terraform

Initialize Terraform

Apply Terraform

Enabling Modular Components

Retrieve Static User Password

Create Home Folder on EFS and FSx for Lustre

FSx for Lustre File-system Eventual Consistency with S3

Access Kubeflow Central Dashboard (Optional)

Use Terraform to Destroy ML Ops Platform

Reference

YAML Recipes

System Architecture

Contributing

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow

Folders and files

Latest commit

History

Repository files navigation

MLOPs on Amazon EKS

Conceptual Overview

Tutorials

Prerequisites

Launch Build Machine Desktop

Connect to Build Machine Desktop using SSH

Clone Git Repository

Install Kubectl

Use Terraform to Create ML Ops PLatform

Enable S3 Backend for Terraform

Initialize Terraform

Apply Terraform

Enabling Modular Components

Retrieve Static User Password

Create Home Folder on EFS and FSx for Lustre

FSx for Lustre File-system Eventual Consistency with S3

Access Kubeflow Central Dashboard (Optional)

Use Terraform to Destroy ML Ops Platform

Reference

YAML Recipes

System Architecture

Contributing

Security

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages