|
| 1 | +# Amazon SageMaker Faster Autoscaling |
| 2 | + |
| 3 | +To demonstrate newer, faster SageMaker autoscaling features, We deploy Meta's **Llama3-8B-Instruct** model to an Amazon SageMaker real-time endpoint using Text Generation Inference (TGI) Deep Learning Container (DLC). |
| 4 | + |
| 5 | +To trigger autoscaling, we need to generate traffic to the endpoint. |
| 6 | +We use [LLMPerf](https://github.com/philschmid/llmperf) to generate sample traffic to the endpoint. |
| 7 | + |
| 8 | +## Prerequisites |
| 9 | + |
| 10 | +Before using this notebook please ensure you have access to an active access token from HuggingFace and have accepted the license agreement from Meta. |
| 11 | + |
| 12 | +- Step 1: Create user access token in HuggingFace (HF). Refer [here](https://huggingface.co/docs/hub/security-tokens) on how to create HF tokens. |
| 13 | +- Step 2: Login to HuggingFace and navigate to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main) home page. |
| 14 | +- Step 3: Accept META LLAMA 3 COMMUNITY LICENSE AGREEMENT by following the instructions [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main). |
| 15 | +- Step 4: Wait for the approval email from META (Approval may take any where b/w 1-3 hrs). |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +>NOTE: LLMPerf spins up a ray cluster to generate traffic to Amazon SageMaker endpoint.\ |
| 20 | +>When running this on Amazon SageMaker Notebook Instance, ensure you use at least **m5.2xlarge** or a larger instance type. |
| 21 | +
|
| 22 | +## Autoscaling on real-time endpoints |
| 23 | + |
| 24 | +### Amazon SageMaker real-time endpoints |
| 25 | + |
| 26 | +- For Application Autoscaling example on Amazon SageMaker real-time endpoints refer to [FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb](./realtime-endpoints/FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb) notebook. |
| 27 | + |
| 28 | +- For StepScaling example on Amazon SageMaker real-time endpoints refer to [FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb](./realtime-endpoints/FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb) notebook. |
| 29 | + |
| 30 | +### Amazon SageMaker Inference Components |
| 31 | + |
| 32 | +- For autoscaling example using Amazon SageMaker Inference components, refer to [inference-component-llama3-autoscaling.ipynb](./realtime-endpoints/FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb) notebook. |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## References |
| 37 | + |
| 38 | +- [LLMPerf](https://github.com/philschmid/llmperf) |
| 39 | +- [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
| 40 | +- [Create HF Access Token](https://huggingface.co/docs/hub/security-tokens) |
| 41 | +- [Amazon SageMaker Inference Components - blog post](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/) |
0 commit comments