Skip to content

Commit ffec3d5

Browse files
pchamartAditi2424
andauthored
SageMaker FasterAutoscaling Llama3-8B TGI, real-time endpoints (#4712)
* SageMaker FasterAutoscaling Llama3-8B TGI, real-time endpoints * Moved trigger autoscaling to shell script. Removed shell=True in subprocess.Popen --------- Co-authored-by: Aditi Sharma <[email protected]>
1 parent 0a6cd56 commit ffec3d5

File tree

9 files changed

+2648
-0
lines changed

9 files changed

+2648
-0
lines changed
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Amazon SageMaker Faster Autoscaling
2+
3+
To demonstrate newer, faster SageMaker autoscaling features, We deploy Meta's **Llama3-8B-Instruct** model to an Amazon SageMaker real-time endpoint using Text Generation Inference (TGI) Deep Learning Container (DLC).
4+
5+
To trigger autoscaling, we need to generate traffic to the endpoint.
6+
We use [LLMPerf](https://github.com/philschmid/llmperf) to generate sample traffic to the endpoint.
7+
8+
## Prerequisites
9+
10+
Before using this notebook please ensure you have access to an active access token from HuggingFace and have accepted the license agreement from Meta.
11+
12+
- Step 1: Create user access token in HuggingFace (HF). Refer [here](https://huggingface.co/docs/hub/security-tokens) on how to create HF tokens.
13+
- Step 2: Login to HuggingFace and navigate to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main) home page.
14+
- Step 3: Accept META LLAMA 3 COMMUNITY LICENSE AGREEMENT by following the instructions [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main).
15+
- Step 4: Wait for the approval email from META (Approval may take any where b/w 1-3 hrs).
16+
17+
---
18+
19+
>NOTE: LLMPerf spins up a ray cluster to generate traffic to Amazon SageMaker endpoint.\
20+
>When running this on Amazon SageMaker Notebook Instance, ensure you use at least **m5.2xlarge** or a larger instance type.
21+
22+
## Autoscaling on real-time endpoints
23+
24+
### Amazon SageMaker real-time endpoints
25+
26+
- For Application Autoscaling example on Amazon SageMaker real-time endpoints refer to [FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb](./realtime-endpoints/FasterAutoscaling-SME-Llama3-8B-AppAutoScaling.ipynb) notebook.
27+
28+
- For StepScaling example on Amazon SageMaker real-time endpoints refer to [FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb](./realtime-endpoints/FasterAutoscaling-SME-Llama3-8B-StepScaling.ipynb) notebook.
29+
30+
### Amazon SageMaker Inference Components
31+
32+
- For autoscaling example using Amazon SageMaker Inference components, refer to [inference-component-llama3-autoscaling.ipynb](./realtime-endpoints/FasterAutoscaling-IC-Llama3-8B-AppAutoScaling.ipynb) notebook.
33+
34+
---
35+
36+
## References
37+
38+
- [LLMPerf](https://github.com/philschmid/llmperf)
39+
- [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
40+
- [Create HF Access Token](https://huggingface.co/docs/hub/security-tokens)
41+
- [Amazon SageMaker Inference Components - blog post](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/)

0 commit comments

Comments
 (0)