Using HF accelerate or DeepSpeed engine for inference: https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/llm-workshop/lab1-deploy-llm/intro_to_llm_deployment.ipynb Also compare with this DeepSpeed example: https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb See: https://github.com/deepjavalibrary/djl-serving