-
Notifications
You must be signed in to change notification settings - Fork 368
Added Triton deployment instructions to documentation #1116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
cb1c0b2
c2cb969
326eaeb
d13127f
a256e6a
2c01adc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,212 @@ | ||
| Deploying a Torch-TensorRT model (to Triton) | ||
| ============================================ | ||
|
|
||
| Optimization and deployment go hand in hand in a discussion about Machine | ||
| Learning infrastructure. For a Torch-TensorRT user, network level optimzation | ||
| to get the maximum performance would already be an area of expertize. | ||
|
|
||
| However, serving this optimized model comes with it's own set of considerations | ||
| and challenges like: building an infrastructure to support concorrent model | ||
| executions, supporting clients over HTTP or gRPC and more. | ||
|
|
||
| The `Triton Inference Server <https://github.com/triton-inference-server/server>`__ | ||
| solves the aforementioned and more. Let's discuss step-by-step, the process of | ||
| optimizing a model with Torch-TensorRT, deploying it on Triton Inference | ||
| Server, and building a client to query the model. | ||
|
|
||
| Step 1: Optimize your model with Torch-TensorRT | ||
| ----------------------------------------------- | ||
|
|
||
| Most Torch-TensorRT users will be familiar with this step. For the purpose of | ||
| this demoonstration, we will be using a ResNet50 model from Torchhub. | ||
tanayvarshney marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Let’s first pull the NGC PyTorch Docker container. You may need to create | ||
| an account and get the API key from `here <https://ngc.nvidia.com/setup/>`__. | ||
| Sign up and login with your key (follow the instructions | ||
| `here <https://ngc.nvidia.com/setup/api-key>`__ after signing up). | ||
|
|
||
| :: | ||
|
|
||
| # <xx.xx> is the yy:mm for the publishing tag for NVIDIA's Pytorch | ||
| # container; eg. 22.04 | ||
|
|
||
| docker run -it --gpus all -v /path/to/folder:/resnet50_eg nvcr.io/nvidia/pytorch:<xx.xx>-py3 | ||
|
||
|
|
||
| Once inside the container, we can proceed to download a ResNet model from | ||
| Torchhub and optimize it with Torch-TensorRT. | ||
|
|
||
| :: | ||
|
|
||
| import torch | ||
| import torch_tensorrt | ||
| torch.hub._validate_not_a_forked_repo=lambda a,b,c: True | ||
|
|
||
| # load model | ||
| model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda") | ||
|
|
||
| # Compile with Torch TensorRT; | ||
| trt_model = torch_tensorrt.compile(model, | ||
| inputs= [torch_tensorrt.Input((1, 3, 224, 224))], | ||
| enabled_precisions= { torch.half} # Run with FP32 | ||
| ) | ||
|
|
||
| # Save the model | ||
| torch.jit.save(trt_model, "model.pt") | ||
|
|
||
| The next step in the process is to set up a Triton Inference Server. | ||
|
|
||
| Step 2: Set Up Triton Inference Server | ||
| -------------------------------------- | ||
|
|
||
| If you are new to the Triton Inference Server and want to learn more, we | ||
| highly recommend to checking our `Github | ||
| Repository <https://github.com/triton-inference-server>`__. | ||
|
|
||
| To use Triton, we need to make a model repository. A model repository, as the | ||
| name suggested, is a repository of the models the Inference server hosts. While | ||
| Triton can serve models from multiple repositories, in this example, we will | ||
| discuss the simplest possible form of the model repository. | ||
|
|
||
| The structure of this repository should look something like this: | ||
|
|
||
| :: | ||
|
|
||
| model_repository | ||
| | | ||
| +-- resnet50 | ||
| | | ||
| +-- config.pbtxt | ||
| +-- 1 | ||
| | | ||
| +-- model.pt | ||
|
|
||
| There are two files that Triton requires to serve the model: the model itself | ||
| and a model configuration file which is typically provided in ``config.pbtxt``. | ||
| For the model we prepared in step 1, the following configuration can be used: | ||
|
|
||
| :: | ||
|
|
||
| name: "resnet50" | ||
| platform: "pytorch_libtorch" | ||
| max_batch_size : 0 | ||
tanayvarshney marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| input [ | ||
| { | ||
| name: "input__0" | ||
| data_type: TYPE_FP32 | ||
| dims: [ 3, 224, 224 ] | ||
| reshape { shape: [ 1, 3, 224, 224 ] } | ||
| } | ||
| ] | ||
| output [ | ||
| { | ||
| name: "output__0" | ||
| data_type: TYPE_FP32 | ||
| dims: [ 1, 1000 ,1, 1] | ||
| reshape { shape: [ 1, 1000 ] } | ||
| } | ||
| ] | ||
|
|
||
| The ``config.pbtxt`` file is used to describe the exact model configuration | ||
| with details like the names and shapes of the input and output layer(s), | ||
| datatypes, scheduling and batching details and more. If you are new to Triton, | ||
| we highly encourage you to check out this `section of our | ||
| documentation <https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md>`__ | ||
| for more details. | ||
|
|
||
| With the model repository setup, we can proceed to launch the Triton server | ||
| with the docker command below. | ||
|
|
||
| :: | ||
|
|
||
| # Make sure that the TensorRT version in the Triton container | ||
| # and TensorRT version in the environment used to optimize the model | ||
| # are the same. | ||
|
|
||
| docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models | ||
|
|
||
| This should spin up a Triton Inference server. Next step, building a simple | ||
| http client to query the server. | ||
|
|
||
| Step 3: Building a Triton Client to Query the Server | ||
| ---------------------------------------------------- | ||
|
|
||
| Before proceeding, make sure to have a sample image on hand. If you don't | ||
| have one, download an example image to test inference. In this section, we | ||
| will be going over a very basic client. For a variety of more fleshed out | ||
| examples, refer to the `Triton Client Repository <https://github.com/triton-inference-server/client/tree/main/src/python/examples>`__ | ||
|
|
||
| :: | ||
|
|
||
| wget -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg" | ||
|
|
||
| We then need to install dependencies for building a python client. These will | ||
| change from client to client. For a full list of all languages supported by Triton, | ||
| please refer to `Triton's client repository <https://github.com/triton-inference-server/client>`__. | ||
|
|
||
| :: | ||
|
|
||
| pip install torchvision | ||
| pip install attrdict | ||
| pip install nvidia-pyindex | ||
| pip install tritonclient[all] | ||
|
|
||
| Let's jump into the client. Firstly, we write a small preprocessing function to | ||
| resize and normalize the query image. | ||
|
|
||
| :: | ||
|
|
||
| import numpy as np | ||
| from torchvision import transforms | ||
| from PIL import Image | ||
| import tritonclient.http as httpclient | ||
| from tritonclient.utils import triton_to_np_dtype | ||
|
|
||
| # preprocessing function | ||
| def rn50_preprocess(img_path="img1.jpg"): | ||
| img = Image.open(img_path) | ||
| preprocess = transforms.Compose([ | ||
| transforms.Resize(256), | ||
| transforms.CenterCrop(224), | ||
| transforms.ToTensor(), | ||
| transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), | ||
| ]) | ||
| return preprocess(img).numpy() | ||
|
|
||
| transformed_img = rn50_preprocess() | ||
|
|
||
| Building a client requires three basic points. Firstly, we setup a connection | ||
| with the Triton Inference Server. | ||
|
|
||
| :: | ||
|
|
||
| # Setting up client | ||
| triton_client = httpclient.InferenceServerClient(url="localhost:8000") | ||
|
|
||
| Secondly, we specify the names of the input and output layer(s) of our model. | ||
|
|
||
| :: | ||
|
|
||
| test_input = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32") | ||
| test_input.set_data_from_numpy(transformed_img, binary_data=True) | ||
|
|
||
| test_output = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000) | ||
peri044 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Lastly, we send an inference request to the Triton Inference Server. | ||
|
|
||
| :: | ||
|
|
||
| # Querying the server | ||
| results = triton_client.infer(model_name="resnet50", inputs=[test_input], outputs=[test_output]) | ||
| test_output_fin = results.as_numpy('output__0') | ||
| print(test_output_fin[:5]) | ||
|
|
||
| The output of the same should look like below: | ||
|
|
||
| :: | ||
|
|
||
| [b'12.468750:90' b'11.523438:92' b'9.664062:14' b'8.429688:136' | ||
| b'8.234375:11'] | ||
|
|
||
| The output format here is ``<confidence_score>:<classification_index>``. | ||
| To learn how to map these to the label names and more, refer to our | ||
| `documentation <https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md>`__. | ||
Uh oh!
There was an error while loading. Please reload this page.