|
| 1 | +# Profiling |
| 2 | +## Introduction |
| 3 | +In terms of improving the performance of the model ,we should evaluate the performance of each operator(op) during inference. |
| 4 | +NLP Toolkit supports tracing the profiling of operator latency. |
| 5 | +## Usage |
| 6 | +### Example |
| 7 | +run python |
| 8 | +```shell |
| 9 | +ENGINE_PROFILING=1 python run_executor.py --input_model=./model_and_tokenizer/int8-model.onnx --mode=performance --batch_size=8 --seq_len=128 |
| 10 | + ``` |
| 11 | + or run C++ |
| 12 | + ```shell |
| 13 | +export ENGINE_PROFILING=1 |
| 14 | +<NLP_Toolkit_folder>/nlp_toolkit/backends/neural_engine/bin/neural_engine --batch_size=<batch_size> --iterations=<iterations> --w=<warmup> --seq_len=128 --config=./ir/conf.yaml --weight=./ir/model.bin |
| 15 | + ``` |
| 16 | + |
| 17 | + ## Result |
| 18 | + ### We will get a profiling form,such as follows. This form is divided into three sections. |
| 19 | + |
| 20 | + #### Part 1 |
| 21 | +- Some arguments for sparse include weight shape,sparse ratio and target performance ratio. |
| 22 | +- Users can set the parameter pref ratio independently to calculate sparse op performance. |
| 23 | + |
| 24 | +| Arguments | Weight shape | 90% 4x1 perf ratio |80% 4x1 perf ratio|70% 4x1 Perf ratio| |
| 25 | +| -------- | :-----: | :----: | :----: | :----: | |
| 26 | +| value | 256x256 |**4(optional)**|**2.5(optional)**|**2(optional)**| |
| 27 | +| value | 256x1024 |**4.5(optional)**|**3(optional)**|**2.5(optional)**| |
| 28 | +| value | 1024x256 |**5(optional)**|**3.5(optional)**|**3(optional)**| |
| 29 | +|description | Shape of weight for "matmul" or "innerproduct" |The op's sparse ratio is 90%, and the performance ratio is "dense op latency"/ "sparse op latency" , representing the performance improvement of the op after sparse. This parameter can be set by the user.|Same as 90% 4x1 perf ratio |Same as 90% 4x1 perf ratio| |
| 30 | +#### Part 2 |
| 31 | +- All operator's profiling, such as operator type ,input tensor ,output tensor and latency .Let's take "innerproduct" as an example. |
| 32 | +- In this form, we can auto calculate the sparse op performance by customized sparse ratio. |
| 33 | + |
| 34 | +| Argument | Value | Additional description | |
| 35 | +| :--------: | :-----: | :----: | |
| 36 | +| operator type | InnerProduct | None | |
| 37 | +| post op | gelu_tanh | In order to improve the performance of inference, we use multiple ops as one op for inference| |
| 38 | +| operator name | Add_37 | None | |
| 39 | +| input tensor name | 116:0;641:0;bert.encoder.layer.0.attention.self.key.bias:0 | The name of input tensor(include multi inputs) | |
| 40 | +| input shape | 1024x256;256x256;256 | The shape of input tensor(include multi inputs) | |
| 41 | +| input dtype | fp32;fp32;fp32| None | |
| 42 | +| output tensor name | Add_37:0| None | |
| 43 | +| output shape | 1024x256 | The shape of output tensor | |
| 44 | +| output dtype | fp32 | None | |
| 45 | +| weight shape | 256x256 | Shape of weight for "matmul" or "innerproduct" | |
| 46 | +| weight sparse ratio | 0.00% | The current sparse ratio for weight | |
| 47 | +| sparse support | TRUE | Whether to support sparse | |
| 48 | +| operator latency (ms) | 0.075 | The latency before sparse | |
| 49 | +| **aim to weight sparse ratio** | **70%(optional)** | **Target weight sparse ratio ,option: 90%,80%,70%,etc** | |
| 50 | +| pref ratio | 2 | Auto look up part 1 form | |
| 51 | +| aim to sparse latency(ms) | 0.0375 | Target sparse latency = "operator latency(0.075)"/"perf ratio(2)"(auto calculate)| |
| 52 | + |
| 53 | +#### Part 3 |
| 54 | +- Performance comparison of dense and sparse networks. |
| 55 | + |
| 56 | +|Arguments|Value|Description| |
| 57 | + |-----------|:--------:|:--------:| |
| 58 | + |total latency(ms)|4.512|The latency for the entire network to inference once before sparse| |
| 59 | + |total aim to sparse latency(ms)|2.185|The latency for the entire network to inference once after sparse| |
| 60 | + |sparse support latency(ms)|3.127|The latency for all operators that support sparse to inference once before sparse| |
| 61 | + |aim to sparse support latency(ms)|0.801| The latency for all operators that support sparse to inference once after sparse| |
| 62 | + |sparse support latency ratio|0.693|The ratio of the latency of the operator before sparse to the latency for the entire network to inference once| |
| 63 | + |aim to sparse support latency ratio|0.366|The ratio of latency of the operator after sparse to the latency for the entire network to inference once| |
| 64 | + |
| 65 | +## Cautions |
| 66 | + - We have obtained a form in csv format, and we can modify the form content of part1 to obtain the desired performance, but after modification, we need to save the form format as "xlsx". |
0 commit comments