Skip to content

Commit 59232b9

Browse files
authored
add a profiling's README (#195)
1 parent 9edbd4a commit 59232b9

File tree

2 files changed

+67
-1
lines changed

2 files changed

+67
-1
lines changed

docs/profiling.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Profiling
2+
## Introduction
3+
In terms of improving the performance of the model ,we should evaluate the performance of each operator(op) during inference.
4+
NLP Toolkit supports tracing the profiling of operator latency.
5+
## Usage
6+
### Example
7+
run python
8+
```shell
9+
ENGINE_PROFILING=1 python run_executor.py --input_model=./model_and_tokenizer/int8-model.onnx --mode=performance --batch_size=8 --seq_len=128
10+
```
11+
or run C++
12+
```shell
13+
export ENGINE_PROFILING=1
14+
<NLP_Toolkit_folder>/nlp_toolkit/backends/neural_engine/bin/neural_engine --batch_size=<batch_size> --iterations=<iterations> --w=<warmup> --seq_len=128 --config=./ir/conf.yaml --weight=./ir/model.bin
15+
```
16+
17+
## Result
18+
### We will get a profiling form,such as follows. This form is divided into three sections.
19+
20+
#### Part 1
21+
- Some arguments for sparse include weight shape,sparse ratio and target performance ratio.
22+
- Users can set the parameter pref ratio independently to calculate sparse op performance.
23+
24+
| Arguments | Weight shape | 90% 4x1 perf ratio |80% 4x1 perf ratio|70% 4x1 Perf ratio|
25+
| -------- | :-----: | :----: | :----: | :----: |
26+
| value | 256x256 |**4(optional)**|**2.5(optional)**|**2(optional)**|
27+
| value | 256x1024 |**4.5(optional)**|**3(optional)**|**2.5(optional)**|
28+
| value | 1024x256 |**5(optional)**|**3.5(optional)**|**3(optional)**|
29+
|description | Shape of weight for "matmul" or "innerproduct" |The op's sparse ratio is 90%, and the performance ratio is "dense op latency"/ "sparse op latency" , representing the performance improvement of the op after sparse. This parameter can be set by the user.|Same as 90% 4x1 perf ratio |Same as 90% 4x1 perf ratio|
30+
#### Part 2
31+
- All operator's profiling, such as operator type ,input tensor ,output tensor and latency .Let's take "innerproduct" as an example.
32+
- In this form, we can auto calculate the sparse op performance by customized sparse ratio.
33+
34+
| Argument | Value | Additional description |
35+
| :--------: | :-----: | :----: |
36+
| operator type | InnerProduct | None |
37+
| post op | gelu_tanh | In order to improve the performance of inference, we use multiple ops as one op for inference|
38+
| operator name | Add_37 | None |
39+
| input tensor name | 116:0;641:0;bert.encoder.layer.0.attention.self.key.bias:0 | The name of input tensor(include multi inputs) |
40+
| input shape | 1024x256;256x256;256 | The shape of input tensor(include multi inputs) |
41+
| input dtype | fp32;fp32;fp32| None |
42+
| output tensor name | Add_37:0| None |
43+
| output shape | 1024x256 | The shape of output tensor |
44+
| output dtype | fp32 | None |
45+
| weight shape | 256x256 | Shape of weight for "matmul" or "innerproduct" |
46+
| weight sparse ratio | 0.00% | The current sparse ratio for weight |
47+
| sparse support | TRUE | Whether to support sparse |
48+
| operator latency (ms) | 0.075 | The latency before sparse |
49+
| **aim to weight sparse ratio** | **70%(optional)** | **Target weight sparse ratio ,option: 90%,80%,70%,etc** |
50+
| pref ratio | 2 | Auto look up part 1 form |
51+
| aim to sparse latency(ms) | 0.0375 | Target sparse latency = "operator latency(0.075)"/"perf ratio(2)"(auto calculate)|
52+
53+
#### Part 3
54+
- Performance comparison of dense and sparse networks.
55+
56+
|Arguments|Value|Description|
57+
|-----------|:--------:|:--------:|
58+
|total latency(ms)|4.512|The latency for the entire network to inference once before sparse|
59+
|total aim to sparse latency(ms)|2.185|The latency for the entire network to inference once after sparse|
60+
|sparse support latency(ms)|3.127|The latency for all operators that support sparse to inference once before sparse|
61+
|aim to sparse support latency(ms)|0.801| The latency for all operators that support sparse to inference once after sparse|
62+
|sparse support latency ratio|0.693|The ratio of the latency of the operator before sparse to the latency for the entire network to inference once|
63+
|aim to sparse support latency ratio|0.366|The ratio of latency of the operator after sparse to the latency for the entire network to inference once|
64+
65+
## Cautions
66+
- We have obtained a form in csv format, and we can modify the form content of part1 to obtain the desired performance, but after modification, we need to save the form format as "xlsx".

nlp_toolkit/backends/neural_engine/executor/src/model.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,7 @@ void Model::Profiling(char* space_name, char* count_name, char* mtx_name, int wa
344344
fprintf(fp, "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n", "operator type", "post op",
345345
"operator name", "input tensor name", "input shape", "input dtype", "output tensor name", "output shape",
346346
"output dtype", "weight shape", "weight sparse ratio", "sparse support", "operator latency (ms)",
347-
"aim to weight sparse ratio", "sparse kernel perf improve", "aim to sparse latency(ms)");
347+
"aim to weight sparse ratio", "sparse kernel pref ratio", "aim to sparse latency(ms)");
348348
float total_latency = 0;
349349
float enable_sparse_latency = 0.;
350350
// skip input and output node

0 commit comments

Comments
 (0)