diff --git a/README.md b/README.md index ab9c037a8e4..91690432918 100644 --- a/README.md +++ b/README.md @@ -116,54 +116,58 @@ quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloade - Architecture - Workflow - APIs - LLMs Recipes - Examples + Architecture + Workflow + APIs + LLMs Recipes + Examples - Python-based APIs + PyTorch Extension APIs - Quantization - Advanced Mixed Precision - Pruning (Sparsity) - Distillation + Overview + Static Quantization + Dynamic Quantization + Smooth Quantization - Orchestration - Benchmarking - Distributed Compression - Model Export + Weight-Only Quantization + MX Quantization + Mixed Precision - Advanced Topics + Tensorflow Extension APIs - Adaptor - Strategy - Distillation for Quantization - SmoothQuant + Overview + Static Quantization + Smooth Quantization + + - Weight-Only Quantization (INT8/INT4/FP4/NF4) - FP8 Quantization - Layer-Wise Quantization + Other Modules + + + + + Auto Tune + Benchmark -> **Note**: -> Further documentations can be found at [User Guide](https://github.com/intel/neural-compressor/blob/master/docs/source/user_guide.md). +> **Note**: +> From 3.0 release, we recommend to use 3.X API. Compression techniques during training such as QAT, Pruning, Distillation only available in [2.X API](https://github.com/intel/neural-compressor/blob/master/docs/source/2x_user_guide.md) currently. ## Selected Publications/Events * Blog by Intel: [Neural Compressor: Boosting AI Model Efficiency](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Neural-Compressor-Boosting-AI-Model-Efficiency/post/1604740) (June 2024) diff --git a/docs/3x/PT_MixPrecision.md b/docs/3x/PT_MixedPrecision.md similarity index 100% rename from docs/3x/PT_MixPrecision.md rename to docs/3x/PT_MixedPrecision.md diff --git a/docs/3x/PyTorch.md b/docs/3x/PyTorch.md index cafc306d9be..a3004f6bcfb 100644 --- a/docs/3x/PyTorch.md +++ b/docs/3x/PyTorch.md @@ -194,6 +194,21 @@ def load(output_dir="./saved_results", model=None): ✔ link + + MX Quantization + Microscaling Data Formats for +Deep Learning + PyTorch eager mode + ✔ + link + + + Mixed Precision + Mixed precision + PyTorch eager mode + ✔ + link + Quantization Aware Training Quantization Aware Training diff --git a/docs/3x/design.md b/docs/3x/design.md new file mode 100644 index 00000000000..73d4d2e5568 --- /dev/null +++ b/docs/3x/design.md @@ -0,0 +1,16 @@ +Design +===== + +## Architecture + + + Architecture + + +## Workflows + +Intel® Neural Compressor provides two workflows: Quantization and Auto-tune. + + + Workflow + diff --git a/docs/3x/get_started.md b/docs/3x/get_started.md new file mode 100644 index 00000000000..76a43c60924 --- /dev/null +++ b/docs/3x/get_started.md @@ -0,0 +1,88 @@ +# Getting Started + +1. [Quick Samples](#quick-samples) + +2. [Feature Matrix](#feature-matrix) + +## Quick Samples + +```shell +# Install Intel Neural Compressor +pip install neural-compressor-pt +``` +```python +from transformers import AutoModelForCausalLM +from neural_compressor.torch.quantization import RTNConfig, prepare, convert + +user_model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m") +quant_config = RTNConfig() +prepared_model = prepare(model=user_model, quant_config=quant_config) +quantized_model = convert(model=prepared_model) +``` + +## Feature Matrix +Intel Neural Compressor 3.X extends PyTorch and TensorFlow's APIs to support compression techniques. +The below table provides a quick overview of the APIs available in Intel Neural Compressor 3.X. +The Intel Neural Compressor 3.X mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference. +It also provides some common modules across different frameworks. For example, Auto-tune support accuracy driven quantization and mixed precision, benchmark aimed to measure the multiple instances performance of the quantized model. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Overview
ArchitectureWorkflowAPIsLLMs RecipesExamples
PyTorch Extension APIs
OverviewStatic QuantizationDynamic QuantizationSmooth Quantization
Weight-Only QuantizationMX QuantizationMixed Precision
Tensorflow Extension APIs
OverviewStatic QuantizationSmooth Quantization
Other Modules
Auto TuneBenchmark
+ +> **Note**: +> From 3.0 release, we recommend to use 3.X API. Compression techniques during training such as QAT, Pruning, Distillation only available in [2.X API](https://github.com/intel/neural-compressor/blob/master/docs/source/2x_user_guide.md) currently. diff --git a/docs/3x/imgs/INC3_WORKFLOW.png b/docs/3x/imgs/workflow.png similarity index 100% rename from docs/3x/imgs/INC3_WORKFLOW.png rename to docs/3x/imgs/workflow.png diff --git a/docs/3x/llm_recipes.md b/docs/3x/llm_recipes.md new file mode 100644 index 00000000000..e69de29bb2d diff --git a/docs/source/user_guide.md b/docs/source/2x_user_guide.md similarity index 96% rename from docs/source/user_guide.md rename to docs/source/2x_user_guide.md index 6272eef2ecb..941e80d6a39 100644 --- a/docs/source/user_guide.md +++ b/docs/source/2x_user_guide.md @@ -1,10 +1,10 @@ -User Guide +2.X API User Guide =========================== Intel® Neural Compressor aims to provide popular model compression techniques such as quantization, pruning (sparsity), distillation, and neural architecture search to help the user optimize their model. The below documents could help you to get familiar with concepts and modules in Intel® Neural Compressor. Learn how to utilize the APIs in Intel® Neural Compressor to conduct quantization, pruning (sparsity), distillation, and neural architecture search on mainstream frameworks. ## Overview -This part helps user to get a quick understand about design structure and workflow of Intel® Neural Compressor. We provided broad examples to help users get started. +This part helps user to get a quick understand about design structure and workflow of 2.X Intel® Neural Compressor. We provided broad examples to help users get started. @@ -53,7 +53,7 @@ In 2.X API, it's very important to create the `DataLoader` and `Metrics` for you
## Advanced Topics -This part provides the advanced topics that help user dive deep into Intel® Neural Compressor. +This part provides the advanced topics that help user dive deep into Intel® Neural Compressor 2.X API. diff --git a/docs/source/NAS.md b/docs/source/NAS.md deleted file mode 100644 index e922b971c97..00000000000 --- a/docs/source/NAS.md +++ /dev/null @@ -1,86 +0,0 @@ -# Neural Architecture Search - -1. [Introduction](#introduction) - - 1.1. [Basic NAS](#basic-nas) - - 1.2. [Dynamic NAS](#dynamic-nas) - -2. [NAS Support Matrix](#nas-support-matrix) -3. [Get Started with NAS API](#get-started-with-nas-api) - - 3.1. [Basic Usage](#basic-usage) - - 3.2. [Advanced Usage (Custom NAS)](#advanced-usage-custom-nas) - -4. [Examples](#examples) - -## Introduction -Neural Architecture Search (NAS) is the process of automating the design of artificial neural networks (ANN) architecture. NAS has been used to design networks that are on par with or outperform hand-designed architectures. Intel® Neural Compressor has supported two different NAS methods: Basic NAS and Dynamic NAS. - -### Basic NAS -Our Basic NAS method leverages a specific search algorithm from built-in search algorithms (grid search, random search, and Bayesian optimization are supported in Intel® Neural Compressor now) or user-defined search algorithms to propose the model architecture based on the given search space, then performs the train evaluation process to evaluate the potential of the proposed model architecture, after several iterations of such procedure, best-performing model architectures which lie in Pareto front will be returned. - -### Dynamic NAS -Dynamic Neural Architecture Search (DyNAS) is a super-network-based NAS approach that uses the metric predictors for predicting the metrics of the model architecture, it is >4x more sample efficient than typical one-shot predictor-based NAS approaches. -
-The flow of the DyNAS approach is shown in the following figure. In the first phase of the search, a small population of sub-networks is randomly sampled from the super-network and evaluated (validation measurement) to provide the initial training set for the inner predictor loop. After the predictors are trained, a multi-objective evolutionary search is performed in the predictor objective space. After this extensive search is performed, the best-performing sub-network configurations are selected to be the next iteration's validation population. The cycle continues until the search concludes when the user-defined evaluation count is met. -
-![DyNAS Workflow](./imgs/dynas.png) - -## NAS Support Matrix - -|NAS Algorithm |PyTorch |TensorFlow | -|------------------|:--------:|:---------:| -|Basic NAS |✔ |Not supported yet| -|Dynamic NAS |✔ |Not supported yet| - -## Get Started with NAS API - -### Basic Usage - -#### 1. Python code + YAML - -Simplest launcher code if NAS configuration is defined in user-defined yaml. - -```python -from neural_compressor.experimental import NAS - -agent = NAS("/path/to/user/yaml") -results = agent.search() -``` - -#### 2. Python code only - -NAS class also support `NASConfig` class as it's argument. - -```python -from neural_compressor.conf.config import NASConfig -from neural_compressor.experimental import NAS - -config = NASConfig(approach="dynas", search_algorithm="nsga2") -config.dynas.supernet = "ofa_mbv3_d234_e346_k357_w1.2" -config.dynas.metrics = ["acc", "macs"] -config.dynas.population = 50 -config.dynas.num_evals = 250 -config.dynas.results_csv_path = "search_results.csv" -config.dynas.batch_size = 64 -config.dynas.dataset_path = "/datasets/imagenet-ilsvrc2012" # example -agent = NAS(config) -results = agent.search() -``` - -### Advanced Usage (Custom NAS) - -Intel® Neural Compressor NAS API is defined under `neural_compressor.experimental.nas`, which takes a user defined yaml file or a [NASConfig](../../neural_compressor/conf/config.py#NASConfig) object as input. The user defined yaml or the [NASConfig](../../neural_compressor/conf/config.py#NASConfig) object defines necessary configuration of the NAS process. The [NAS](../../neural_compressor/experimental/nas/nas.py#NAS) class aims to create an object according to the defined NAS approach in the configuration, please note this NAS approach should be registered in the Intel® Neural Compressor. - -Currently, Intel® Neural Compressor supported two built-in NAS methods: [Basic NAS](../../neural_compressor/experimental/nas/basic_nas.py#BasicNAS) and [Dynamic NAS](../../neural_compressor/experimental/nas/dynas.py#DyNAS). Both methods are inherited from a base class called [NASBase](../../neural_compressor/experimental/nas/nas.py#NASBase). User can also customize their own NAS approach in Intel® Neural Compressor just by decorating their NAS approach class with function [nas_registry](../../neural_compressor/experimental/nas/nas_utils.py#nas_registry) as well as following the API in [NASBase](../../neural_compressor/experimental/nas/nas.py#NASBase), like the way used in the two built-in NAS methods. - -## Examples - -Following examples are supported in Intel® Neural Compressor: - -- DyNAS MobileNetV3 supernet Example: - - [DyNAS MobileNetV3 supernet Example](../../examples/notebook/dynas/MobileNetV3_Supernet_NAS.ipynb): DyNAS with MobileNetV3 supernet on ImageNet dataset. -- DyNAS Transformer LT supernet Example: - - [DyNAS Transformer LT supernet Example](../../examples/notebook/dynas/Transformer_LT_Supernet_NAS.ipynb): DyNAS with Transformer LT supernet on WMT En-De dataset. diff --git a/docs/source/imgs/dynas.png b/docs/source/imgs/dynas.png deleted file mode 100644 index e82a1c03160..00000000000 Binary files a/docs/source/imgs/dynas.png and /dev/null differ diff --git a/docs/source/imgs/release_data.png b/docs/source/imgs/release_data.png deleted file mode 100644 index 5e06eb067e1..00000000000 Binary files a/docs/source/imgs/release_data.png and /dev/null differ diff --git a/docs/source/imgs/tensorboard_baseline_v0_cg_conv0.png b/docs/source/imgs/tensorboard_baseline_v0_cg_conv0.png deleted file mode 100644 index daa3036c516..00000000000 Binary files a/docs/source/imgs/tensorboard_baseline_v0_cg_conv0.png and /dev/null differ diff --git a/docs/source/imgs/tensorboard_tune_1_v0_cg_conv0.png b/docs/source/imgs/tensorboard_tune_1_v0_cg_conv0.png deleted file mode 100644 index 6086a6eb837..00000000000 Binary files a/docs/source/imgs/tensorboard_tune_1_v0_cg_conv0.png and /dev/null differ diff --git a/docs/source/imgs/tensorboard_v0_cg_conv0_histogram.png b/docs/source/imgs/tensorboard_v0_cg_conv0_histogram.png deleted file mode 100644 index 6f5d52a5cb8..00000000000 Binary files a/docs/source/imgs/tensorboard_v0_cg_conv0_histogram.png and /dev/null differ diff --git a/docs/source/imgs/terminal-ops.jpg b/docs/source/imgs/terminal-ops.jpg deleted file mode 100644 index 32bfee29110..00000000000 Binary files a/docs/source/imgs/terminal-ops.jpg and /dev/null differ diff --git a/docs/source/imgs/terminal-profiling.jpg b/docs/source/imgs/terminal-profiling.jpg deleted file mode 100644 index 642a64ab446..00000000000 Binary files a/docs/source/imgs/terminal-profiling.jpg and /dev/null differ diff --git a/docs/source/imgs/terminal-weights.jpg b/docs/source/imgs/terminal-weights.jpg deleted file mode 100644 index 3fe8501017b..00000000000 Binary files a/docs/source/imgs/terminal-weights.jpg and /dev/null differ diff --git a/docs/source/imgs/tutorial.png b/docs/source/imgs/tutorial.png deleted file mode 100644 index 1c57041b0ce..00000000000 Binary files a/docs/source/imgs/tutorial.png and /dev/null differ diff --git a/docs/source/imgs/workflow.jpg b/docs/source/imgs/workflow.jpg deleted file mode 100644 index c40f02b99b9..00000000000 Binary files a/docs/source/imgs/workflow.jpg and /dev/null differ diff --git a/docs/source/infrastructure.md b/docs/source/infrastructure.md index 96f0ea3fca8..cfa1912c571 100644 --- a/docs/source/infrastructure.md +++ b/docs/source/infrastructure.md @@ -182,19 +182,6 @@ Intel® Neural Compressor has unified interfaces which dispatch tasks to differe
-
-
- -[Neural architecture search](NAS.md): -|Approach |Framework | -|------------------------------------------------|:-----------:| -|Basic |PyTorch | -|DyNas |PyTorch | - -
-
- - [Mixed precision](mixed_precision.md): |Framework | | |--------------|:-----------:| diff --git a/docs/source/installation_guide.md b/docs/source/installation_guide.md index c99e09e0340..a0e7ad5e47c 100644 --- a/docs/source/installation_guide.md +++ b/docs/source/installation_guide.md @@ -59,8 +59,8 @@ The following prerequisites and requirements must be satisfied for a successful cd neural-compressor pip install -r requirements.txt python setup.py install - [optional] pip install requirements_pt.txt # for PyTorch framework extension API - [optional] pip install requirements_tf.txt # for TensorFlow framework extension API + [optional] pip install -r requirements_pt.txt # for PyTorch framework extension API + [optional] pip install -r requirements_tf.txt # for TensorFlow framework extension API ``` ### Install from AI Kit diff --git a/docs/source/tensorboard.md b/docs/source/tensorboard.md deleted file mode 100644 index 670f7930417..00000000000 --- a/docs/source/tensorboard.md +++ /dev/null @@ -1,205 +0,0 @@ -TensorBoard -=========== -1. [Introduction](#introduction) -2. [Supported Feature Matrix](#supported-feature-matrix) -3. [Get Started with Tensorboard](#get-started-with-tensorboard) -4. [Examples](#examples) - -## Introduction - -TensorBoard is a suite of web applications that provide measurements and visualizations used to inspect and understand your machine learning workflow for [TensorFlow TensorBoard](https://github.com/tensorflow/tensorboard) and [PyTorch TensorBoard](https://github.com/pytorch/pytorch/tree/master/torch/utils/tensorboard). Intel® Neural Compressor performs accuracy-driven quantization; the tuning process quantizes the tensor and performs graph transformation and optimization to achieve optimal performance under accuracy requirement. If you want to observe the behaviors of the optimizations, or if you want to discover why an accuracy target cannot be met, TensorBoard can provide you with some valuable information. You can inspect the graph and tensor after each tuning run. If a model cannot meet accuracy requirements, you can analyze the comparison of FP32 and the INT8 tensor histogram. - -We collect the TensorBoard event summary during evaluation. The first time is on the baseline FP32 model and later on at the end of each tuning runs are based on the quantized model. The TensorBoard log directory is named baseline_acc_ and tune__acc_, to indicate the stage and accuracy of the data that is generated. Users can select their data of interest to observe with TensorBoard. - -## Supported Feature Matrix -| Optimized Framework | Tensorboard Support | -|---------------------|:-------------------:| -| PyTorch | ✔ | -| TensorFlow | ✔ | - -## Get Started with TensorBoard -### PyTorch TensorBoard - -PyTorch TensorBoard implementation includes three steps: - -* Before evaluation in the _pre_eval_hook() where instruments observers are placed in the model. -* During evaluation where observers collect tensor information in a dict data structure. -* After evaluation where the graph and tensor information is dumped with the TensorBoard summary writer in _post_eval_hook(). - - -The detailed algorithm can be described by the Pseudo code: -```python - -def evaluate(self, model, dataloader, postprocess=None, \ - metric=None, measurer=None, iteration=-1, tensorboard=False): -# The tensorboard summary is collected in the evaluation function of adaptor - - if tensorboard: - model = self._pre_eval_hook(model) - #evaluation code - .... - acc = metric.result() - if tensorboard: - self._post_eval_hook(model, accuracy=acc, input=input) - -def _pre_eval_hook(self, model): -# Insert observer submodule into each module in whitelist in order to collect tensor information - - class _RecordingObserver(ABC, torch.nn.Module): - # Define the Observer class - - def forward(self, x): - # Record the tensor information in a dict structure - self.output_tensors_dict[self.current_iter] = x.to("cpu") - - @torch.jit.export - def get_tensor_value(self): - return self.output_tensors_dict - - def _observer_forward_hook(module, input, output): - #Forward hook that calls observer on the output - return module.activation_post_process(output) - - def _add_observer_(module, op_list=None, prefix=""): - - #Add observer for each child module - for name, child in module.named_children(): - _add_observer_(child, op_list, op_name) - - if module is a leaf: - module.add_module( - 'activation_post_process', - module.qconfig.activation()) - module.register_forward_hook(_observer_forward_hook) - -def _post_eval_hook(self, model, **args): - # Dump tensor and graph information with TensorBoard summary writer - if self.dump_times == 0: - writer = SummaryWriter('runs/eval/baseline' + - '_acc' + str(accuracy), model) - else: - writer = SummaryWriter('runs/eval/tune_' + - str(self.dump_times) + - '_acc' + str(accuracy), model) - - if args is not None and 'input' in args and self.dump_times == 0: - writer.add_graph(model, args['input']) - - from torch.quantization import get_observer_dict - get_observer_dict(model, observer_dict) - for key in observer_dict: - ...... - op_name = key.strip(".activation_post_process") - summary[op_name + ".output"] = observer_dict[key].get_tensor_value() - - for iter in summary[op_name + ".output"]: - #Record output tensor, for fused op only record the parent op output - ...... - if summary[op_name + ".output"][iter].is_quantized: - writer.add_histogram( - op + "/Output/int8", - torch.dequantize(summary[op_name + - ".output"][iter])) - else: - writer.add_histogram( - op + "/Output/fp32", - summary[op_name + ".output"][iter]) - - state_dict = model.state_dict() - for key in state_dict: - # Record weight tensor, fused child tensorBoard tag will be merge - if state_dict[key].is_quantized: - writer.add_histogram(op + "/int8", - torch.dequantize(state_dict[key])) - else: - writer.add_histogram(op + "/fp32", state_dict[key]) - -``` - - -#### Usage - -1. Add "tensorboard: true" in the yaml file. -2. Run quantization tuning; a "./runs" folder is generated in the working folder. -3. Start tensorboard: - - ``shell - tensorboard --bind_all --logdir_spec baseline:./runs/eval/tune_0_acc0.80,tune_1:././runs/eval/tune_1_acc0.79 - `` - - -### TensorFlow Tensorboard - -TensorFlow TensorBoard implementation includes four steps: - -1. Before evaluation where we create the TensorBoard summary write and write graph, collect FP32 and node names for inspection, and dump the histogram of weights and bias tensor directly from graph_def. -2. Run get_tensor_by_name_with_import() where we get data output tensors. -3. Run session.run() to predict and get the inference result of the output tensor list collected in the previous step. -4. Enumerate the output tensor and write the histogram. - -See the [tensorflow.py](https://github.com/intel/neural-compressor/tree/master/neural_compressor/adaptor/tensorflow.py) evaluate() function for details. - -#### Usage - -1. Add "tensorboard: true" in the yaml file. - -2. Run quantization tuning; a "./runs" folder is generated in the working folder. For example: - - ```shell - ls ./runs/eval - baseline_acc_0.776 tune_1_acc_0.095 - ``` - The baseline_acc_0.776 folder contains the FP32 event log and 0.776 is the FP32 accuracy. tune_1_acc_0.095 contains the evaluation event log of the first run of tuning. - -3. Start tensorboard: - - ```shell - tensorboard --bind_all --logdir_spec baseline:./runs_v3/eval/baseline_acc_0.776/,tune_1:./runs_v3/eval/tune_1_acc_0.095/ - ``` -## Examples - -### PyTorch Examples - -```shell - examples/pytorch/eager/image_recognition/imagenet/cpu/ptq/run_tuning_dump_tensor.sh -``` - -### TensorFlow Examples - -1. Add "tensorboard: true" into examples/tensorflow/image_recognition/inceptionv3.yaml. In order to demonstrate the usage of TensorBoard, remove the following lines which are added to skip the quantization of 'v0/cg/conv0/conv2d/Conv2D' to avoid a known limitation. - - ```yaml - op_wise: { - 'v0/cg/conv0/conv2d/Conv2D': { - 'activation': {'dtype': ['fp32']}, - } - } - ``` - -2. Run tuning: - - ```shell - bash run_quant.sh --topology=inception_v3 --dataset_location= \ - --input_model=./inceptionv3_fp32_pretrained_model.pb --output_model=./nc_inceptionv3.pb --config=./inceptionv3_dump_tensor.yaml - ``` - -3. Start TensorBoard - - ```shell - tensorboard --bind_all --logdir_spec baseline:./runs_v3/eval/baseline_acc_0.776/,tune_1:./runs_v3/eval/tune_1_acc_0.095/ - ``` - -4. In order to find the reason why tune_1 got such poor accuracy, we can observe the TensorBoard. - -* From the **GRAPHS** tab, select "baseline/." in the "Run" box and find the first 'Conv2d' op after 'input' op. The op name is "v0/cg/conv0/Relu": - -![TensorBoard Baseline](./imgs/tensorboard_baseline_v0_cg_conv0.png "TensorBoard Baseline") - -* From the **GRAPHS** tab, select "tune_1/." in the "Run" box and find the first 'Conv2d' op after 'input' op. The tensor name is 'v0/cg/conv0/conv2d/Conv2D_eightbit_requantize': - -![TensorBoard Tuning](./imgs/tensorboard_tune_1_v0_cg_conv0.png "TensorBoard Tuning") - - -* Switch to the **HISTOGRAMS** tab. Click the 'v0/cg/conv0' op name in the search box. TensorBoard groups the tensors with the same op name together so you can compare the tensor of baseline 'v0/cg/conv0/Relu' with the tensor of tune_1 'v0/cg/conv0/conv2d/Conv2D_eightbit_requantize_int8.output'. Note that the tensor name can be changed after quantization, so group the tensor by op name and compare. From the chart below, we can see that the histogram of the first conv2d output tensor are different. This is due to a known TensorFlow issue. After filtering the 'v0/cg/conv0/conv2d/Conv2D' op by adding "op_wise" in the yaml file, the issue disappears. - -![TensorBoard Histogram](./imgs/tensorboard_v0_cg_conv0_histogram.png "TensorBoard Histogram") diff --git a/docs/source/user_yaml.md b/docs/source/user_yaml.md deleted file mode 100644 index 14d4157ab78..00000000000 --- a/docs/source/user_yaml.md +++ /dev/null @@ -1,166 +0,0 @@ -User YAML Configuration Files -===== -1. [Introduction](#introduction) -2. [Supported Feature Matrix](#supported-feature-matrix) -3. [Get Started with User YAML Files](#get-started-with-user-yaml-files) - - -## Introduction - -Intel® Neural Compressor uses YAML files for quick -and user-friendly configurations. There are two types of YAML files - -user YAML files and framework YAML files, which are used in -running user cases and setting up framework capabilities, respectively. - -First, let's take a look at a user YAML file, It defines the model, tuning -strategies, tuning calibrations and evaluations, and performance benchmarking -of the passing model vs. original model. - -## Supported Feature Matrix - -| Optimization Techniques | YAML Configuration Files | -|-------------------------|:------------------------:| -| Quantization | ✔ | -| Pruning | ✔ | -| Distillation | ✔ | - - -## Get started with User YAML Files - - -A complete user YAML file is organized logically into several sections: - -* ***model***: The model specifications define a user model's name, inputs, outputs and framework. - - -```yaml -model: # mandatory. used to specify model specific information. - name: mobilenet_v1 - framework: tensorflow # mandatory. supported values are tensorflow, pytorch, pytorch_ipex, onnxrt_integer, onnxrt_qlinear or mxnet; allow new framework backend extension. - inputs: image_tensor # optional. inputs field is only required in tensorflow. - outputs: num_detections,detection_boxes,detection_scores,detection_classes # optional. outputs field is only required in tensorflow. -``` -* ***quantization***: The quantization specifications define quantization tuning space and related calibrations. To calibrate, users can -specify *sampling_size* (optional) and use the subsection *dataloader* to specify -the dataset location using *root* and transformation using *transform*. To -implement tuning space constraints, users can use the subsection *model_wise* and *op_wise* for specific configurations. - -```yaml -quantization: # optional. tuning constraints on model-wise for advance user to reduce tuning space. - calibration: - sampling_size: 20 # optional. default value is 100. used to set how many samples should be used in calibration. - dataloader: - dataset: - ImageRecord: - root: /path/to/imagenet/ # NOTE: modify to calibration dataset location if needed - transform: - BilinearImagenet: - height: 224 - width: 224 - model_wise: # optional. tuning constraints on model-wise for advance user to reduce tuning space. - weight: - granularity: per_channel - scheme: asym - dtype: int8 - algorithm: minmax - activation: - granularity: per_tensor - scheme: asym - dtype: int8, fp32 - algorithm: minmax, kl - op_wise: { # optional. tuning constraints on op-wise for advance user to reduce tuning space. - 'conv1': { - 'activation': {'dtype': ['uint8', 'fp32'], - 'algorithm': ['minmax', 'kl'], - 'scheme':['sym']}, - 'weight': {'dtype': ['int8', 'fp32'], - 'algorithm': ['minmax']} - } - } -``` - -* ***pruning***: The pruning specifications define pruning tuning space. To define the training behavior, uses can -use the subsection *train* to specify the training hyper-parameters and the training dataloader. -To define the pruning approach, users can use the subsection *approach* to specify -pruning target, choose the type of pruning algorithm, and the way to apply it -during training process. - -```yaml -pruning: - train: - dataloader: - ... - epoch: 40 - optimizer: - Adam: - learning_rate: 1e-06 - beta_1: 0.9 - beta_2: 0.999 - epsilon: 1e-07 - criterion: - SparseCategoricalCrossentropy: - reduction: sum_over_batch_size - from_logits: False - approach: - weight_compression: - initial_sparsity: 0.0 - target_sparsity: 0.54 - start_epoch: 0 - end_epoch: 19 - pruners: - - !Pruner - start_epoch: 0 - end_epoch: 19 - prune_type: basic_magnitude -``` -* ***distillation***: The distillation specifications define distillation's tuning -space. Similar to pruning, to define the training behavior, users can use the -subsection *train* to specify the training hyper-parameters and the training -dataloader and it is optional if users implement *train_func* and set the attribute -of distillation instance to *train_func*. For criterion, Intel® Neural Compressor provides a built-in -knowledge distillation loss class to calculate distillation loss. -```yaml -distillation: - train: - start_epoch: 0 - end_epoch: 90 - iteration: 1000 - frequency: 1 - dataloader: - ... - optimizer: - SGD: - learning_rate: 0.001 - momentum: 0.1 - nesterov: True - weight_decay: 0.001 - criterion: - KnowledgeDistillationLoss: - temperature: 1.0 - loss_types: ['CE', 'CE'] - loss_weights: [0.5, 0.5] -``` -* ***evaluation***: The evaluation specifications define the dataloader and metric for accuracy evaluation as well as dataloader -and configurations for performance benchmarking. -```yaml -evaluation: # optional. required if user doesn't provide eval_func in neural_compressor.Quantization. - accuracy: - metric: - ... - dataloader: - ... -``` -* ***tuning***: The tuning specifications define overall tuning targets. Users can -use *accuracy_criterion* to specify the target of accuracy loss percentage and use -*exit_policy* to specify the tuning timeout in seconds. The random -seed can be specified using *random_seed*. - -```yaml -tuning: - accuracy_criterion: - relative: 0.01 # the tuning target of accuracy loss percentage: 1% - higher_is_better: True - exit_policy: - timeout: 0 # tuning timeout (seconds), 0 means early stop - random_seed: 9527 # random seed -```