diff --git a/docs/release_note_cn.md b/docs/release_note_cn.md index f27bdf550d5..676bcdd8169 100644 --- a/docs/release_note_cn.md +++ b/docs/release_note_cn.md @@ -1,4 +1,120 @@ +# 2.3.1 Release Note + +## 1. 重要更新 + +- 2.3.1 版本是在 2.3 版本的基础上修复了已知问题,并且发布了支持 CUDA 11.6 的安装包。 + +## 2. 训练框架(含分布式) + +### (1)功能优化 + +#### API + +- 修改`paddle.nn.initializer.KaimingUniform`和`paddle.nn.initializer.KaimingNormal` 两种初始化方式,使其支持多种类型的激活函数。([#43721](https://github.com/PaddlePaddle/Paddle/pull/43721), [#43827](https://github.com/PaddlePaddle/Paddle/pull/43827)) +- 优化 `paddle.io.DataLoader` 的数据预读取功能,使其支持设置了 `prefetch_factor` 设定的预读取数据的缓存数量,避免在读取大块数据时出现 IO 阻塞。([#43674](https://github.com/PaddlePaddle/Paddle/pull/43674) ) + +#### 新动态图执行机制 + +- 修改新动态图 API 逻辑中 optional 类型 Tensor 的初始化方法,防止被提前析构导致数据异常。([#42561](https://github.com/PaddlePaddle/Paddle/pull/42561)) + +#### 全新静态图执行器 + +- 延迟初始化执行器中的线程池,避免只执行一轮的 `program`(如 `save、load、startup_program`等)创建线程池。([#43768](https://github.com/PaddlePaddle/Paddle/pull/43768)) + +#### 混合精度训练 + +- 设置 `paddle.nn.Layer` 中 `set_state_dict`中禁用 `state_dict` hook。([#43407](https://github.com/PaddlePaddle/Paddle/pull/43407)) + +#### 分布式训练 + +- 使 `paddle.incubate.nn.functional.fused_attention`和 `paddle.incubate.nn.functional.fused_feedforward`支持张量模型并行。([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505)) + +#### 其他 + +- 调整框架算子内核打印字符串的格式,便于进行自动化拆分解析。([#42931](https://github.com/PaddlePaddle/Paddle/pull/42931)) +- 更新模型量化 API,支持`rounding to nearest ties to even`的四舍五入方式,支持量化取值范围 [-128, 127]。([#43829](https://github.com/PaddlePaddle/Paddle/pull/43829)) +- 量化感知训练适配支持 AMP 混合精度训练。([#43689](https://github.com/PaddlePaddle/Paddle/pull/43689)) +- 量化感知训练在启动时新增 `progress bar`,便于查看量化初始化进度,统计 out_threshold 时跳过 scale op,加速初始化过程。([#43454](https://github.com/PaddlePaddle/Paddle/pull/43454)) +- 动态图量化训练支持 `conv` 和 `bn` 融合,静态图离线量化支持设置 `skip_tensor_list` 来跳过某些层不做量化。([#43301](https://github.com/PaddlePaddle/Paddle/pull/43301)) + +### (2)性能优化 + +- 优化`paddle.incubate.nn.functional.fused_attention`和`paddle.incubate.nn.functional.fused_feedforward`算子,增加`add_residual`属性,用以控制最后一步是否进行加`residual`操作,CAE 模型性能提升 7.7%。([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719)) +- 优化 `linspace` 算子,将 `start`、`stop`、`num`三个输入 Tensor 初始化在 CPU 上,避免在算子中进行 GPU -> CPU 拷贝,SOLOv2 模型性能提升6%。([#43746](https://github.com/PaddlePaddle/Paddle/pull/43746)) + +### (3)问题修复 + +#### API + +- 修复 `paddle.io.DataLoader`在 `return_list=True` 时因多线程冲突小概率报错问题。([#43691](https://github.com/PaddlePaddle/Paddle/pull/43691)) +- 修复 `paddle.nn.Layer`的参数存在 `None`类型参数时 `to`方法报 NoneType 不存在 device 属性的错误。([#43597](https://github.com/PaddlePaddle/Paddle/pull/43597)) +- 修复 cumsum op 在某些 `shape`下计算结果出错的问题。 ([#42500](https://github.com/PaddlePaddle/Paddle/pull/42500), [#43777](https://github.com/PaddlePaddle/Paddle/pull/43777)) +- 修复静态图下 `Tensor.__getitem__`在使用 `bool`索引时组网阶段输出结果维度为 0 的问题。 ([#43246](https://github.com/PaddlePaddle/Paddle/pull/43246)) +- 修复 `paddle.slice` 和 `paddle.strided_slice` 处理参数为负数时出现异常的问题。([#43432](https://github.com/PaddlePaddle/Paddle/pull/43432)) +- 修复 set_value op 在处理切片 `step`为负数时赋值结果异常的问题。 ([#43694](https://github.com/PaddlePaddle/Paddle/pull/43694)) +- 修复 C++ 端 `copy`接口不能在多卡设备间拷贝的问题。([#43728](https://github.com/PaddlePaddle/Paddle/pull/43728)) +- 修改 `paddle.incubate.nn.functional.fused_attention`和 `paddle.incubate.nn.functional.fused_feedforward` 中属性命名引发的推理时的问题。([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505)) +- 修复 ConditionalBlockGrad op 处理不需要 `grad`的 Tensor 时异常的问题。([#43034](https://github.com/PaddlePaddle/Paddle/pull/43034)) +- 解决 C++ 的 einsum op 反向速度优化引起的显存增加问题,并将反向优化默认打开。([#43397](https://github.com/PaddlePaddle/Paddle/pull/43397)) +- 修复单卡下 `paddle.io.DataLoader`多进程数据读取在固定随机种子时数据无法固定的问题。([#43702](https://github.com/PaddlePaddle/Paddle/pull/43702)) +- 修复 softmax op 在 Tensor 元素超过 2G 时,触发 CUDNN_STATUS_NOT_SUPPORT 的错误。([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719)) +- 修复 trace op `Event` 字符串在不同算子无区分,导致性能分析不便利的问题。([#42789](https://github.com/PaddlePaddle/Paddle/pull/42789)) + +#### 其他 + +- 修复动转静多次 deepcopy 并保存导致的显存溢出问题。([#43141](https://github.com/PaddlePaddle/Paddle/pull/43141)) +- 修复自定义算子中使用的 PlaceType 类型升级引入的 device id 在多卡场景中出错的问题。([#43830](https://github.com/PaddlePaddle/Paddle/pull/43830)) +- 优化 `paddle.profiler.Profiler` timeline 可视化逻辑,将在 python 脚本中自定义的事件从 C++ 折叠层显示移动至 python 折叠层显示。([#42790](https://github.com/PaddlePaddle/Paddle/pull/42790)) + +## 3. 部署方向(Paddle Inference) + +### (1)新增特性 + +#### 新增功能 + +- CPU 上 ONNX Runtime 后端新增 PaddleSlim 量化模型支持。 ([#43774](https://github.com/PaddlePaddle/Paddle/pull/43774), [#43796](https://github.com/PaddlePaddle/Paddle/pull/43796)) + +### (2)底层优化 + +#### CPU性能优化 + +- EnableMkldnn 配置中移除 `gpu_cpu_reshape2_matmul_fuse_pass`,修复 ResNet50 性能下降的问题。 ([#43750](https://github.com/PaddlePaddle/Paddle/pull/43750)) + +#### GPU 性能优化 + +- 添加 `bilinear_interp_v2` TensorRT convert 支持。 ([#43618](https://github.com/PaddlePaddle/Paddle/pull/43618)) +- 添加 `matmul_scale_fuse_pass`、`multihead_matmul_fuse_pass_v3`到 GPU pass,并添加单测。([#43765](https://github.com/PaddlePaddle/Paddle/pull/43765)) +- 添加 GPU handle 延迟初始化支持。 ([#43661](https://github.com/PaddlePaddle/Paddle/pull/43661)) + +### (3)问题修复 + +#### 框架及API修复 + +- 修复联编 Paddle-Lite XPU 时的编译报错问题。([#43178](https://github.com/PaddlePaddle/Paddle/pull/43178)) +- 修复 ERNIE 3.0 pass误触发的问题。([#43948](https://github.com/PaddlePaddle/Paddle/pull/43948)) +- 修复 multihead op 中 int8 量化属性读不到的问题。([#43020](https://github.com/PaddlePaddle/Paddle/pull/43020)) + +#### 后端能力修复 + +- 修复 MKLDNN 中 elementwise_mul 和 matmul 两个 op 在运行量化推理过程中崩溃的问题。 ([#43725](https://github.com/PaddlePaddle/Paddle/pull/43725)) +- 修复同一模型在推理时 TensorRT 子图序列化文件反复生成的问题。([#42945](https://github.com/PaddlePaddle/Paddle/pull/43945), [#42633](https://github.com/PaddlePaddle/Paddle/pull/42633)) +- 修复 ONNX Runtime 后端与外部使用的 protobuf 冲突问题。([#43159](https://github.com/PaddlePaddle/Paddle/pull/43159), [#43742](https://github.com/PaddlePaddle/Paddle/pull/43742)) +- 修复 python 预测库 ONNX Runtime 后端在多输入情况下推理报错问题。 ([#43621](https://github.com/PaddlePaddle/Paddle/pull/43621)) + +## 4. 环境适配 + +### 编译安装 + +- 完成对 CUDA 11.6 的验证和适配,并在官网发布 CUDA 11.6 的安装包。([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005)) +- 修复在 Windows 上使用 CUDA 11.6 编译时的 cub 报错问题。([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005)) +- 修复 elementwise、reduce op 编译时间较长的问题。([#43202](https://github.com/PaddlePaddle/Paddle/pull/43202), [#42779](https://github.com/PaddlePaddle/Paddle/pull/42779), [#43205](https://github.com/PaddlePaddle/Paddle/pull/43205)) + +### 新硬件适配 + +- 寒武纪 MLU 支持飞桨 Profiler。([#42115](https://github.com/PaddlePaddle/Paddle/pull/42115)) +- GraphCore IPU 支持显示编译进度。([#42078](https://github.com/PaddlePaddle/Paddle/pull/42078)) + # 2.3.0 Release Note ## 1. 重要更新 diff --git a/docs/release_note_en.md b/docs/release_note_en.md index de55218567b..be25861cb45 100644 --- a/docs/release_note_en.md +++ b/docs/release_note_en.md @@ -1,4 +1,120 @@ +# 2.3.1 Release Note + +## **1. Important Updates** + +- V2.3.1 is built on V2.3 by fixing known issues and releasing precompiled binary that supports CUDA 11.6. + +## **2. Training Framework (distributed included)** + +### **(1) Function Optimization** + +#### API + +- Modify two initialization modes of `paddle.nn.initializer.KaimingUniform` and `paddle.nn.initializer.KaimingNormal`, to support multiple types of activation functions. ([#43721](https://github.com/PaddlePaddle/Paddle/pull/43721), [#43827](https://github.com/PaddlePaddle/Paddle/pull/43827)) +- Optimize the data pre-fetching function of `paddle.io.DataLoader`, so that it can support the setting of the `prefetch_factor` to set the cache size of pre-fetched data. This can avoid IO blocking when reading large blocks of data. ([#43674](https://github.com/PaddlePaddle/Paddle/pull/43674)) + +#### **New dynamic graph execution mechanism** + +- Modify the initialization method of optional type Tensor in the new dynamic graph API logic to prevent data exceptions caused by early destruction. ([#42561](https://github.com/PaddlePaddle/Paddle/pull/42561)) + +#### **New static graph executor** + +- Defer initialization of the thread pools in the executor, to avoid creating thread pools for `programs` that execute only once (e.g.,`save, load, startup_program`, etc.). ([#43768](https://github.com/PaddlePaddle/Paddle/pull/43768)) + +#### **Mixed precision training** + +- Disabling `state_dict` hook in `set_state_dict` in `paddle.nn.Layer`. ([#43407](https://github.com/PaddlePaddle/Paddle/pull/43407)) + +#### **Distributed training** + +- Enabling tensor parallelism in `paddle.incubate.nn.functional.fused_attention` and `paddle.incubate.nn.functional.fused_feedforward`. ([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505)) + +#### **Others** + +- Adjust print format of the framework operator kernels to facilitate automated splitting and parsing. ([#42931](https://github.com/PaddlePaddle/Paddle/pull/42931)) +- Update the model quantization API to support the round-off in `rounding to nearest ties to even`, and support quantization in the range [-128, 127]. ([#43829](https://github.com/PaddlePaddle/Paddle/pull/43829)) +- Support AMP mixed precision training in quantization-aware training. ([#43689](https://github.com/PaddlePaddle/Paddle/pull/43689)) +- Add the `progress bar` at the beginning of quantization-aware training, so that it is easy to check the progress of quantization initialization. Skip the scale op when counting out_threshold to speed up the initialization process. ([#43454](https://github.com/PaddlePaddle/Paddle/pull/43454)) +- Support `conv` and `bn` fusion in the dynamic graph quantization training. Support the settings of skip_tensor_list in the static graph offline quantization, to skip some layers without quantization. ([#43301](https://github.com/PaddlePaddle/Paddle/pull/43301)) + +### **(2) Performance Optimization** + +- Optimize`paddle.incubate.nn.functional.fused_attention` and `paddle.incubate.nn.functional.fused_feedforward`operators. Add `add_residual` property to control whether to perform add-`residual` operation in the last step. The performance of CAE model is improved by 7.7%. ([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719)) +- Optimize `linspace` operator. Initialize three input Tensor of `start`,`stop` and `num` on CPU, to avoid GPU->CPU copy in the operator. This can speed up SOLOv2 model performance by 6%. ([#43746](https://github.com/PaddlePaddle/Paddle/pull/43746)) + +### **(3) Bug Fix** + +#### API + +- Fix the error reported by `paddle.io.DataLoader` when `return_list=True` due to multi-thread conflict. ([#43691](https://github.com/PaddlePaddle/Paddle/pull/43691)) +- Fix the error that the `to` method reports NoneType does not have the device attribute when the `paddle.nn.Layer` parameter has the `None` type parameter. ([#43597](https://github.com/PaddlePaddle/Paddle/pull/43597)) +- Fix the bug that the calculation result of cumsum op is wrong in some `shape` settings. ([#42500](https://github.com/PaddlePaddle/Paddle/pull/42500), [#43777](https://github.com/PaddlePaddle/Paddle/pull/43777)) +- Fix the bug that the output result dimension of `Tensor.__getitem__` is 0 in the networking stage when using `bool` index in the static graph.([#43246](https://github.com/PaddlePaddle/Paddle/pull/43246)) +- Fix the bug occurred when `paddle.slice` and `paddle.strided_slice` handle negative parameters. ([#43432](https://github.com/PaddlePaddle/Paddle/pull/43432)) +- Fix the bug that the assignment result of set_value op is abnormal when the processing slice `step` is negative. ([#43694](https://github.com/PaddlePaddle/Paddle/pull/43694)) +- Fix the bug that the `copy` interface in C++ cannot copy between multiple cards. ([#43728](https://github.com/PaddlePaddle/Paddle/pull/43728)) +- Fix the bug in inference stage caused by attribute naming in `paddle.incubate.nn.functional.fused_attention`and `paddle.incubate.nn.functional.fused_feedforward` . ([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505)) +- Fix an exception in ConditionalBlockGrad op when processing Tensor that does not require `grad`. ([#43034](https://github.com/PaddlePaddle/Paddle/pull/43034)) +- Fix the bug of device memory increase caused by einsum op in the speed optimization of backward computation. By default, this optimization is enabled. ([#43397](https://github.com/PaddlePaddle/Paddle/pull/43397)) +- Fix the bug that data fails to be fixed when `paddle.io.DataLoader` multi-process data reads the fixing random seeds under a single card. ([#43702](https://github.com/PaddlePaddle/Paddle/pull/43702)) +- Fix the bug that softmax op triggers CUDNN_STATUS_NOT_SUPPORT when the Tensor exceeds 2G. ([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719)) +- Fix the bug that the trace op `Event` string is indistinguishable among different operators that cause the inconvenient performance analysis. ([#42789](https://github.com/PaddlePaddle/Paddle/pull/42789)) + +#### **Others** + +- Fix the bug of overflowing device memory caused by multiple deepcopy and saving in case of dynamic-to-static. ([#43141](https://github.com/PaddlePaddle/Paddle/pull/43141)) +- Fix the bug that the device id introduced by the upgrade of PlaceType used in the custom operator is wrong in the multi-card scenario.([#43830](https://github.com/PaddlePaddle/Paddle/pull/43830)) +- Optimize the `paddle.profiler.Profiler` timeline visualization logic, move events customized in python scripts from C++ folding display to python folding display. ([#42790](https://github.com/PaddlePaddle/Paddle/pull/42790)) + +## **3.** Deployment Direction (Paddle Inference) + +### **(1) New Features** + +#### **New functions** + +- Add the support of the PaddleSlim quantization model for ONNX Runtime backends on CPUs. ([#43774](https://github.com/PaddlePaddle/Paddle/pull/43774), [#43796](https://github.com/PaddlePaddle/Paddle/pull/43796)) + +### **(2) Underlying Optimization** + +#### **CPU performance optimization** + +- Remove `gpu_cpu_reshape2_matmul_fuse_pass` from EnableMkldnn configuration to fix the bug of ResNet50 performance degradation. ([#43750](https://github.com/PaddlePaddle/Paddle/pull/43750)) + +#### **GPU performance optimization** + +- Add the support of `bilinear_interp_v2` TensorRT convert. ([#43618](https://github.com/PaddlePaddle/Paddle/pull/43618)) +- Add `matmul_scale_fuse_pass` and `multihead_matmul_fuse_pass_v3` to GPU pass. ([#43765](https://github.com/PaddlePaddle/Paddle/pull/43765)) +- Add the support of the GPU handle deferred initialization. ([#43661](https://github.com/PaddlePaddle/Paddle/pull/43661)) + +### **(3) Bug Fixing** + +#### **Framework and API fixing** + +- Fix the compile error problem when binding Paddle-Lite XPU. ([#43178](https://github.com/PaddlePaddle/Paddle/pull/43178)) +- Fix the bug of false trigger of ERNIE 3.0 pass. ([#43948](https://github.com/PaddlePaddle/Paddle/pull/43948)) +- Fix the bug that int8 quantization attribute in multihead op cannot be read. ([#43020](https://github.com/PaddlePaddle/Paddle/pull/43020)) + +#### **Backend capability fixing** + +- Fix the bug that two ops of elementwise_mul and matmul in MKLDNN are crashed during quantitative inference. ([#43725](https://github.com/PaddlePaddle/Paddle/pull/43725)) +- Fix a bug where TensorRT subgraph serialization files are repeatedly generated for the same model during inference. ([#42945](https://github.com/PaddlePaddle/Paddle/pull/43945), [#42633](https://github.com/PaddlePaddle/Paddle/pull/42633)) +- Fix a conflict between the ONNX Runtime backend and the externally use of protobuf. ([#43159](https://github.com/PaddlePaddle/Paddle/pull/43159), [#43742](https://github.com/PaddlePaddle/Paddle/pull/43742)) +- Fix an error reported by python prediction library when using ONNX Runtime backend in case of multiple inputs. ([#43621](https://github.com/PaddlePaddle/Paddle/pull/43621)) + +## **4. Environment Adaptation** + +### **Compile and install** + +- Complete verification and adaptation of CUDA 11.6, and release CUDA 11.6 precompiled binary. ([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005)) +- Fix a cub error when compiling with CUDA 11.6 on Windows. ([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005)) +- Fix the bug of long compilation time for elementwise and reduce op. ([#43202](https://github.com/PaddlePaddle/Paddle/pull/43202), [#42779](https://github.com/PaddlePaddle/Paddle/pull/42779), [#43205](https://github.com/PaddlePaddle/Paddle/pull/43205)) + +### **New hardware adaptation** + +- Cambricon MLU supports PaddlePaddle Profiler. ([#42115](https://github.com/PaddlePaddle/Paddle/pull/42115)) +- GraphCore IPU supports visualization of compilation progress. ([#42078](https://github.com/PaddlePaddle/Paddle/pull/42078)) + # 2.3.0 Release Note ## 1. **Important Updates**