diff --git a/docs/api/paddle/matrix_power_cn.rst b/docs/api/paddle/linalg/matrix_power_cn.rst
similarity index 86%
rename from docs/api/paddle/matrix_power_cn.rst
rename to docs/api/paddle/linalg/matrix_power_cn.rst
index 210b41e61c9..c1f771a92f0 100644
--- a/docs/api/paddle/matrix_power_cn.rst
+++ b/docs/api/paddle/linalg/matrix_power_cn.rst
@@ -3,7 +3,7 @@
matrix_power
-------------------------------
-.. py:function:: paddle.matrix_power(x, n, name=None)
+.. py:function:: paddle.linalg.matrix_power(x, n, name=None)
计算一个或一批方阵的 ``n`` 次幂。
@@ -41,17 +41,17 @@ matrix_power
x = paddle.to_tensor([[1, 2, 3],
[1, 4, 9],
[1, 8, 27]], dtype='float64')
- print(paddle.matrix_power(x, 2))
+ print(paddle.linalg.matrix_power(x, 2))
# [[6. , 34. , 102.],
# [14. , 90. , 282.],
# [36. , 250., 804.]]
- print(paddle.matrix_power(x, 0))
+ print(paddle.linalg.matrix_power(x, 0))
# [[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.]]
- print(paddle.matrix_power(x, -2))
+ print(paddle.linalg.matrix_power(x, -2))
# [[ 12.91666667, -12.75000000, 2.83333333 ],
# [-7.66666667 , 8. , -1.83333333 ],
- # [ 1.80555556 , -1.91666667 , 0.44444444 ]]
\ No newline at end of file
+ # [ 1.80555556 , -1.91666667 , 0.44444444 ]]
diff --git a/docs/api/paddle/multi_dot_cn.rst b/docs/api/paddle/linalg/multi_dot_cn.rst
similarity index 97%
rename from docs/api/paddle/multi_dot_cn.rst
rename to docs/api/paddle/linalg/multi_dot_cn.rst
index 8dc63f4a419..e6200eecbdd 100755
--- a/docs/api/paddle/multi_dot_cn.rst
+++ b/docs/api/paddle/linalg/multi_dot_cn.rst
@@ -3,7 +3,7 @@
multi_dot
-------------------------------
-.. py:function:: paddle.multi_dot(x, name=None)
+.. py:function:: paddle.linalg.multi_dot(x, name=None)
Multi_dot是一个计算多个矩阵乘法的算子。
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/amp_cn.ipynb b/docs/guides/01_paddle2.0_introduction/basic_concept/amp_cn.ipynb
new file mode 100644
index 00000000000..e5a5b2106b8
--- /dev/null
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/amp_cn.ipynb
@@ -0,0 +1,463 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "# 自动混合精度训练\n",
+ "\n",
+ "一般情况下,训练深度学习模型时使用的数据类型为单精度(FP32)。2018年,百度与NVIDIA联合发表论文:[MIXED PRECISION TRAINING](https://arxiv.org/pdf/1710.03740.pdf),提出了混合精度训练的方法。混合精度训练是指在训练过程中,同时使用单精度(FP32)和半精度(FP16),其目的是相较于使用单精度(FP32)训练模型,在保持精度持平的条件下,能够加速训练。本文将介绍如何使用飞桨框架,实现自动混合精度训练。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 一、半精度浮点类型 FP16\n",
+ "\n",
+ "首先介绍半精度(FP16)。如图1所示,半精度(FP16)是一种相对较新的浮点类型,在计算机中使用2字节(16位)存储。在IEEE 754-2008标准中,它亦被称作binary16。与计算中常用的单精度(FP32)和双精度(FP64)类型相比,FP16更适于在精度要求不高的场景中使用。\n",
+ "\n",
+ "\n",
+ "
\n",
+ " 图 1. 半精度和单精度数据示意图\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 二、NVIDIA GPU的FP16算力\n",
+ "在使用相同的超参数下,混合精度训练使用半精度浮点(FP16)和单精度(FP32)浮点即可达到与使用纯单精度训练相同的准确率,并可加速模型的训练速度。这主要得益于英伟达推出的Volta及Turing架构GPU在使用FP16计算时具有如下特点:\n",
+ "- FP16可降低一半的内存带宽和存储需求,这使得在相同的硬件条件下研究人员可使用更大更复杂的模型以及更大的batch size大小。\n",
+ "- FP16可以充分利用英伟达Volta及Turing架构GPU提供的Tensor Cores技术。在相同的GPU硬件上,Tensor Cores的FP16计算吞吐量是FP32的8倍。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 三、使用飞桨框架实现自动混合精度\n",
+ "使用飞桨框架提供的API,``paddle.amp.auto_cast`` 和 ``paddle.amp.decorate`` 和 ``paddle.amp.GradScaler`` 能够实现自动混合精度训练(Automatic Mixed Precision,AMP),即在相关OP的计算中,根据一定的规则,自动选择FP16或FP32计算。飞桨的AMP为用户提供了两种模式:\n",
+ "- level=’O1‘:采用黑名名单策略的混合精度训练,使用FP16与FP32进行计算的OP列表可见该[文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/amp/Overview_cn.html)。\n",
+ "- level=’O2‘:纯FP16训练,除用户自定义黑名单中指定的OP和不支持FP16计算的OP之外,全部使用FP16计算。\n",
+ "\n",
+ "下面来看一个具体的例子,来了解如果使用飞桨框架实现混合精度训练。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.1 辅助函数\n",
+ "首先定义辅助函数,用来计算训练时间。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "# 开始时间\n",
+ "start_time = None\n",
+ "\n",
+ "def start_timer():\n",
+ " # 获取开始时间\n",
+ " global start_time\n",
+ " start_time = time.time()\n",
+ "\n",
+ "def end_timer_and_print(msg):\n",
+ " # 打印信息并输出训练时间\n",
+ " end_time = time.time()\n",
+ " print(\"\\n\" + msg)\n",
+ " print(\"共计耗时 = {:.3f} sec\".format(end_time - start_time))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.2 构建一个简单的网络\n",
+ "\n",
+ "构建一个简单的网络,用于对比使用普通方法进行训练与使用混合精度训练的训练速度。该网络由三层 ``Linear`` 组成,其中前两层 ``Linear`` 后接 ``ReLU`` 激活函数。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "import paddle\n",
+ "import paddle.nn as nn\n",
+ "\n",
+ "class SimpleNet(nn.Layer):\n",
+ "\n",
+ " def __init__(self, input_size, output_size):\n",
+ " \n",
+ " super(SimpleNet, self).__init__()\n",
+ " self.linear1 = nn.Linear(input_size, output_size)\n",
+ " self.relu1 = nn.ReLU()\n",
+ " self.linear2 = nn.Linear(input_size, output_size)\n",
+ " self.relu2 = nn.ReLU()\n",
+ " self.linear3 = nn.Linear(input_size, output_size)\n",
+ "\n",
+ " def forward(self, x):\n",
+ "\n",
+ " x = self.linear1(x)\n",
+ " x = self.relu1(x)\n",
+ " x = self.linear2(x)\n",
+ " x = self.relu2(x)\n",
+ " x = self.linear3(x)\n",
+ "\n",
+ " return x"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "设置训练的相关参数,这里为了能有效的看出混合精度训练对于训练速度的提升,将 ``input_size`` 与 ``output_size`` 的值设为较大的值,为了使用GPU 提供的``Tensor Core`` 性能,还需将 ``batch_size`` 设置为 8 的倍数。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "W1110 18:42:02.362493 104 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1\n",
+ "W1110 18:42:02.367755 104 device_context.cc:465] device: 0, cuDNN Version: 7.6.\n"
+ ]
+ }
+ ],
+ "source": [
+ "epochs = 5\n",
+ "input_size = 4096 # 设为较大的值\n",
+ "output_size = 4096 # 设为较大的值\n",
+ "batch_size = 512 # batch_size 为8的倍数\n",
+ "nums_batch = 50\n",
+ "\n",
+ "train_data = [paddle.randn((batch_size, input_size)) for _ in range(nums_batch)]\n",
+ "labels = [paddle.randn((batch_size, output_size)) for _ in range(nums_batch)]\n",
+ "\n",
+ "mse = paddle.nn.MSELoss()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.3 使用默认的训练方式进行训练"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+ " [1.24519622])\n",
+ "\n",
+ "默认耗时:\n",
+ "共计耗时 = 2.926 sec\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = SimpleNet(input_size, output_size) # 定义模型\n",
+ "\n",
+ "optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # 定义优化器\n",
+ "\n",
+ "start_timer() # 获取训练开始时间\n",
+ "\n",
+ "for epoch in range(epochs):\n",
+ " datas = zip(train_data, labels)\n",
+ " for i, (data, label) in enumerate(datas):\n",
+ "\n",
+ " output = model(data)\n",
+ " loss = mse(output, label)\n",
+ "\n",
+ " # 反向传播\n",
+ " loss.backward()\n",
+ "\n",
+ " # 训练模型\n",
+ " optimizer.step()\n",
+ " optimizer.clear_grad()\n",
+ "\n",
+ "print(loss)\n",
+ "end_timer_and_print(\"默认耗时:\") # 获取结束时间并打印相关信息"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.4 使用AMP训练模型\n",
+ "\n",
+ "在飞桨框架中,使用自动混合精度训练,需要进行四个步骤:\n",
+ "\n",
+ "- Step1: 定义 ``GradScaler`` ,用于缩放 ``loss`` 比例,避免浮点数下溢\n",
+ "- Step2: 使用 ``decorate`` 在level=’O1‘模式下不做任何处理,无需调用该api,在level=’O2‘模式下,将网络参数从FP32转换为FP16\n",
+ "- Step3: 使用 ``auto_cast`` 用于创建AMP上下文环境,该上下文中自动会确定每个OP的输入数据类型(FP16或FP32)\n",
+ "- Step4: 使用 Step1中定义的 ``GradScaler`` 完成 ``loss`` 的缩放,用缩放后的 ``loss`` 进行反向传播,完成训练\n",
+ "\n",
+ "\n",
+ "采用level=’O1‘模式训练:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+ " [1.24815702])\n",
+ "\n",
+ "使用AMP-O1模式耗时:\n",
+ "共计耗时 = 1.294 sec\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = SimpleNet(input_size, output_size) # 定义模型\n",
+ "\n",
+ "optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # 定义优化器\n",
+ "\n",
+ "# Step1:定义 GradScaler,用于缩放loss比例,避免浮点数溢出\n",
+ "scaler = paddle.amp.GradScaler(init_loss_scaling=1024)\n",
+ "\n",
+ "start_timer() # 获取训练开始时间\n",
+ "\n",
+ "for epoch in range(epochs):\n",
+ " datas = zip(train_data, labels)\n",
+ " for i, (data, label) in enumerate(datas):\n",
+ "\n",
+ " # Step2:创建AMP上下文环境,开启自动混合精度训练\n",
+ " with paddle.amp.auto_cast():\n",
+ " output = model(data)\n",
+ " loss = mse(output, label)\n",
+ "\n",
+ " # Step3:使用 Step1中定义的 GradScaler 完成 loss 的缩放,用缩放后的 loss 进行反向传播\n",
+ " scaled = scaler.scale(loss)\n",
+ " scaled.backward()\n",
+ "\n",
+ " # 训练模型\n",
+ " scaler.minimize(optimizer, scaled)\n",
+ " optimizer.clear_grad()\n",
+ "\n",
+ "print(loss)\n",
+ "end_timer_and_print(\"使用AMP-O1模式耗时:\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "采用level=’O2‘模式训练:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+ " [1.25423336])\n",
+ "\n",
+ "使用AMP-O2模式耗时:\n",
+ "共计耗时 = 0.890 sec\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = SimpleNet(input_size, output_size) # 定义模型\n",
+ "\n",
+ "optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # 定义优化器\n",
+ "\n",
+ "# Step1:定义 GradScaler,用于缩放loss比例,避免浮点数溢出\n",
+ "scaler = paddle.amp.GradScaler(init_loss_scaling=1024)\n",
+ "\n",
+ "# Step2:在level=’O2‘模式下,将网络参数从FP32转换为FP16\n",
+ "model, optimizer = paddle.amp.decorate(models=model, optimizers=optimizer, level='O2', master_weight=None, save_dtype=None)\n",
+ "\n",
+ "start_timer() # 获取训练开始时间\n",
+ "\n",
+ "for epoch in range(epochs):\n",
+ " datas = zip(train_data, labels)\n",
+ " for i, (data, label) in enumerate(datas):\n",
+ "\n",
+ " # Step3:创建AMP上下文环境,开启自动混合精度训练\n",
+ " with paddle.amp.auto_cast(enable=True, custom_white_list=None, custom_black_list=None, level='O2'):\n",
+ " output = model(data)\n",
+ " loss = mse(output, label)\n",
+ "\n",
+ " # Step4:使用 Step1中定义的 GradScaler 完成 loss 的缩放,用缩放后的 loss 进行反向传播\n",
+ " scaled = scaler.scale(loss)\n",
+ " scaled.backward()\n",
+ "\n",
+ " # 训练模型\n",
+ " scaler.minimize(optimizer, scaled)\n",
+ " optimizer.clear_grad()\n",
+ "\n",
+ "print(loss)\n",
+ "end_timer_and_print(\"使用AMP-O2模式耗时:\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 四、进阶用法\n",
+ "### 4.1 使用梯度累加\n",
+ "梯度累加是指在模型训练过程中,训练一个batch的数据得到梯度后,不立即用该梯度更新模型参数,而是继续下一个batch数据的训练,得到梯度后继续循环,多次循环后梯度不断累加,直至达到一定次数后,用累加的梯度更新参数,这样可以起到变相扩大 batch_size 的作用。\n",
+ "\n",
+ "在自动混合精度训练中,也支持梯度累加,使用方式如下:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+ " [1.25602019])\n",
+ "\n",
+ "使用AMP模式耗时:\n",
+ "共计耗时 = 1.026 sec\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = SimpleNet(input_size, output_size) # 定义模型\n",
+ "\n",
+ "optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # 定义优化器\n",
+ "\n",
+ "accumulate_batchs_num = 10 # 梯度累加中 batch 的数量\n",
+ "\n",
+ "# 定义 GradScaler\n",
+ "scaler = paddle.amp.GradScaler(init_loss_scaling=1024)\n",
+ "\n",
+ "start_timer() # 获取训练开始时间\n",
+ "\n",
+ "for epoch in range(epochs):\n",
+ " datas = zip(train_data, labels)\n",
+ " for i, (data, label) in enumerate(datas):\n",
+ "\n",
+ " # 创建AMP上下文环境,开启自动混合精度训练\n",
+ " with paddle.amp.auto_cast():\n",
+ " output = model(data)\n",
+ " loss = mse(output, label)\n",
+ "\n",
+ " # 使用 GradScaler 完成 loss 的缩放,用缩放后的 loss 进行反向传播\n",
+ " scaled = scaler.scale(loss)\n",
+ " scaled.backward()\n",
+ "\n",
+ " # 当累计的 batch 为 accumulate_batchs_num 时,更新模型参数\n",
+ " if (i + 1) % accumulate_batchs_num == 0:\n",
+ "\n",
+ " # 训练模型\n",
+ " scaler.minimize(optimizer, scaled)\n",
+ " optimizer.clear_grad()\n",
+ "\n",
+ "print(loss)\n",
+ "end_timer_and_print(\"使用AMP模式耗时:\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 五、总结\n",
+ "从上面的示例中可以看出,使用自动混合精度训练,O1模式共计耗时约 1.294s,O2模式共计耗时约 0.890s,而普通的训练方式则耗时 2.926s,O1模式训练速度提升约为 2.1倍,O2模式训练速度提升约为 3.0倍。如需更多使用混合精度训练的示例,请参考飞桨模型库: [paddlepaddle/models](https://github.com/PaddlePaddle/models)。"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "py35-paddle1.2.0"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/amp_cn.md b/docs/guides/01_paddle2.0_introduction/basic_concept/amp_cn.md
index bc96b6736a4..646e01ecd37 100644
--- a/docs/guides/01_paddle2.0_introduction/basic_concept/amp_cn.md
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/amp_cn.md
@@ -1,6 +1,6 @@
# 自动混合精度训练
-一般情况下,训练深度学习模型时使用的数据类型为单精度(FP32)。2018年,百度与NVIDIA联合发表论文:[MIXED PRECISION TRAINING](https://arxiv.org/pdf/1710.03740.pdf),提出了混合精度训练的方法。混合精度训练是指在训练过程中,同时使用单精度(FP32)和半精度(FP16),其目的是相较于使用单精度(FP32)训练模型,在保持精度持平的条件下,能够加速训练。本文将介绍如何使用飞桨框架,实现自动混合精度训练。
+一般情况下,训练深度学习模型时使用的数据类型为单精度(FP32)。2018年,百度与NVIDIA联合发表论文:[MIXED PRECISION TRAINING](https://arxiv.org/pdf/1710.03740.pdf),提出了混合精度训练的方法。混合精度训练是指在训练过程中,同时使用单精度(FP32)和半精度(FP16),其目的是相较于使用单精度(FP32)训练模型,在保持精度持平的条件下,能够加速训练。本文将介绍如何使用飞桨框架,实现自动混合精度训练。
## 一、半精度浮点类型 FP16
@@ -57,6 +57,7 @@ import paddle.nn as nn
class SimpleNet(nn.Layer):
def __init__(self, input_size, output_size):
+
super(SimpleNet, self).__init__()
self.linear1 = nn.Linear(input_size, output_size)
self.relu1 = nn.ReLU()
@@ -91,6 +92,10 @@ labels = [paddle.randn((batch_size, output_size)) for _ in range(nums_batch)]
mse = paddle.nn.MSELoss()
```
+ W1110 18:42:02.362493 104 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
+ W1110 18:42:02.367755 104 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+
+
### 3.3 使用默认的训练方式进行训练
@@ -120,10 +125,10 @@ end_timer_and_print("默认耗时:") # 获取结束时间并打印相关信息
```
Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
- [1.24609220])
-
+ [1.24519622])
+
默认耗时:
- 共计耗时 = 2.819 sec
+ 共计耗时 = 2.926 sec
### 3.4 使用AMP训练模型
@@ -138,6 +143,7 @@ end_timer_and_print("默认耗时:") # 获取结束时间并打印相关信息
采用level=’O1‘模式训练:
+
```python
model = SimpleNet(input_size, output_size) # 定义模型
@@ -170,14 +176,15 @@ end_timer_and_print("使用AMP-O1模式耗时:")
```
Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
- [1.24609900])
-
+ [1.24815702])
+
使用AMP-O1模式耗时:
- 共计耗时 = 1.324 sec
+ 共计耗时 = 1.294 sec
采用level=’O2‘模式训练:
+
```python
model = SimpleNet(input_size, output_size) # 定义模型
@@ -212,11 +219,17 @@ print(loss)
end_timer_and_print("使用AMP-O2模式耗时:")
```
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ in ParamBase copy_to func
Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
- [1.24997652])
-
+ [1.25423336])
+
使用AMP-O2模式耗时:
- 共计耗时 = 0.933 sec
+ 共计耗时 = 0.890 sec
## 四、进阶用法
@@ -263,10 +276,11 @@ end_timer_and_print("使用AMP模式耗时:")
```
Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
- [1.24623466])
-
+ [1.25602019])
+
使用AMP模式耗时:
- 共计耗时 = 1.020 sec
+ 共计耗时 = 1.026 sec
+
## 五、总结
-从上面的示例中可以看出,使用自动混合精度训练,O1模式共计耗时约 1.324s,O2模式共计耗时约 0.933s,而普通的训练方式则耗时 2.819s,O1模式训练速度提升约为 2.1倍,O2模式训练速度提升约为 3.0倍。如需更多使用混合精度训练的示例,请参考飞桨模型库: [paddlepaddle/models](https://github.com/PaddlePaddle/models)。
+从上面的示例中可以看出,使用自动混合精度训练,O1模式共计耗时约 1.294s,O2模式共计耗时约 0.890s,而普通的训练方式则耗时 2.926s,O1模式训练速度提升约为 2.1倍,O2模式训练速度提升约为 3.0倍。如需更多使用混合精度训练的示例,请参考飞桨模型库: [paddlepaddle/models](https://github.com/PaddlePaddle/models)。
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/amp_en.ipynb b/docs/guides/01_paddle2.0_introduction/basic_concept/amp_en.ipynb
new file mode 100644
index 00000000000..22c12fcfed1
--- /dev/null
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/amp_en.ipynb
@@ -0,0 +1,453 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "# Automatic Mixed Precision Training\n",
+ "\n",
+ "In general, the datatype of training deep learning models is single-precision floating-point format(also called FP32). In 2018, Baidu and NVIDIA jointly published the paper: [MIXED PRECISION TRAINING](https://arxiv.org/pdf/1710.03740.pdf), which proposed mixed precision training. During the process of training, some operators use FP32 and other operators use half precision(also called FP16) in the same time. Its purpose is to speed up training, while compared with the FP32 training model, the same accuracy is maintained. This tutorial will introduce how to use automatic mixed precision training with PaddlePaddle."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 1. Half Precision (FP16)\n",
+ "\n",
+ "First introduce FP16. As shown in Figure 1, FP16 occupies 16 bits (two bytes in modern computers) of computer memory. In the IEEE 754-2008 standard, it is also named binary16. Compared with FP32 and double precision (also called FP64) commonly used, FP16 is more suitable for the usage in scenarios with low precision requirements.\n",
+ "\n",
+ "\n",
+ "
\n",
+ " Figure 1. Half precision(FP16) and single precision(FP32)\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 2. FP16 Computing Power of NVIDIA GPU\n",
+ "\n",
+ "When the same hyperparameters are used, mixed precision training using FP16 and FP32 can achieve the same accuracy as that of pure single precision used, and can accelerate the training speed. It mainly attributes to the features that NVIDIA Volta and NVIDIA Turing use FP16 to calculate:\n",
+ "- FP16 can reduce memory bandwidth and storage requirements by half, which allows researchers to use more complex models and larger batch sizes under the same hardware conditions.\n",
+ "- FP16 can make full use of Tensor Cores technology provided by NVIDIA Volta and NVIDIA Turing. On the same GPU hardware, the computing throughput of Tensor Cores' FP16 is 8 times bigger than that of FP32."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 3. Automatic Mixed Precision Training with PaddlePaddle\n",
+ "\n",
+ "Using PaddlePaddle's API ``paddle.amp.auto_cast`` and ``paddle.amp.GradScaler`` can realize automatic mixed precision training (AMP), which can automatically choose FP16 or FP32 for different operators' calculation. After the AMP mode is turned on, the operator list calculated by FP16 and FP32 can be found in this [document](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/amp/Overview_cn.html). This is a specific example to understand how to use PaddlePaddle to achieve mixed precision training."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.1 Auxiliary Function\n",
+ "First define the auxiliary function to calculate the training time."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "# start time\n",
+ "start_time = None\n",
+ "\n",
+ "def start_timer():\n",
+ " # get start time\n",
+ " global start_time\n",
+ " start_time = time.time()\n",
+ "\n",
+ "def end_timer_and_print(msg):\n",
+ " # print message and total training time\n",
+ " end_time = time.time()\n",
+ " print(\"\\n\" + msg)\n",
+ " print(\"total time = {:.3f} sec\".format(end_time - start_time))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.2 A Simple Network\n",
+ "\n",
+ "Define a simple network to compare the training speed of common methods and mixed precision. The network is composed of three layers of ``Linear``. The first two layers of ``Linear`` are followed by the ``ReLU`` activation function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "import paddle\n",
+ "import paddle.nn as nn\n",
+ "\n",
+ "class SimpleNet(nn.Layer):\n",
+ "\n",
+ " def __init__(self, input_size, output_size):\n",
+ " \n",
+ " super(SimpleNet, self).__init__()\n",
+ " self.linear1 = nn.Linear(input_size, output_size)\n",
+ " self.relu1 = nn.ReLU()\n",
+ " self.linear2 = nn.Linear(input_size, output_size)\n",
+ " self.relu2 = nn.ReLU()\n",
+ " self.linear3 = nn.Linear(input_size, output_size)\n",
+ "\n",
+ " def forward(self, x):\n",
+ "\n",
+ " x = self.linear1(x)\n",
+ " x = self.relu1(x)\n",
+ " x = self.linear2(x)\n",
+ " x = self.relu2(x)\n",
+ " x = self.linear3(x)\n",
+ "\n",
+ " return x"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "Set the parameters of training. In order to effectively show the improvement of training speed by mixed precision training, please set the larger values of ``input_size`` and ``output_size``. And in order to use the ``Tensor Core`` provided by GPU, ``batch_size`` needs to be set as a multiple of 8."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "epochs = 5\n",
+ "input_size = 4096 # set to a larger value\n",
+ "output_size = 4096 # set to a larger value\n",
+ "batch_size = 512 # batch_size is a multiple of 8\n",
+ "nums_batch = 50\n",
+ "\n",
+ "train_data = [paddle.randn((batch_size, input_size)) for _ in range(nums_batch)]\n",
+ "labels = [paddle.randn((batch_size, output_size)) for _ in range(nums_batch)]\n",
+ "\n",
+ "mse = paddle.nn.MSELoss()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.3 Training with Default Method"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+ " [1.24072289])\n",
+ "\n",
+ "Default time:\n",
+ "total time = 2.935 sec\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = SimpleNet(input_size, output_size) # define model\n",
+ "\n",
+ "optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # define optimizer\n",
+ "\n",
+ "start_timer() # get the start time of training\n",
+ "\n",
+ "for epoch in range(epochs):\n",
+ " datas = zip(train_data, labels)\n",
+ " for i, (data, label) in enumerate(datas):\n",
+ "\n",
+ " output = model(data)\n",
+ " loss = mse(output, label)\n",
+ "\n",
+ " # backpropagation\n",
+ " loss.backward()\n",
+ "\n",
+ " # update parameters\n",
+ " optimizer.step()\n",
+ " optimizer.clear_grad()\n",
+ "\n",
+ "print(loss)\n",
+ "end_timer_and_print(\"Default time:\") # print massage and total time"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.4 Training with AMP\n",
+ "\n",
+ "Using automatic mixed precision training with PaddlePaddle requires four steps:\n",
+ "\n",
+ "- Step1: Define ``GradScaler``, which is used to scale the ``loss`` to avoid underflow\n",
+ "- Step2: Use ``decorate``, to do nothing in level='O1' mode without using this api, and in level='O2' mode to convert network parameters from FP32 to FP16\n",
+ "- Step3: Use ``auto_cast`` to create an AMP context, in which the input datatype(FP16 or FP32) of each oprator will be automatically determined\n",
+ "- Step4: Use ``GradScaler`` defined in Step1 to complete the scaling of ``loss``, and use the scaled ``loss`` for backpropagation to complete the training\n",
+ "\n",
+ "In level=’O1‘ mode:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+ " [1.24848151])\n",
+ "\n",
+ "AMP time in O1 mode:\n",
+ "total time = 1.299 sec\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = SimpleNet(input_size, output_size) # define model\n",
+ "\n",
+ "optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # define optimizer\n",
+ "\n",
+ "# Step1:define GradScaler\n",
+ "scaler = paddle.amp.GradScaler(init_loss_scaling=1024)\n",
+ "\n",
+ "start_timer() # get start time\n",
+ "\n",
+ "for epoch in range(epochs):\n",
+ " datas = zip(train_data, labels)\n",
+ " for i, (data, label) in enumerate(datas):\n",
+ "\n",
+ " # Step2:create AMP context environment\n",
+ " with paddle.amp.auto_cast():\n",
+ " output = model(data)\n",
+ " loss = mse(output, label)\n",
+ "\n",
+ " # Step3:use GradScaler complete the loss scaling\n",
+ " scaled = scaler.scale(loss)\n",
+ " scaled.backward()\n",
+ "\n",
+ " # update parameters\n",
+ " scaler.minimize(optimizer, scaled)\n",
+ " optimizer.clear_grad()\n",
+ "\n",
+ "print(loss)\n",
+ "end_timer_and_print(\"AMP time in O1 mode:\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "In level='O2' mode:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "in ParamBase copy_to func\n",
+ "Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+ " [1.25075114])\n",
+ "\n",
+ "AMP time in O2 mode:\n",
+ "total time = 0.888 sec\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = SimpleNet(input_size, output_size) # define model\n",
+ "\n",
+ "optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # define optimizer\n",
+ "\n",
+ "# Step1:define GradScaler\n",
+ "scaler = paddle.amp.GradScaler(init_loss_scaling=1024)\n",
+ "\n",
+ "# Step2:in level='O2' mode, convert network parameters from FP32 to FP16\n",
+ "model, optimizer = paddle.amp.decorate(models=model, optimizers=optimizer, level='O2', master_weight=None, save_dtype=None)\n",
+ "\n",
+ "start_timer() # get start time\n",
+ "\n",
+ "for epoch in range(epochs):\n",
+ " datas = zip(train_data, labels)\n",
+ " for i, (data, label) in enumerate(datas):\n",
+ "\n",
+ " # Step3:create AMP context environment\n",
+ " with paddle.amp.auto_cast(enable=True, custom_white_list=None, custom_black_list=None, level='O2'):\n",
+ " output = model(data)\n",
+ " loss = mse(output, label)\n",
+ "\n",
+ " # Step4:use GradScaler complete the loss scaling\n",
+ " scaled = scaler.scale(loss)\n",
+ " scaled.backward()\n",
+ "\n",
+ " # update parameters\n",
+ " scaler.minimize(optimizer, scaled)\n",
+ " optimizer.clear_grad()\n",
+ "\n",
+ "print(loss)\n",
+ "end_timer_and_print(\"AMP time in O2 mode:\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 4. Advanced Usage\n",
+ "### 4.1 Gradient Accumulation\n",
+ "\n",
+ "Gradient accumulation means running a configured number of steps without updating the model variables. Until certain steps, use the accumulated gradients to update the variables.\n",
+ "\n",
+ "In automatic mixed precision training, gradient accumulation is also supported, and the usage is as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,\n",
+ " [1.25853443])\n",
+ "\n",
+ "AMP time:\n",
+ "total time = 1.034 sec\n"
+ ]
+ }
+ ],
+ "source": [
+ "model = SimpleNet(input_size, output_size) # define model\n",
+ "\n",
+ "optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # define optimizer\n",
+ "\n",
+ "accumulate_batchs_num = 10 # the batch numbers of gradients accumulation\n",
+ "\n",
+ "# define GradScaler\n",
+ "scaler = paddle.amp.GradScaler(init_loss_scaling=1024)\n",
+ "\n",
+ "start_timer() # get start time\n",
+ "\n",
+ "for epoch in range(epochs):\n",
+ " datas = zip(train_data, labels)\n",
+ " for i, (data, label) in enumerate(datas):\n",
+ "\n",
+ " # create AMP context environment\n",
+ " with paddle.amp.auto_cast():\n",
+ " output = model(data)\n",
+ " loss = mse(output, label)\n",
+ "\n",
+ " # use GradScaler complete the loss scaling\n",
+ " scaled = scaler.scale(loss)\n",
+ " scaled.backward()\n",
+ "\n",
+ " # when the accumulated batch is accumulate_batchs_num, update the model parameters\n",
+ " if (i + 1) % accumulate_batchs_num == 0:\n",
+ "\n",
+ " # update parameters\n",
+ " scaler.minimize(optimizer, scaled)\n",
+ " optimizer.clear_grad()\n",
+ "\n",
+ "print(loss)\n",
+ "end_timer_and_print(\"AMP time:\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 5. Conclusion\n",
+ "\n",
+ "As can be seen from the above example, using the automatic mixed precision training, in O1 mode the total time is about 1.299s, in O2 mode the total time is about 0.888s, while the ordinary training method takes 2.935s, and the training speed is increased by about 2.4 times in O1 mode and 2.4 times in O2 mode. For more examples of using mixed precision training, please refer to paddlepaddle's models: [paddlepaddle/models](https://github.com/PaddlePaddle/models)."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "py35-paddle1.2.0"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/amp_en.md b/docs/guides/01_paddle2.0_introduction/basic_concept/amp_en.md
index ee31dc70ba1..6c5f15edfae 100644
--- a/docs/guides/01_paddle2.0_introduction/basic_concept/amp_en.md
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/amp_en.md
@@ -1,6 +1,6 @@
# Automatic Mixed Precision Training
-In general, the datatype of training deep learning models is single-precision floating-point format(also called FP32). In 2018, Baidu and NVIDIA jointly published the paper: [MIXED PRECISION TRAINING](https://arxiv.org/pdf/1710.03740.pdf), which proposed mixed precision training. During the process of training, some operators use FP32 and other operators use half precision(also called FP16) in the same time. Its purpose is to speed up training, while compared with the FP32 training model, the same accuracy is maintained. This tutorial will introduce how to use automatic mixed precision training with PaddlePaddle.
+In general, the datatype of training deep learning models is single-precision floating-point format(also called FP32). In 2018, Baidu and NVIDIA jointly published the paper: [MIXED PRECISION TRAINING](https://arxiv.org/pdf/1710.03740.pdf), which proposed mixed precision training. During the process of training, some operators use FP32 and other operators use half precision(also called FP16) in the same time. Its purpose is to speed up training, while compared with the FP32 training model, the same accuracy is maintained. This tutorial will introduce how to use automatic mixed precision training with PaddlePaddle.
## 1. Half Precision (FP16)
@@ -55,6 +55,7 @@ import paddle.nn as nn
class SimpleNet(nn.Layer):
def __init__(self, input_size, output_size):
+
super(SimpleNet, self).__init__()
self.linear1 = nn.Linear(input_size, output_size)
self.relu1 = nn.ReLU()
@@ -118,19 +119,22 @@ end_timer_and_print("Default time:") # print massage and total time
```
Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
- [1.25010288])
-
+ [1.24072289])
+
Default time:
- total time = 2.943 sec
+ total time = 2.935 sec
### 3.4 Training with AMP
-Using automatic mixed precision training with PaddlePaddle requires three steps:
+Using automatic mixed precision training with PaddlePaddle requires four steps:
+
+- Step1: Define ``GradScaler``, which is used to scale the ``loss`` to avoid underflow
+- Step2: Use ``decorate``, to do nothing in level='O1' mode without using this api, and in level='O2' mode to convert network parameters from FP32 to FP16
+- Step3: Use ``auto_cast`` to create an AMP context, in which the input datatype(FP16 or FP32) of each oprator will be automatically determined
+- Step4: Use ``GradScaler`` defined in Step1 to complete the scaling of ``loss``, and use the scaled ``loss`` for backpropagation to complete the training
-- Step1: Define ``GradScaler``, which is used to scale the ``loss`` and ``gradients``to avoid underflow
-- Step2: Use ``auto_cast`` to create an AMP context, in which the input datatype(FP16 or FP32) of each oprator will be automatically determined
-- Step3: Use ``GradScaler`` defined in Step1 to complete the scaling of ``loss``, and use the scaled ``loss`` for backpropagation to complete the training
+In level=’O1‘ mode:
```python
@@ -161,14 +165,64 @@ for epoch in range(epochs):
optimizer.clear_grad()
print(loss)
-end_timer_and_print("AMP time:")
+end_timer_and_print("AMP time in O1 mode:")
```
Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
- [1.23644269])
+ [1.24848151])
+
+ AMP time in O1 mode:
+ total time = 1.299 sec
- AMP time:
- total time = 1.222 sec
+
+In level='O2' mode:
+
+
+```python
+model = SimpleNet(input_size, output_size) # define model
+
+optimizer = paddle.optimizer.SGD(learning_rate=0.0001, parameters=model.parameters()) # define optimizer
+
+# Step1:define GradScaler
+scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+
+# Step2:in level='O2' mode, convert network parameters from FP32 to FP16
+model, optimizer = paddle.amp.decorate(models=model, optimizers=optimizer, level='O2', master_weight=None, save_dtype=None)
+
+start_timer() # get start time
+
+for epoch in range(epochs):
+ datas = zip(train_data, labels)
+ for i, (data, label) in enumerate(datas):
+
+ # Step3:create AMP context environment
+ with paddle.amp.auto_cast(enable=True, custom_white_list=None, custom_black_list=None, level='O2'):
+ output = model(data)
+ loss = mse(output, label)
+
+ # Step4:use GradScaler complete the loss scaling
+ scaled = scaler.scale(loss)
+ scaled.backward()
+
+ # update parameters
+ scaler.minimize(optimizer, scaled)
+ optimizer.clear_grad()
+
+print(loss)
+end_timer_and_print("AMP time in O2 mode:")
+```
+
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ in ParamBase copy_to func
+ Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
+ [1.25075114])
+
+ AMP time in O2 mode:
+ total time = 0.888 sec
## 4. Advanced Usage
@@ -204,7 +258,7 @@ for epoch in range(epochs):
scaled = scaler.scale(loss)
scaled.backward()
- # when the accumulated batch is accumulate_batchs_num, update the model parameters
+ # when the accumulated batch is accumulate_batchs_num, update the model parameters
if (i + 1) % accumulate_batchs_num == 0:
# update parameters
@@ -216,12 +270,12 @@ end_timer_and_print("AMP time:")
```
Tensor(shape=[1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
- [1.25127280])
-
+ [1.25853443])
+
AMP time:
- total time = 1.006 sec
+ total time = 1.034 sec
## 5. Conclusion
-As can be seen from the above example, using the automatic mixed precision training, the total time is about 1.222s, while the ordinary training method takes 2.943s, and the training speed is increased by about 2.4 times. For more examples of using mixed precision training, please refer to paddlepaddle's models: [paddlepaddle/models](https://github.com/PaddlePaddle/models).
+As can be seen from the above example, using the automatic mixed precision training, in O1 mode the total time is about 1.299s, in O2 mode the total time is about 0.888s, while the ordinary training method takes 2.935s, and the training speed is increased by about 2.4 times in O1 mode and 2.4 times in O2 mode. For more examples of using mixed precision training, please refer to paddlepaddle's models: [paddlepaddle/models](https://github.com/PaddlePaddle/models).
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/autograd_cn.rst b/docs/guides/01_paddle2.0_introduction/basic_concept/autograd_cn.rst
index 3951f03c09d..fcf36e1d774 100644
--- a/docs/guides/01_paddle2.0_introduction/basic_concept/autograd_cn.rst
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/autograd_cn.rst
@@ -35,7 +35,7 @@ PaddlePaddle的神经网络核心是自动微分,本篇文章主要为你介
.. parsed-literal::
- 2.1.1
+ 2.2.0
本案例首先定义网络。因为本示例着重展示如何使用飞桨进行自动微分,故组网部分不过多展开,直接使用高层API中封装好的模型\ ``vgg11``\ 。
@@ -291,4 +291,4 @@ PaddlePaddle的神经网络核心是自动微分,本篇文章主要为你介
五、总结
------------------------
-本文章主要介绍了如何使用飞桨的自动微分,以及飞桨的自动微分机制。
+本文章主要介绍了如何使用飞桨的自动微分,以及飞桨的自动微分机制。
\ No newline at end of file
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/gradient_clip_cn.rst b/docs/guides/01_paddle2.0_introduction/basic_concept/gradient_clip_cn.rst
index 5f32441212d..7d5cd89b959 100644
--- a/docs/guides/01_paddle2.0_introduction/basic_concept/gradient_clip_cn.rst
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/gradient_clip_cn.rst
@@ -20,6 +20,8 @@ Paddle提供了三种梯度裁剪方式:
.. code:: ipython3
+ import paddle
+
linear = paddle.nn.Linear(10, 10)
clip = paddle.nn.ClipGradByValue(min=-1, max=1)
sdg = paddle.optimizer.SGD(learning_rate=0.1, parameters=linear.parameters(), grad_clip=clip)
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/gradient_clip_en.rst b/docs/guides/01_paddle2.0_introduction/basic_concept/gradient_clip_en.rst
index b6d58570b4f..31fd73f8b11 100644
--- a/docs/guides/01_paddle2.0_introduction/basic_concept/gradient_clip_en.rst
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/gradient_clip_en.rst
@@ -20,6 +20,8 @@ By default, Gradients of all parameters in SGD optimizer will be clipped:
.. code:: ipython3
+ import paddle
+
linear = paddle.nn.Linear(10, 10)
clip = paddle.nn.ClipGradByValue(min=-1, max=1)
sdg = paddle.optimizer.SGD(learning_rate=0.1, parameters=linear.parameters(), grad_clip=clip)
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/tensor_introduction_cn.md b/docs/guides/01_paddle2.0_introduction/basic_concept/tensor_introduction_cn.md
index 3eb03db37b8..00efa373a39 100644
--- a/docs/guides/01_paddle2.0_introduction/basic_concept/tensor_introduction_cn.md
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/tensor_introduction_cn.md
@@ -81,8 +81,8 @@ array([[1., 2., 3.],
**Tensor**不仅支持 floats、ints 类型数据,也支持 complex numbers数据,如果输入为复数数据,则**Tensor**的dtype为 ``complex64`` 或 ``complex128`` ,其每个元素均为1个复数:
```python
-ndim_2_tensor = paddle.to_tensor([[1.0, 2.0, 3.0],
- [4.0, 5.0, 6.0]])
+ndim_2_tensor = paddle.to_tensor([[(1+1j), (2+2j)],
+ [(3+3j), (4+4j)]])
print(ndim_2_tensor)
```
@@ -473,7 +473,6 @@ x.logical_not(y) #对两个bool型tensor逐元素进行逻辑非操
### 线性代数相关
```python
-x.cholesky() #矩阵的cholesky分解
x.t() #矩阵转置
x.transpose([1, 0]) #交换axis 0 与axis 1的顺序
x.norm('fro') #矩阵的Frobenius 范数
diff --git a/docs/guides/01_paddle2.0_introduction/basic_concept/tensor_introduction_en.md b/docs/guides/01_paddle2.0_introduction/basic_concept/tensor_introduction_en.md
index 9e44ad029a7..f9dfcde4c58 100644
--- a/docs/guides/01_paddle2.0_introduction/basic_concept/tensor_introduction_en.md
+++ b/docs/guides/01_paddle2.0_introduction/basic_concept/tensor_introduction_en.md
@@ -80,8 +80,8 @@ array([[1., 2., 3.],
**Tensor** supports not only floats and ints but also complex numbers data, If input complex number data, the dtype of **Tensor** is ``complex64`` or ``complex128`` :
```python
-ndim_2_tensor = paddle.to_tensor([[1.0, 2.0, 3.0],
- [4.0, 5.0, 6.0]])
+ndim_2_tensor = paddle.to_tensor([[(1+1j), (2+2j)],
+ [(3+3j), (4+4j)]])
print(ndim_2_tensor)
```
@@ -482,7 +482,6 @@ x.logical_not(y) #logic not operation for two bool tensor
### linear algebra operators
```python
-x.cholesky() #cholesky decomposition of a matrix
x.t() #matrix transpose
x.transpose([1, 0]) #swap axis 0 with axis 1
x.norm('fro') #Frobenius Norm of matrix
diff --git a/docs/guides/01_paddle2.0_introduction/load_old_format_model.rst b/docs/guides/01_paddle2.0_introduction/load_old_format_model_cn.rst
similarity index 100%
rename from docs/guides/01_paddle2.0_introduction/load_old_format_model.rst
rename to docs/guides/01_paddle2.0_introduction/load_old_format_model_cn.rst
diff --git a/docs/guides/01_paddle2.0_introduction/migration_cn.rst b/docs/guides/01_paddle2.0_introduction/migration_cn.rst
index f04a2ee8835..94f9e2ee60d 100644
--- a/docs/guides/01_paddle2.0_introduction/migration_cn.rst
+++ b/docs/guides/01_paddle2.0_introduction/migration_cn.rst
@@ -66,7 +66,7 @@ paddle_upgrade_tool 可以使用下面的方式,快速使用:
开始
^^^^
-在使用paddle_upgrade_tool前,需要确保已经安装了Paddle 2.0.0版本。
+在使用paddle_upgrade_tool前,需要确保已经安装了Paddle 2.0.0+版本。
.. code:: ipython3
diff --git a/docs/guides/01_paddle2.0_introduction/update_cn.md b/docs/guides/01_paddle2.0_introduction/update_cn.md
index 2e1c44ab4ac..7f367547d13 100644
--- a/docs/guides/01_paddle2.0_introduction/update_cn.md
+++ b/docs/guides/01_paddle2.0_introduction/update_cn.md
@@ -558,5 +558,5 @@ https://github.com/PaddlePaddle/paddle_upgrade_tool
### 2.0文档教程
以下提供了2.0版本的一些示例教程:
-你可以在官网[应用实践](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/tutorial/index_cn.html)栏目内进行在线浏览,也可以下载在这里提供的源代码:
-https://github.com/PaddlePaddle/book/tree/develop/paddle2.0_docs
+你可以在官网[应用实践](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/practices/index_cn.html)栏目内进行在线浏览,也可以下载在这里提供的源代码:
+https://github.com/PaddlePaddle/docs/tree/develop/docs/practices
diff --git a/docs/guides/02_paddle2.0_develop/05_train_eval_predict_cn.rst b/docs/guides/02_paddle2.0_develop/05_train_eval_predict_cn.rst
index 789be2a9394..3c2182c9b33 100644
--- a/docs/guides/02_paddle2.0_develop/05_train_eval_predict_cn.rst
+++ b/docs/guides/02_paddle2.0_develop/05_train_eval_predict_cn.rst
@@ -7,7 +7,7 @@
.. note::
- 高层API实现的模型训练与预测如\ ``Model.fit()、Model.evaluate()、Model.predict()``\ 都可以通过基础API实现,本文先介绍高层API的训练方式,然后会将高层API拆解为基础API的方式,方便对比学习。最后会补充介绍如何使用paddle inference进行预测。
+ 高层API实现的模型训练与预测如\ ``Model.fit()、Model.evaluate()、Model.predict()``\ 都可以通过基础API实现,本文先介绍高层API的训练方式,然后会将高层API拆解为基础API的方式,方便对比学习。
一、训练前准备
---------------------
@@ -137,11 +137,6 @@ numpy_ndarray_n是对应原始数据经过模型计算后得到的预测数据
除了通过第一部分的高层API实现模型的训练与预测,飞桨框架也同样支持通过基础API对模型进行训练与预测。简单来说,\ ``Model.prepare()、Model.fit()、Model.evaluate()、Model.predict()``\ 都是由基础API封装而来。下面通过拆解高层API到基础API的方式,来了解如何用基础API完成模型的训练与预测。
-
-.. note::
-
- 对于网络模型的创建你依旧可以选择Sequential组网方式,也可以采用SubClass组网方式,为方便后续使用paddle inference进行预测,我们使用SubClass组网方式创建网络,若后续使用paddle inference预测,需通过paddle.jit.save保存适用于预测部署的模型,并在forward函数前加@paddle.jit.to_static装饰器,将函数内的动态图API转化为静态图API。
-
.. code:: ipython3
# 定义网络结构( 采用SubClass 组网 )
@@ -153,9 +148,7 @@ numpy_ndarray_n是对应原始数据经过模型计算后得到的预测数据
self.linear_2 = paddle.nn.Linear(512, 10)
self.relu = paddle.nn.ReLU()
self.dropout = paddle.nn.Dropout(0.2)
-
- #后续若不使用paddle inferece,可对 @paddle.jit.to_static 进行注释
- @paddle.jit.to_static
+
def forward(self, inputs):
y = self.flatten(inputs)
y = self.linear_1(y)
@@ -214,9 +207,6 @@ numpy_ndarray_n是对应原始数据经过模型计算后得到的预测数据
# 梯度清零
optim.clear_grad()
- ##保存模型,会生成*.pdmodel、*.pdiparams、*.pdiparams.info三个模型文件
- path='./mnist/inference_model'
- paddle.jit.save(layer=mnist,path=path)
.. parsed-literal::
@@ -284,101 +274,3 @@ numpy_ndarray_n是对应原始数据经过模型计算后得到的预测数据
.. parsed-literal::
predict finished
-
-
-部署预测模型
-=====================
-其中预测方法除以上两种外,还可采用原生推理库paddle inference 进行推理部署,该方法支持TeansorRT加速,支持第三方框架模型,支持量化、裁剪后的模型,适合于工业部署或对推理性能、通用性有要求的用户。
-
-
-四、通过paddle inference实现预测
------------------------------------------
-
-paddle inference与model.predict()以及基础API的预测相比,可使用MKLDNN、CUDNN、TensorRT进行预测加速,同时支持用 X2Paddle 工具从第三方框架(TensorFlow、Pytorh 、 Caffe 等)产出的模型,可联动PaddleSlim,支持加载量化、裁剪和蒸馏后的模型部署。针对不同平台不同的应用场景进行了深度的适配优化,保证模型在服务器端即训即用,快速部署。在这里,我们只简单的展示如何用paddle inference实现该模型的部署预测。
-
-4.1 准备预测部署模型
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-要使用paddle inference预测需得到paddle预测格式的模型,所以你需要在训练过程中通过 paddle.jit.save(layer=mnist,path=path) 来保存模型,注意在训练时在forward函数前加@paddle.jit.to_static装饰器,将函数内的动态图API转化为静态图API。在第三章节基础API模型的训练中已加入相关配置。
-
-.. code:: ipython3
-
- #模型目录如下:
- mnist/
- ├── inference.pdmodel
- ├── inference.pdiparams.info
- └── inference.pdiparams
-4.2 准备预测部署程序
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-将以下代码保存为python_demo.py文件:
-
-.. code:: ipython3
-
- import argparse
- import numpy as np
- from skimage import transform,data
-
- # 引用 paddle inference 预测库
- import paddle.inference as paddle_infer
- from PIL import Image
-
- def main():
- args = parse_args()
-
- # 创建 config
- config = paddle_infer.Config(args.model_file, args.params_file)
-
- # 根据 config 创建 predictor
- predictor = paddle_infer.create_predictor(config)
-
- # 获取输入的名称
- input_names = predictor.get_input_names()
- input_handle = predictor.get_input_handle(input_names[0])
-
- # 设置输入,自定义一张输入照片,图片大小为28*28
- im=Image.open('./img3.png').convert('L')
- im=np.array(im).reshape(1,1,28,28).astype(np.float32)
- input_handle.copy_from_cpu(im)
-
- # 运行predictor
- predictor.run()
-
- # 获取输出
- output_names = predictor.get_output_names()
- output_handle = predictor.get_output_handle(output_names[0])
- output_data = output_handle.copy_to_cpu() # numpy.ndarray类型,是10个分类的概率
- print(output_data)
- print("Output data size is {}".format(output_data.size))
- print("Output data shape is {}".format(output_data.shape))
- pred=np.argmax(output_data) #选出概率最大的一个
- print("The predicted data is : {}".format(pred.item()))
-
- def parse_args():
- parser = argparse.ArgumentParser()
- parser.add_argument("--model_file", type=str, help="model filename")
- parser.add_argument("--params_file", type=str, help="parameter filename")
- parser.add_argument("--batch_size", type=int, default=1, help="batch size")
- return parser.parse_args()
-
- if __name__ == "__main__":
- main()
-
-
-4.3 执行预测程序
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: ipython3
-
- python python_demo.py --model_file ./mnist/inference_model.pdmodel --params_file ./mnist/inference_model.pdiparams --batch_size 2
-
-.. parsed-literal::
-
- #输出如下
-
- [[-1347.5923 -1156.918 -774.73865 3387.0623 -1553.3696 107.96879
- -2631.2185 -701.50323 -1094.3896 206.71666]]
- Output data size is 10
- Output data shape is (1, 10)
- The predicted data is : 3
-
-详细教程可参照paddle inference文档:https://paddle-inference.readthedocs.io/en/latest/quick_start/python_demo.html
-
diff --git a/docs/guides/performance_improving/index_cn.rst b/docs/guides/performance_improving/index_cn.rst
index 241893eca6b..64faa2caf93 100644
--- a/docs/guides/performance_improving/index_cn.rst
+++ b/docs/guides/performance_improving/index_cn.rst
@@ -2,6 +2,11 @@
性能调优
########
+你可以通过以下内容,了解飞桨框架性能调优相关的内容:
+
+- `模型量化 <./quantization.html>`_ : 使用飞桨框架进行模型量化。
+
.. toctree::
- :maxdepth: 1
+ :hidden:
+ quantization.md
\ No newline at end of file
diff --git a/docs/install/docker/fromdocker.rst b/docs/install/docker/fromdocker.rst
index aa25d82d3d7..62905f664d7 100644
--- a/docs/install/docker/fromdocker.rst
+++ b/docs/install/docker/fromdocker.rst
@@ -5,5 +5,4 @@
.. toctree::
:maxdepth: 1
- linux-docker.md
macos-docker.md
diff --git a/docs/install/docker/fromdocker_en.rst b/docs/install/docker/fromdocker_en.rst
index c0b2b487411..af6a1a7fafe 100644
--- a/docs/install/docker/fromdocker_en.rst
+++ b/docs/install/docker/fromdocker_en.rst
@@ -5,5 +5,4 @@
.. toctree::
- linux-docker_en.md
macos-docker_en.md
diff --git a/docs/practices/cv/image_ocr.ipynb b/docs/practices/cv/image_ocr.ipynb
new file mode 100644
index 00000000000..d3b9c516c16
--- /dev/null
+++ b/docs/practices/cv/image_ocr.ipynb
@@ -0,0 +1,722 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "# 通过OCR实现验证码识别\n",
+ "\n",
+ "**作者:** [GT_老张](https://github.com/GT-ZhangAcer) \n",
+ "\n",
+ "**时间:** 2021.11\n",
+ "\n",
+ "**摘要:** 本篇将介绍如何通过飞桨实现简单的CRNN+CTC自定义数据集OCR识别模型,数据集采用[CaptchaDataset](https://github.com/GT-ZhangAcer/CaptchaDataset)中OCR部分的9453张图像,其中前8453张图像在本案例中作为训练集,后1000张则作为测试集。 \n",
+ "在更复杂的场景中推荐使用[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)产出工业级模型,模型轻量且精度大幅提升。 \n",
+ "同样也可以在[PaddleHub](https://www.paddlepaddle.org.cn/hubdetail?name=chinese_ocr_db_crnn_mobile&en_category=TextRecognition)中快速使用PaddleOCR。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 一、环境配置\n",
+ "\n",
+ "本教程基于Paddle 2.2.0 编写,如果你的环境不是本版本,请先参考官网[安装](https://www.paddlepaddle.org.cn/install/quick) PaddlePaddle 2.2 。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2.2.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "import paddle\n",
+ "print(paddle.__version__)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 二、自定义数据集读取器\n",
+ "\n",
+ "常见的开发任务中,我们并不一定会拿到标准的数据格式,好在我们可以通过自定义Reader的形式来随心所欲读取自己想要数据。 \n",
+ "\n",
+ "设计合理的Reader往往可以带来更好的性能,我们可以将读取标签文件列表、制作图像文件列表等必要操作在`__init__`特殊方法中实现。这样就可以在实例化`Reader`时装入内存,避免使用时频繁读取导致增加额外开销。同样我们可以在`__getitem__`特殊方法中实现如图像增强、归一化等个性操作,完成数据读取后即可释放该部分内存。 \n",
+ "需要我们注意的是,如果不能保证自己数据十分纯净,可以通过`try`和`expect`来捕获异常并指出该数据的位置。当然也可以制定一个策略,使其在发生数据读取异常后依旧可以正常进行训练。 "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 2.1 数据展示\n",
+ "
\n",
+ "

\n",
+ "
\n",
+ "\n",
+ "点此[快速获取本节数据集](https://aistudio.baidu.com/aistudio/datasetdetail/57285),待数据集下载完毕后可使用`!unzip OCR_Dataset.zip -d data/`命令或熟悉的解压软件进行解压,待数据准备工作完成后修改本文“训练准备”中的`DATA_PATH = 解压后数据集路径`。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# 下载数据集 \n",
+ "!wget -O OCR_Dataset.zip https://bj.bcebos.com/v1/ai-studio-online/c91f50ef72de43b090298a38281e9c59a2d741eadd334f1cba7c710c5496e342?responseContentDisposition=attachment%3B%20filename%3DOCR_Dataset.zip&authorization=bce-auth-v1%2F0ef6765c1e494918bc0d4c3ca3e5c6d1%2F2020-10-27T09%3A50%3A21Z%2F-1%2F%2Fddc4aebed803af6c57dac46abba42d207961b78e7bc81744e8388395979b66fa"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# 解压数据集\n",
+ "!unzip OCR_Dataset.zip -d data/"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "import PIL.Image as Image\n",
+ "import numpy as np\n",
+ "from paddle.io import Dataset\n",
+ "\n",
+ "# 图片信息配置 - 通道数、高度、宽度\n",
+ "IMAGE_SHAPE_C = 3\n",
+ "IMAGE_SHAPE_H = 30\n",
+ "IMAGE_SHAPE_W = 70\n",
+ "# 数据集图片中标签长度最大值设置 - 因图片中均为4个字符,故该处填写为4即可\n",
+ "LABEL_MAX_LEN = 4\n",
+ "\n",
+ "\n",
+ "class Reader(Dataset):\n",
+ " def __init__(self, data_path: str, is_val: bool = False):\n",
+ " \"\"\"\n",
+ " 数据读取Reader\n",
+ " :param data_path: Dataset路径\n",
+ " :param is_val: 是否为验证集\n",
+ " \"\"\"\n",
+ " super().__init__()\n",
+ " self.data_path = data_path\n",
+ " # 读取Label字典\n",
+ " with open(os.path.join(self.data_path, \"label_dict.txt\"), \"r\", encoding=\"utf-8\") as f:\n",
+ " self.info = eval(f.read())\n",
+ " # 获取文件名列表\n",
+ " self.img_paths = [img_name for img_name in self.info]\n",
+ " # 将数据集后1024张图片设置为验证集,当is_val为真时img_path切换为后1024张\n",
+ " self.img_paths = self.img_paths[-1024:] if is_val else self.img_paths[:-1024]\n",
+ "\n",
+ " def __getitem__(self, index):\n",
+ " # 获取第index个文件的文件名以及其所在路径\n",
+ " file_name = self.img_paths[index]\n",
+ " file_path = os.path.join(self.data_path, file_name)\n",
+ " # 捕获异常 - 在发生异常时终止训练\n",
+ " try:\n",
+ " # 使用Pillow来读取图像数据\n",
+ " img = Image.open(file_path)\n",
+ " # 转为Numpy的array格式并整体除以255进行归一化\n",
+ " img = np.array(img, dtype=\"float32\").reshape((IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W)) / 255\n",
+ " except Exception as e:\n",
+ " raise Exception(file_name + \"\\t文件打开失败,请检查路径是否准确以及图像文件完整性,报错信息如下:\\n\" + str(e))\n",
+ " # 读取该图像文件对应的Label字符串,并进行处理\n",
+ " label = self.info[file_name]\n",
+ " label = list(label)\n",
+ " # 将label转化为Numpy的array格式\n",
+ " label = np.array(label, dtype=\"int32\")\n",
+ "\n",
+ " return img, label\n",
+ "\n",
+ " def __len__(self):\n",
+ " # 返回每个Epoch中图片数量\n",
+ " return len(self.img_paths)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 三、模型配置"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 3.1 定义模型结构以及模型输入\n",
+ "\n",
+ "模型方面使用的简单的CRNN-CTC结构,输入形为CHW的图像在经过CNN->Flatten->Linear->RNN->Linear后输出图像中每个位置所对应的字符概率。考虑到CTC解码器在面对图像中元素数量不一、相邻元素重复时会存在无法正确对齐等情况,故额外添加一个类别代表“分隔符”进行改善。\n",
+ "\n",
+ "CTC相关论文:[Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neu](http://people.idsia.ch/~santiago/papers/icml2006.pdf) \n",
+ "\n",
+ "\n",
+ "

\n",
+ "
\n",
+ "\n",
+ "网络部分,因本篇采用数据集较为简单且图像尺寸较小并不适合较深层次网络。若在对尺寸较大的图像进行模型构建,可以考虑使用更深层次网络/注意力机制来完成。当然也可以通过目标检测形式先检出文本位置,然后进行OCR部分模型构建。\n",
+ "\n",
+ "\n",
+ "

\n",
+ "
\n",
+ "\n",
+ "PaddleOCR效果图\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "import paddle\n",
+ "\n",
+ "# 分类数量设置 - 因数据集中共包含0~9共10种数字+分隔符,所以是11分类任务\n",
+ "CLASSIFY_NUM = 11\n",
+ "\n",
+ "# 定义输入层,shape中第0维使用-1则可以在预测时自由调节batch size\n",
+ "input_define = paddle.static.InputSpec(shape=[-1, IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W],\n",
+ " dtype=\"float32\",\n",
+ " name=\"img\")\n",
+ "\n",
+ "# 定义网络结构\n",
+ "class Net(paddle.nn.Layer):\n",
+ " def __init__(self, is_infer: bool = False):\n",
+ " super().__init__()\n",
+ " self.is_infer = is_infer\n",
+ "\n",
+ " # 定义一层3x3卷积+BatchNorm\n",
+ " self.conv1 = paddle.nn.Conv2D(in_channels=IMAGE_SHAPE_C,\n",
+ " out_channels=32,\n",
+ " kernel_size=3)\n",
+ " self.bn1 = paddle.nn.BatchNorm2D(32)\n",
+ " # 定义一层步长为2的3x3卷积进行下采样+BatchNorm\n",
+ " self.conv2 = paddle.nn.Conv2D(in_channels=32,\n",
+ " out_channels=64,\n",
+ " kernel_size=3,\n",
+ " stride=2)\n",
+ " self.bn2 = paddle.nn.BatchNorm2D(64)\n",
+ " # 定义一层1x1卷积压缩通道数,输出通道数设置为比LABEL_MAX_LEN稍大的定值可获取更优效果,当然也可设置为LABEL_MAX_LEN\n",
+ " self.conv3 = paddle.nn.Conv2D(in_channels=64,\n",
+ " out_channels=LABEL_MAX_LEN + 4,\n",
+ " kernel_size=1)\n",
+ " # 定义全连接层,压缩并提取特征(可选)\n",
+ " self.linear = paddle.nn.Linear(in_features=429,\n",
+ " out_features=128)\n",
+ " # 定义RNN层来更好提取序列特征,此处为双向LSTM输出为2 x hidden_size,可尝试换成GRU等RNN结构\n",
+ " self.lstm = paddle.nn.LSTM(input_size=128,\n",
+ " hidden_size=64,\n",
+ " direction=\"bidirectional\")\n",
+ " # 定义输出层,输出大小为分类数\n",
+ " self.linear2 = paddle.nn.Linear(in_features=64 * 2,\n",
+ " out_features=CLASSIFY_NUM)\n",
+ "\n",
+ " def forward(self, ipt):\n",
+ " # 卷积 + ReLU + BN\n",
+ " x = self.conv1(ipt)\n",
+ " x = paddle.nn.functional.relu(x)\n",
+ " x = self.bn1(x)\n",
+ " # 卷积 + ReLU + BN\n",
+ " x = self.conv2(x)\n",
+ " x = paddle.nn.functional.relu(x)\n",
+ " x = self.bn2(x)\n",
+ " # 卷积 + ReLU\n",
+ " x = self.conv3(x)\n",
+ " x = paddle.nn.functional.relu(x)\n",
+ " # 将3维特征转换为2维特征 - 此处可以使用reshape代替\n",
+ " x = paddle.tensor.flatten(x, 2)\n",
+ " # 全连接 + ReLU\n",
+ " x = self.linear(x)\n",
+ " x = paddle.nn.functional.relu(x)\n",
+ " # 双向LSTM - [0]代表取双向结果,[1][0]代表forward结果,[1][1]代表backward结果,详细说明可在官方文档中搜索'LSTM'\n",
+ " x = self.lstm(x)[0]\n",
+ " # 输出层 - Shape = (Batch Size, Max label len, Signal) \n",
+ " x = self.linear2(x)\n",
+ "\n",
+ " # 在计算损失时ctc-loss会自动进行softmax,所以在预测模式中需额外做softmax获取标签概率\n",
+ " if self.is_infer:\n",
+ " # 输出层 - Shape = (Batch Size, Max label len, Prob) \n",
+ " x = paddle.nn.functional.softmax(x)\n",
+ " # 转换为标签\n",
+ " x = paddle.argmax(x, axis=-1)\n",
+ " return x"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 四、训练准备"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 4.1 定义label输入以及超参数\n",
+ "监督训练需要定义label,预测则不需要该步骤。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# 数据集路径设置\n",
+ "DATA_PATH = \"./data/OCR_Dataset\"\n",
+ "# 训练轮数\n",
+ "EPOCH = 10\n",
+ "# 每批次数据大小\n",
+ "BATCH_SIZE = 16\n",
+ "\n",
+ "label_define = paddle.static.InputSpec(shape=[-1, LABEL_MAX_LEN],\n",
+ " dtype=\"int32\",\n",
+ " name=\"label\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 4.2 定义CTC Loss\n",
+ "\n",
+ "了解CTC解码器效果后,我们需要在训练中让模型尽可能接近这种类型输出形式,那么我们需要定义一个CTC Loss来计算模型损失。不必担心,在飞桨框架中内置了多种Loss,无需手动复现即可完成损失计算。\n",
+ " \n",
+ "使用文档:[CTCLoss](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-beta/api/paddle/nn/functional/loss/ctc_loss_cn.html#ctc-loss)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "class CTCLoss(paddle.nn.Layer):\n",
+ " def __init__(self):\n",
+ " \"\"\"\n",
+ " 定义CTCLoss\n",
+ " \"\"\"\n",
+ " super().__init__()\n",
+ "\n",
+ " def forward(self, ipt, label):\n",
+ " input_lengths = paddle.full(shape=[BATCH_SIZE],fill_value=LABEL_MAX_LEN + 4,dtype= \"int64\")\n",
+ " label_lengths = paddle.full(shape=[BATCH_SIZE],fill_value=LABEL_MAX_LEN,dtype= \"int64\")\n",
+ " # 按文档要求进行转换dim顺序\n",
+ " ipt = paddle.tensor.transpose(ipt, [1, 0, 2])\n",
+ " # 计算loss\n",
+ " loss = paddle.nn.functional.ctc_loss(ipt, label, input_lengths, label_lengths, blank=10)\n",
+ " return loss"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 4.3 实例化模型并配置优化策略"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# 实例化模型\n",
+ "model = paddle.Model(Net(), inputs=input_define, labels=label_define)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# 定义优化器\n",
+ "optimizer = paddle.optimizer.Adam(learning_rate=0.0001, parameters=model.parameters())\n",
+ "\n",
+ "# 为模型配置运行环境并设置该优化策略\n",
+ "model.prepare(optimizer=optimizer,\n",
+ " loss=CTCLoss())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 五、开始训练\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The loss value printed in the log is the current step, and the metric is the average value of previous steps.\n",
+ "Epoch 1/10\n",
+ "step 526/526 [==============================] - loss: 0.2182 - 13ms/step \n",
+ "save checkpoint at /home/aistudio/output/0\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.1953 - 6ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 2/10\n",
+ "step 526/526 [==============================] - loss: 0.1394 - 10ms/step \n",
+ "save checkpoint at /home/aistudio/output/1\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0416 - 5ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 3/10\n",
+ "step 526/526 [==============================] - loss: 0.0296 - 9ms/step \n",
+ "save checkpoint at /home/aistudio/output/2\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0327 - 6ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 4/10\n",
+ "step 526/526 [==============================] - loss: 0.0150 - 9ms/step \n",
+ "save checkpoint at /home/aistudio/output/3\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0228 - 5ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 5/10\n",
+ "step 526/526 [==============================] - loss: 0.0102 - 9ms/step \n",
+ "save checkpoint at /home/aistudio/output/4\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0161 - 6ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 6/10\n",
+ "step 526/526 [==============================] - loss: 0.1300 - 10ms/step \n",
+ "save checkpoint at /home/aistudio/output/5\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0164 - 5ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 7/10\n",
+ "step 526/526 [==============================] - loss: 0.0199 - 9ms/step \n",
+ "save checkpoint at /home/aistudio/output/6\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0121 - 5ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 8/10\n",
+ "step 526/526 [==============================] - loss: 0.0060 - 9ms/step \n",
+ "save checkpoint at /home/aistudio/output/7\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0133 - 5ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 9/10\n",
+ "step 526/526 [==============================] - loss: 0.0084 - 11ms/step \n",
+ "save checkpoint at /home/aistudio/output/8\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0098 - 5ms/step \n",
+ "Eval samples: 1024\n",
+ "Epoch 10/10\n",
+ "step 526/526 [==============================] - loss: 0.0100 - 9ms/step \n",
+ "save checkpoint at /home/aistudio/output/9\n",
+ "Eval begin...\n",
+ "step 64/64 [==============================] - loss: 0.0109 - 10ms/step \n",
+ "Eval samples: 1024\n",
+ "save checkpoint at /home/aistudio/output/final\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 执行训练\n",
+ "model.fit(train_data=Reader(DATA_PATH),\n",
+ " eval_data=Reader(DATA_PATH, is_val=True),\n",
+ " batch_size=BATCH_SIZE,\n",
+ " epochs=EPOCH,\n",
+ " save_dir=\"output/\",\n",
+ " save_freq=1,\n",
+ " verbose=1,\n",
+ " drop_last=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 六、预测前准备"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 6.1 像定义训练Reader一样定义预测Reader"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# 与训练近似,但不包含Label\n",
+ "class InferReader(Dataset):\n",
+ " def __init__(self, dir_path=None, img_path=None):\n",
+ " \"\"\"\n",
+ " 数据读取Reader(预测)\n",
+ " :param dir_path: 预测对应文件夹(二选一)\n",
+ " :param img_path: 预测单张图片(二选一)\n",
+ " \"\"\"\n",
+ " super().__init__()\n",
+ " if dir_path:\n",
+ " # 获取文件夹中所有图片路径\n",
+ " self.img_names = [i for i in os.listdir(dir_path) if os.path.splitext(i)[1] == \".jpg\"]\n",
+ " self.img_paths = [os.path.join(dir_path, i) for i in self.img_names]\n",
+ " elif img_path:\n",
+ " self.img_names = [os.path.split(img_path)[1]]\n",
+ " self.img_paths = [img_path]\n",
+ " else:\n",
+ " raise Exception(\"请指定需要预测的文件夹或对应图片路径\")\n",
+ "\n",
+ " def get_names(self):\n",
+ " \"\"\"\n",
+ " 获取预测文件名顺序 \n",
+ " \"\"\"\n",
+ " return self.img_names\n",
+ "\n",
+ " def __getitem__(self, index):\n",
+ " # 获取图像路径\n",
+ " file_path = self.img_paths[index]\n",
+ " # 使用Pillow来读取图像数据并转成Numpy格式\n",
+ " img = Image.open(file_path)\n",
+ " img = np.array(img, dtype=\"float32\").reshape((IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W)) / 255\n",
+ " return img\n",
+ "\n",
+ " def __len__(self):\n",
+ " return len(self.img_paths)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 6.2 参数设置"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": [
+ "# 待预测目录 - 可在测试数据集中挑出\\b3张图像放在该目录中进行推理\n",
+ "INFER_DATA_PATH = \"./sample_img\"\n",
+ "# 训练后存档点路径 - final 代表最终训练所得模型\n",
+ "CHECKPOINT_PATH = \"./output/final.pdparams\"\n",
+ "# 每批次处理数量\n",
+ "BATCH_SIZE = 32"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "### 6.3 展示待预测数据"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "plt.figure(figsize=(10, 10))\n",
+ "sample_idxs = np.random.choice(50000, size=25, replace=False)\n",
+ "\n",
+ "for img_id, img_name in enumerate(os.listdir(INFER_DATA_PATH)):\n",
+ " plt.subplot(1, 3, img_id + 1)\n",
+ " plt.xticks([])\n",
+ " plt.yticks([])\n",
+ " im = Image.open(os.path.join(INFER_DATA_PATH, img_name))\n",
+ " plt.imshow(im, cmap=plt.cm.binary)\n",
+ " plt.xlabel(\"Img name: \" + img_name)\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": [
+ "## 七、开始预测\n",
+ "> 飞桨2.2 CTC Decoder 相关API正在迁移中,本节暂时使用简易版解码器。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Predict begin...\n",
+ "step 1/1 [==============================] - 7ms/step\n",
+ "Predict samples: 3\n",
+ "文件名:9451.jpg,推理结果为:[3, 4, 6, 3]\n",
+ "文件名:9450.jpg,推理结果为:[8, 2, 0, 5]\n",
+ "文件名:9452.jpg,推理结果为:[0, 3, 0, 0]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 编写简易版解码器\n",
+ "def ctc_decode(text, blank=10):\n",
+ " \"\"\"\n",
+ " 简易CTC解码器\n",
+ " :param text: 待解码数据\n",
+ " :param blank: 分隔符索引值\n",
+ " :return: 解码后数据\n",
+ " \"\"\"\n",
+ " result = []\n",
+ " cache_idx = -1\n",
+ " for char in text:\n",
+ " if char != blank and char != cache_idx:\n",
+ " result.append(char)\n",
+ " cache_idx = char\n",
+ " return result\n",
+ "\n",
+ "\n",
+ "# 实例化推理模型\n",
+ "model = paddle.Model(Net(is_infer=True), inputs=input_define)\n",
+ "# 加载训练好的参数模型\n",
+ "model.load(CHECKPOINT_PATH)\n",
+ "# 设置运行环境\n",
+ "model.prepare()\n",
+ "\n",
+ "# 加载预测Reader\n",
+ "infer_reader = InferReader(INFER_DATA_PATH)\n",
+ "img_names = infer_reader.get_names()\n",
+ "results = model.predict(infer_reader, batch_size=BATCH_SIZE)\n",
+ "index = 0\n",
+ "for text_batch in results[0]:\n",
+ " for prob in text_batch:\n",
+ " out = ctc_decode(prob, blank=10)\n",
+ " print(f\"文件名:{img_names[index]},推理结果为:{out}\")\n",
+ " index += 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "py35-paddle1.2.0"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/docs/practices/cv/image_ocr/image_ocr.ipynb b/docs/practices/cv/image_ocr/image_ocr.ipynb
deleted file mode 100644
index 95f6699855b..00000000000
--- a/docs/practices/cv/image_ocr/image_ocr.ipynb
+++ /dev/null
@@ -1,739 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "# 通过OCR实现验证码识别\n",
- "\n",
- "**作者:** [GT_老张](https://github.com/GT-ZhangAcer) \n",
- "\n",
- "**时间:** 2021.11\n",
- "\n",
- "**摘要:** 本篇将介绍如何通过飞桨实现简单的CRNN+CTC自定义数据集OCR识别模型,数据集采用[CaptchaDataset](https://github.com/GT-ZhangAcer/CaptchaDataset)中OCR部分的9453张图像,其中前8453张图像在本案例中作为训练集,后1000张则作为测试集。 \n",
- "在更复杂的场景中推荐使用[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)产出工业级模型,模型轻量且精度大幅提升。 \n",
- "同样也可以在[PaddleHub](https://www.paddlepaddle.org.cn/hubdetail?name=chinese_ocr_db_crnn_mobile&en_category=TextRecognition)中快速使用PaddleOCR。"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "## 一、环境配置\n",
- "\n",
- "本教程基于Paddle 2.2.0 编写,如果你的环境不是本版本,请先参考官网[安装](https://www.paddlepaddle.org.cn/install/quick) PaddlePaddle 2.2 。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "2.2.0\n"
- ]
- }
- ],
- "source": [
- "import paddle\n",
- "print(paddle.__version__)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "## 二、自定义数据集读取器\n",
- "\n",
- "常见的开发任务中,我们并不一定会拿到标准的数据格式,好在我们可以通过自定义Reader的形式来随心所欲读取自己想要数据。 \n",
- "\n",
- "设计合理的Reader往往可以带来更好的性能,我们可以将读取标签文件列表、制作图像文件列表等必要操作在`__init__`特殊方法中实现。这样就可以在实例化`Reader`时装入内存,避免使用时频繁读取导致增加额外开销。同样我们可以在`__getitem__`特殊方法中实现如图像增强、归一化等个性操作,完成数据读取后即可释放该部分内存。 \n",
- "需要我们注意的是,如果不能保证自己数据十分纯净,可以通过`try`和`expect`来捕获异常并指出该数据的位置。当然也可以制定一个策略,使其在发生数据读取异常后依旧可以正常进行训练。 "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "### 2.1 数据展示\n",
- "\n",
- "

\n",
- "
\n",
- "\n",
- "点此[快速获取本节数据集](https://aistudio.baidu.com/aistudio/datasetdetail/57285),待数据集下载完毕后可使用`!unzip OCR_Dataset.zip -d data/`命令或熟悉的解压软件进行解压,待数据准备工作完成后修改本文“训练准备”中的`DATA_PATH = 解压后数据集路径`。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# 下载数据集 \n",
- "!wget -O OCR_Dataset.zip https://bj.bcebos.com/v1/ai-studio-online/c91f50ef72de43b090298a38281e9c59a2d741eadd334f1cba7c710c5496e342?responseContentDisposition=attachment%3B%20filename%3DOCR_Dataset.zip&authorization=bce-auth-v1%2F0ef6765c1e494918bc0d4c3ca3e5c6d1%2F2020-10-27T09%3A50%3A21Z%2F-1%2F%2Fddc4aebed803af6c57dac46abba42d207961b78e7bc81744e8388395979b66fa"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# 解压数据集\n",
- "!unzip OCR_Dataset.zip -d data/"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "import os\n",
- "\n",
- "import PIL.Image as Image\n",
- "import numpy as np\n",
- "from paddle.io import Dataset\n",
- "\n",
- "# 图片信息配置 - 通道数、高度、宽度\n",
- "IMAGE_SHAPE_C = 3\n",
- "IMAGE_SHAPE_H = 30\n",
- "IMAGE_SHAPE_W = 70\n",
- "# 数据集图片中标签长度最大值设置 - 因图片中均为4个字符,故该处填写为4即可\n",
- "LABEL_MAX_LEN = 4\n",
- "\n",
- "\n",
- "class Reader(Dataset):\n",
- " def __init__(self, data_path: str, is_val: bool = False):\n",
- " \"\"\"\n",
- " 数据读取Reader\n",
- " :param data_path: Dataset路径\n",
- " :param is_val: 是否为验证集\n",
- " \"\"\"\n",
- " super().__init__()\n",
- " self.data_path = data_path\n",
- " # 读取Label字典\n",
- " with open(os.path.join(self.data_path, \"label_dict.txt\"), \"r\", encoding=\"utf-8\") as f:\n",
- " self.info = eval(f.read())\n",
- " # 获取文件名列表\n",
- " self.img_paths = [img_name for img_name in self.info]\n",
- " # 将数据集后1024张图片设置为验证集,当is_val为真时img_path切换为后1024张\n",
- " self.img_paths = self.img_paths[-1024:] if is_val else self.img_paths[:-1024]\n",
- "\n",
- " def __getitem__(self, index):\n",
- " # 获取第index个文件的文件名以及其所在路径\n",
- " file_name = self.img_paths[index]\n",
- " file_path = os.path.join(self.data_path, file_name)\n",
- " # 捕获异常 - 在发生异常时终止训练\n",
- " try:\n",
- " # 使用Pillow来读取图像数据\n",
- " img = Image.open(file_path)\n",
- " # 转为Numpy的array格式并整体除以255进行归一化\n",
- " img = np.array(img, dtype=\"float32\").reshape((IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W)) / 255\n",
- " except Exception as e:\n",
- " raise Exception(file_name + \"\\t文件打开失败,请检查路径是否准确以及图像文件完整性,报错信息如下:\\n\" + str(e))\n",
- " # 读取该图像文件对应的Label字符串,并进行处理\n",
- " label = self.info[file_name]\n",
- " label = list(label)\n",
- " # 将label转化为Numpy的array格式\n",
- " label = np.array(label, dtype=\"int32\")\n",
- "\n",
- " return img, label\n",
- "\n",
- " def __len__(self):\n",
- " # 返回每个Epoch中图片数量\n",
- " return len(self.img_paths)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "## 三、模型配置"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "### 3.1 定义模型结构以及模型输入\n",
- "\n",
- "模型方面使用的简单的CRNN-CTC结构,输入形为CHW的图像在经过CNN->Flatten->Linear->RNN->Linear后输出图像中每个位置所对应的字符概率。考虑到CTC解码器在面对图像中元素数量不一、相邻元素重复时会存在无法正确对齐等情况,故额外添加一个类别代表“分隔符”进行改善。\n",
- "\n",
- "CTC相关论文:[Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neu](http://people.idsia.ch/~santiago/papers/icml2006.pdf) \n",
- "\n",
- "\n",
- "

\n",
- "
\n",
- "\n",
- "网络部分,因本篇采用数据集较为简单且图像尺寸较小并不适合较深层次网络。若在对尺寸较大的图像进行模型构建,可以考虑使用更深层次网络/注意力机制来完成。当然也可以通过目标检测形式先检出文本位置,然后进行OCR部分模型构建。\n",
- "\n",
- "\n",
- "

\n",
- "
\n",
- "\n",
- "PaddleOCR效果图\n",
- ""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "import paddle\n",
- "\n",
- "# 分类数量设置 - 因数据集中共包含0~9共10种数字+分隔符,所以是11分类任务\n",
- "CLASSIFY_NUM = 11\n",
- "\n",
- "# 定义输入层,shape中第0维使用-1则可以在预测时自由调节batch size\n",
- "input_define = paddle.static.InputSpec(shape=[-1, IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W],\n",
- " dtype=\"float32\",\n",
- " name=\"img\")\n",
- "\n",
- "# 定义网络结构\n",
- "class Net(paddle.nn.Layer):\n",
- " def __init__(self, is_infer: bool = False):\n",
- " super().__init__()\n",
- " self.is_infer = is_infer\n",
- "\n",
- " # 定义一层3x3卷积+BatchNorm\n",
- " self.conv1 = paddle.nn.Conv2D(in_channels=IMAGE_SHAPE_C,\n",
- " out_channels=32,\n",
- " kernel_size=3)\n",
- " self.bn1 = paddle.nn.BatchNorm2D(32)\n",
- " # 定义一层步长为2的3x3卷积进行下采样+BatchNorm\n",
- " self.conv2 = paddle.nn.Conv2D(in_channels=32,\n",
- " out_channels=64,\n",
- " kernel_size=3,\n",
- " stride=2)\n",
- " self.bn2 = paddle.nn.BatchNorm2D(64)\n",
- " # 定义一层1x1卷积压缩通道数,输出通道数设置为比LABEL_MAX_LEN稍大的定值可获取更优效果,当然也可设置为LABEL_MAX_LEN\n",
- " self.conv3 = paddle.nn.Conv2D(in_channels=64,\n",
- " out_channels=LABEL_MAX_LEN + 4,\n",
- " kernel_size=1)\n",
- " # 定义全连接层,压缩并提取特征(可选)\n",
- " self.linear = paddle.nn.Linear(in_features=429,\n",
- " out_features=128)\n",
- " # 定义RNN层来更好提取序列特征,此处为双向LSTM输出为2 x hidden_size,可尝试换成GRU等RNN结构\n",
- " self.lstm = paddle.nn.LSTM(input_size=128,\n",
- " hidden_size=64,\n",
- " direction=\"bidirectional\")\n",
- " # 定义输出层,输出大小为分类数\n",
- " self.linear2 = paddle.nn.Linear(in_features=64 * 2,\n",
- " out_features=CLASSIFY_NUM)\n",
- "\n",
- " def forward(self, ipt):\n",
- " # 卷积 + ReLU + BN\n",
- " x = self.conv1(ipt)\n",
- " x = paddle.nn.functional.relu(x)\n",
- " x = self.bn1(x)\n",
- " # 卷积 + ReLU + BN\n",
- " x = self.conv2(x)\n",
- " x = paddle.nn.functional.relu(x)\n",
- " x = self.bn2(x)\n",
- " # 卷积 + ReLU\n",
- " x = self.conv3(x)\n",
- " x = paddle.nn.functional.relu(x)\n",
- " # 将3维特征转换为2维特征 - 此处可以使用reshape代替\n",
- " x = paddle.tensor.flatten(x, 2)\n",
- " # 全连接 + ReLU\n",
- " x = self.linear(x)\n",
- " x = paddle.nn.functional.relu(x)\n",
- " # 双向LSTM - [0]代表取双向结果,[1][0]代表forward结果,[1][1]代表backward结果,详细说明可在官方文档中搜索'LSTM'\n",
- " x = self.lstm(x)[0]\n",
- " # 输出层 - Shape = (Batch Size, Max label len, Signal) \n",
- " x = self.linear2(x)\n",
- "\n",
- " # 在计算损失时ctc-loss会自动进行softmax,所以在预测模式中需额外做softmax获取标签概率\n",
- " if self.is_infer:\n",
- " # 输出层 - Shape = (Batch Size, Max label len, Prob) \n",
- " x = paddle.nn.functional.softmax(x)\n",
- " # 转换为标签\n",
- " x = paddle.argmax(x, axis=-1)\n",
- " return x"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "## 四、训练准备"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "### 4.1 定义label输入以及超参数\n",
- "监督训练需要定义label,预测则不需要该步骤。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# 数据集路径设置\n",
- "DATA_PATH = \"./data/OCR_Dataset\"\n",
- "# 训练轮数\n",
- "EPOCH = 10\n",
- "# 每批次数据大小\n",
- "BATCH_SIZE = 16\n",
- "\n",
- "label_define = paddle.static.InputSpec(shape=[-1, LABEL_MAX_LEN],\n",
- " dtype=\"int32\",\n",
- " name=\"label\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "### 4.2 定义CTC Loss\n",
- "\n",
- "了解CTC解码器效果后,我们需要在训练中让模型尽可能接近这种类型输出形式,那么我们需要定义一个CTC Loss来计算模型损失。不必担心,在飞桨框架中内置了多种Loss,无需手动复现即可完成损失计算。\n",
- " \n",
- "使用文档:[CTCLoss](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-beta/api/paddle/nn/functional/loss/ctc_loss_cn.html#ctc-loss)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "class CTCLoss(paddle.nn.Layer):\n",
- " def __init__(self):\n",
- " \"\"\"\n",
- " 定义CTCLoss\n",
- " \"\"\"\n",
- " super().__init__()\n",
- "\n",
- " def forward(self, ipt, label):\n",
- " input_lengths = paddle.full(shape=[BATCH_SIZE],fill_value=LABEL_MAX_LEN + 4,dtype= \"int64\")\n",
- " label_lengths = paddle.full(shape=[BATCH_SIZE],fill_value=LABEL_MAX_LEN,dtype= \"int64\")\n",
- " # 按文档要求进行转换dim顺序\n",
- " ipt = paddle.tensor.transpose(ipt, [1, 0, 2])\n",
- " # 计算loss\n",
- " loss = paddle.nn.functional.ctc_loss(ipt, label, input_lengths, label_lengths, blank=10)\n",
- " return loss"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "### 4.3 实例化模型并配置优化策略"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# 实例化模型\n",
- "model = paddle.Model(Net(), inputs=input_define, labels=label_define)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# 定义优化器\n",
- "optimizer = paddle.optimizer.Adam(learning_rate=0.0001, parameters=model.parameters())\n",
- "\n",
- "# 为模型配置运行环境并设置该优化策略\n",
- "model.prepare(optimizer=optimizer,\n",
- " loss=CTCLoss())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "## 五、开始训练\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "collapsed": false
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "The loss value printed in the log is the current step, and the metric is the average value of previous steps.\n",
- "Epoch 1/10\n",
- "step 529/529 [==============================] - loss: 0.0891 - 9ms/step \n",
- "save checkpoint at /home/aistudio/output/0\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0830 - 6ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 2/10\n",
- "step 529/529 [==============================] - loss: 0.0199 - 10ms/step \n",
- "save checkpoint at /home/aistudio/output/1\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0353 - 6ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 3/10\n",
- "step 529/529 [==============================] - loss: 0.2133 - 10ms/step \n",
- "save checkpoint at /home/aistudio/output/2\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0259 - 6ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 4/10\n",
- "step 529/529 [==============================] - loss: 0.0133 - 9ms/step \n",
- "save checkpoint at /home/aistudio/output/3\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0210 - 6ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 5/10\n",
- "step 529/529 [==============================] - loss: 0.0110 - 10ms/step \n",
- "save checkpoint at /home/aistudio/output/4\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0130 - 5ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 6/10\n",
- "step 529/529 [==============================] - loss: 0.0150 - 9ms/step \n",
- "save checkpoint at /home/aistudio/output/5\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0111 - 6ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 7/10\n",
- "step 529/529 [==============================] - loss: 0.0039 - 9ms/step \n",
- "save checkpoint at /home/aistudio/output/6\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0093 - 6ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 8/10\n",
- "step 529/529 [==============================] - loss: 0.0100 - 9ms/step \n",
- "save checkpoint at /home/aistudio/output/7\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0059 - 5ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 9/10\n",
- "step 529/529 [==============================] - loss: 0.0096 - 9ms/step \n",
- "save checkpoint at /home/aistudio/output/8\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0061 - 5ms/step \n",
- "Eval samples: 1000\n",
- "Epoch 10/10\n",
- "step 529/529 [==============================] - loss: 0.0066 - 10ms/step \n",
- "save checkpoint at /home/aistudio/output/9\n",
- "Eval begin...\n",
- "step 63/63 [==============================] - loss: 0.0054 - 6ms/step \n",
- "Eval samples: 1000\n",
- "save checkpoint at /home/aistudio/output/final\n"
- ]
- }
- ],
- "source": [
- "# 执行训练\n",
- "model.fit(train_data=Reader(DATA_PATH),\n",
- " eval_data=Reader(DATA_PATH, is_val=True),\n",
- " batch_size=BATCH_SIZE,\n",
- " epochs=EPOCH,\n",
- " save_dir=\"output/\",\n",
- " save_freq=1,\n",
- " verbose=1,\n",
- " drop_last=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "## 六、预测前准备"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "### 6.1 像定义训练Reader一样定义预测Reader"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# 与训练近似,但不包含Label\n",
- "class InferReader(Dataset):\n",
- " def __init__(self, dir_path=None, img_path=None):\n",
- " \"\"\"\n",
- " 数据读取Reader(预测)\n",
- " :param dir_path: 预测对应文件夹(二选一)\n",
- " :param img_path: 预测单张图片(二选一)\n",
- " \"\"\"\n",
- " super().__init__()\n",
- " if dir_path:\n",
- " # 获取文件夹中所有图片路径\n",
- " self.img_names = [i for i in os.listdir(dir_path) if os.path.splitext(i)[1] == \".jpg\"]\n",
- " self.img_paths = [os.path.join(dir_path, i) for i in self.img_names]\n",
- " elif img_path:\n",
- " self.img_names = [os.path.split(img_path)[1]]\n",
- " self.img_paths = [img_path]\n",
- " else:\n",
- " raise Exception(\"请指定需要预测的文件夹或对应图片路径\")\n",
- "\n",
- " def get_names(self):\n",
- " \"\"\"\n",
- " 获取预测文件名顺序 \n",
- " \"\"\"\n",
- " return self.img_names\n",
- "\n",
- " def __getitem__(self, index):\n",
- " # 获取图像路径\n",
- " file_path = self.img_paths[index]\n",
- " # 使用Pillow来读取图像数据并转成Numpy格式\n",
- " img = Image.open(file_path)\n",
- " img = np.array(img, dtype=\"float32\").reshape((IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W)) / 255\n",
- " return img\n",
- "\n",
- " def __len__(self):\n",
- " return len(self.img_paths)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "### 6.2 参数设置"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# 待预测目录 - 可在测试数据集中挑出\\b3张图像放在该目录中进行推理\n",
- "INFER_DATA_PATH = \"./sample_img\"\n",
- "# 训练后存档点路径 - final 代表最终训练所得模型\n",
- "CHECKPOINT_PATH = \"./output/final.pdparams\"\n",
- "# 每批次处理数量\n",
- "BATCH_SIZE = 32"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "### 6.3 展示待预测数据"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {
- "collapsed": false
- },
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "plt.figure(figsize=(10, 10))\n",
- "sample_idxs = np.random.choice(50000, size=25, replace=False)\n",
- "\n",
- "for img_id, img_name in enumerate(os.listdir(INFER_DATA_PATH)):\n",
- " plt.subplot(1, 3, img_id + 1)\n",
- " plt.xticks([])\n",
- " plt.yticks([])\n",
- " im = Image.open(os.path.join(INFER_DATA_PATH, img_name))\n",
- " plt.imshow(im, cmap=plt.cm.binary)\n",
- " plt.xlabel(\"Img name: \" + img_name)\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "collapsed": false
- },
- "source": [
- "## 七、开始预测\n",
- "> 飞桨2.1 CTC Decoder 相关API正在迁移中,本节暂时使用简易版解码器。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {
- "collapsed": false
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "WARNING: Detect dataset only contains single fileds, return format changed since Paddle 2.1. In Paddle <= 2.0, DataLoader add a list surround output data(e.g. return [data]), and in Paddle >= 2.1, DataLoader return the single filed directly (e.g. return data). For example, in following code: \n",
- "\n",
- "import numpy as np\n",
- "from paddle.io import DataLoader, Dataset\n",
- "\n",
- "class RandomDataset(Dataset):\n",
- " def __getitem__(self, idx):\n",
- " data = np.random.random((2, 3)).astype('float32')\n",
- "\n",
- " return data\n",
- "\n",
- " def __len__(self):\n",
- " return 10\n",
- "\n",
- "dataset = RandomDataset()\n",
- "loader = DataLoader(dataset, batch_size=1)\n",
- "data = next(loader())\n",
- "\n",
- "In Paddle <= 2.0, data is in format '[Tensor(shape=(1, 2, 3), dtype=float32)]', and in Paddle >= 2.1, data is in format 'Tensor(shape=(1, 2, 3), dtype=float32)'\n",
- "\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Predict begin...\n",
- "step 1/1 [==============================] - 10ms/step\n",
- "Predict samples: 3\n",
- "文件名:9451.jpg,推理结果为:[3, 4, 6, 3]\n",
- "文件名:9452.jpg,推理结果为:[0, 3, 0, 0]\n",
- "文件名:9450.jpg,推理结果为:[8, 2, 0, 5]\n"
- ]
- }
- ],
- "source": [
- "# 编写简易版解码器\n",
- "def ctc_decode(text, blank=10):\n",
- " \"\"\"\n",
- " 简易CTC解码器\n",
- " :param text: 待解码数据\n",
- " :param blank: 分隔符索引值\n",
- " :return: 解码后数据\n",
- " \"\"\"\n",
- " result = []\n",
- " cache_idx = -1\n",
- " for char in text:\n",
- " if char != blank and char != cache_idx:\n",
- " result.append(char)\n",
- " cache_idx = char\n",
- " return result\n",
- "\n",
- "\n",
- "# 实例化推理模型\n",
- "model = paddle.Model(Net(is_infer=True), inputs=input_define)\n",
- "# 加载训练好的参数模型\n",
- "model.load(CHECKPOINT_PATH)\n",
- "# 设置运行环境\n",
- "model.prepare()\n",
- "\n",
- "# 加载预测Reader\n",
- "infer_reader = InferReader(INFER_DATA_PATH)\n",
- "img_names = infer_reader.get_names()\n",
- "results = model.predict(infer_reader, batch_size=BATCH_SIZE)\n",
- "index = 0\n",
- "for text_batch in results[0]:\n",
- " for prob in text_batch:\n",
- " out = ctc_decode(prob, blank=10)\n",
- " print(f\"文件名:{img_names[index]},推理结果为:{out}\")\n",
- " index += 1"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "py35-paddle1.2.0"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.4"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}
\ No newline at end of file
diff --git a/docs/practices/cv/image_ocr/images/image1.png b/docs/practices/cv/image_ocr/images/image1.png
deleted file mode 100644
index 8163e6d5df9..00000000000
Binary files a/docs/practices/cv/image_ocr/images/image1.png and /dev/null differ
diff --git a/docs/practices/cv/image_ocr/images/image2.png b/docs/practices/cv/image_ocr/images/image2.png
deleted file mode 100644
index 715085e1b99..00000000000
Binary files a/docs/practices/cv/image_ocr/images/image2.png and /dev/null differ
diff --git a/docs/practices/cv/image_ocr/images/image3.png b/docs/practices/cv/image_ocr/images/image3.png
deleted file mode 100644
index 66c7d3758f6..00000000000
Binary files a/docs/practices/cv/image_ocr/images/image3.png and /dev/null differ
diff --git a/docs/practices/cv/index_cn.rst b/docs/practices/cv/index_cn.rst
index 2a7aa441493..a432fec5e74 100644
--- a/docs/practices/cv/index_cn.rst
+++ b/docs/practices/cv/index_cn.rst
@@ -9,7 +9,7 @@
- `图像分类 <./convnet_image_classification.html>`_ :介绍使用 PaddlePaddle 在Cifar10数据集上完成图像分类。
- `以图搜图 <./image_search.html>`_ : 介绍使用 PaddlePaddle 实现以图搜图。
- `图像分割 <./image_segmentation.html>`_ : 介绍使用 PaddlePaddle 实现U-Net模型完成图像分割。
- - `OCR <./image_ocr/image_ocr.html>`_ : 介绍使用 PaddlePaddle 实现 OCR。
+ - `OCR <./image_ocr.html>`_ : 介绍使用 PaddlePaddle 实现 OCR。
- `图像超分 <./super_resolution_sub_pixel.html>`_ : 介绍使用 PaddlePaddle 完成图像超分。
- `人脸关键点检测 <./landmark_detection.html>`_ : 介绍使用 PaddlePaddle 完成人脸关键点检测。
- `点云分类 <./pointnet.html>`_ :介绍使用 PaddlePaddle 完成点云分类。
@@ -23,7 +23,7 @@
convnet_image_classification.ipynb
image_search.ipynb
image_segmentation.ipynb
- image_ocr/image_ocr.ipynb
+ image_ocr.ipynb
super_resolution_sub_pixel.ipynb
landmark_detection.ipynb
- pointnet.ipynb
\ No newline at end of file
+ pointnet.ipynb
diff --git a/docs/practices/cv/image_ocr/sample_img/9450.jpg b/docs/practices/cv/sample_img/9450.jpg
similarity index 100%
rename from docs/practices/cv/image_ocr/sample_img/9450.jpg
rename to docs/practices/cv/sample_img/9450.jpg
diff --git a/docs/practices/cv/image_ocr/sample_img/9451.jpg b/docs/practices/cv/sample_img/9451.jpg
similarity index 100%
rename from docs/practices/cv/image_ocr/sample_img/9451.jpg
rename to docs/practices/cv/sample_img/9451.jpg
diff --git a/docs/practices/cv/image_ocr/sample_img/9452.jpg b/docs/practices/cv/sample_img/9452.jpg
similarity index 100%
rename from docs/practices/cv/image_ocr/sample_img/9452.jpg
rename to docs/practices/cv/sample_img/9452.jpg
diff --git a/docs/practices/index_cn.rst b/docs/practices/index_cn.rst
index 08f43970778..fe8a731775c 100644
--- a/docs/practices/index_cn.rst
+++ b/docs/practices/index_cn.rst
@@ -19,7 +19,7 @@
- `图像分类 <./cv/convnet_image_classification.html>`_ :介绍使用 PaddlePaddle 在Cifar10数据集上完成图像分类。
- `以图搜图 <./cv/image_search.html>`_ : 介绍使用 PaddlePaddle 实现以图搜图。
- `图像分割 <./cv/image_segmentation.html>`_ : 介绍使用 PaddlePaddle 实现U-Net模型完成图像分割。
- - `OCR <./cv/image_ocr/image_ocr.html>`_ : 介绍使用 PaddlePaddle 实现 OCR。
+ - `OCR <./cv/image_ocr.html>`_ : 介绍使用 PaddlePaddle 实现 OCR。
- `图像超分 <./cv/super_resolution_sub_pixel.html>`_ : 介绍使用 PaddlePaddle 完成图像超分。
- `人脸关键点检测 <./cv/landmark_detection.html>`_ : 介绍使用 PaddlePaddle 完成人脸关键点检测。
- `点云分类 <./cv/pointnet.html>`_ :介绍使用 PaddlePaddle 完成点云分类。
diff --git a/docs/release_note_cn.md b/docs/release_note_cn.md
index fc56944eac8..781b1b76957 100644
--- a/docs/release_note_cn.md
+++ b/docs/release_note_cn.md
@@ -1,13 +1,13 @@
-# 2.2.0 rc0 Release Note
+# Release Note
## 1. 重要更新
-我们很高兴的发布飞桨框架2.2.0-rc0版本,本版本包含如下重要更新。
+我们很高兴的发布飞桨框架2.2.0版本,本版本包含如下重要更新。
### API
-- 新增100+个API,包含24个傅里叶变换API、14个线性代数计算 API 等,更好地支持科学计算类、信号处理类模型。
+- 新增100+个API,包含24个傅里叶变换API、17个线性代数计算 API 等,更好地支持科学计算类、信号处理类模型。
- 新增多种索引类型的支持,新增的索引类型包括:省略号(…)、维度扩增(None)、布尔类型数组(Bool Mask)、整数数组((list),以及张量(Tensor) ),可以更加方便的对张量(Tensor)进行操作。
- 新增 `paddle.einsum` API,可以以更加简洁的方式来表达多维张量(Tensor)的计算。
- 动态图混合精度功能增强,新增整个任务使用半精度(float16)训练的方式,主要任务下的计算效率提升20%左右。
@@ -290,7 +290,9 @@ paddle.int64
- 新增 ``paddle.linalg.multi_dot``,支持多个矩阵连乘的计算。([#35224](https://github.com/PaddlePaddle/Paddle/pull/35224))
- 新增 ``paddle.linalg.solve``,支持计算线性方程组的解。([#35715](https://github.com/PaddlePaddle/Paddle/pull/35715))
- 新增``paddle.linalg.matrix_power``,支持矩阵的幂运算操作。([#34667](https://github.com/PaddlePaddle/Paddle/pull/34667))
-
+ - 新增`paddle.linalg.eigvalsh`,用于计算厄米特矩阵或者实数对称矩阵的特征值。([#36680](https://github.com/PaddlePaddle/Paddle/pull/36680))
+ - 新增`paddle.linalg.eig`,用于计算一般方阵的特征值和特征向量。([#35674](https://github.com/PaddlePaddle/Paddle/pull/35674))
+ - 新增`paddle.linalg.qr`,用于计算矩阵的QR分解(暂不支持反向)。([#36627](https://github.com/PaddlePaddle/Paddle/pull/36627))
- 新增傅里叶变换相关API ([#35665](https://github.com/PaddlePaddle/Paddle/pull/35665))
- 新增快速傅立叶变换系列函数
- 可微分的 1d 到 nd 复数到复数快速傅里叶变换。(``paddle.fft.fft``, ``paddle.fft.fft2``, ``paddle.fft.fftn``, ``paddle.fft.ifft``, ``paddle.fft.ifft2``, ``paddle.fft.ifftn``)
@@ -303,19 +305,21 @@ paddle.int64
- 短时傅里叶逆变换。(``paddle.signal.istft``)
- 新增高层API
- - 新增 ``paddle.vision.ops.roi_pool`` 和 ``paddle.vision.ops.RoIPool``,支持检测任务中 RoI 区域池化操作。 ([#36154](https://github.com/PaddlePaddle/Paddle/pull/36154))
- - 新增 ``paddle.vision.ops.roi_align`` 和 ``paddle.vision.ops.RoIAlign``,支持检测任务中 RoI 区域 Align 操作。([#36207](https://github.com/PaddlePaddle/Paddle/pull/36207))
- - 新增 ``paddle.vision.ops.psroi_pool`` 和 ``paddle.vision.ops.PSRoIPool``,支持检测任务中位置敏感的 RoI 区域池化操作。 ([#36111](https://github.com/PaddlePaddle/Paddle/pull/36111))
- - 新增 ``paddle.vision.models.vgg19`` 预训练权重。 ([#35788](https://github.com/PaddlePaddle/Paddle/pull/35788))
- - 新增 ``paddle.vision.datasets.*`` 中数据集 API 下载进度条。([#33302](https://github.com/PaddlePaddle/Paddle/pull/33302))
- - 新增 ``paddle.Model.predict`` 参数 ``verbose``,支持是否显示日志。([#33405](https://github.com/PaddlePaddle/Paddle/pull/33405))
- - 新增 ``paddle.hub`` 下载选项 `wget` 方式。([#33379](https://github.com/PaddlePaddle/Paddle/pull/33379))
- - 新增 ``paddle.Model`` 动态图模式下梯度累加功能。([#32702](https://github.com/PaddlePaddle/Paddle/pull/32702))
- - 新增 ``paddle.Model.fit`` 和 ``paddle.Model.evaluate`` 动态图模式下 ``num_iters`` 参数,控制训练迭代轮数。([#33986](https://github.com/PaddlePaddle/Paddle/pull/33986))
- - 新增 ``paddle.vision.ops.yolo_box`` 参数 ``iou_aware`` 和 ``iou_aware_factor``,支持 YoloBox 使用预测的 IOU 作为置信度的因子。([#33400](https://github.com/PaddlePaddle/Paddle/pull/33400))
- - 新增 ``paddle.summary`` 参数``input``,支持给定输入。([#34165](https://github.com/PaddlePaddle/Paddle/pull/34165))
+ - 新增 ``paddle.vision.ops.roi_pool`` 和 ``paddle.vision.ops.RoIPool``,支持检测任务中 RoI 区域池化操作。 ([#36154](https://github.com/PaddlePaddle/Paddle/pull/36154))
+ - 新增 ``paddle.vision.ops.roi_align`` 和 ``paddle.vision.ops.RoIAlign``,支持检测任务中 RoI 区域 Align 操作。([#36207](https://github.com/PaddlePaddle/Paddle/pull/36207))
+ - 新增 ``paddle.vision.ops.psroi_pool`` 和 ``paddle.vision.ops.PSRoIPool``,支持检测任务中位置敏感的 RoI 区域池化操作。 ([#36111](https://github.com/PaddlePaddle/Paddle/pull/36111))
+ - 新增 ``paddle.vision.models.vgg19`` 预训练权重。 ([#35788](https://github.com/PaddlePaddle/Paddle/pull/35788))
+ - 新增 ``paddle.vision.datasets.*`` 中数据集 API 下载进度条。([#33302](https://github.com/PaddlePaddle/Paddle/pull/33302))
+ - 新增 ``paddle.Model.predict`` 参数 ``verbose``,支持是否显示日志。([#33405](https://github.com/PaddlePaddle/Paddle/pull/33405))
+ - 新增 ``paddle.hub`` 下载选项 `wget` 方式。([#33379](https://github.com/PaddlePaddle/Paddle/pull/33379))
+ - 新增 ``paddle.Model`` 动态图模式下梯度累加功能。([#32702](https://github.com/PaddlePaddle/Paddle/pull/32702))
+ - 新增 ``paddle.Model.fit`` 和 ``paddle.Model.evaluate`` 动态图模式下 ``num_iters`` 参数,控制训练迭代轮数。([#33986](https://github.com/PaddlePaddle/Paddle/pull/33986))
+ - 新增 ``paddle.vision.ops.yolo_box`` 参数 ``iou_aware`` 和 ``iou_aware_factor``,支持 YoloBox 使用预测的 IOU 作为置信度的因子。([#33400](https://github.com/PaddlePaddle/Paddle/pull/33400))
+ - 新增 ``paddle.summary`` 参数``input``,支持给定输入。([#34165](https://github.com/PaddlePaddle/Paddle/pull/34165))
+ - 新增`paddle.text.viterbi_decode`,支持动态图下CPU、GPU的Viterbi解码功能。([#35778](https://github.com/PaddlePaddle/Paddle/pull/35778))
- 新增组网类 API
+ - 新增`paddle.nn.functional.sparse_attention`,用于计算稀疏的Transformer Attention模块。([#35757](https://github.com/PaddlePaddle/Paddle/pull/35757))
- 新增 ``paddle.nn.MaxUnPool2D`` 和 ``paddle.nn.functional.max_unpool2d``,支持根据输入的input和最大值位置计算出池化的逆结果。([#35056](https://github.com/PaddlePaddle/Paddle/pull/35056))
- 新增 ``paddle.nn.functional.gumbel_softmax``,支持 ``gumbel softmax`` 采样。([#35506](https://github.com/PaddlePaddle/Paddle/pull/35506), [#36065](https://github.com/PaddlePaddle/Paddle/pull/36065), [#36094](https://github.com/PaddlePaddle/Paddle/pull/36094))
- 新增 ``paddle.nn.functional.class_center_sample``,支持 PartialFC 类中心采样功能。([#34106](https://github.com/PaddlePaddle/Paddle/pull/34106))
@@ -332,9 +336,13 @@ paddle.int64
- 新增 ``paddle.device.cuda.empty_cache``,支持清理空闲的显存。([#35427](https://github.com/PaddlePaddle/Paddle/pull/35427))
- 新增 ``paddle.device.cuda.get_device_properties``,支持返回给定的设备属性。([#35875](https://github.com/PaddlePaddle/Paddle/pull/35875))
- 新增 ``paddle.device.cuda.stream_guard``,用于动态图下 CUDA Stream的灵活切换。([#35623](https://github.com/PaddlePaddle/Paddle/pull/35623))
-
+ - 新增`paddle.device.cuda.get_device_name`,支持返回给定设备的名称。([#36172](https://github.com/PaddlePaddle/Paddle/pull/36172))
+ - 新增`paddle.device.cuda.get_device_capability`,支持返回给定设备计算能力的版本号。([#36172](https://github.com/PaddlePaddle/Paddle/pull/36172))
+ - 新增`paddle.framework.core.async_read`和`paddle.framework.core.async_write`,可支持非默认 CUDA `Stream`下`CUDAPinnedPlace` 和 `CUDAPlace` 的 `Tensor` 数据异步读写。([#36501](https://github.com/PaddlePaddle/Paddle/pull/36501))
- 新增Tensor操作API
+ - 新增`paddle.tensordot`,支持对高维张量做缩并(Tensor Contraction)运算。([#36454](https://github.com/PaddlePaddle/Paddle/pull/36454))
+ - 新增`paddle.bincount`,支持对一维张量内元素进行计数。([#36709](https://github.com/PaddlePaddle/Paddle/pull/36709))
- 新增 `paddle.broadcast_tensors` ,支持对一组 `Tensor` 进行广播操作。([#33294](https://github.com/PaddlePaddle/Paddle/pull/33294), [#34874](https://github.com/PaddlePaddle/Paddle/pull/34874))
- 新增 `paddle.einsum` 。([#33821](https://github.com/PaddlePaddle/Paddle/pull/34874))
- 增强``paddle.tensor.gradient``接口,支持sigmoid_op的二阶求导算子。([#32971](https://github.com/PaddlePaddle/Paddle/pull/32971))
@@ -373,6 +381,7 @@ paddle.int64
- 新增 ``paddle.static.ExponentialMovingAverage``,支持用指数衰减计算参数的滑动平均值。([#35673](https://github.com/PaddlePaddle/Paddle/pull/35673))
- 新增 `` paddle::Tensor::slice`` C++ API, 支持 slice 操作,允许用户对外部 Tensor 切片操作。([#34227](https://github.com/PaddlePaddle/Paddle/pull/34227))
- 新增``paddle.incubate.segment_*``系列API,包含 ``paddle.incubate.segment_sum, paddle.incubate.segment_mean, paddle.incubate.segment_max, paddle.incubate.segment_min``。支持对`Tensor`按照分段求和、求均值、求最大值、求最小值。 ([#35759](https://github.com/PaddlePaddle/Paddle/pull/35759))
+ - 新增`paddle.version.cuda`和`paddle.version.cudnn`,用于获取 paddle 安装包所使用的 `CUDA`和 `cuDNN`的版本号。([#36556](https://github.com/PaddlePaddle/Paddle/pull/36556))
#### IR(Intermediate Representation)
- 动态图转静态图
@@ -388,13 +397,15 @@ paddle.int64
- 提供分析 `Program` 中控制流需要的依赖辅助函数。 ([#33439](https://github.com/PaddlePaddle/Paddle/pull/33439))
- `Program` 和 `Graph` 相互转换后保留训练所需要的 `stop_gradient` , `persistable` 属性值。([#33771](https://github.com/PaddlePaddle/Paddle/pull/33771))
- 原 `Pass` 只处理主`Graph`,忽略子图,现`Pass` 支持处理主 `Graph`及其所有子图。 ([#34158](https://github.com/PaddlePaddle/Paddle/pull/34158))
- - 处理了在预测情况下 `Program` 和 `Graph` 互转的一些拓扑排序问题。([#34121](https://github.com/PaddlePaddle/Paddle/pull/34121), [#34521](https://github.com/PaddlePaddle/Paddle/pull/34521)). **《== **
+ - 处理了在预测情况下 `Program` 和 `Graph` 互转的一些拓扑排序问题。([#34121](https://github.com/PaddlePaddle/Paddle/pull/34121), [#34521](https://github.com/PaddlePaddle/Paddle/pull/34521))
- Pass开发
- 新增 Python 侧针对 fusion 等子图替换场景下的 Pass 开发方式。([#35708](https://github.com/PaddlePaddle/Paddle/pull/35708), [#35602](https://github.com/PaddlePaddle/Paddle/pull/35602))
- Kernel Primitive API
- 对算子 Kernel 实现中的底层代码进行了抽象与功能封装,提供高性能的 Block 级 IO 运算和 Compute 运算。使用 Kernel Primitive API 进行 Kernel 开发可以更加专注计算逻辑的实现,在保证性能的同时大幅减少代码量,同时实现了算子计算与硬件解耦。([#34672](https://github.com/PaddlePaddle/Paddle/pull/34672), [#35075](https://github.com/PaddlePaddle/Paddle/pull/35075), [#34456](https://github.com/PaddlePaddle/Paddle/pull/34456), [#35282](https://github.com/PaddlePaddle/Paddle/pull/35282), [#35743](https://github.com/PaddlePaddle/Paddle/pull/35743), [#34208](https://github.com/PaddlePaddle/Paddle/pull/34208))
+ - 在 Kernel Primitive API中添加一元和二元计算Functor共13个。 ([#36418](https://github.com/PaddlePaddle/Paddle/pull/36418))
+ - 修改 Kernel Primitive API 中 ReadData 实现方式,修复`NX !=1`访存越界的问题。 ([#36373](https://github.com/PaddlePaddle/Paddle/pull/36373))
#### 混合精度训练
- 动态图混合精度功能增强,新增整个任务使用半精度(float16)训练的方式,主要任务下的计算效率提升20%左右。 ([#35521](https://github.com/PaddlePaddle/Paddle/pull/35521))
@@ -512,7 +523,13 @@ paddle.int64
- 优化``l2_normalize``,``p_norm``,``elementwise_max``,``prelu``,``clip_by_norm``,``lars optimizer``算子支持float16计算。 ([#35576](https://github.com/PaddlePaddle/Paddle/pull/35576), [#35888](https://github.com/PaddlePaddle/Paddle/pull/35888), [#35888](https://github.com/PaddlePaddle/Paddle/pull/35888), [35532](https://github.com/PaddlePaddle/Paddle/pull/35532), [#35446](https://github.com/PaddlePaddle/Paddle/pull/35446), [#33280](https://github.com/PaddlePaddle/Paddle/pull/33280))
- 优化flowers数据集的读取速度,从每批次数分钟优化至1~3秒。([#31408](https://github.com/PaddlePaddle/Paddle/pull/31408))
- 支持`paddle.distributed.fleet.DistributedStrategy` 中 `without_graph_optimize` 开关打开后的fuse allreduce sum功能。FP32下性能提升3%,AMP下性能提升8%。([#34446](https://github.com/PaddlePaddle/Paddle/pull/34446))
-
+- `paddle.matmul` 将底层Op算子由matmul op 切换到 matmul_v2 op。 ([#36374](https://github.com/PaddlePaddle/Paddle/pull/36374))
+- `paddle.fft` 模块添加了 mkl_cdft 和 hipfft 两个计算后端。 ([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+- `paddle.roll` 的参数 `shifts` 支持 `Tensor` 作为输入。 ([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+- `paddle.shape` 支持复数类型的输入。([#36835](https://github.com/PaddlePaddle/Paddle/pull/36835))
+- matmul_v2 支持量化。([#36469](https://github.com/PaddlePaddle/Paddle/pull/36469))
+- 新增 `clip_op` 对 `float16` 的支持。 ([#36672](https://github.com/PaddlePaddle/Paddle/pull/36672))
+- `paddle.fft` 模块为 cufft 后端添加了缓存 plan 的功能,优化性能。([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
#### IR(Intermediate Representation)
- 动态图转静态图
@@ -521,6 +538,9 @@ paddle.int64
- 优化了动转静训练代码逻辑,升级内部 ``Program`` 缓存机制,新增输入 ``Tensor`` 的提前 copy 策略,提升训练性能。 ([#34181](https://github.com/PaddlePaddle/Paddle/pull/34181), [#33796](https://github.com/PaddlePaddle/Paddle/pull/33796))
- 优化动转静内部执行器显存回收策略,减少训练时显存占用量。 ([#34177](https://github.com/PaddlePaddle/Paddle/pull/34177))
- 集成了 ``Gast`` 三方依赖库的源码,解耦了版本依赖。 ([#34556](https://github.com/PaddlePaddle/Paddle/pull/34556))
+ - 动转静报错时显示部分框架层报错信息,使得定位问题更加容易。([#36765](https://github.com/PaddlePaddle/Paddle/pull/36765))
+ - 移除动转静报错模块中重复的临时文件删除函数`remove_static_file()`。([#36375](https://github.com/PaddlePaddle/Paddle/pull/36375))
+ - 优化对RegisterPass中`input_specs`参数处理,支持图优化时作为匹配子图条件。([#36453](https://github.com/PaddlePaddle/Paddle/pull/36453))
#### 分布式训练
@@ -534,7 +554,13 @@ paddle.int64
- `paddle.io.Dataset` 支持动态库解析数据。 ([#33969](https://github.com/PaddlePaddle/Paddle/pull/33969))
- 新增 `paddle.distributed.fleet.dataset.DatasetBase` 中对`use_var_list`和 `pipe_command` 生成数据的一致性检查函数。 ([#34463](https://github.com/PaddlePaddle/Paddle/pull/34463))
- 新增 `paddle.fluid.layers.embedding` 的 `emd` 维度与 `fleet` 中` sparse table` 的 `emb` 维度的一致性检查。 ([#34249](https://github.com/PaddlePaddle/Paddle/pull/34249))
-
+ - 动态图混合并行支持Pure FP16训练。([#36707](https://github.com/PaddlePaddle/Paddle/pull/36707))
+ - 静态图混合并行支持dropout使用固定随机种子生成器,以确保模型并行中全局变量的一致性与局部变量的随机性。([#36682](https://github.com/PaddlePaddle/Paddle/pull/36682))
+ ‘
+ - 实现了CPU并行,并支持调用 spawn 或 launch 时可以添加自定义的backend参数。可用的backend选择为 "gloo", "nccl", "bkcl", "auto" ,分别表示CPU并行,GPU并行,XPU并行和按照Paddle版本自动选择。([#35745](https://github.com/PaddlePaddle/Paddle/pull/35745))
+ - 优化动态图混合并行 HybridParallelClipGrad 策略,支持4D混合并行+Pure FP16训练。([#36707](https://github.com/PaddlePaddle/Paddle/pull/36707))
+ - 添加 SlotRecordDataset 类支持GPU参数服务器训练。([#36710](https://github.com/PaddlePaddle/Paddle/pull/36710))
+ - GPU参数服务器构建阶段支持使用SlotRecordDataset。([#36723](https://github.com/PaddlePaddle/Paddle/pull/36723))
- 静态图混合并行
- 优化混合并行 loss scale,减少 scale op 插入个数。([#35775](https://github.com/PaddlePaddle/Paddle/pull/35775))
@@ -555,6 +581,14 @@ paddle.int64
- 修正 ``paddle.jit.save`` 接口和模型裁剪的逻辑,不再为输出变量增加一个关联的 ``scale_op``,可以正确导出含有 ``bool``,``float16`` 类型输出的模型。([#35730](https://github.com/PaddlePaddle/Paddle/pull/35730), [#36132](https://github.com/PaddlePaddle/Paddle/pull/36132))
- 自定义OP
- 移除 ``paddle::Tensor`` 的 ``copy`` 方法中不必要的 ``cudaStreamSynchronize`` 操作,以提升性能。([#35802](https://github.com/PaddlePaddle/Paddle/pull/35802))
+- 新增C++对GeneratePass开发注册的支持,开发方式与Python侧对齐。([#36302](https://github.com/PaddlePaddle/Paddle/pull/36302))
+- 自动稀疏化训练(Automic SParsity)
+ - 新增`paddle.static.sparsity`,支持生成`n:m`稀疏模式的稀疏参数,目前只支持静态图ASP训练。A100上FP32、FP16分别设置`1:2`、`2:4`的稀疏模式,训练保存的稀疏模型,可通过调用TensorRT 8利用Ampere架构的稀疏Tensor Core加速推理任务。当前版本共提供了5个API:([#32995](https://github.com/PaddlePaddle/Paddle/pull/32995)、[#33132](https://github.com/PaddlePaddle/Paddle/pull/33132)、[#33558](https://github.com/PaddlePaddle/Paddle/pull/33558)、[#36525](https://github.com/PaddlePaddle/Paddle/pull/36525))
+ - `paddle.static.sparsity.calculate_density`,计算输入Tensor的密度。
+ - `paddle.static.sparsity.decorate`,将给定的优化器包装为`OptimizerWithSparsityGuarantee`,在调用 `optimizer.minimize()`时自动为ASP工作流插入必要的操作。
+ - `paddle.static.sparsity.prune_model`,依据`mask_algo`指定的掩码生成函数裁剪`main_program`中支持的层的参数。
+ - `paddle.static.sparsity.set_excluded_layers`,设置不会被裁剪的层的参数名称。
+ - `paddle.static.sparsity.reset_excluded_layers`,重置与`main_program`相对应的`excluded_layers`设置。
@@ -594,6 +628,18 @@ paddle.int64
- 优化动态图性能,将只在静态图执行的逻辑从动态图的执行路径中剥离。([#34024](https://github.com/PaddlePaddle/Paddle/pull/34024))
- IR Pass优化能力作为通用能力露出,同时支持单机和分布式优化。在GPT混合并行场景性能提升3%-5%。([#34955](https://github.com/PaddlePaddle/Paddle/pull/34955), [#35704](https://github.com/PaddlePaddle/Paddle/pull/35704), [#34730](https://github.com/PaddlePaddle/Paddle/pull/34730), [#34524](https://github.com/PaddlePaddle/Paddle/pull/34524))
- 优化 ctc loss grad 计算速度,提速~3x,但相应增加了GPU显存占用。([#34729](https://github.com/PaddlePadle/Paddle/pull/34729))
+- transformer encoder 性能优化
+ - 优化思路:通过新增 `paddle.incubate.nn.FusedMultiHeadAttention` 和 `paddle.incubate.nn.FusedFeedForward` 的方式,在实现中采用 q, k, v gemm融合及多种kernel融合优化技术,提升transformer encoder的性能。
+ - FusedAttention
+ - 新增 `paddle.incubate.nn.functional.fused_multi_head_attention` ,支持multi-head attention的融合计算。([#35905](https://github.com/PaddlePaddle/Paddle/pull/35905) [35903](https://github.com/PaddlePaddle/Paddle/pull/35903) [#36803](https://github.com/PaddlePaddle/Paddle/pull/36803) [#36793](https://github.com/PaddlePaddle/Paddle/pull/36793) [36185](https://github.com/PaddlePaddle/Paddle/pull/36185))
+ - 新增 `paddle.incubate.nn.FusedMultiHeadAttention` ,用于融合multi-head attention的layer层组网。 ([#36498](https://github.com/PaddlePaddle/Paddle/pull/36498) )
+ - 该模块使用q, k, v gemm融合和bias add + dropout + residual add + layer_norm kernel融合优化技术,可带来1.08x-1.45x加速。
+
+ - FusedFeedForward
+ - 新增 `paddle.incubate.nn.functional.fused_feedforward` ,支持 feedforward的融合计算。([#36729](https://github.com/PaddlePaddle/Paddle/pull/36729) [#36730](https://github.com/PaddlePaddle/Paddle/pull/36730))
+ - 新增 `paddle.incubate.nn.FusedFeedForward` ,用于融合feedforward的layer层组网。 ([#36776](https://github.com/PaddlePaddle/Paddle/pull/36776))
+ - 性能较优化前有1.04x~1.22x左右的提升。
+ - 新增 `paddle.incubate.nn.FusedTransformerEncoderLayer`,支持使用融合multi-head attention和融合feedforward计算的layer层组网。 ([#36776](https://github.com/PaddlePaddle/Paddle/pull/36776))
### (4)问题修复
@@ -687,12 +733,27 @@ paddle.int64
- 迁移``paddle.nn.functional.dice_loss``API中的`one_hot`算子到`one_hot_v2`算子。([#35734](https://github.com/PaddlePaddle/Paddle/pull/35734))
- 修复 ``paddle.summary`` 静态图模式下使用 bug。([#35303](https://github.com/PaddlePaddle/Paddle/pull/35303))
- 修复 ``paddle.Model.prepare`` 静态图模式下多卡启动的 bug。([#34311](https://github.com/PaddlePaddle/Paddle/pull/34311))
+- 修复`paddle.nn.functional.cross_entropy` 给定`weight`,且指定`axis`为除-1外的其他合法维度时会报错的问题。([#36647](https://github.com/PaddlePaddle/Paddle/pull/36647))
+- 修复`paddle.utils.dlpack.to_dlpack`无法编码多维 `Tensor` 的问题,修复其所生成的 DLPack 对象无法进行跨深度学习框架共享的问题。([#36177](https://github.com/PaddlePaddle/Paddle/pull/36177))
+- 修复使用`paddle.distribution.Categorical`的`sample`方法报错的问题,具体原因是multinomial op的cuda kernel中数组访问越界,该bug会导致访问超出数组下标的值,引起报错。 ([#36511](https://github.com/PaddlePaddle/Paddle/pull/36511))
+- 修复动态图`_BatchNormBase`基类中修改了 default_dtype,导致后续组网参数类型错误的问题,受影响的API有`paddle.nn.BatchNorm1D`,`paddle.nn.BatchNorm2D`,`paddle.nn.BatchNorm3D`,`paddle.nn.SyncBatchNorm`。具体原因是当 `get_default_dtype() == 'float16'` 时,通过 `set_default_dtype('float32')`修改默认参数数据类型,动态图组网的参数类型是通过 default_dtype 来创建的,因此当默认参数类型被修改后导致后续的组网参数类型错误。 ([#36376](https://github.com/PaddlePaddle/Paddle/pull/36376))
+- 修复`paddle.nn.functional.grid_sample`因特殊输入导致的异常问题。([#36625](https://github.com/PaddlePaddle/Paddle/pull/36625))
+- 修复 `paddle.fft.fft`, `paddle.fft.ifft`, `paddle.fft.rfft` , `paddle.fft.irfft`, `paddle.fft.hfft`, `paddle.fft.ihfft` 在输入 `axis=0` 情况下的计算错误问题。([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+- 修复 `paddle.fft.fftshift` 和 `paddle.fft.ifftshift` 在静态图下出错的问题。([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+- 修复 `paddle.fft.ifftshift` 计算结果不正确的问题。([#36835](https://github.com/PaddlePaddle/Paddle/pull/36835))
+- 修复`paddle.nn.functional.pad`在`replicate`模式下的报错信息提示。([#36531](https://github.com/PaddlePaddle/Paddle/pull/36531))
+
#### IR(Intermediate Representation)
- 动态图转静态图
- 修复了动转静后,在 ``paddle.no_grad`` 语义下显存异常增长的问题。([#35725](https://github.com/PaddlePaddle/Paddle/pull/35725))
- 修复了对 ``paddle.no_grad`` 接口的错误识别和转换问题。([#34136](https://github.com/PaddlePaddle/Paddle/pull/34136))
+ - 修复了部分场景下模型中间设置 stop_gradient=True 时,动转静训练报错的问题。([#36353](https://github.com/PaddlePaddle/Paddle/pull/36353))
+ - 修复了在控制流 if 的部分场景转换时,对返回结果检查会报错的问题。([#36830](https://github.com/PaddlePaddle/Paddle/pull/36830))
+ - 修复了在 ifelse 分支返回不等长结果时,动转静会额外对齐返回长度导致返回类型意外改变的问题。([#36565](https://github.com/PaddlePaddle/Paddle/pull/36565))
+ - 修复使用 jit.save/load 接口加载模型后,在 train 模式和 no_grad 上下文中,显存会一直增长的问题。([#36463](https://github.com/PaddlePaddle/Paddle/pull/36463))
+
#### 分布式训练
@@ -727,6 +788,10 @@ paddle.int64
- 修复 GPU 参数服务器使用非0卡训练报错问题。([#33078](https://github.com/PaddlePaddle/Paddle/pull/33078))
- 修复 GPU 参数服务器 delta score,scale show问题。([#33492](https://github.com/PaddlePaddle/Paddle/pull/33078), [#33492](https://github.com/PaddlePaddle/Paddle/pull/33492))
- 修复 GPU 参数服务器训练结束后未 merge dense,g2sum 计算有误,data norm 添加了optimize op 等问题。 ([#35029](https://github.com/PaddlePaddle/Paddle/pull/35029))
+ - 修复使用 fuse all reduce ops 开关时,如果梯度出现 empty 时会报错的问题。([#36231](https://github.com/PaddlePaddle/Paddle/pull/36231))
+ - 修复 dist_transformer 文件出现未定义的变量问题。([#36211](https://github.com/PaddlePaddle/Paddle/pull/36211))
+
+
- 动态图混合并行
- 修复流水线并行计算错误的问题。([#35556](https://github.com/PaddlePaddle/Paddle/pull/35556))
@@ -767,6 +832,8 @@ paddle.int64
- 子图通过支持Paddle-Lite NNAdapter接入ascend310硬件预测 [#35226](https://github.com/PaddlePaddle/Paddle/pull/35226), 示例可参考[demo](https://github.com/PaddlePaddle/Paddle-Inference-Demo/tree/master/c%2B%2B/ascend310_lite_subgraph/image_classification_demo)。
- 新增晟腾910 推理支持 [#34101](https://github.com/PaddlePaddle/Paddle/pull/34101)
+- 新增pool3d算子支持TensorRT的功能。([#36545](https://github.com/PaddlePaddle/Paddle/pull/36545))
+
### (2)功能优化
#### 框架及API更新
@@ -774,6 +841,7 @@ paddle.int64
- 量化支持
- 动态图量化推理 pass 的重构,支持非模拟量化的 OP和模拟量化的 OP。([#35907](https://github.com/PaddlePaddle/Paddle/pull/35907))
- 增加 int8 的模拟量化OP matmul(权重乘以 tensor的情况)。([#34359](https://github.com/PaddlePaddle/Paddle/pull/34359))
+ - 修复MobileNetV3模型在量化训练过程中因量化参数为0导致的Loss出NAN问题。([#36763](https://github.com/PaddlePaddle/Paddle/pull/36763))
- API 增强
@@ -810,16 +878,18 @@ paddle.int64
- 增加TensorRT `qkv_context` plugin 对int8的支持([#34917](https://github.com/PaddlePaddle/Paddle/pull/34917), [#35504](https://github.com/PaddlePaddle/Paddle/pull/35504))
- 增加TensorRT conv3d的支持。([#35507](https://github.com/PaddlePaddle/Paddle/pull/35507))
- 增加对 `multihead_matmul` 融合算子的输入进行广播的支持。([#35780](https://github.com/PaddlePaddle/Paddle/pull/35780))
+ - Inference 支持 TensorRT8 稀疏推理,[测试环境](https://github.com/PaddlePaddle/Paddle-Inference-Demo/tree/master/c%2B%2B/sparsity)下,ERNIE 模型变长输入在不同的 batch_size 下性能提升10%-30%,ResNeXt101_32x4d模型在不同的batch_size下性能提升10%。([#36659](https://github.com/PaddlePaddle/Paddle/pull/36659))
- Nvidia Jetson 原生支持能力增强
- 新增 Op 支持,针对Jetson Nano/TX2这两款算力较低的设备,我们做了针对性的优化,目前新增了 `pool2d`, `pool_max`, `conv3d_transpose` 等 17个OP的支持。([#35378](https://github.com/PaddlePaddle/Paddle/pull/35378))
- 针对Jetson Nano,新增模型:DPN68, EfficientNetB0, ttfnet, fcn_hrnetw18, hardnet。([#35378](https://github.com/PaddlePaddle/Paddle/pull/35378))
- 针对Jetson TX2,新增模型:deeplabv3p_resnet50, deeplabv3_resnet50, fcn_hrnetw18, hardnet, pspnet, ttfnet, unet。([#35378](https://github.com/PaddlePaddle/Paddle/pull/35378))
-
- 昆仑XPU接口功能扩展
- 新增 `set_xpu_device_id` 接口,支持设置推理时的昆仑芯片的设备号([#35572](https://github.com/PaddlePaddle/Paddle/pull/35572))
+- Inference python `copy_from_cpu`接口加入输入类型检查,错误类型输入下提前报错。([#36552](https://github.com/PaddlePaddle/Paddle/pull/36552))
+
### (3)问题修复
#### 框架及API修复
@@ -842,6 +912,16 @@ paddle.int64
- 修复ernie变长情况下,输入的顺序不一致导致输出不对的问题。([#33575](https://github.com/PaddlePaddle/Paddle/pull/33575))
- 修复多流状态下分配器功能异常的问题。([#32932](https://github.com/PaddlePaddle/Paddle/pull/33575))
+- 修复 ERNIE 模型在 TRT8 下可能出现的崩溃问题。([#36769](https://github.com/PaddlePaddle/Paddle/pull/36769))
+- 修复使用 Pool, Slice 时可能出现的崩溃及精度问题。([#36666](https://github.com/PaddlePaddle/Paddle/pull/36666))
+- 修复 yolo_box op因为计算公式错误导致的精度问题。([#36365](https://github.com/PaddlePaddle/Paddle/pull/36365))
+- 修复量化后的 matmul_v2 在TRT下无法正常推理的问题。([#36821](https://github.com/PaddlePaddle/Paddle/pull/36821))
+- 修复了量化 matmul_v2 时错误地添加量化op的问题。([#36820](https://github.com/PaddlePaddle/Paddle/pull/36820))
+- 修复算子 batch_norm 和 elementwise_add 在3D应用场景下开启 TRT 报错的问题。([#36446](https://github.com/PaddlePaddle/Paddle/pull/36446))
+- 修复高层 linear api保存得到的预测模型无法被 Pass 融合优化的问题。([#36500](https://github.com/PaddlePaddle/Paddle/pull/36500))
+- 修改 MatmulV2ToMul 的 Pass,重新限定 (matmul_v2 to mul) 映射的 Pass,增加 MatmulV2ToMatmul 的 Pass,限定 (matmul_v2 to matmul) 映射的 Pass条件(不支持广播),修改 (matmul, mul) 的 op_teller 映射条件。([#36652](https://github.com/PaddlePaddle/Paddle/pull/36652))
+
+
#### 后端能力修复
- TensorRT 子图引擎修复
@@ -907,4 +987,5 @@ paddle.int64
This release contains contributions from:
-0x45f, 123malin, Adam Osewski, Aganlengzi, Aurelius84, Baibaifan, Bo Liu, CheQiXiao, Chen Long, Chen Weihang, CtfGo, Double\_V, Ethanzjp, Fan Zhang, Feiyu Chan, Feng Xing, From00, GT-Zhang, Guanghua Yu, Guoxia Wang, Haipeng Wang, Hao Lin, Haohongxiang, Hui Zhang, Huihuang Zheng, HydrogenSulfate, IMMORTAL, JYChen, JZ-LIANG, Jacek Czaja, Jack Zhou, Jackwaterveg, Jeng Bai-Cheng, Jiangxinz, Jiaqi Liu, Jiawei Wang, JingZhuangzhuang, June Weng, Kaipeng Deng, Kqnonrime, LJQ❤️, Leo Chen, Li Min, LielinJiang, Lijunhui, Linjie Chen, Liu-xiandong, LiuWei, Ming-Xu Huang, MissPenguin, PaddlePM, Pei Yang, Peihan, Qi Li, QingshuChen, Ren Wei (任卫), Roc, Shang Zhizhou, ShenLiang, Shibo Tao, Siming Dai, Sing\_chan, TCChenLong, TTerror, TeslaZhao, Thomas Young, Thunderbrook, Tongxin Bai, WJJ1995, WangXi, Wangzheee, Wei Shengyu, WeiXin, Weilong Wu, Wenyu, Wilber, XGZhang, XYZ, XYZ916829, XiangGao, Xiaoxu Chen, YUNSHEN XIE, Yanxing Shi, Yiqun Liu, YuanRisheng, Yuang Liu, Yulong Ao, Zeng Jinle, Zhang Ting, Zhang Zheng, Zhanlue Yang, Zhen Wang, Zhong Hui, Zhou Wei, andreazanetti, andyjpaddle, arlesniak, baoachun, cc, ceci3, chajchaj, chenenquan, chenjian, chentianyu03, crystal, cuicheng01, danleifeng, denglin-github, duanboqiang, dyning, feng626, feng_shuai, furnace, gongweibao, heliqi, hlygit66666, hong, hong19860320, houj04, huangjun12, huangxu96, huzhiqiang, iducn, jakpiase, jiangcheng, joanna.wozna.intel, jzhang533, kuizhiqing, levi131, lidanqing, lilong12, limingshu, littletomatodonkey, liu zhengxi, liutiexing, liuyuhui, liym27, lyuwenyu, lzzyzlbb, niuliling123, pangyoki, parap1uie-s, ronnywang, root, seemingwang, shangliang Xu, shiyutang, smallv0221, sunli, sunzhongkai588, taixiurong, tangwei12, tianshuo78520a, veyron95, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangxinxin08, wangzhen38, wangzhuang01, wawltor, wenbin, whs, will-jl944, wuhuachaocoding, wuhuanzhou, xiaoting, xiaoxiaohehe001, xiayanming, xiegegege, xiemoyuan, xiongkun, yaoxuefeng, yeliang2258, yingyibiao, zhangbo9674, zhangchunle, zhangkaihuo, zhaoyingli, zhiboniu, zhoujun, zhouzj, zhulei, zhupengyang, zlsh80826, zmx, zyfncg, 李季, 津, 王明冬, 石晓伟
\ No newline at end of file
+0x45f, 123malin, Adam Osewski, Aganlengzi, Aurelius84, Baibaifan, Bo Liu, CheQiXiao, Chen Long, Chen Weihang, CtfGo, Double\_V, Ethanzjp, Fan Zhang, Feiyu Chan, Feng Xing, From00, GT-Zhang, Guanghua Yu, Guoxia Wang, Haipeng Wang, Hao Lin, Haohongxiang, Hui Zhang, Huihuang Zheng, HydrogenSulfate, IMMORTAL, JYChen, JZ-LIANG, Jacek Czaja, Jack Zhou, Jackwaterveg, Jeng Bai-Cheng, Jiangxinz, Jiaqi Liu, Jiawei Wang, JingZhuangzhuang, June Weng, Kaipeng Deng, Kqnonrime, LJQ❤️, Leo Chen, Li Min, LielinJiang, Lijunhui, Linjie Chen, Liu-xiandong, LiuWei, Ming-Xu Huang, MissPenguin, PaddlePM, Pei Yang, Peihan, Qi Li, QingshuChen, Ren Wei (任卫), Roc, Shang Zhizhou, ShenLiang, Shibo Tao, Siming Dai, Sing\_chan, TCChenLong, TTerror, TeslaZhao, Thomas Young, Thunderbrook, Tongxin Bai, WJJ1995, WangXi, Wangzheee, Wei Shengyu, WeiXin, Weilong Wu, Wenyu, Wilber, XGZhang, XYZ, XYZ916829, XiangGao, Xiaoxu Chen, YUNSHEN XIE, Yanxing Shi, Yiqun Liu, YuanRisheng, Yuang Liu, Yulong Ao, Zeng Jinle, Zhang Ting, Zhang Zheng, Zhanlue Yang, Zhen Wang, Zhong Hui, Zhou Wei, andreazanetti, andyjpaddle, arlesniak, baoachun, cc, ceci3, chajchaj, chenenquan, chenjian, chentianyu03, crystal, cuicheng01, danleifeng, denglin-github, duanboqiang, dyning, feng626, feng_shuai, furnace, gongweibao, heliqi, hlygit66666, hong, hong19860320, houj04, huangjun12, huangxu96, huzhiqiang, iducn, jakpiase, jiangcheng, joanna.wozna.intel, jzhang533, kuizhiqing, levi131, lidanqing, lilong12, limingshu, littletomatodonkey, liu zhengxi, liutiexing, liuyuhui, liym27, lyuwenyu, lzzyzlbb, niuliling123, pangyoki, parap1uie-s, ronnywang, root, seemingwang, shangliang Xu, shiyutang, smallv0221, sunli, sunzhongkai588, taixiurong, tangwei12, tianshuo78520a, veyron95, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangxinxin08, wangzhen38, wangzhuang01, wawltor, wenbin, whs, will-jl944, wuhuachaocoding, wuhuanzhou, xiaoting, xiaoxiaohehe001, xiayanming, xiegegege, xiemoyuan, xiongkun, yaoxuefeng, yeliang2258, yingyibiao, zhangbo9674, zhangchunle, zhangkaihuo, zhaoyingli, zhiboniu, zhoujun, zhouzj, zhulei, zhupengyang, zlsh80826, zmx, zyfncg, 李季, 津, 王明冬, 石晓伟
+
diff --git a/docs/release_note_en.md b/docs/release_note_en.md
index 9848c8de754..349796fdebb 100644
--- a/docs/release_note_en.md
+++ b/docs/release_note_en.md
@@ -1,13 +1,13 @@
-# 2.2.0 rc0 Release Note
+# Release Note
## **1. Highlights**
-We are excited to release the PaddlePaddle Framework V2.2.0-rc0. This version contains the following highlights.
+We are excited to release the PaddlePaddle Framework V2.2.0. This version contains the following highlights.
### API
-- Added 100+ APIs, including 24 Fourier transform APIs, 14 linear algebra APIs, etc., to better facilitate developing of scientific computing and signal processing models.
+- Added 100+ APIs, including 24 Fourier transform APIs, 17 linear algebra APIs, etc., to better facilitate developing of scientific computing and signal processing models.
- Added the support for multiple indexing syntax, including ellipsis (...), dimension expansion (None), boolean arrays (Bool Mask), and integer arrays (list and tensor), making it easier to operate on tensor.
- Added the `paddle.einsum` API, to express multi-dimensional tensor computation in a more concise way.
- Enhanced the dynamic graph mixed precision. Added a way to use half-precision (float16) training for the whole task. The computational efficiency under the main tasks increased by 20%.
@@ -289,6 +289,9 @@ paddle.int64
- Add the ``paddle.linalg.multi_dot``, to support the computing of concatenated multiplication of multiple matrices. ([#35224](https://github.com/PaddlePaddle/Paddle/pull/35224))
- Add the ``paddle.linalg.solve``, to support the computing of the solutions of linear equations. ([#35715](https://github.com/PaddlePaddle/Paddle/pull/35715))
- Add the ``paddle.linalg.matrix_power``, to support the power operations on matrices. ([#34667](https://github.com/PaddlePaddle/Paddle/pull/34667))
+ - Add `paddle.linalg.eigvalsh` for computing eigenvalues of Hermite Matrix or real symmetric matrices. ([#36680](https://github.com/PaddlePaddle/Paddle/pull/36680))
+ - Add `paddle.linalg.eig` for computing eigenvalues and eigenvectors of general square matrices. ([#35674](https://github.com/PaddlePaddle/Paddle/pull/35674))
+ - Add `paddle.linalg.qr` for computing QR decomposition of matrices (inverse is not supported yet). ([#36627](https://github.com/PaddlePaddle/Paddle/pull/36627))
- Add new Fourier transform related API ([#35665](https://github.com/PaddlePaddle/Paddle/pull/35665))
- Add fast Fourier transform family functions
@@ -303,18 +306,20 @@ paddle.int64
- Add new high-level APIs
- Add the ``paddle.vision.ops.roi_pool`` and ``paddle.vision.ops.RoIPool``, support RoI region pooling operations in detection tasks. ([#36154](https://github.com/PaddlePaddle/Paddle/pull/36154))
- - Add the ``paddle.vision.ops.roi_align`` and ``paddle.vision.ops.RoIAlign``, to support RoI region Align operations in detection tasks. ([#36207](https://github.com/PaddlePaddle/Paddle/pull/36207))
- - Add the ``paddle.vision.ops.psroi_pool`` and ``paddle.vision.ops.PSRoIPool``, to support location-sensitive RoI region pooling operations in detection tasks. ([#36111](https://github.com/PaddlePaddle/Paddle/pull/36111))
- - Add the ``paddle.vision.models.vgg19`` pre-training weights. ([#35788](https://github.com/PaddlePaddle/Paddle/pull/35788))
- - Add thedatasets API download progress bar in ``paddle.vision.datasets.*``. ([#33302](https://github.com/PaddlePaddle/Paddle/pull/33302))
- - Add the ``paddle.Model.predict`` parameter ``verbose``, to support whether to show logs or not. ([#33405](https://github.com/PaddlePaddle/Paddle/pull/33405))
- - Add the ``paddle.hub`` download option ``wget`` method. ([#33379](https://github.com/PaddlePaddle/Paddle/pull/33379))
- - Add the ``paddle.Model`` gradient accumulation in dynamic graph mode. ([#32702](https://github.com/PaddlePaddle/Paddle/pull/32702))
- - Add the ``paddle.Model.fit`` and ``paddle.Model.evaluate`` ``num_iters`` parameters in dynamic graph mode to control the number of training iterations. ([#33986](https://github.com/PaddlePaddle/Paddle/pull/33986))
- - Add the ``paddle.vision.ops.yolo_box`` parameters ``iou_aware`` and ``iou_aware_factor``, to support YoloBox using predicted IOUs as confidence factors. ([#33400](https://github.com/PaddlePaddle/Paddle/pull/33400))
- - Add the ``paddle.summary`` parameter input to support the given ``input``. ([#34165](https://github.com/PaddlePaddle/Paddle/pull/34165))
+ - Add the ``paddle.vision.ops.roi_align`` and ``paddle.vision.ops.RoIAlign``, to support RoI region Align operations in detection tasks. ([#36207](https://github.com/PaddlePaddle/Paddle/pull/36207))
+ - Add the ``paddle.vision.ops.psroi_pool`` and ``paddle.vision.ops.PSRoIPool``, to support location-sensitive RoI region pooling operations in detection tasks. ([#36111](https://github.com/PaddlePaddle/Paddle/pull/36111))
+ - Add the ``paddle.vision.models.vgg19`` pre-training weights. ([#35788](https://github.com/PaddlePaddle/Paddle/pull/35788))
+ - Add the datasets API download progress bar in ``paddle.vision.datasets.*``. ([#33302](https://github.com/PaddlePaddle/Paddle/pull/33302))
+ - Add the ``paddle.Model.predict`` parameter ``verbose``, to support whether to show logs or not. ([#33405](https://github.com/PaddlePaddle/Paddle/pull/33405))
+ - Add the ``paddle.hub`` download option ``wget`` method. ([#33379](https://github.com/PaddlePaddle/Paddle/pull/33379))
+ - Add the ``paddle.Model`` gradient accumulation in dynamic graph mode. ([#32702](https://github.com/PaddlePaddle/Paddle/pull/32702))
+ - Add the ``paddle.Model.fit`` and ``paddle.Model.evaluate`` ``num_iters`` parameters in dynamic graph mode to control the number of training iterations. ([#33986](https://github.com/PaddlePaddle/Paddle/pull/33986))
+ - Add the ``paddle.vision.ops.yolo_box`` parameters ``iou_aware`` and ``iou_aware_factor``, to support YoloBox using predicted IOUs as confidence factors. ([#33400](https://github.com/PaddlePaddle/Paddle/pull/33400))
+ - Add the ``paddle.summary`` parameter input to support the given ``input``. ([#34165](https://github.com/PaddlePaddle/Paddle/pull/34165))
+ - Add `paddle.text.viterbi_decode`, to support Viterbi decoding for CPU and GPU under dynamic graphs. ([#35778](https://github.com/PaddlePaddle/Paddle/pull/35778))
- Add networking class APIs
+ - Add `paddle.nn.functional.sparse_attention` for computing sparse Transformer Attention modules. ([#35757](https://github.com/PaddlePaddle/Paddle/pull/35757))
- Add the ``paddle.nn.MaxUnPool2D`` and ``paddle.nn.functional.max_unpool2d``, to support the computing of the inverse of the pooling result based on the input and maximum position. ([#35056](https://github.com/PaddlePaddle/Paddle/pull/35056))
- Add the ``paddle.nn.functional.gumbel_softmax``, to support ``gumbel softmax`` sampling. ([#35506](https://github.com/PaddlePaddle/Paddle/pull/35506), [#36065](https://github.com/PaddlePaddle/Paddle/pull/36065), [#36094](https://github.com/PaddlePaddle/Paddle/pull/36094))
- Add the ``paddle.nn.functional.class_center_sample``, to support PartialFC class center sampling. ([#34106](https://github.com/PaddlePaddle/Paddle/pull/34106))
@@ -331,9 +336,14 @@ paddle.int64
- Add the ``paddle.device.cuda.empty_cache``, to support for clearing free GPU memory. ([#35427](https://github.com/PaddlePaddle/Paddle/pull/35427))
- Add the ``paddle.device.cuda.get_device_properties``, to support for returning the given device properties. ([#35875](https://github.com/PaddlePaddle/Paddle/pull/35875))
- Add the ``paddle.device.cuda.stream_guard`` for flexible switching of CUDA Streams under dynamic graphs. ([#35623](https://github.com/PaddlePaddle/Paddle/pull/35623))
+ - Add `paddle.device.cuda.get_device_name`, to support returning the name of a given device. ([#36172](https://github.com/PaddlePaddle/Paddle/pull/36172))
+ - Add `paddle.device.cuda.get_device_capability`, to support returning version number of the computational capability of a given device. ([#36172](https://github.com/PaddlePaddle/Paddle/pull/36172))
+ - Add `paddle.framework.core.async_read` and `paddle.framework.core.async_write`, to support `Tensor` data asynchronous read and write of `CUDAPinnedPlace` and ` CUDAPlace` under non-default CUDA `Stream`. ([#36501](https://github.com/PaddlePaddle/Paddle/pull/36501))
- Add Tensor operation APIs
+ - Add `paddle.tensordot`, to support Tensor Contraction for high dimension. ([#36454](https://github.com/PaddlePaddle/Paddle/pull/36454))
+ - Add `paddle.bincount`, to support counting elements in a one-dimensional tensor. ([#36709](https://github.com/PaddlePaddle/Paddle/pull/36709))
- Add the `paddle.broadcast_tensors`, to support broadcast operations on a set of `Tensors`. ([#33294](https://github.com/PaddlePaddle/Paddle/pull/33294), [#34874](https://github.com/PaddlePaddle/Paddle/pull/34874))
- Add the `paddle.einsum`. ([#33821](https://github.com/PaddlePaddle/Paddle/pull/34874))
- Enhance the ``paddle.tensor.gradient`` interface to support second-order derivative operators for sigmoid_op. ([#32971](https://github.com/PaddlePaddle/Paddle/pull/32971))
@@ -372,6 +382,8 @@ paddle.int64
- Add the ``paddle.static.ExponentialMovingAverage``, to support the computing of the sliding average of parameters with exponential decay. ([#35673](https://github.com/PaddlePaddle/Paddle/pull/35673))
- Add the ``paddle::Tensor::slice`` C++ API, to support the slice operation, and allow users to perform slice operations for the external Tensor. ([#34227](https://github.com/PaddlePaddle/Paddle/pull/34227))
- Add the ``paddle.incubate.segment_*`` series APIs, including ``paddle.incubate.segment_sum``, ``paddle.incubate.segment_mean``, ``paddle.incubate.segment_max``, and ``paddle. incubate.segment_min``. Support the summing, averaging, maximizing, and minimizing of ``Tensor`` by segment. ([#35759](https://github.com/PaddlePaddle/Paddle/pull/35759))
+ - Add `paddle.version.cuda` and `paddle.version.cudnn` to get version numbers of `CUDA` and `cuDNN` used by paddle installer. ([#36556](https://github.com/PaddlePaddle/Paddle/pull/36556))
+
#### IR(Intermediate Representation)
@@ -388,13 +400,15 @@ paddle.int64
- Provide dependent helper functions needed to analyze the control flow in `Program`. ([#33439](https://github.com/PaddlePaddle/Paddle/pull/33439))
- `Program` and `Graph` retain the values of the `stop_gradient` and `persistable` attributes needed for training after converting each other. ([#33771](https://github.com/PaddlePaddle/Paddle/pull/33771))
- `Pass` now supports processing the main `Graph` and all its sub-graphs, while the original `Pass` only processed the main `Graph` and ignored the sub-graphs. ([#34158](https://github.com/PaddlePaddle/Paddle/pull/34158))
- - Handle some topological ordering problems for `Program` and `Graph` inter-conversion in the prediction cases. ([#34121](https://github.com/PaddlePaddle/Paddle/pull/34121), [#34521](https://github.com/PaddlePaddle/Paddle/pull/34521)). **《== **
+ - Handle some topological ordering problems for `Program` and `Graph` inter-conversion in the prediction cases. ([#34121](https://github.com/PaddlePaddle/Paddle/pull/34121), [#34521](https://github.com/PaddlePaddle/Paddle/pull/34521)).
- Pass development
- Add the Pass development for subgraph replacement scenarios such as fusion on the Python side. ([#35708](https://github.com/PaddlePaddle/Paddle/pull/35708), [#35602](https://github.com/PaddlePaddle/Paddle/pull/35602))
- Kernel Primitive API
- Abstract and encapsulate the underlying codes in the operator Kernel implementation, to provide high-performance Block-level IO and Compute operations. The Kernel development using the Kernel Primitive API allows you to focus more on the implementation of the computational logic, significantly reducing the amount of codes while ensuring performance, and decoupling operator computation from hardware. ([#34672](https://github.com/PaddlePaddle/Paddle/pull/34672), [#35075](https://github.com/PaddlePaddle/Paddle/pull/35075), [#34456](https://github.com/PaddlePaddle/Paddle/pull/34456), [#35282](https://github.com/PaddlePaddle/Paddle/pull/35282), [#35743](https://github.com/PaddlePaddle/Paddle/pull/35743), [#34208](https://github.com/PaddlePaddle/Paddle/pull/34208))
+ - Add a total of 13 monadic and binary computation Functors to the Kernel Primitive API. ([#36418](https://github.com/PaddlePaddle/Paddle/pull/36418))
+ - Modify the ReadData implementation in the Kernel Primitive API to fix the NX ! =1 access memory out-of-bound bug. ([#36373](https://github.com/PaddlePaddle/Paddle/pull/36373))
#### **Mixed Precision Training**
@@ -513,8 +527,16 @@ paddle.int64
- `paddle.equal`: Add the support for `int`, `float`, and `bool` types for the second input. ([#35695](https://github.com/PaddlePaddle/Paddle/pull/35695))
- ``paddle.io.DataLoader``: Add the support for persistent_worker mode. ([#34017](https://github.com/PaddlePaddle/Paddle/pull/34017))
- Optimize ``l2_normalize``, ``p_norm``, ``elementwise_max``, ``prelu,clip_by_norm``, ``lars optimizer`` operators support the float16 computation. ([#35576](https://github.com/PaddlePaddle/Paddle/pull/35576), [#35888](https://github.com/PaddlePaddle/Paddle/pull/35888), [#35888](https://github.com/PaddlePaddle/Paddle/pull/35888), [35532](https://github.com/PaddlePaddle/Paddle/pull/35532), [#35446](https://github.com/PaddlePaddle/Paddle/pull/35446), [#33280](https://github.com/PaddlePaddle/Paddle/pull/33280))
-- Optimize the reading speed of flowers dataset from several minutes per batch to 1~3 seconds per batch. ([#31408](https://github.com/PaddlePaddle/Paddle/pull/31408))
-- Support the fuse allreduce sum function in `paddle.distributed.fleet.DistributedStrategy` when the `without_graph_optimize` switch is on.In the FP32, the performance increases by 3%. In the AMP, the performance increases by 8%. ([#34446](https://github.com/PaddlePaddle/Paddle/pull/34446))
+- Optimize the reading speed of flowers dataset from several minutes per batch to 1~3 seconds per batch. ([#31408](https://github.com/PaddlePaddle/Paddle/pull/31408))
+- Support the fuse allreduce sum function in `paddle.distributed.fleet.DistributedStrategy` when the `without_graph_optimize` switch is on.In the FP32, the performance increases by 3%. In the AMP, the performance increases by 8%. ([#34446](https://github.com/PaddlePaddle/Paddle/pull/34446))
+- In `paddle.matmul`, switch underlying Op from matmul op to matmul_v2 op. ([#36374](https://github.com/PaddlePaddle/Paddle/pull/36374))
+- In `paddle.fft` module, add mkl_cdft and hipfft two computational backends. ([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+- Parameter `shifts` of `paddle.roll` supports `Tensor` as input. ([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+- `paddle.shape` supports plural type inputs. ([#36835](https://github.com/PaddlePaddle/Paddle/pull/36835))
+- matmul_v2 supports quantization. ([#36469](https://github.com/PaddlePaddle/Paddle/pull/36469))
+- Add `clip_op` support for `float16`. ([#36672](https://github.com/PaddlePaddle/Paddle/pull/36672))
+- In `paddle.fft` module, add cache plan functionality to the cufft backend, optimizing performance. ([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+
#### IR(Intermediate Representation)
@@ -525,7 +547,9 @@ paddle.int64
- Optimize the logic of dynamic to static training codes, upgrade the internal ``Program`` cache mechanism, and add an advance copy policy for input ``Tensor`` to improve training performance. ([#34181](https://github.com/PaddlePaddle/Paddle/pull/34181), [#33796](https://github.com/PaddlePaddle/Paddle/pull/33796))
- Optimize the internal actuator memory recycling strategy for dynamic to static graphs, reducing the GPU memory usage during training. ([#34177](https://github.com/PaddlePaddle/Paddle/pull/34177))
- Integrate the source codes of ``Gast`` triple dependency library, decoupling version dependencies. ([#34556](https://github.com/PaddlePaddle/Paddle/pull/34556))
-
+ - Display partial frame level error reporting information in case of dynamic-to-static error reporting. It is easier to locate the problem. ([#36765](https://github.com/PaddlePaddle/Paddle/pull/36765))
+ - Remove duplicate temporary file removal function `remove_static_file()` in the dynamic to static error reporting module. ([#36375](https://github.com/PaddlePaddle/Paddle/pull/36375))
+ - Optimize processing of `input_specs` parameter in RegisterPass, to support graph optimization as a matching subgraph condition. ([#36453](https://github.com/PaddlePaddle/Paddle/pull/36453))
#### **Distributed training**
@@ -539,6 +563,12 @@ paddle.int64
- `paddle.io.Dataset`: Support the dynamic library parsing data. ([#33969](https://github.com/PaddlePaddle/Paddle/pull/33969))
- In the `paddle.distributed.fleet.dataset.DatasetBase`, add the consistency check function for generated data of the `use_var_list` and `pipe_command`. ([#34463](https://github.com/PaddlePaddle/Paddle/pull/34463))
- Add the consistency check between the `emd` dimension of `paddle.fluid.layers.embedding` and `emb` dimension of `sparse table` in `fleet`. ([#34249](https://github.com/PaddlePaddle/Paddle/pull/34249))
+ - Dynamic graph hybrid parallel supports for Pure FP16 training. ([#36707](https://github.com/PaddlePaddle/Paddle/pull/36707))
+ - Static graph hybrid parallel supports dropout using a fixed random seed generator to ensure consistency of global variables and randomness of local variables in model parallel. ([#36682](https://github.com/PaddlePaddle/Paddle/pull/36682))
+ - Implement CPU parallelism and support for adding custom backend parameters when calling spawn or launch. Available backend options are "gloo", "nccl", "bkcl", and "auto", for CPU parallel, GPU parallel, XPU parallel, and automatic selection by Paddle version, respectively. ([#35745](https://github.com/PaddlePaddle/Paddle/pull/35745))
+ - Optimize dynamic graph hybrid parallel HybridParallelClipGrad policy, to support 4D hybrid parallel + Pure FP16 training. ([#36707](https://github.com/PaddlePaddle/Paddle/pull/36707))
+ - Add SlotRecordDataset class to support GPU parameter server training. ([#36710](https://github.com/PaddlePaddle/Paddle/pull/36710))
+ - In the GPU parameter server building phase, support use of SlotRecordDataset. ([#36723](https://github.com/PaddlePaddle/Paddle/pull/36723))
- Static graph hybrid parallel
@@ -561,7 +591,15 @@ paddle.int64
- Fix the ``paddle.jit.save`` interface and model pruning logic. It is unnecessary to add an associated ``scale_op`` for output variables, and to properly export models containing outputs of type ``bool`` and ``float16``. ([#35730](https://github.com/PaddlePaddle/Paddle/pull/35730), [#36132](https://github.com/PaddlePaddle/Paddle/pull/36132))
- Custom OP
- Remove unnecessary ``cudaStreamSynchronize`` operations from ``paddle::Tensor's`` ``copy`` method, to improve performance. ([#35802](https://github.com/PaddlePaddle/Paddle/pull/35802))
+- Add C++ to support for GeneratePass development registration. The development mode is aligned with Python side. ([#36302](https://github.com/PaddlePaddle/Paddle/pull/36302))
+- Automic SParsity
+- Add `paddle.static.sparsity`, to support generating sparse parameters for `n:m` sparse mode. Currently, it only supports static graph ASP training. FP32 and FP16 on A100 are set with `1:2` and `2:4` sparse modes, respectively, to train saved sparse models, which can be used to accelerate inference tasks by calling TensorRT 8 based on the sparse Tensor Core of Ampere architecture. The current version provides a total of 5 APIs: ([#32995](https://github.com/PaddlePaddle/Paddle/pull/32995)、[#33132](https://github.com/PaddlePaddle/Paddle/pull/33132)、[#33558](https://github.com/PaddlePaddle/Paddle/pull/33558)、[#36525](https://github.com/PaddlePaddle/Paddle/pull/36525))
+ - `paddle.static.sparsity.calculate_density`: calculates the density of the input Tensor.
+ - `paddle.static.sparsity.decorate`: wraps the given optimizer as `OptimizerWithSparsityGuarantee`, automatically inserting necessary operations for the ASP workflow when calling `optimizer.minimize()`.
+ - `paddle.static.sparsity.prune_model`: prunes the parameters of the supported layers in `main_program` based on the mask generator function specified by `mask_algo`.
+ - `paddle.static.sparsity.set_excluded_layers`: sets the names of the parameters of layers that will not be trimmed.
+ - `paddle.static.sparsity.reset_excluded_layers`: resets the `excluded_layers` setting corresponding to `main_program`.
### **(3) Performance optimization**
@@ -600,6 +638,20 @@ paddle.int64
- Optimize the dynamic graph performance by stripping logic executed only on static graphs from the execution path of dynamic graphs. ([#34024](https://github.com/PaddlePaddle/Paddle/pull/34024))
- For the IR Pass, optimize the capability exposed as a general-purpose capability. Support both single machine and distributed optimization.The performance improves by 3%-5% in GPT mixed parallel scenarios. ([#34955](https://github.com/PaddlePaddle/Paddle/pull/34955), [#35704](https://github.com/PaddlePaddle/Paddle/pull/35704), [#34730](https://github.com/PaddlePaddle/Paddle/pull/34730), [#34524](https://github.com/PaddlePaddle/Paddle/pull/34524))
- Optimize the ctc loss grad computation, increase the speed by ~3x. Correspondingly, the GPU memory usage increases. ([#34729](https://github.com/PaddlePadle/Paddle/pull/34729))
+- transformer encoder Performance Optimization
+ - Optimization method: add `paddle.incubate.nn.FusedMultiHeadAttention` and `paddle.incubate.nn.FusedFeedForward`. In the implementation, q, k, v gemm fusion and multiple kernel fusion optimization techniques are used to improve performance of the transformer encoder.
+ - FusedAttention
+ - Add `paddle.incubate.nn.functional.fused_multi_head_attention`, to support fusion computation of multi-head attention. ([#35905](https://github.com/PaddlePaddle/Paddle/pull/35905) [35903](https://github.com/PaddlePaddle/Paddle/pull/35903) [#36803](https://github.com/PaddlePaddle/Paddle/pull/36803) [#36793](https://github.com/PaddlePaddle/Paddle/pull/36793) [36185](https://github.com/PaddlePaddle/Paddle/pull/36185))
+ - Add `paddle.incubate.nn.FusedMultiHeadAttention` for layer networking of the fused multi-head attention. ([#36498](https://github.com/PaddlePaddle/Paddle/pull/36498) )
+ - This module uses q, k, v gemm fusion and bias add + dropout + residual add + layer_norm kernel fusion optimization techniques, resulting in 1.08x-1.45x acceleration.
+
+ - FusedFeedForward
+ - Add `paddle.incubate.nn.functional.fused_feedforward`, to support feedforward fusion computation. ([#36729](https://github.com/PaddlePaddle/Paddle/pull/36729) [#36730](https://github.com/PaddlePaddle/Paddle/pull/36730))
+ - Add `paddle.incubate.nn.FusedFeedForward` for layer networking of fused feedforward. ([#36776](https://github.com/PaddlePaddle/Paddle/pull/36776))
+ - Performance is improved by about 1.04x~1.22x over pre-optimization.
+ - Add `paddle.incubate.nn.FusedTransformerEncoderLayer`, to support layer networking by using fused multi-head attention and fused feedforward computation. ([#36776](https://github.com/PaddlePaddle/Paddle/pull/36776))
+
+
### **(4) Troubleshooting**
@@ -693,12 +745,27 @@ paddle.int64
- Migrate the one_hot operator in ``paddle.nn.functional.dice_loss`` API to the ``one_hot_v2`` operator. ([#35734](https://github.com/PaddlePaddle/Paddle/pull/35734))
- Fix the bug of usage in the static graph mode in ``paddle.summary``. ([#35303](https://github.com/PaddlePaddle/Paddle/pull/35303))
- Fix the multi-card startup bug in ``paddle.Model.prepare`` static graph mode. ([#34311](https://github.com/PaddlePaddle/Paddle/pull/34311))
+- Fix error report of `paddle.nn.functional.cross_entropy` when `weight` is given and `axis` is specified as a legal dimension other than -1. ([#36647](https://github.com/PaddlePaddle/Paddle/pull/36647))
+- Fix a bug with `paddle.utils.dlpack.to_dlpack` that prevents it from encoding multidimensional `Tensor`, and fix a bug with its generated DLPack objects not being shared across deep learning frameworks. ([#36177](https://github.com/PaddlePaddle/Paddle/pull/36177))
+- Fix a bug in the `sample` method using `paddle.distribution.Categorical`, specifically, due to an out-of-bounds array access in the multinomial op's cuda kernel. The bug causes access to values beyond the subscript of the array, causing an error to be reported. ([#36511](https://github.com/PaddlePaddle/Paddle/pull/36511))
+- Fix a bug in the dynamic graph `_BatchNormBase` base class where the default_dtype is modified, resulting in the wrong type of subsequent networking parameters. Affected APIs are `paddle.nn.BatchNorm1D`, `paddle.nn.BatchNorm2D`, ` paddle.nn.BatchNorm3D`, and `paddle.nn.SyncBatchNorm`. The specific reason is that when `get_default_dtype() == 'float16'`, the default parameter data type is modified by `set_default_dtype('float32')`. The parameter type of dynamic graph networking is created by default_dtype. Therefore, when the default parameter type is modified, subsequent networking parameter type is consequently incorrect. ([#36376](https://github.com/PaddlePaddle/Paddle/pull/36376))
+- Fix an exception in `paddle.nn.functional.grid_sample` caused by special input. ([#36625](https://github.com/PaddlePaddle/Paddle/pull/36625))
+- Fix calculation error of `paddle.fft.ffft`, `paddle.fft.ifft`, `paddle.fft.rfft` , `paddle.fft.irfft`, `paddle.fft.hfft`, and `paddle.fft.ihfft` when input ` axis=0`. ([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+- Fix a bug of errors of `paddle.fft.fftshift` and `paddle.fft.ifftshift` under static graphs. ([#36537](https://github.com/PaddlePaddle/Paddle/pull/36537))
+- Fix a bug where `paddle.fft.ifftshift` is not calculated correctly. ([#36835](https://github.com/PaddlePaddle/Paddle/pull/36835))
+- Fix error message prompt for `paddle.nn.functional.pad` in `replicate` mode. ([#36531](https://github.com/PaddlePaddle/Paddle/pull/36531))
+
+
#### IR(Intermediate Representation)
- Dynamic graph to static graph
- Fix an abnormal growth of GPU memory under ``paddle.no_grad`` semantics after dynamic to static. ([#35725](https://github.com/PaddlePaddle/Paddle/pull/35725))
- Fix a misidentification and conversion bug in the ``paddle.no_grad`` interface. ([#34136](https://github.com/PaddlePaddle/Paddle/pull/34136))
+ - Fix a bug of reporting an error in dynamic to static training when stop_gradient=True is set in the middle of the model in some scenarios. ([#36353](https://github.com/PaddlePaddle/Paddle/pull/36353))
+ - Fix a bug of reporting an error when checking the return result in some scenarios where the control flow “if” is converted. ([#36830](https://github.com/PaddlePaddle/Paddle/pull/36830))
+ - Fix a bug that the return type changes unexpectedly due to additional dynamic to static aligning in the return length when “ifelse” branch returns unequal results. ([#36565](https://github.com/PaddlePaddle/Paddle/pull/36565))
+ - Fix a bug where video memory will keep growing in train mode and no_grad contexts after loading a model via the jit.save/load interface. ([#36463](https://github.com/PaddlePaddle/Paddle/pull/36463))
#### **Distributed training**
@@ -733,6 +800,8 @@ paddle.int64
- Fix the GPU parameter server error reported by using non-0 card training. ([#33078](https://github.com/PaddlePaddle/Paddle/pull/33078))
- Fix the bug of the delta score and scale show in the GPU Parameter Server. ([#33492](https://github.com/PaddlePaddle/Paddle/pull/33078), [#33492](https://github.com/PaddlePaddle/Paddle/pull/33492))
- Fix the bug with GPU Parameter Server not merging dense after training, in incorrect g2sum calculation. For data norm, add the optimize op. ([#35029](https://github.com/PaddlePaddle/Paddle/pull/35029))
+ - Fix an error reported if the gradient is empty when using the fuse all reduce ops switch. ([#36231](https://github.com/PaddlePaddle/Paddle/pull/36231))
+ - Fix a bug with dist_transformer files showing undefined variables. ([#36211](https://github.com/PaddlePaddle/Paddle/pull/36211))
- Dynamic graph hybrid parallel
- Fix the precision error in pipeline parallel due to communication asynchronization. [#35556](https://github.com/PaddlePaddle/Paddle/pull/35556)
@@ -774,6 +843,7 @@ paddle.int64
- Add native support for Ascend series hardware
- sub-graphs are accessed to ascend310 hardware [#35226](https://github.com/PaddlePaddle/Paddle/pull/35226) by supporting Paddle-Lite NNAdapter. For the example, see the [demo](https://github.com/PaddlePaddle/Paddle-Inference-Demo/tree/master/c%2B%2B/ascend310_lite_subgraph/image_classification_demo).
- New Ascend 910 inference support [#34101](https://github.com/PaddlePaddle/Paddle/pull/34101)
+- Add pool3d OP to support for TensorRT. ([#36545](https://github.com/PaddlePaddle/Paddle/pull/36545))
### **(2) Function optimization**
@@ -782,7 +852,7 @@ paddle.int64
- Quantification support
- Refactor dynamic graph quantization inference pass, to support non-analog quantization OP and analog quantization OP. ([#35907](https://github.com/PaddlePaddle/Paddle/pull/35907))
- Add int8 for analog quantized OP matmul (the case where weights are multiplied by tensor). ([#34359](https://github.com/PaddlePaddle/Paddle/pull/34359))
-
+ - Fix a bug that MobileNetV3 model "Loss” out of NAN during quantization training due to the quantization parameter being 0. ([#36763](https://github.com/PaddlePaddle/Paddle/pull/36763))
- API enhancements
- Refactor GO API based on new version of CAPI, [#33113](https://github.com/PaddlePaddle/Paddle/pull/33113). For the example, see the [demo](https://github.com/PaddlePaddle/Paddle-Inference-Demo/tree/master/go/resnet50).
@@ -818,6 +888,7 @@ paddle.int64
- Add support for int8 in TensorRT `qkv_context` plugin ([#34917](https://github.com/PaddlePaddle/Paddle/pull/34917), [#35504](https://github.com/PaddlePaddle/Paddle/pull/35504))
- Add support for TensorRT conv3d. ([#35507](https://github.com/PaddlePaddle/Paddle/pull/35507))
- Add support for broadcasting the input of the `multihead_matmul` fusion operator. ([#35780](https://github.com/PaddlePaddle/Paddle/pull/35780))
+ - Inference supports for TensorRT8 sparse inference, with performance improved by 10%-30% for ERNIE model with variable-length input at different batch_sizes, and performance improved by 10% for ResNeXt101_32x4d model at different batch_sizes under test environment. ([#36659](https://github.com/PaddlePaddle/Paddle/pull/36659))
- Nvidia Jetson native support enhancements
- Add the Op support, for the Jetson Nano/TX2, two devices with lower arithmetic power. We made targeted optimizations. Now add the support for 17 OPs such as `pool2d`, `pool_max`, `conv3d_transpose`, etc. ([#35378](https://github.com/PaddlePaddle/Paddle/pull/35378))
@@ -827,6 +898,7 @@ paddle.int64
- Kunlun XPU interface feature extensions
- Add the `set_xpu_device_id` interface to support setting the device number of the Kunlun chip in the inference ([#35572](https://github.com/PaddlePaddle/Paddle/pull/35572))
+- In Inference python `copy_from_cpu` interface, add input type check. Report errors in advance for wrong type inputs. ([#36552](https://github.com/PaddlePaddle/Paddle/pull/36552))
### **(3) Troubleshooting**
@@ -849,6 +921,14 @@ paddle.int64
- Fix a possible accuracy bug in the running of the ernie model FP16 with precision. ([#34771](https://github.com/PaddlePaddle/Paddle/pull/34711))
- Fix the incorrect output bug due to an inconsistent order of inputs when the ernie becomes longer. ([#33575](https://github.com/PaddlePaddle/Paddle/pull/33575))
- Fix a bug where the allocator function is abnormal in multi-stream state. ([#32932](https://github.com/PaddlePaddle/Paddle/pull/33575))
+- Fix a possible crash bug of ERNIE model under TRT8. ([#36769](https://github.com/PaddlePaddle/Paddle/pull/36769))
+- Fix a bug of crash and accuracy when Pool and Slice are used. ([#36666](https://github.com/PaddlePaddle/Paddle/pull/36666))
+- Fix an accuracy bug of yolo_box op caused by a wrong formula. ([#36365](https://github.com/PaddlePaddle/Paddle/pull/36365))
+- Fix a bug where quantized matmul_v2 does not infer properly under TRT. ([#36821](https://github.com/PaddlePaddle/Paddle/pull/36821))
+- Fix a bug where quantized op is incorrectly added when quantizing matmul_v2. ([#36820](https://github.com/PaddlePaddle/Paddle/pull/36820))
+- Fix a bug with the operators batch_norm and elementwise_add reporting an error when TRT is enabled in 3D application scenarios. ([#36446](https://github.com/PaddlePaddle/Paddle/pull/36446))
+- Fix a bug where the prediction model saved by the high-level linear api cannot not be optimized by Pass fusion. ([#36500](https://github.com/PaddlePaddle/Paddle/pull/36500))
+- Fix the Pass of MatmulV2ToMul, re-qualify (matmul_v2 to mul) mapping pass, add Pass of MatmulV2ToMatmul, qualify (matmul_v2 to matmul) mapping pass condition (not supporting broadcast), and modify (matmul, mul) op_teller mapping condition. ([#36652](https://github.com/PaddlePaddle/Paddle/pull/36652)
#### **Back-end capability fixing**