From 47e2031880272dcf9ba5d0f754e20d4938232886 Mon Sep 17 00:00:00 2001
From: Chen Lai <chenlai@fb.com>
Date: Tue, 23 Feb 2021 11:53:24 -0800
Subject: [PATCH 1/6] test  Please enter the commit message for your changes.
 Lines starting

---
 prototype_source/lite_interpreter.rst | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 prototype_source/lite_interpreter.rst

diff --git a/prototype_source/lite_interpreter.rst b/prototype_source/lite_interpreter.rst
new file mode 100644
index 00000000000..76d4bb83f8d
--- /dev/null
+++ b/prototype_source/lite_interpreter.rst
@@ -0,0 +1 @@
+add

From 26077e09a8a5960c31f6389d61e4eb53ec75be15 Mon Sep 17 00:00:00 2001
From: Chen Lai <chenlai@fb.com>
Date: Tue, 23 Feb 2021 11:53:24 -0800
Subject: [PATCH 2/6] [Lite Interpreter] Add lite interpreter workflow in
 Android and iOS

 Please enter the commit message for your changes. Lines starting
---
 prototype_source/lite_interpreter.rst | 221 ++++++++++++++++++++++++++
 1 file changed, 221 insertions(+)
 create mode 100644 prototype_source/lite_interpreter.rst

diff --git a/prototype_source/lite_interpreter.rst b/prototype_source/lite_interpreter.rst
new file mode 100644
index 00000000000..bb3efe44f46
--- /dev/null
+++ b/prototype_source/lite_interpreter.rst
@@ -0,0 +1,221 @@
+(Prototype) Introduce lite interpreter workflow in Android and iOS
+==================================================================
+
+**Author**: `Chen Lai <https://github.com/cccclai>`_, `Martin Yuan <https://github.com/iseeyuan>`_
+
+Introduction
+------------
+
+This tutorial introduces the steps to use lite interpreter on iOS and Android. We'll be using the ImageSegmentation demo app as an example. Since lite interpreter is currently in the prototype stage, a custom pytorch binary from source is required.
+
+
+Android
+-------------------
+Get ImageSegmentation demo app in Android: https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation
+
+1. **Prepare model**: Prepare the lite interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
+
+.. code:: python
+
+    import torch
+
+    model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
+    model.eval()
+
+    scripted_module = torch.jit.script(model)
+    # Export full jit version model (not compatible lite interpreter), leave it here for comparison
+    scripted_module.save("deeplabv3_scripted.pt")
+    # Export lite interpreter version model (compatible with lite interpreter)
+    scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")
+
+2. **Build libtorch lite for android**: Build libtorch for android for all 4 android abis (``armeabi-v7a``, ``arm64-v8a``, ``x86``, ``x86_64``) ``BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh``. For example, if it will be tested on Pixel 4 emulator with ``x86``, use cmd ``BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86`` to specify abi to save built time. After the build finish, it will show the library path:
+
+
+.. code-block:: bash
+
+   BUILD SUCCESSFUL in 55s
+   134 actionable tasks: 22 executed, 112 up-to-date
+   + find /Users/chenlai/pytorch/android -type f -name '*aar'
+   + xargs ls -lah
+   -rw-r--r--  1 chenlai  staff    13M Feb 11 11:48 /Users/chenlai/pytorch/android/pytorch_android/build/outputs/aar/pytorch_android-release.aar
+   -rw-r--r--  1 chenlai  staff    36K Feb  9 16:45 /Users/chenlai/pytorch/android/pytorch_android_torchvision/build/outputs/aar/pytorch_android_torchvision-release.aar
+
+3. **Use the PyTorch Android libraries built from source in the ImageSegmentation app**: Create a folder `libs` in the path, the path from repository root will be `ImageSegmentation/app/libs`. Copy `pytorch_android-release` to the path ``ImageSegmentation/app/libs/pytorch_android-release.aar``. Copy `pytorch_android_torchvision` (downloaded from `Pytorch Android Torchvision Nightly <https://oss.sonatype.org/#nexus-search;quick~torchvision_android/>`_) to the path ``ImageSegmentation/app/libs/pytorch_android_torchvision.aar``. Update the `dependencies` part of ``ImageSegmentation/app/build.gradle`` to
+
+.. code:: gradle
+
+   dependencies {
+       implementation 'androidx.appcompat:appcompat:1.2.0'
+       implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
+       testImplementation 'junit:junit:4.12'
+       androidTestImplementation 'androidx.test.ext:junit:1.1.2'
+       androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'
+
+
+       implementation(name:'pytorch_android-release', ext:'aar')
+       implementation(name:'pytorch_android_torchvision', ext:'aar')
+
+       implementation 'com.android.support:appcompat-v7:28.0.0'
+       implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
+   }
+
+Update `all projects` part in ``ImageSegmentation/build.gradle`` to
+
+
+.. code:: gradle
+
+    allprojects {
+        repositories {
+            google()
+            jcenter()
+            flatDir {
+                dirs 'libs'
+            }
+        }
+    }
+
+4. **Update model loader api**: Update ``ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java`` by
+
+  4.1 Add new import: `import org.pytorch.LiteModuleLoader`
+
+  4.2 Replace the way to load pytorch lite model
+
+.. code:: java
+
+    // mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
+    mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
+
+5. **Test app**: Build and run the `ImageSegmentation` app in Android Studio
+
+iOS
+-------------------
+Get ImageSegmentation demo app in iOS: https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation
+
+1. **Prepare model**: Same as Android.
+
+2. **Build libtorch lite for iOS**:
+
+.. code-block:: bash
+
+   BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR BUILD_LITE_INTERPRETER=1 ./scripts/build_ios.sh
+
+
+3. **Remove Cocoapods from the project** (this step is only needed if you ran `pod install`):
+
+.. code-block:: bash
+
+   pod deintegrate
+
+4. **Link ImageSegmentation demo app with the custom built library**:
+Open your project in XCode, go to your project Target’s **Build Phases - Link Binaries With Libraries**, click the **+** sign and add all the library files located in `build_ios/install/lib`. Navigate to the project **Build Settings**, set the value **Header Search Paths** to `build_ios/install/include` and **Library Search Paths** to `build_ios/install/lib`.
+In the build settings, search for **other linker flags**. Add a custom linker flag below
+```
+-all_load
+```
+Finally, disable bitcode for your target by selecting the Build Settings, searching for Enable Bitcode, and set the value to **No**.
+
+5. **Update library and api**
+
+  5.1 Update ``TorchModule.mm``: To use the custom built libraries the project, replace `#import <LibTorch/LibTorch.h>` (in ``TorchModule.mm``) which is needed when using LibTorch via Cocoapods with the code below:
+
+.. code-block:: swift
+
+    //#import <LibTorch/LibTorch.h>
+    #include "ATen/ATen.h"
+    #include "caffe2/core/timer.h"
+    #include "caffe2/utils/string_utils.h"
+    #include "torch/csrc/autograd/grad_mode.h"
+    #include "torch/script.h"
+    #include <torch/csrc/jit/mobile/function.h>
+    #include <torch/csrc/jit/mobile/import.h>
+    #include <torch/csrc/jit/mobile/interpreter.h>
+    #include <torch/csrc/jit/mobile/module.h>
+    #include <torch/csrc/jit/mobile/observer.h>
+
+.. code-block:: swift
+
+    @implementation TorchModule {
+    @protected
+    // torch::jit::script::Module _impl;
+     torch::jit::mobile::Module _impl;
+    }
+
+    - (nullable instancetype)initWithFileAtPath:(NSString*)filePath {
+      self = [super init];
+      if (self) {
+          try {
+              _impl = torch::jit::_load_for_mobile(filePath.UTF8String);
+             //  _impl = torch::jit::load(filePath.UTF8String);
+             //  _impl.eval();
+            } catch (const std::exception& exception) {
+                NSLog(@"%s", exception.what());
+                return nil;
+            }
+        }
+        return self;
+    }
+
+
+5.2 Update ``ViewController.swift``
+
+.. code-block:: swift
+
+    //  if let filePath = Bundle.main.path(forResource:
+    //      "deeplabv3_scripted", ofType: "pt"),
+    //      let module = TorchModule(fileAtPath: filePath) {
+    //      return module
+    //  } else {
+    //      fatalError("Can't find the model file!")
+    //  }
+    if let filePath = Bundle.main.path(forResource:
+        "deeplabv3_scripted", ofType: "ptl"),
+        let module = TorchModule(fileAtPath: filePath) {
+        return module
+    } else {
+        fatalError("Can't find the model file!")
+    }
+
+6. Build and test the app in Xcode.
+
+How to use lite interpreter + custom build
+------------------------------------------
+1. To dump the operators in your model, say `deeplabv3_scripted`, run the following lines of Python code:
+
+.. code-block:: python
+
+    # Dump list of operators used by deeplabv3_scripted:
+    import torch, yaml
+    model = torch.jit.load('deeplabv3_scripted.ptl')
+    ops = torch.jit.export_opnames(model)
+    with open('deeplabv3_scripted.yaml', 'w') as output:
+        yaml.dump(ops, output)
+
+In the snippet above, you first need to load the ScriptModule. Then, use export_opnames to return a list of operator names of the ScriptModule and its submodules. Lastly, save the result in a yaml file. The yaml file can be generated for any PyTorch 1.4.0 or above version. You can do that by checking the value of `torch.__version__`.
+
+2. To run the build script locally with the prepared yaml list of operators, pass in the yaml file generate from the last step into the environment variable SELECTED_OP_LIST. Also in the arguments, specify BUILD_PYTORCH_MOBILE=1 as well as the platform/architechture type.
+
+**iOS**: Take the simulator build for example, the command should be:
+
+.. code-block:: bash
+
+   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR BUILD_LITE_INTERPRETER=1 ./scripts/build_ios.sh
+
+**Android**: Take the x86 build for example, the command should be:
+
+.. code-block:: bash
+
+   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86
+
+
+Conclusion
+----------
+
+In this tutorial, we demonstrated how to use lite interpreter in Android and iOS app. We walked through an Image Segmentation example to show how to dump the model, build torch library from source and use the new api to run model. Please be aware of that lite interpreter is still under development, more library size reduction will be introduced in the future. APIs are subject to change in the future versions.
+
+Thanks for reading! As always, we welcome any feedback, so please create an issue `here <https://github.com/pytorch/pytorch/issues>`_ if you have any.
+
+Learn More
+----------
+
+- To learn more about PyTorch Mobile, please refer to `PyTorch Mobile Home Page <https://pytorch.org/mobile/home/>`_
+- To learn more about Image Segmentation, please refer to the `Image Segmentation DeepLabV3 on Android Recipe <https://pytorch.org/tutorials/beginner/deeplabv3_on_android.html>`_

From 5e51680dd0d9428437b5ed774e269e55d7f0c7df Mon Sep 17 00:00:00 2001
From: Chen Lai <chenlai@fb.com>
Date: Wed, 9 Jun 2021 12:22:26 -0700
Subject: [PATCH 3/6] Update mobile interpreter to beta

---
 prototype_source/lite_interpreter.rst   | 221 ------------------------
 prototype_source/mobile_interpreter.rst | 198 +++++++++++++++++++++
 2 files changed, 198 insertions(+), 221 deletions(-)
 delete mode 100644 prototype_source/lite_interpreter.rst
 create mode 100644 prototype_source/mobile_interpreter.rst

diff --git a/prototype_source/lite_interpreter.rst b/prototype_source/lite_interpreter.rst
deleted file mode 100644
index bb3efe44f46..00000000000
--- a/prototype_source/lite_interpreter.rst
+++ /dev/null
@@ -1,221 +0,0 @@
-(Prototype) Introduce lite interpreter workflow in Android and iOS
-==================================================================
-
-**Author**: `Chen Lai <https://github.com/cccclai>`_, `Martin Yuan <https://github.com/iseeyuan>`_
-
-Introduction
-------------
-
-This tutorial introduces the steps to use lite interpreter on iOS and Android. We'll be using the ImageSegmentation demo app as an example. Since lite interpreter is currently in the prototype stage, a custom pytorch binary from source is required.
-
-
-Android
--------------------
-Get ImageSegmentation demo app in Android: https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation
-
-1. **Prepare model**: Prepare the lite interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
-
-.. code:: python
-
-    import torch
-
-    model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
-    model.eval()
-
-    scripted_module = torch.jit.script(model)
-    # Export full jit version model (not compatible lite interpreter), leave it here for comparison
-    scripted_module.save("deeplabv3_scripted.pt")
-    # Export lite interpreter version model (compatible with lite interpreter)
-    scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")
-
-2. **Build libtorch lite for android**: Build libtorch for android for all 4 android abis (``armeabi-v7a``, ``arm64-v8a``, ``x86``, ``x86_64``) ``BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh``. For example, if it will be tested on Pixel 4 emulator with ``x86``, use cmd ``BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86`` to specify abi to save built time. After the build finish, it will show the library path:
-
-
-.. code-block:: bash
-
-   BUILD SUCCESSFUL in 55s
-   134 actionable tasks: 22 executed, 112 up-to-date
-   + find /Users/chenlai/pytorch/android -type f -name '*aar'
-   + xargs ls -lah
-   -rw-r--r--  1 chenlai  staff    13M Feb 11 11:48 /Users/chenlai/pytorch/android/pytorch_android/build/outputs/aar/pytorch_android-release.aar
-   -rw-r--r--  1 chenlai  staff    36K Feb  9 16:45 /Users/chenlai/pytorch/android/pytorch_android_torchvision/build/outputs/aar/pytorch_android_torchvision-release.aar
-
-3. **Use the PyTorch Android libraries built from source in the ImageSegmentation app**: Create a folder `libs` in the path, the path from repository root will be `ImageSegmentation/app/libs`. Copy `pytorch_android-release` to the path ``ImageSegmentation/app/libs/pytorch_android-release.aar``. Copy `pytorch_android_torchvision` (downloaded from `Pytorch Android Torchvision Nightly <https://oss.sonatype.org/#nexus-search;quick~torchvision_android/>`_) to the path ``ImageSegmentation/app/libs/pytorch_android_torchvision.aar``. Update the `dependencies` part of ``ImageSegmentation/app/build.gradle`` to
-
-.. code:: gradle
-
-   dependencies {
-       implementation 'androidx.appcompat:appcompat:1.2.0'
-       implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
-       testImplementation 'junit:junit:4.12'
-       androidTestImplementation 'androidx.test.ext:junit:1.1.2'
-       androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'
-
-
-       implementation(name:'pytorch_android-release', ext:'aar')
-       implementation(name:'pytorch_android_torchvision', ext:'aar')
-
-       implementation 'com.android.support:appcompat-v7:28.0.0'
-       implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
-   }
-
-Update `all projects` part in ``ImageSegmentation/build.gradle`` to
-
-
-.. code:: gradle
-
-    allprojects {
-        repositories {
-            google()
-            jcenter()
-            flatDir {
-                dirs 'libs'
-            }
-        }
-    }
-
-4. **Update model loader api**: Update ``ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java`` by
-
-  4.1 Add new import: `import org.pytorch.LiteModuleLoader`
-
-  4.2 Replace the way to load pytorch lite model
-
-.. code:: java
-
-    // mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
-    mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
-
-5. **Test app**: Build and run the `ImageSegmentation` app in Android Studio
-
-iOS
--------------------
-Get ImageSegmentation demo app in iOS: https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation
-
-1. **Prepare model**: Same as Android.
-
-2. **Build libtorch lite for iOS**:
-
-.. code-block:: bash
-
-   BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR BUILD_LITE_INTERPRETER=1 ./scripts/build_ios.sh
-
-
-3. **Remove Cocoapods from the project** (this step is only needed if you ran `pod install`):
-
-.. code-block:: bash
-
-   pod deintegrate
-
-4. **Link ImageSegmentation demo app with the custom built library**:
-Open your project in XCode, go to your project Target’s **Build Phases - Link Binaries With Libraries**, click the **+** sign and add all the library files located in `build_ios/install/lib`. Navigate to the project **Build Settings**, set the value **Header Search Paths** to `build_ios/install/include` and **Library Search Paths** to `build_ios/install/lib`.
-In the build settings, search for **other linker flags**. Add a custom linker flag below
-```
--all_load
-```
-Finally, disable bitcode for your target by selecting the Build Settings, searching for Enable Bitcode, and set the value to **No**.
-
-5. **Update library and api**
-
-  5.1 Update ``TorchModule.mm``: To use the custom built libraries the project, replace `#import <LibTorch/LibTorch.h>` (in ``TorchModule.mm``) which is needed when using LibTorch via Cocoapods with the code below:
-
-.. code-block:: swift
-
-    //#import <LibTorch/LibTorch.h>
-    #include "ATen/ATen.h"
-    #include "caffe2/core/timer.h"
-    #include "caffe2/utils/string_utils.h"
-    #include "torch/csrc/autograd/grad_mode.h"
-    #include "torch/script.h"
-    #include <torch/csrc/jit/mobile/function.h>
-    #include <torch/csrc/jit/mobile/import.h>
-    #include <torch/csrc/jit/mobile/interpreter.h>
-    #include <torch/csrc/jit/mobile/module.h>
-    #include <torch/csrc/jit/mobile/observer.h>
-
-.. code-block:: swift
-
-    @implementation TorchModule {
-    @protected
-    // torch::jit::script::Module _impl;
-     torch::jit::mobile::Module _impl;
-    }
-
-    - (nullable instancetype)initWithFileAtPath:(NSString*)filePath {
-      self = [super init];
-      if (self) {
-          try {
-              _impl = torch::jit::_load_for_mobile(filePath.UTF8String);
-             //  _impl = torch::jit::load(filePath.UTF8String);
-             //  _impl.eval();
-            } catch (const std::exception& exception) {
-                NSLog(@"%s", exception.what());
-                return nil;
-            }
-        }
-        return self;
-    }
-
-
-5.2 Update ``ViewController.swift``
-
-.. code-block:: swift
-
-    //  if let filePath = Bundle.main.path(forResource:
-    //      "deeplabv3_scripted", ofType: "pt"),
-    //      let module = TorchModule(fileAtPath: filePath) {
-    //      return module
-    //  } else {
-    //      fatalError("Can't find the model file!")
-    //  }
-    if let filePath = Bundle.main.path(forResource:
-        "deeplabv3_scripted", ofType: "ptl"),
-        let module = TorchModule(fileAtPath: filePath) {
-        return module
-    } else {
-        fatalError("Can't find the model file!")
-    }
-
-6. Build and test the app in Xcode.
-
-How to use lite interpreter + custom build
-------------------------------------------
-1. To dump the operators in your model, say `deeplabv3_scripted`, run the following lines of Python code:
-
-.. code-block:: python
-
-    # Dump list of operators used by deeplabv3_scripted:
-    import torch, yaml
-    model = torch.jit.load('deeplabv3_scripted.ptl')
-    ops = torch.jit.export_opnames(model)
-    with open('deeplabv3_scripted.yaml', 'w') as output:
-        yaml.dump(ops, output)
-
-In the snippet above, you first need to load the ScriptModule. Then, use export_opnames to return a list of operator names of the ScriptModule and its submodules. Lastly, save the result in a yaml file. The yaml file can be generated for any PyTorch 1.4.0 or above version. You can do that by checking the value of `torch.__version__`.
-
-2. To run the build script locally with the prepared yaml list of operators, pass in the yaml file generate from the last step into the environment variable SELECTED_OP_LIST. Also in the arguments, specify BUILD_PYTORCH_MOBILE=1 as well as the platform/architechture type.
-
-**iOS**: Take the simulator build for example, the command should be:
-
-.. code-block:: bash
-
-   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR BUILD_LITE_INTERPRETER=1 ./scripts/build_ios.sh
-
-**Android**: Take the x86 build for example, the command should be:
-
-.. code-block:: bash
-
-   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86
-
-
-Conclusion
-----------
-
-In this tutorial, we demonstrated how to use lite interpreter in Android and iOS app. We walked through an Image Segmentation example to show how to dump the model, build torch library from source and use the new api to run model. Please be aware of that lite interpreter is still under development, more library size reduction will be introduced in the future. APIs are subject to change in the future versions.
-
-Thanks for reading! As always, we welcome any feedback, so please create an issue `here <https://github.com/pytorch/pytorch/issues>`_ if you have any.
-
-Learn More
-----------
-
-- To learn more about PyTorch Mobile, please refer to `PyTorch Mobile Home Page <https://pytorch.org/mobile/home/>`_
-- To learn more about Image Segmentation, please refer to the `Image Segmentation DeepLabV3 on Android Recipe <https://pytorch.org/tutorials/beginner/deeplabv3_on_android.html>`_
diff --git a/prototype_source/mobile_interpreter.rst b/prototype_source/mobile_interpreter.rst
new file mode 100644
index 00000000000..62e0449324d
--- /dev/null
+++ b/prototype_source/mobile_interpreter.rst
@@ -0,0 +1,198 @@
+(beta) Efficient mobile interpreter in Android and iOS
+==================================================================
+
+**Author**: `Chen Lai <https://github.com/cccclai>`_, `Martin Yuan <https://github.com/iseeyuan>`_
+
+Introduction
+------------
+
+This tutorial introduces the steps to use PyTorch's efficient interpreter on iOS and Android. We will be using an  Image Segmentation demo application as an example.
+
+This application will take advantage of the pre-built interpreter libraries available for Android and iOS, which can be used directly with Maven (Android) and CocoaPods (iOS). It is important to note that the pre-built libraries are the available for simplicity, but further size optimization can be achieved with by utilizing PyTorch's custom build capabilities.
+
+.. note:: If you see the error message: `PytorchStreamReader failed locating file bytecode.pkl: file not found ()`, likely you are using a torch script model that requires the use of the PyTorch JIT interpreter (a version of our PyTorch interpreter that is not as size-efficient). In order to leverage our efficient interpreter, please regenerate the model by running: `module._save_for_lite_interpreter(${model_path})`.
+
+   - If `bytecode.pkl` is missing, likely the model is generated with the api: `module.save(${model_psth})`.
+   - The api `_load_for_lite_interpreter(${model_psth}) can be helpful to validate model with the efficient mobile interpreter.
+
+Android
+-------------------
+Get the Image Segmentation demo app in Android: https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation
+
+1. **Prepare model**: Prepare the mobile interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
+
+.. code:: python
+
+    import torch
+    from torch.utils.mobile_optimizer import optimize_for_mobile
+    model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
+    model.eval()
+
+    scripted_module = torch.jit.script(model)
+    # Export full jit version model (not compatible mobile interpreter), leave it here for comparison
+    scripted_module.save("deeplabv3_scripted.pt")
+    # Export mobile interpreter version model (compatible with mobile interpreter)
+    optimized_scripted_module = optimize_for_mobile(scripted_module)
+    optimized_scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")
+
+2. **Use the PyTorch Android library in the ImageSegmentation app**: Update the `dependencies` part of ``ImageSegmentation/app/build.gradle`` to
+
+.. code:: gradle
+
+    repositories {
+        maven {
+            url "https://oss.sonatype.org/content/repositories/snapshots"
+        }
+    }
+
+    dependencies {
+        implementation 'androidx.appcompat:appcompat:1.2.0'
+        implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
+        testImplementation 'junit:junit:4.12'
+        androidTestImplementation 'androidx.test.ext:junit:1.1.2'
+        androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'
+        implementation 'org.pytorch:pytorch_android_lite:1.9.0'
+        implementation 'org.pytorch:pytorch_android_torchvision:1.9.0'
+
+        implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
+    }
+
+
+
+3. **Update model loader api**: Update ``ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java`` by
+
+  4.1 Add new import: `import org.pytorch.LiteModuleLoader`
+
+  4.2 Replace the way to load pytorch lite model
+
+.. code:: java
+
+    // mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
+    mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
+
+4. **Test app**: Build and run the `ImageSegmentation` app in Android Studio
+
+iOS
+-------------------
+Get ImageSegmentation demo app in iOS: https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation
+
+1. **Prepare model**: Same as Android.
+
+2. **Build the project with Cocoapods and prebuilt interpreter** Update the `PodFile` and run `pod install`:
+
+.. code-block:: podfile
+
+    target 'ImageSegmentation' do
+    # Comment the next line if you don't want to use dynamic frameworks
+    use_frameworks!
+
+    # Pods for ImageSegmentation
+    pod 'LibTorch_Lite', '~>1.9.0'
+    end
+
+3. **Update library and API**
+
+  3.1 Update ``TorchModule.mm``: To use the custom built libraries project, use `<Libtorch-Lite/Libtorch-Lite.h>` (in ``TorchModule.mm``):
+
+.. code-block:: swift
+
+    #import <Libtorch-Lite/Libtorch-Lite.h>
+    // If it's built from source with xcode, comment out the line above
+    // and use following headers
+    // #include <torch/csrc/jit/mobile/import.h>
+    // #include <torch/csrc/jit/mobile/module.h>
+    // #include <torch/script.h>
+
+.. code-block:: swift
+
+    @implementation TorchModule {
+    @protected
+    // torch::jit::script::Module _impl;
+     torch::jit::mobile::Module _impl;
+    }
+
+    - (nullable instancetype)initWithFileAtPath:(NSString*)filePath {
+      self = [super init];
+      if (self) {
+          try {
+              _impl = torch::jit::_load_for_mobile(filePath.UTF8String);
+             //  _impl = torch::jit::load(filePath.UTF8String);
+             //  _impl.eval();
+            } catch (const std::exception& exception) {
+                NSLog(@"%s", exception.what());
+                return nil;
+            }
+        }
+        return self;
+    }
+
+3.2 Update ``ViewController.swift``
+
+.. code-block:: swift
+
+    //  if let filePath = Bundle.main.path(forResource:
+    //      "deeplabv3_scripted", ofType: "pt"),
+    //      let module = TorchModule(fileAtPath: filePath) {
+    //      return module
+    //  } else {
+    //      fatalError("Can't find the model file!")
+    //  }
+    if let filePath = Bundle.main.path(forResource:
+        "deeplabv3_scripted", ofType: "ptl"),
+        let module = TorchModule(fileAtPath: filePath) {
+        return module
+    } else {
+        fatalError("Can't find the model file!")
+    }
+
+4. Build and test the app in Xcode.
+
+How to use mobile interpreter + custom build
+------------------------------------------
+A custom PyTorch interpreter library can be created to reduce binary size, by only containing the operators needed by the model. In order to do that follow these steps:
+
+1. To dump the operators in your model, say `deeplabv3_scripted`, run the following lines of Python code:
+
+.. code-block:: python
+
+    # Dump list of operators used by deeplabv3_scripted:
+    import torch, yaml
+    model = torch.jit.load('deeplabv3_scripted.ptl')
+    ops = torch.jit.export_opnames(model)
+    with open('deeplabv3_scripted.yaml', 'w') as output:
+        yaml.dump(ops, output)
+
+In the snippet above, you first need to load the ScriptModule. Then, use export_opnames to return a list of operator names of the ScriptModule and its submodules. Lastly, save the result in a yaml file. The yaml file can be generated for any PyTorch 1.4.0 or above version. You can do that by checking the value of `torch.__version__`.
+
+2. To run the build script locally with the prepared yaml list of operators, pass in the yaml file generate from the last step into the environment variable SELECTED_OP_LIST. Also in the arguments, specify BUILD_PYTORCH_MOBILE=1 as well as the platform/architechture type.
+
+**iOS**: Take the simulator build for example, the command should be:
+
+.. code-block:: bash
+
+   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR ./scripts/build_ios.sh
+
+**Android**: Take the x86 build for example, the command should be:
+
+.. code-block:: bash
+
+   SELECTED_OP_LIST=deeplabv3_scripted.yaml ./scripts/build_pytorch_android.sh x86
+
+
+
+Conclusion
+----------
+
+In this tutorial, we demonstrated how to use PyTorch's efficient mobile interpreter, in an Android and iOS app.
+
+We walked through an Image Segmentation example to show how to dump the model, build a custom torch library from source and use the new api to run model.
+
+Our efficient mobile interpreter is still under development, and we will continue improving its size in the future. Note, however, that the APIs are subject to change in future versions.
+
+Thanks for reading! As always, we welcome any feedback, so please create an issue `here <https://github.com/pytorch/pytorch/issues>` - if you have any.
+
+Learn More
+----------
+
+- To learn more about PyTorch Mobile, please refer to `PyTorch Mobile Home Page <https://pytorch.org/mobile/home/>`_
+- To learn more about Image Segmentation, please refer to the `Image Segmentation DeepLabV3 on Android Recipe <https://pytorch.org/tutorials/beginner/deeplabv3_on_android.html>`_

From 2c25c73d9a7cd8a78631d64ceb43beae26bb2e50 Mon Sep 17 00:00:00 2001
From: Chen Lai <chenlai@fb.com>
Date: Wed, 9 Jun 2021 12:56:37 -0700
Subject: [PATCH 4/6] fix type

---
 prototype_source/mobile_interpreter.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/prototype_source/mobile_interpreter.rst b/prototype_source/mobile_interpreter.rst
index 62e0449324d..ec23b28100e 100644
--- a/prototype_source/mobile_interpreter.rst
+++ b/prototype_source/mobile_interpreter.rst
@@ -13,7 +13,7 @@ This application will take advantage of the pre-built interpreter libraries avai
 .. note:: If you see the error message: `PytorchStreamReader failed locating file bytecode.pkl: file not found ()`, likely you are using a torch script model that requires the use of the PyTorch JIT interpreter (a version of our PyTorch interpreter that is not as size-efficient). In order to leverage our efficient interpreter, please regenerate the model by running: `module._save_for_lite_interpreter(${model_path})`.
 
    - If `bytecode.pkl` is missing, likely the model is generated with the api: `module.save(${model_psth})`.
-   - The api `_load_for_lite_interpreter(${model_psth}) can be helpful to validate model with the efficient mobile interpreter.
+   - The api `_load_for_lite_interpreter(${model_psth})` can be helpful to validate model with the efficient mobile interpreter.
 
 Android
 -------------------

From c46e675b235442a060d30080ca322bce935efebc Mon Sep 17 00:00:00 2001
From: Chen Lai <chenlai@fb.com>
Date: Wed, 9 Jun 2021 12:58:59 -0700
Subject: [PATCH 5/6] move from prototype_resources to recipe_resources

---
 prototype_source/mobile_interpreter.rst | 198 -------------
 recipes_source/mobile_perf.rst          | 365 +++++++++---------------
 2 files changed, 137 insertions(+), 426 deletions(-)
 delete mode 100644 prototype_source/mobile_interpreter.rst

diff --git a/prototype_source/mobile_interpreter.rst b/prototype_source/mobile_interpreter.rst
deleted file mode 100644
index ec23b28100e..00000000000
--- a/prototype_source/mobile_interpreter.rst
+++ /dev/null
@@ -1,198 +0,0 @@
-(beta) Efficient mobile interpreter in Android and iOS
-==================================================================
-
-**Author**: `Chen Lai <https://github.com/cccclai>`_, `Martin Yuan <https://github.com/iseeyuan>`_
-
-Introduction
-------------
-
-This tutorial introduces the steps to use PyTorch's efficient interpreter on iOS and Android. We will be using an  Image Segmentation demo application as an example.
-
-This application will take advantage of the pre-built interpreter libraries available for Android and iOS, which can be used directly with Maven (Android) and CocoaPods (iOS). It is important to note that the pre-built libraries are the available for simplicity, but further size optimization can be achieved with by utilizing PyTorch's custom build capabilities.
-
-.. note:: If you see the error message: `PytorchStreamReader failed locating file bytecode.pkl: file not found ()`, likely you are using a torch script model that requires the use of the PyTorch JIT interpreter (a version of our PyTorch interpreter that is not as size-efficient). In order to leverage our efficient interpreter, please regenerate the model by running: `module._save_for_lite_interpreter(${model_path})`.
-
-   - If `bytecode.pkl` is missing, likely the model is generated with the api: `module.save(${model_psth})`.
-   - The api `_load_for_lite_interpreter(${model_psth})` can be helpful to validate model with the efficient mobile interpreter.
-
-Android
--------------------
-Get the Image Segmentation demo app in Android: https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation
-
-1. **Prepare model**: Prepare the mobile interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
-
-.. code:: python
-
-    import torch
-    from torch.utils.mobile_optimizer import optimize_for_mobile
-    model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
-    model.eval()
-
-    scripted_module = torch.jit.script(model)
-    # Export full jit version model (not compatible mobile interpreter), leave it here for comparison
-    scripted_module.save("deeplabv3_scripted.pt")
-    # Export mobile interpreter version model (compatible with mobile interpreter)
-    optimized_scripted_module = optimize_for_mobile(scripted_module)
-    optimized_scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")
-
-2. **Use the PyTorch Android library in the ImageSegmentation app**: Update the `dependencies` part of ``ImageSegmentation/app/build.gradle`` to
-
-.. code:: gradle
-
-    repositories {
-        maven {
-            url "https://oss.sonatype.org/content/repositories/snapshots"
-        }
-    }
-
-    dependencies {
-        implementation 'androidx.appcompat:appcompat:1.2.0'
-        implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
-        testImplementation 'junit:junit:4.12'
-        androidTestImplementation 'androidx.test.ext:junit:1.1.2'
-        androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'
-        implementation 'org.pytorch:pytorch_android_lite:1.9.0'
-        implementation 'org.pytorch:pytorch_android_torchvision:1.9.0'
-
-        implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
-    }
-
-
-
-3. **Update model loader api**: Update ``ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java`` by
-
-  4.1 Add new import: `import org.pytorch.LiteModuleLoader`
-
-  4.2 Replace the way to load pytorch lite model
-
-.. code:: java
-
-    // mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
-    mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
-
-4. **Test app**: Build and run the `ImageSegmentation` app in Android Studio
-
-iOS
--------------------
-Get ImageSegmentation demo app in iOS: https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation
-
-1. **Prepare model**: Same as Android.
-
-2. **Build the project with Cocoapods and prebuilt interpreter** Update the `PodFile` and run `pod install`:
-
-.. code-block:: podfile
-
-    target 'ImageSegmentation' do
-    # Comment the next line if you don't want to use dynamic frameworks
-    use_frameworks!
-
-    # Pods for ImageSegmentation
-    pod 'LibTorch_Lite', '~>1.9.0'
-    end
-
-3. **Update library and API**
-
-  3.1 Update ``TorchModule.mm``: To use the custom built libraries project, use `<Libtorch-Lite/Libtorch-Lite.h>` (in ``TorchModule.mm``):
-
-.. code-block:: swift
-
-    #import <Libtorch-Lite/Libtorch-Lite.h>
-    // If it's built from source with xcode, comment out the line above
-    // and use following headers
-    // #include <torch/csrc/jit/mobile/import.h>
-    // #include <torch/csrc/jit/mobile/module.h>
-    // #include <torch/script.h>
-
-.. code-block:: swift
-
-    @implementation TorchModule {
-    @protected
-    // torch::jit::script::Module _impl;
-     torch::jit::mobile::Module _impl;
-    }
-
-    - (nullable instancetype)initWithFileAtPath:(NSString*)filePath {
-      self = [super init];
-      if (self) {
-          try {
-              _impl = torch::jit::_load_for_mobile(filePath.UTF8String);
-             //  _impl = torch::jit::load(filePath.UTF8String);
-             //  _impl.eval();
-            } catch (const std::exception& exception) {
-                NSLog(@"%s", exception.what());
-                return nil;
-            }
-        }
-        return self;
-    }
-
-3.2 Update ``ViewController.swift``
-
-.. code-block:: swift
-
-    //  if let filePath = Bundle.main.path(forResource:
-    //      "deeplabv3_scripted", ofType: "pt"),
-    //      let module = TorchModule(fileAtPath: filePath) {
-    //      return module
-    //  } else {
-    //      fatalError("Can't find the model file!")
-    //  }
-    if let filePath = Bundle.main.path(forResource:
-        "deeplabv3_scripted", ofType: "ptl"),
-        let module = TorchModule(fileAtPath: filePath) {
-        return module
-    } else {
-        fatalError("Can't find the model file!")
-    }
-
-4. Build and test the app in Xcode.
-
-How to use mobile interpreter + custom build
-------------------------------------------
-A custom PyTorch interpreter library can be created to reduce binary size, by only containing the operators needed by the model. In order to do that follow these steps:
-
-1. To dump the operators in your model, say `deeplabv3_scripted`, run the following lines of Python code:
-
-.. code-block:: python
-
-    # Dump list of operators used by deeplabv3_scripted:
-    import torch, yaml
-    model = torch.jit.load('deeplabv3_scripted.ptl')
-    ops = torch.jit.export_opnames(model)
-    with open('deeplabv3_scripted.yaml', 'w') as output:
-        yaml.dump(ops, output)
-
-In the snippet above, you first need to load the ScriptModule. Then, use export_opnames to return a list of operator names of the ScriptModule and its submodules. Lastly, save the result in a yaml file. The yaml file can be generated for any PyTorch 1.4.0 or above version. You can do that by checking the value of `torch.__version__`.
-
-2. To run the build script locally with the prepared yaml list of operators, pass in the yaml file generate from the last step into the environment variable SELECTED_OP_LIST. Also in the arguments, specify BUILD_PYTORCH_MOBILE=1 as well as the platform/architechture type.
-
-**iOS**: Take the simulator build for example, the command should be:
-
-.. code-block:: bash
-
-   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR ./scripts/build_ios.sh
-
-**Android**: Take the x86 build for example, the command should be:
-
-.. code-block:: bash
-
-   SELECTED_OP_LIST=deeplabv3_scripted.yaml ./scripts/build_pytorch_android.sh x86
-
-
-
-Conclusion
-----------
-
-In this tutorial, we demonstrated how to use PyTorch's efficient mobile interpreter, in an Android and iOS app.
-
-We walked through an Image Segmentation example to show how to dump the model, build a custom torch library from source and use the new api to run model.
-
-Our efficient mobile interpreter is still under development, and we will continue improving its size in the future. Note, however, that the APIs are subject to change in future versions.
-
-Thanks for reading! As always, we welcome any feedback, so please create an issue `here <https://github.com/pytorch/pytorch/issues>` - if you have any.
-
-Learn More
-----------
-
-- To learn more about PyTorch Mobile, please refer to `PyTorch Mobile Home Page <https://pytorch.org/mobile/home/>`_
-- To learn more about Image Segmentation, please refer to the `Image Segmentation DeepLabV3 on Android Recipe <https://pytorch.org/tutorials/beginner/deeplabv3_on_android.html>`_
diff --git a/recipes_source/mobile_perf.rst b/recipes_source/mobile_perf.rst
index 2e7e7c17f73..ec23b28100e 100644
--- a/recipes_source/mobile_perf.rst
+++ b/recipes_source/mobile_perf.rst
@@ -1,289 +1,198 @@
-Pytorch Mobile Performance Recipes
-==================================
+(beta) Efficient mobile interpreter in Android and iOS
+==================================================================
 
-Introduction
-----------------
-Performance (aka latency) is crucial to most, if not all,
-applications and use-cases of ML model inference on mobile devices.
-
-Today, PyTorch executes the models on the CPU backend pending availability
-of other hardware backends such as GPU, DSP, and NPU.
-
-In this recipe, you will learn:
-
-- How to optimize your model to help decrease execution time (higher performance, lower latency) on the mobile device.
-- How to benchmark (to check if optimizations helped your use case).
-
-
-Model preparation
------------------
-
-We will start with preparing to optimize your model to help decrease execution time
-(higher performance, lower latency) on the mobile device.
-
-
-Setup
-^^^^^^^
-
-First we need to installed pytorch using conda or pip with version at least 1.5.0.
-
-::
-
-   conda install pytorch torchvision -c pytorch
-
-or
-
-::
-
-   pip install torch torchvision
-
-Code your model:
-
-::
-
-  import torch
-  from torch.utils.mobile_optimizer import optimize_for_mobile
-
-  class AnnotatedConvBnReLUModel(torch.nn.Module):
-      def __init__(self):
-          super(AnnotatedConvBnReLUModel, self).__init__()
-          self.conv = torch.nn.Conv2d(3, 5, 3, bias=False).to(dtype=torch.float)
-          self.bn = torch.nn.BatchNorm2d(5).to(dtype=torch.float)
-          self.relu = torch.nn.ReLU(inplace=True)
-          self.quant = torch.quantization.QuantStub()
-          self.dequant = torch.quantization.DeQuantStub()
-
-      def forward(self, x):
-          x.contiguous(memory_format=torch.channels_last)
-          x = self.quant(x)
-          x = self.conv(x)
-          x = self.bn(x)
-          x = self.relu(x)
-          x = self.dequant(x)
-          return x
-
-  model = AnnotatedConvBnReLUModel()
-
-
-``torch.quantization.QuantStub`` and ``torch.quantization.DeQuantStub()`` are no-op stubs, which will be used for quantization step.
-
-
-1. Fuse operators using ``torch.quantization.fuse_modules``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Do not be confused that fuse_modules is in the quantization package.
-It works for all ``torch.nn.Module``.
-
-``torch.quantization.fuse_modules`` fuses a list of modules into a single module.
-It fuses only the following sequence of modules:
-
-- Convolution, Batch normalization
-- Convolution, Batch normalization, Relu
-- Convolution, Relu
-- Linear, Relu
-
-This script will fuse Convolution, Batch Normalization and Relu in previously declared model.
-
-::
-
-  torch.quantization.fuse_modules(model, [['conv', 'bn', 'relu']], inplace=True)
+**Author**: `Chen Lai <https://github.com/cccclai>`_, `Martin Yuan <https://github.com/iseeyuan>`_
 
+Introduction
+------------
 
-2. Quantize your model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This tutorial introduces the steps to use PyTorch's efficient interpreter on iOS and Android. We will be using an  Image Segmentation demo application as an example.
 
-You can find more about PyTorch quantization in
-`the dedicated tutorial <https://pytorch.org/blog/introduction-to-quantization-on-pytorch/>`_.
+This application will take advantage of the pre-built interpreter libraries available for Android and iOS, which can be used directly with Maven (Android) and CocoaPods (iOS). It is important to note that the pre-built libraries are the available for simplicity, but further size optimization can be achieved with by utilizing PyTorch's custom build capabilities.
 
-Quantization of the model not only moves computation to int8,
-but also reduces the size of your model on a disk.
-That size reduction helps to reduce disk read operations during the first load of the model and decreases the amount of RAM.
-Both of those resources can be crucial for the performance of mobile applications.
-This code does quantization, using stub for model calibration function, you can find more about it `here <https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html#post-training-static-quantization>`__.
+.. note:: If you see the error message: `PytorchStreamReader failed locating file bytecode.pkl: file not found ()`, likely you are using a torch script model that requires the use of the PyTorch JIT interpreter (a version of our PyTorch interpreter that is not as size-efficient). In order to leverage our efficient interpreter, please regenerate the model by running: `module._save_for_lite_interpreter(${model_path})`.
 
-::
+   - If `bytecode.pkl` is missing, likely the model is generated with the api: `module.save(${model_psth})`.
+   - The api `_load_for_lite_interpreter(${model_psth})` can be helpful to validate model with the efficient mobile interpreter.
 
-  model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
-  torch.quantization.prepare(model, inplace=True)
-  # Calibrate your model
-  def calibrate(model, calibration_data):
-      # Your calibration code here
-      return
-  calibrate(model, [])
-  torch.quantization.convert(model, inplace=True)
+Android
+-------------------
+Get the Image Segmentation demo app in Android: https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation
 
+1. **Prepare model**: Prepare the mobile interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
 
+.. code:: python
 
-3. Use torch.utils.mobile_optimizer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    import torch
+    from torch.utils.mobile_optimizer import optimize_for_mobile
+    model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
+    model.eval()
 
-Torch mobile_optimizer package does several optimizations with the scripted model,
-which will help to conv2d and linear operations.
-It pre-packs model weights in an optimized format and fuses ops above with relu
-if it is the next operation.
+    scripted_module = torch.jit.script(model)
+    # Export full jit version model (not compatible mobile interpreter), leave it here for comparison
+    scripted_module.save("deeplabv3_scripted.pt")
+    # Export mobile interpreter version model (compatible with mobile interpreter)
+    optimized_scripted_module = optimize_for_mobile(scripted_module)
+    optimized_scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")
 
-First we script the result model from previous step:
+2. **Use the PyTorch Android library in the ImageSegmentation app**: Update the `dependencies` part of ``ImageSegmentation/app/build.gradle`` to
 
-::
+.. code:: gradle
 
-  torchscript_model = torch.jit.script(model)
+    repositories {
+        maven {
+            url "https://oss.sonatype.org/content/repositories/snapshots"
+        }
+    }
 
-Next we call ``optimize_for_mobile`` and save model on the disk.
+    dependencies {
+        implementation 'androidx.appcompat:appcompat:1.2.0'
+        implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
+        testImplementation 'junit:junit:4.12'
+        androidTestImplementation 'androidx.test.ext:junit:1.1.2'
+        androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'
+        implementation 'org.pytorch:pytorch_android_lite:1.9.0'
+        implementation 'org.pytorch:pytorch_android_torchvision:1.9.0'
 
-::
+        implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
+    }
 
-  torchscript_model_optimized = optimize_for_mobile(torchscript_model)
-  torch.jit.save(torchscript_model_optimized, "model.pt")
 
-4. Prefer Using Channels Last Tensor memory format
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Channels Last(NHWC) memory format was introduced in PyTorch 1.4.0. It is supported only for four-dimensional tensors. This memory format gives a better memory locality for most operators, especially convolution. Our measurements showed a 3x speedup of MobileNetV2 model compared with the default Channels First(NCHW) format.
+3. **Update model loader api**: Update ``ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java`` by
 
-At the moment of writing this recipe, PyTorch Android java API does not support using inputs in Channels Last memory format. But it can be used on the TorchScript model level, by adding the conversion to it for model inputs.
+  4.1 Add new import: `import org.pytorch.LiteModuleLoader`
 
-.. code-block:: python
+  4.2 Replace the way to load pytorch lite model
 
-  def forward(self, x):
-      x.contiguous(memory_format=torch.channels_last)
-      ...
+.. code:: java
 
+    // mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
+    mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
 
-This conversion is zero cost if your input is already in Channels Last memory format. After it, all operators will work preserving ChannelsLast memory format.
+4. **Test app**: Build and run the `ImageSegmentation` app in Android Studio
 
-5. Android - Reusing tensors for forward
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+iOS
+-------------------
+Get ImageSegmentation demo app in iOS: https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation
 
-This part of the recipe is Android only.
+1. **Prepare model**: Same as Android.
 
-Memory is a critical resource for android performance, especially on old devices.
-Tensors can need a significant amount of memory.
-For example, standard computer vision tensor contains 1*3*224*224 elements,
-assuming that data type is float and will need 588Kb of memory.
+2. **Build the project with Cocoapods and prebuilt interpreter** Update the `PodFile` and run `pod install`:
 
-::
+.. code-block:: podfile
 
-  FloatBuffer buffer = Tensor.allocateFloatBuffer(1*3*224*224);
-  Tensor tensor = Tensor.fromBlob(buffer, new long[]{1, 3, 224, 224});
+    target 'ImageSegmentation' do
+    # Comment the next line if you don't want to use dynamic frameworks
+    use_frameworks!
 
+    # Pods for ImageSegmentation
+    pod 'LibTorch_Lite', '~>1.9.0'
+    end
 
-Here we allocate native memory as ``java.nio.FloatBuffer`` and creating ``org.pytorch.Tensor`` which storage will be pointing to the memory of the allocated buffer.
+3. **Update library and API**
 
-For most of the use cases, we do not do model forward only once, repeating it with some frequency or as fast as possible.
+  3.1 Update ``TorchModule.mm``: To use the custom built libraries project, use `<Libtorch-Lite/Libtorch-Lite.h>` (in ``TorchModule.mm``):
 
-If we are doing new memory allocation for every module forward - that will be suboptimal.
-Instead of this, we can reuse the same memory that we allocated on the previous step, fill it with new data, and run module forward again on the same tensor object.
+.. code-block:: swift
 
-You can check how it looks in code in `pytorch android application example <https://github.com/pytorch/android-demo-app/blob/master/PyTorchDemoApp/app/src/main/java/org/pytorch/demo/vision/ImageClassificationActivity.java#L174>`_.
+    #import <Libtorch-Lite/Libtorch-Lite.h>
+    // If it's built from source with xcode, comment out the line above
+    // and use following headers
+    // #include <torch/csrc/jit/mobile/import.h>
+    // #include <torch/csrc/jit/mobile/module.h>
+    // #include <torch/script.h>
 
-::
+.. code-block:: swift
 
-  protected AnalysisResult analyzeImage(ImageProxy image, int rotationDegrees) {
-    if (mModule == null) {
-      mModule = Module.load(moduleFileAbsoluteFilePath);
-      mInputTensorBuffer =
-      Tensor.allocateFloatBuffer(3 * 224 * 224);
-      mInputTensor = Tensor.fromBlob(mInputTensorBuffer, new long[]{1, 3, 224, 224});
+    @implementation TorchModule {
+    @protected
+    // torch::jit::script::Module _impl;
+     torch::jit::mobile::Module _impl;
     }
 
-    TensorImageUtils.imageYUV420CenterCropToFloatBuffer(
-        image.getImage(), rotationDegrees,
-        224, 224,
-        TensorImageUtils.TORCHVISION_NORM_MEAN_RGB,
-        TensorImageUtils.TORCHVISION_NORM_STD_RGB,
-        mInputTensorBuffer, 0);
-
-    Tensor outputTensor = mModule.forward(IValue.from(mInputTensor)).toTensor();
-  }
-
-Member fields ``mModule``, ``mInputTensorBuffer`` and ``mInputTensor`` are initialized only once
-and buffer is refilled using ``org.pytorch.torchvision.TensorImageUtils.imageYUV420CenterCropToFloatBuffer``.
-
-Benchmarking
-------------
-
-The best way to benchmark (to check if optimizations helped your use case) - is to measure your particular use case that you want to optimize, as performance behavior can vary in different environments.
-
-PyTorch distribution provides a way to benchmark naked binary that runs the model forward,
-this approach can give more stable measurements rather than testing inside the application.
-
-
-Android - Benchmarking Setup
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-This part of the recipe is Android only.
-
-For this you first need to build benchmark binary:
-
-::
-
-    <from-your-root-pytorch-dir>
-    rm -rf build_android
-    BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DBUILD_BINARY=ON
-
-You should have arm64 binary at: ``build_android/bin/speed_benchmark_torch``.
-This binary takes ``--model=<path-to-model>``, ``--input_dim="1,3,224,224"`` as dimension information for the input and ``--input_type="float"`` as the type of the input as arguments.
-
-Once you have your android device connected,
-push speedbenchark_torch binary and your model to the phone:
+    - (nullable instancetype)initWithFileAtPath:(NSString*)filePath {
+      self = [super init];
+      if (self) {
+          try {
+              _impl = torch::jit::_load_for_mobile(filePath.UTF8String);
+             //  _impl = torch::jit::load(filePath.UTF8String);
+             //  _impl.eval();
+            } catch (const std::exception& exception) {
+                NSLog(@"%s", exception.what());
+                return nil;
+            }
+        }
+        return self;
+    }
 
-::
+3.2 Update ``ViewController.swift``
+
+.. code-block:: swift
+
+    //  if let filePath = Bundle.main.path(forResource:
+    //      "deeplabv3_scripted", ofType: "pt"),
+    //      let module = TorchModule(fileAtPath: filePath) {
+    //      return module
+    //  } else {
+    //      fatalError("Can't find the model file!")
+    //  }
+    if let filePath = Bundle.main.path(forResource:
+        "deeplabv3_scripted", ofType: "ptl"),
+        let module = TorchModule(fileAtPath: filePath) {
+        return module
+    } else {
+        fatalError("Can't find the model file!")
+    }
 
-  adb push <speedbenchmark-torch> /data/local/tmp
-  adb push <path-to-scripted-model> /data/local/tmp
+4. Build and test the app in Xcode.
 
+How to use mobile interpreter + custom build
+------------------------------------------
+A custom PyTorch interpreter library can be created to reduce binary size, by only containing the operators needed by the model. In order to do that follow these steps:
 
-Now we are ready to benchmark your model:
+1. To dump the operators in your model, say `deeplabv3_scripted`, run the following lines of Python code:
 
-::
+.. code-block:: python
 
-  adb shell "/data/local/tmp/speed_benchmark_torch --model=/data/local/tmp/model.pt" --input_dims="1,3,224,224" --input_type="float"
-  ----- output -----
-  Starting benchmark.
-  Running warmup runs.
-  Main runs.
-  Main run finished. Microseconds per iter: 121318. Iters per second: 8.24281
+    # Dump list of operators used by deeplabv3_scripted:
+    import torch, yaml
+    model = torch.jit.load('deeplabv3_scripted.ptl')
+    ops = torch.jit.export_opnames(model)
+    with open('deeplabv3_scripted.yaml', 'w') as output:
+        yaml.dump(ops, output)
 
+In the snippet above, you first need to load the ScriptModule. Then, use export_opnames to return a list of operator names of the ScriptModule and its submodules. Lastly, save the result in a yaml file. The yaml file can be generated for any PyTorch 1.4.0 or above version. You can do that by checking the value of `torch.__version__`.
 
-iOS - Benchmarking Setup
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+2. To run the build script locally with the prepared yaml list of operators, pass in the yaml file generate from the last step into the environment variable SELECTED_OP_LIST. Also in the arguments, specify BUILD_PYTORCH_MOBILE=1 as well as the platform/architechture type.
 
-For iOS, we'll be using our `TestApp <https://github.com/pytorch/pytorch/tree/master/ios/TestApp>`_ as the benchmarking tool. 
+**iOS**: Take the simulator build for example, the command should be:
 
-To begin with, let's apply the ``optimize_for_mobile`` method to our python script located at `TestApp/benchmark/trace_model.py <https://github.com/pytorch/pytorch/blob/master/ios/TestApp/benchmark/trace_model.py>`_. Simply modify the code as below.
+.. code-block:: bash
 
-::
+   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR ./scripts/build_ios.sh
 
-  import torch
-  import torchvision
-  from torch.utils.mobile_optimizer import optimize_for_mobile
+**Android**: Take the x86 build for example, the command should be:
 
-  model = torchvision.models.mobilenet_v2(pretrained=True)
-  model.eval()
-  example = torch.rand(1, 3, 224, 224)
-  traced_script_module = torch.jit.trace(model, example)
-  torchscript_model_optimized = optimize_for_mobile(traced_script_module)
-  torch.jit.save(torchscript_model_optimized, "model.pt")
+.. code-block:: bash
 
-Now let's run ``python trace_model.py``. If everything works well, we should be able to generate our optimized model in the benchmark directory. 
+   SELECTED_OP_LIST=deeplabv3_scripted.yaml ./scripts/build_pytorch_android.sh x86
 
-Next, we're going to build the PyTorch libraries from source.
 
-::
 
-  BUILD_PYTORCH_MOBILE=1 IOS_ARCH=arm64 ./scripts/build_ios.sh
+Conclusion
+----------
 
-Now that we have the optimized model and PyTorch ready, it's time to generate our XCode project and do benchmarking. To do that, we'll be using a ruby script - `setup.rb` which does the heavy lifting jobs of setting up the XCode project. 
+In this tutorial, we demonstrated how to use PyTorch's efficient mobile interpreter, in an Android and iOS app.
 
-::
+We walked through an Image Segmentation example to show how to dump the model, build a custom torch library from source and use the new api to run model.
 
-  ruby setup.rb
+Our efficient mobile interpreter is still under development, and we will continue improving its size in the future. Note, however, that the APIs are subject to change in future versions.
 
-Now open the `TestApp.xcodeproj` and plug in your iPhone, you're ready to go. Below is an example result from iPhoneX
+Thanks for reading! As always, we welcome any feedback, so please create an issue `here <https://github.com/pytorch/pytorch/issues>` - if you have any.
 
-::
+Learn More
+----------
 
-  TestApp[2121:722447] Main runs
-  TestApp[2121:722447] Main run finished. Milliseconds per iter: 28.767
-  TestApp[2121:722447] Iters per second: : 34.762
-  TestApp[2121:722447] Done.
+- To learn more about PyTorch Mobile, please refer to `PyTorch Mobile Home Page <https://pytorch.org/mobile/home/>`_
+- To learn more about Image Segmentation, please refer to the `Image Segmentation DeepLabV3 on Android Recipe <https://pytorch.org/tutorials/beginner/deeplabv3_on_android.html>`_

From 4c8107aba9fdf3224efca125760411399ab0fc07 Mon Sep 17 00:00:00 2001
From: Chen Lai <chenlai@fb.com>
Date: Wed, 9 Jun 2021 13:05:12 -0700
Subject: [PATCH 6/6] revert the overwrite and move mobile interpreter to
 recipes

---
 recipes_source/mobile_interpreter.rst | 198 ++++++++++++++
 recipes_source/mobile_perf.rst        | 365 ++++++++++++++++----------
 2 files changed, 426 insertions(+), 137 deletions(-)
 create mode 100644 recipes_source/mobile_interpreter.rst

diff --git a/recipes_source/mobile_interpreter.rst b/recipes_source/mobile_interpreter.rst
new file mode 100644
index 00000000000..ec23b28100e
--- /dev/null
+++ b/recipes_source/mobile_interpreter.rst
@@ -0,0 +1,198 @@
+(beta) Efficient mobile interpreter in Android and iOS
+==================================================================
+
+**Author**: `Chen Lai <https://github.com/cccclai>`_, `Martin Yuan <https://github.com/iseeyuan>`_
+
+Introduction
+------------
+
+This tutorial introduces the steps to use PyTorch's efficient interpreter on iOS and Android. We will be using an  Image Segmentation demo application as an example.
+
+This application will take advantage of the pre-built interpreter libraries available for Android and iOS, which can be used directly with Maven (Android) and CocoaPods (iOS). It is important to note that the pre-built libraries are the available for simplicity, but further size optimization can be achieved with by utilizing PyTorch's custom build capabilities.
+
+.. note:: If you see the error message: `PytorchStreamReader failed locating file bytecode.pkl: file not found ()`, likely you are using a torch script model that requires the use of the PyTorch JIT interpreter (a version of our PyTorch interpreter that is not as size-efficient). In order to leverage our efficient interpreter, please regenerate the model by running: `module._save_for_lite_interpreter(${model_path})`.
+
+   - If `bytecode.pkl` is missing, likely the model is generated with the api: `module.save(${model_psth})`.
+   - The api `_load_for_lite_interpreter(${model_psth})` can be helpful to validate model with the efficient mobile interpreter.
+
+Android
+-------------------
+Get the Image Segmentation demo app in Android: https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation
+
+1. **Prepare model**: Prepare the mobile interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
+
+.. code:: python
+
+    import torch
+    from torch.utils.mobile_optimizer import optimize_for_mobile
+    model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
+    model.eval()
+
+    scripted_module = torch.jit.script(model)
+    # Export full jit version model (not compatible mobile interpreter), leave it here for comparison
+    scripted_module.save("deeplabv3_scripted.pt")
+    # Export mobile interpreter version model (compatible with mobile interpreter)
+    optimized_scripted_module = optimize_for_mobile(scripted_module)
+    optimized_scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")
+
+2. **Use the PyTorch Android library in the ImageSegmentation app**: Update the `dependencies` part of ``ImageSegmentation/app/build.gradle`` to
+
+.. code:: gradle
+
+    repositories {
+        maven {
+            url "https://oss.sonatype.org/content/repositories/snapshots"
+        }
+    }
+
+    dependencies {
+        implementation 'androidx.appcompat:appcompat:1.2.0'
+        implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
+        testImplementation 'junit:junit:4.12'
+        androidTestImplementation 'androidx.test.ext:junit:1.1.2'
+        androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'
+        implementation 'org.pytorch:pytorch_android_lite:1.9.0'
+        implementation 'org.pytorch:pytorch_android_torchvision:1.9.0'
+
+        implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
+    }
+
+
+
+3. **Update model loader api**: Update ``ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java`` by
+
+  4.1 Add new import: `import org.pytorch.LiteModuleLoader`
+
+  4.2 Replace the way to load pytorch lite model
+
+.. code:: java
+
+    // mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
+    mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
+
+4. **Test app**: Build and run the `ImageSegmentation` app in Android Studio
+
+iOS
+-------------------
+Get ImageSegmentation demo app in iOS: https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation
+
+1. **Prepare model**: Same as Android.
+
+2. **Build the project with Cocoapods and prebuilt interpreter** Update the `PodFile` and run `pod install`:
+
+.. code-block:: podfile
+
+    target 'ImageSegmentation' do
+    # Comment the next line if you don't want to use dynamic frameworks
+    use_frameworks!
+
+    # Pods for ImageSegmentation
+    pod 'LibTorch_Lite', '~>1.9.0'
+    end
+
+3. **Update library and API**
+
+  3.1 Update ``TorchModule.mm``: To use the custom built libraries project, use `<Libtorch-Lite/Libtorch-Lite.h>` (in ``TorchModule.mm``):
+
+.. code-block:: swift
+
+    #import <Libtorch-Lite/Libtorch-Lite.h>
+    // If it's built from source with xcode, comment out the line above
+    // and use following headers
+    // #include <torch/csrc/jit/mobile/import.h>
+    // #include <torch/csrc/jit/mobile/module.h>
+    // #include <torch/script.h>
+
+.. code-block:: swift
+
+    @implementation TorchModule {
+    @protected
+    // torch::jit::script::Module _impl;
+     torch::jit::mobile::Module _impl;
+    }
+
+    - (nullable instancetype)initWithFileAtPath:(NSString*)filePath {
+      self = [super init];
+      if (self) {
+          try {
+              _impl = torch::jit::_load_for_mobile(filePath.UTF8String);
+             //  _impl = torch::jit::load(filePath.UTF8String);
+             //  _impl.eval();
+            } catch (const std::exception& exception) {
+                NSLog(@"%s", exception.what());
+                return nil;
+            }
+        }
+        return self;
+    }
+
+3.2 Update ``ViewController.swift``
+
+.. code-block:: swift
+
+    //  if let filePath = Bundle.main.path(forResource:
+    //      "deeplabv3_scripted", ofType: "pt"),
+    //      let module = TorchModule(fileAtPath: filePath) {
+    //      return module
+    //  } else {
+    //      fatalError("Can't find the model file!")
+    //  }
+    if let filePath = Bundle.main.path(forResource:
+        "deeplabv3_scripted", ofType: "ptl"),
+        let module = TorchModule(fileAtPath: filePath) {
+        return module
+    } else {
+        fatalError("Can't find the model file!")
+    }
+
+4. Build and test the app in Xcode.
+
+How to use mobile interpreter + custom build
+------------------------------------------
+A custom PyTorch interpreter library can be created to reduce binary size, by only containing the operators needed by the model. In order to do that follow these steps:
+
+1. To dump the operators in your model, say `deeplabv3_scripted`, run the following lines of Python code:
+
+.. code-block:: python
+
+    # Dump list of operators used by deeplabv3_scripted:
+    import torch, yaml
+    model = torch.jit.load('deeplabv3_scripted.ptl')
+    ops = torch.jit.export_opnames(model)
+    with open('deeplabv3_scripted.yaml', 'w') as output:
+        yaml.dump(ops, output)
+
+In the snippet above, you first need to load the ScriptModule. Then, use export_opnames to return a list of operator names of the ScriptModule and its submodules. Lastly, save the result in a yaml file. The yaml file can be generated for any PyTorch 1.4.0 or above version. You can do that by checking the value of `torch.__version__`.
+
+2. To run the build script locally with the prepared yaml list of operators, pass in the yaml file generate from the last step into the environment variable SELECTED_OP_LIST. Also in the arguments, specify BUILD_PYTORCH_MOBILE=1 as well as the platform/architechture type.
+
+**iOS**: Take the simulator build for example, the command should be:
+
+.. code-block:: bash
+
+   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR ./scripts/build_ios.sh
+
+**Android**: Take the x86 build for example, the command should be:
+
+.. code-block:: bash
+
+   SELECTED_OP_LIST=deeplabv3_scripted.yaml ./scripts/build_pytorch_android.sh x86
+
+
+
+Conclusion
+----------
+
+In this tutorial, we demonstrated how to use PyTorch's efficient mobile interpreter, in an Android and iOS app.
+
+We walked through an Image Segmentation example to show how to dump the model, build a custom torch library from source and use the new api to run model.
+
+Our efficient mobile interpreter is still under development, and we will continue improving its size in the future. Note, however, that the APIs are subject to change in future versions.
+
+Thanks for reading! As always, we welcome any feedback, so please create an issue `here <https://github.com/pytorch/pytorch/issues>` - if you have any.
+
+Learn More
+----------
+
+- To learn more about PyTorch Mobile, please refer to `PyTorch Mobile Home Page <https://pytorch.org/mobile/home/>`_
+- To learn more about Image Segmentation, please refer to the `Image Segmentation DeepLabV3 on Android Recipe <https://pytorch.org/tutorials/beginner/deeplabv3_on_android.html>`_
diff --git a/recipes_source/mobile_perf.rst b/recipes_source/mobile_perf.rst
index ec23b28100e..55d992148a3 100644
--- a/recipes_source/mobile_perf.rst
+++ b/recipes_source/mobile_perf.rst
@@ -1,198 +1,289 @@
-(beta) Efficient mobile interpreter in Android and iOS
-==================================================================
-
-**Author**: `Chen Lai <https://github.com/cccclai>`_, `Martin Yuan <https://github.com/iseeyuan>`_
+Pytorch Mobile Performance Recipes
+==================================
 
 Introduction
-------------
+----------------
+Performance (aka latency) is crucial to most, if not all,
+applications and use-cases of ML model inference on mobile devices.
 
-This tutorial introduces the steps to use PyTorch's efficient interpreter on iOS and Android. We will be using an  Image Segmentation demo application as an example.
+Today, PyTorch executes the models on the CPU backend pending availability
+of other hardware backends such as GPU, DSP, and NPU.
 
-This application will take advantage of the pre-built interpreter libraries available for Android and iOS, which can be used directly with Maven (Android) and CocoaPods (iOS). It is important to note that the pre-built libraries are the available for simplicity, but further size optimization can be achieved with by utilizing PyTorch's custom build capabilities.
+In this recipe, you will learn:
 
-.. note:: If you see the error message: `PytorchStreamReader failed locating file bytecode.pkl: file not found ()`, likely you are using a torch script model that requires the use of the PyTorch JIT interpreter (a version of our PyTorch interpreter that is not as size-efficient). In order to leverage our efficient interpreter, please regenerate the model by running: `module._save_for_lite_interpreter(${model_path})`.
+- How to optimize your model to help decrease execution time (higher performance, lower latency) on the mobile device.
+- How to benchmark (to check if optimizations helped your use case).
 
-   - If `bytecode.pkl` is missing, likely the model is generated with the api: `module.save(${model_psth})`.
-   - The api `_load_for_lite_interpreter(${model_psth})` can be helpful to validate model with the efficient mobile interpreter.
 
-Android
--------------------
-Get the Image Segmentation demo app in Android: https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation
+Model preparation
+-----------------
 
-1. **Prepare model**: Prepare the mobile interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
+We will start with preparing to optimize your model to help decrease execution time
+(higher performance, lower latency) on the mobile device.
 
-.. code:: python
 
-    import torch
-    from torch.utils.mobile_optimizer import optimize_for_mobile
-    model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
-    model.eval()
+Setup
+^^^^^^^
 
-    scripted_module = torch.jit.script(model)
-    # Export full jit version model (not compatible mobile interpreter), leave it here for comparison
-    scripted_module.save("deeplabv3_scripted.pt")
-    # Export mobile interpreter version model (compatible with mobile interpreter)
-    optimized_scripted_module = optimize_for_mobile(scripted_module)
-    optimized_scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")
+First we need to installed pytorch using conda or pip with version at least 1.5.0.
 
-2. **Use the PyTorch Android library in the ImageSegmentation app**: Update the `dependencies` part of ``ImageSegmentation/app/build.gradle`` to
+::
 
-.. code:: gradle
+   conda install pytorch torchvision -c pytorch
 
-    repositories {
-        maven {
-            url "https://oss.sonatype.org/content/repositories/snapshots"
-        }
-    }
+or
 
-    dependencies {
-        implementation 'androidx.appcompat:appcompat:1.2.0'
-        implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
-        testImplementation 'junit:junit:4.12'
-        androidTestImplementation 'androidx.test.ext:junit:1.1.2'
-        androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'
-        implementation 'org.pytorch:pytorch_android_lite:1.9.0'
-        implementation 'org.pytorch:pytorch_android_torchvision:1.9.0'
+::
 
-        implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
-    }
+   pip install torch torchvision
 
+Code your model:
 
+::
 
-3. **Update model loader api**: Update ``ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java`` by
+  import torch
+  from torch.utils.mobile_optimizer import optimize_for_mobile
 
-  4.1 Add new import: `import org.pytorch.LiteModuleLoader`
+  class AnnotatedConvBnReLUModel(torch.nn.Module):
+      def __init__(self):
+          super(AnnotatedConvBnReLUModel, self).__init__()
+          self.conv = torch.nn.Conv2d(3, 5, 3, bias=False).to(dtype=torch.float)
+          self.bn = torch.nn.BatchNorm2d(5).to(dtype=torch.float)
+          self.relu = torch.nn.ReLU(inplace=True)
+          self.quant = torch.quantization.QuantStub()
+          self.dequant = torch.quantization.DeQuantStub()
 
-  4.2 Replace the way to load pytorch lite model
+      def forward(self, x):
+          x.contiguous(memory_format=torch.channels_last)
+          x = self.quant(x)
+          x = self.conv(x)
+          x = self.bn(x)
+          x = self.relu(x)
+          x = self.dequant(x)
+          return x
 
-.. code:: java
+  model = AnnotatedConvBnReLUModel()
 
-    // mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
-    mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
 
-4. **Test app**: Build and run the `ImageSegmentation` app in Android Studio
+``torch.quantization.QuantStub`` and ``torch.quantization.DeQuantStub()`` are no-op stubs, which will be used for quantization step.
 
-iOS
--------------------
-Get ImageSegmentation demo app in iOS: https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation
 
-1. **Prepare model**: Same as Android.
+1. Fuse operators using ``torch.quantization.fuse_modules``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-2. **Build the project with Cocoapods and prebuilt interpreter** Update the `PodFile` and run `pod install`:
+Do not be confused that fuse_modules is in the quantization package.
+It works for all ``torch.nn.Module``.
 
-.. code-block:: podfile
+``torch.quantization.fuse_modules`` fuses a list of modules into a single module.
+It fuses only the following sequence of modules:
 
-    target 'ImageSegmentation' do
-    # Comment the next line if you don't want to use dynamic frameworks
-    use_frameworks!
+- Convolution, Batch normalization
+- Convolution, Batch normalization, Relu
+- Convolution, Relu
+- Linear, Relu
 
-    # Pods for ImageSegmentation
-    pod 'LibTorch_Lite', '~>1.9.0'
-    end
+This script will fuse Convolution, Batch Normalization and Relu in previously declared model.
 
-3. **Update library and API**
+::
 
-  3.1 Update ``TorchModule.mm``: To use the custom built libraries project, use `<Libtorch-Lite/Libtorch-Lite.h>` (in ``TorchModule.mm``):
+  torch.quantization.fuse_modules(model, [['conv', 'bn', 'relu']], inplace=True)
 
-.. code-block:: swift
 
-    #import <Libtorch-Lite/Libtorch-Lite.h>
-    // If it's built from source with xcode, comment out the line above
-    // and use following headers
-    // #include <torch/csrc/jit/mobile/import.h>
-    // #include <torch/csrc/jit/mobile/module.h>
-    // #include <torch/script.h>
+2. Quantize your model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-.. code-block:: swift
+You can find more about PyTorch quantization in
+`the dedicated tutorial <https://pytorch.org/blog/introduction-to-quantization-on-pytorch/>`_.
 
-    @implementation TorchModule {
-    @protected
-    // torch::jit::script::Module _impl;
-     torch::jit::mobile::Module _impl;
-    }
+Quantization of the model not only moves computation to int8,
+but also reduces the size of your model on a disk.
+That size reduction helps to reduce disk read operations during the first load of the model and decreases the amount of RAM.
+Both of those resources can be crucial for the performance of mobile applications.
+This code does quantization, using stub for model calibration function, you can find more about it `here <https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html#post-training-static-quantization>`__.
 
-    - (nullable instancetype)initWithFileAtPath:(NSString*)filePath {
-      self = [super init];
-      if (self) {
-          try {
-              _impl = torch::jit::_load_for_mobile(filePath.UTF8String);
-             //  _impl = torch::jit::load(filePath.UTF8String);
-             //  _impl.eval();
-            } catch (const std::exception& exception) {
-                NSLog(@"%s", exception.what());
-                return nil;
-            }
-        }
-        return self;
-    }
+::
 
-3.2 Update ``ViewController.swift``
-
-.. code-block:: swift
-
-    //  if let filePath = Bundle.main.path(forResource:
-    //      "deeplabv3_scripted", ofType: "pt"),
-    //      let module = TorchModule(fileAtPath: filePath) {
-    //      return module
-    //  } else {
-    //      fatalError("Can't find the model file!")
-    //  }
-    if let filePath = Bundle.main.path(forResource:
-        "deeplabv3_scripted", ofType: "ptl"),
-        let module = TorchModule(fileAtPath: filePath) {
-        return module
-    } else {
-        fatalError("Can't find the model file!")
-    }
+  model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
+  torch.quantization.prepare(model, inplace=True)
+  # Calibrate your model
+  def calibrate(model, calibration_data):
+      # Your calibration code here
+      return
+  calibrate(model, [])
+  torch.quantization.convert(model, inplace=True)
+
+
+
+3. Use torch.utils.mobile_optimizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Torch mobile_optimizer package does several optimizations with the scripted model,
+which will help to conv2d and linear operations.
+It pre-packs model weights in an optimized format and fuses ops above with relu
+if it is the next operation.
+
+First we script the result model from previous step:
+
+::
+
+  torchscript_model = torch.jit.script(model)
 
-4. Build and test the app in Xcode.
+Next we call ``optimize_for_mobile`` and save model on the disk.
 
-How to use mobile interpreter + custom build
-------------------------------------------
-A custom PyTorch interpreter library can be created to reduce binary size, by only containing the operators needed by the model. In order to do that follow these steps:
+::
 
-1. To dump the operators in your model, say `deeplabv3_scripted`, run the following lines of Python code:
+  torchscript_model_optimized = optimize_for_mobile(torchscript_model)
+  torch.jit.save(torchscript_model_optimized, "model.pt")
+
+4. Prefer Using Channels Last Tensor memory format
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Channels Last(NHWC) memory format was introduced in PyTorch 1.4.0. It is supported only for four-dimensional tensors. This memory format gives a better memory locality for most operators, especially convolution. Our measurements showed a 3x speedup of MobileNetV2 model compared with the default Channels First(NCHW) format.
+
+At the moment of writing this recipe, PyTorch Android java API does not support using inputs in Channels Last memory format. But it can be used on the TorchScript model level, by adding the conversion to it for model inputs.
 
 .. code-block:: python
 
-    # Dump list of operators used by deeplabv3_scripted:
-    import torch, yaml
-    model = torch.jit.load('deeplabv3_scripted.ptl')
-    ops = torch.jit.export_opnames(model)
-    with open('deeplabv3_scripted.yaml', 'w') as output:
-        yaml.dump(ops, output)
+  def forward(self, x):
+      x.contiguous(memory_format=torch.channels_last)
+      ...
+
+
+This conversion is zero cost if your input is already in Channels Last memory format. After it, all operators will work preserving ChannelsLast memory format.
+
+5. Android - Reusing tensors for forward
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This part of the recipe is Android only.
+
+Memory is a critical resource for android performance, especially on old devices.
+Tensors can need a significant amount of memory.
+For example, standard computer vision tensor contains 1*3*224*224 elements,
+assuming that data type is float and will need 588Kb of memory.
+
+::
+
+  FloatBuffer buffer = Tensor.allocateFloatBuffer(1*3*224*224);
+  Tensor tensor = Tensor.fromBlob(buffer, new long[]{1, 3, 224, 224});
+
+
+Here we allocate native memory as ``java.nio.FloatBuffer`` and creating ``org.pytorch.Tensor`` which storage will be pointing to the memory of the allocated buffer.
+
+For most of the use cases, we do not do model forward only once, repeating it with some frequency or as fast as possible.
+
+If we are doing new memory allocation for every module forward - that will be suboptimal.
+Instead of this, we can reuse the same memory that we allocated on the previous step, fill it with new data, and run module forward again on the same tensor object.
+
+You can check how it looks in code in `pytorch android application example <https://github.com/pytorch/android-demo-app/blob/master/PyTorchDemoApp/app/src/main/java/org/pytorch/demo/vision/ImageClassificationActivity.java#L174>`_.
+
+::
+
+  protected AnalysisResult analyzeImage(ImageProxy image, int rotationDegrees) {
+    if (mModule == null) {
+      mModule = Module.load(moduleFileAbsoluteFilePath);
+      mInputTensorBuffer =
+      Tensor.allocateFloatBuffer(3 * 224 * 224);
+      mInputTensor = Tensor.fromBlob(mInputTensorBuffer, new long[]{1, 3, 224, 224});
+    }
+
+    TensorImageUtils.imageYUV420CenterCropToFloatBuffer(
+        image.getImage(), rotationDegrees,
+        224, 224,
+        TensorImageUtils.TORCHVISION_NORM_MEAN_RGB,
+        TensorImageUtils.TORCHVISION_NORM_STD_RGB,
+        mInputTensorBuffer, 0);
+
+    Tensor outputTensor = mModule.forward(IValue.from(mInputTensor)).toTensor();
+  }
+
+Member fields ``mModule``, ``mInputTensorBuffer`` and ``mInputTensor`` are initialized only once
+and buffer is refilled using ``org.pytorch.torchvision.TensorImageUtils.imageYUV420CenterCropToFloatBuffer``.
+
+Benchmarking
+------------
+
+The best way to benchmark (to check if optimizations helped your use case) - is to measure your particular use case that you want to optimize, as performance behavior can vary in different environments.
+
+PyTorch distribution provides a way to benchmark naked binary that runs the model forward,
+this approach can give more stable measurements rather than testing inside the application.
+
+
+Android - Benchmarking Setup
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This part of the recipe is Android only.
+
+For this you first need to build benchmark binary:
+
+::
+
+    <from-your-root-pytorch-dir>
+    rm -rf build_android
+    BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DBUILD_BINARY=ON
+
+You should have arm64 binary at: ``build_android/bin/speed_benchmark_torch``.
+This binary takes ``--model=<path-to-model>``, ``--input_dim="1,3,224,224"`` as dimension information for the input and ``--input_type="float"`` as the type of the input as arguments.
+
+Once you have your android device connected,
+push speedbenchark_torch binary and your model to the phone:
+
+::
+
+  adb push <speedbenchmark-torch> /data/local/tmp
+  adb push <path-to-scripted-model> /data/local/tmp
+
+
+Now we are ready to benchmark your model:
+
+::
+
+  adb shell "/data/local/tmp/speed_benchmark_torch --model=/data/local/tmp/model.pt" --input_dims="1,3,224,224" --input_type="float"
+  ----- output -----
+  Starting benchmark.
+  Running warmup runs.
+  Main runs.
+  Main run finished. Microseconds per iter: 121318. Iters per second: 8.24281
 
-In the snippet above, you first need to load the ScriptModule. Then, use export_opnames to return a list of operator names of the ScriptModule and its submodules. Lastly, save the result in a yaml file. The yaml file can be generated for any PyTorch 1.4.0 or above version. You can do that by checking the value of `torch.__version__`.
 
-2. To run the build script locally with the prepared yaml list of operators, pass in the yaml file generate from the last step into the environment variable SELECTED_OP_LIST. Also in the arguments, specify BUILD_PYTORCH_MOBILE=1 as well as the platform/architechture type.
+iOS - Benchmarking Setup
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-**iOS**: Take the simulator build for example, the command should be:
+For iOS, we'll be using our `TestApp <https://github.com/pytorch/pytorch/tree/master/ios/TestApp>`_ as the benchmarking tool.
 
-.. code-block:: bash
+To begin with, let's apply the ``optimize_for_mobile`` method to our python script located at `TestApp/benchmark/trace_model.py <https://github.com/pytorch/pytorch/blob/master/ios/TestApp/benchmark/trace_model.py>`_. Simply modify the code as below.
 
-   SELECTED_OP_LIST=deeplabv3_scripted.yaml BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR ./scripts/build_ios.sh
+::
 
-**Android**: Take the x86 build for example, the command should be:
+  import torch
+  import torchvision
+  from torch.utils.mobile_optimizer import optimize_for_mobile
 
-.. code-block:: bash
+  model = torchvision.models.mobilenet_v2(pretrained=True)
+  model.eval()
+  example = torch.rand(1, 3, 224, 224)
+  traced_script_module = torch.jit.trace(model, example)
+  torchscript_model_optimized = optimize_for_mobile(traced_script_module)
+  torch.jit.save(torchscript_model_optimized, "model.pt")
 
-   SELECTED_OP_LIST=deeplabv3_scripted.yaml ./scripts/build_pytorch_android.sh x86
+Now let's run ``python trace_model.py``. If everything works well, we should be able to generate our optimized model in the benchmark directory.
 
+Next, we're going to build the PyTorch libraries from source.
 
+::
 
-Conclusion
-----------
+  BUILD_PYTORCH_MOBILE=1 IOS_ARCH=arm64 ./scripts/build_ios.sh
 
-In this tutorial, we demonstrated how to use PyTorch's efficient mobile interpreter, in an Android and iOS app.
+Now that we have the optimized model and PyTorch ready, it's time to generate our XCode project and do benchmarking. To do that, we'll be using a ruby script - `setup.rb` which does the heavy lifting jobs of setting up the XCode project.
 
-We walked through an Image Segmentation example to show how to dump the model, build a custom torch library from source and use the new api to run model.
+::
 
-Our efficient mobile interpreter is still under development, and we will continue improving its size in the future. Note, however, that the APIs are subject to change in future versions.
+  ruby setup.rb
 
-Thanks for reading! As always, we welcome any feedback, so please create an issue `here <https://github.com/pytorch/pytorch/issues>` - if you have any.
+Now open the `TestApp.xcodeproj` and plug in your iPhone, you're ready to go. Below is an example result from iPhoneX
 
-Learn More
-----------
+::
 
-- To learn more about PyTorch Mobile, please refer to `PyTorch Mobile Home Page <https://pytorch.org/mobile/home/>`_
-- To learn more about Image Segmentation, please refer to the `Image Segmentation DeepLabV3 on Android Recipe <https://pytorch.org/tutorials/beginner/deeplabv3_on_android.html>`_
+  TestApp[2121:722447] Main runs
+  TestApp[2121:722447] Main run finished. Milliseconds per iter: 28.767
+  TestApp[2121:722447] Iters per second: : 34.762
+  TestApp[2121:722447] Done.