Merge branch 'master' into master

lzxzy · web-flow · commit f3531fdd0fdb · 2021-04-18T22:24:41.000+08:00
diff --git a/advanced_source/cpp_extension.rst b/advanced_source/cpp_extension.rst
@@ -115,13 +115,13 @@ PyTorch has no knowledge of the *algorithm* you are implementing. It knows only
 of the individual operations you use to compose your algorithm. As such, PyTorch
 must execute your operations individually, one after the other. Since each
 individual call to the implementation (or *kernel*) of an operation, which may
-involve launch of a CUDA kernel, has a certain amount of overhead, this overhead
-may become significant across many function calls. Furthermore, the Python
-interpreter that is running our code can itself slow down our program.
+involve the launch of a CUDA kernel, has a certain amount of overhead, this
+overhead may become significant across many function calls. Furthermore, the
+Python interpreter that is running our code can itself slow down our program.
 
 A definite method of speeding things up is therefore to rewrite parts in C++ (or
 CUDA) and *fuse* particular groups of operations. Fusing means combining the
-implementations of many functions into a single functions, which profits from
+implementations of many functions into a single function, which profits from
 fewer kernel launches as well as other optimizations we can perform with
 increased visibility of the global flow of data.
 
@@ -509,12 +509,12 @@ and with our new C++ version::
   Forward: 349.335 us | Backward 443.523 us
 
 We can already see a significant speedup for the forward function (more than
-30%). For the backward function a speedup is visible, albeit not major one. The
-backward pass I wrote above was not particularly optimized and could definitely
-be improved. Also, PyTorch's automatic differentiation engine can automatically
-parallelize computation graphs, may use a more efficient flow of operations
-overall, and is also implemented in C++, so it's expected to be fast.
-Nevertheless, this is a good start.
+30%). For the backward function, a speedup is visible, albeit not a major one.
+The backward pass I wrote above was not particularly optimized and could
+definitely be improved. Also, PyTorch's automatic differentiation engine can
+automatically parallelize computation graphs, may use a more efficient flow of
+operations overall, and is also implemented in C++, so it's expected to be
+fast. Nevertheless, this is a good start.
 
 Performance on GPU Devices
 **************************
@@ -571,7 +571,7 @@ And C++/ATen::
 
 That's a great overall speedup compared to non-CUDA code. However, we can pull
 even more performance out of our C++ code by writing custom CUDA kernels, which
-we'll dive into soon. Before that, let's dicuss another way of building your C++
+we'll dive into soon. Before that, let's discuss another way of building your C++
 extensions.
 
 JIT Compiling Extensions
@@ -851,7 +851,7 @@ and ``Double``), you can use ``AT_DISPATCH_ALL_TYPES``.
 
 Note that we perform some operations with plain ATen. These operations will
 still run on the GPU, but using ATen's default implementations. This makes
-sense, because ATen will use highly optimized routines for things like matrix
+sense because ATen will use highly optimized routines for things like matrix
 multiplies (e.g. ``addmm``) or convolutions which would be much harder to
 implement and improve ourselves.
 
@@ -903,7 +903,7 @@ You can see in the CUDA kernel that we work directly on pointers with the right
 type. Indeed, working directly with high level type agnostic tensors inside cuda
 kernels would be very inefficient.
 
-However, this comes at a cost of ease of use and readibility, especially for
+However, this comes at a cost of ease of use and readability, especially for
 highly dimensional data. In our example, we know for example that the contiguous
 ``gates`` tensor has 3 dimensions:
 
@@ -920,7 +920,7 @@ arithmetic.
   gates.data<scalar_t>()[n*3*state_size + row*state_size + column]
 
 
-In addition to being verbose, this expression needs stride to be explicitely
+In addition to being verbose, this expression needs stride to be explicitly
 known, and thus passed to the kernel function within its arguments. You can see
 that in the case of kernel functions accepting multiple tensors with different
 sizes you will end up with a very long list of arguments.
diff --git a/beginner_source/blitz/cifar10_tutorial.py b/beginner_source/blitz/cifar10_tutorial.py
@@ -43,15 +43,15 @@
 
 We will do the following steps in order:
 
-1. Load and normalizing the CIFAR10 training and test datasets using
+1. Load and normalize the CIFAR10 training and test datasets using
    ``torchvision``
 2. Define a Convolutional Neural Network
 3. Define a loss function
 4. Train the network on the training data
 5. Test the network on the test data
 
-1. Loading and normalizing CIFAR10
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+1. Load and normalize CIFAR10
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Using ``torchvision``, it’s extremely easy to load CIFAR10.
 """
diff --git a/beginner_source/blitz/neural_networks_tutorial.py b/beginner_source/blitz/neural_networks_tutorial.py
@@ -58,7 +58,7 @@ def __init__(self):
     def forward(self, x):
         # Max pooling over a (2, 2) window
         x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
-        # If the size is a square you can only specify a single number
+        # If the size is a square, you can specify with a single number
         x = F.max_pool2d(F.relu(self.conv2(x)), 2)
         x = x.view(-1, self.num_flat_features(x))
         x = F.relu(self.fc1(x))
@@ -176,7 +176,7 @@ def num_flat_features(self, x):
 #           -> loss
 #
 # So, when we call ``loss.backward()``, the whole graph is differentiated
-# w.r.t. the loss, and all Tensors in the graph that has ``requires_grad=True``
+# w.r.t. the loss, and all Tensors in the graph that have ``requires_grad=True``
 # will have their ``.grad`` Tensor accumulated with the gradient.
 #
 # For illustration, let us follow a few steps backward:
diff --git a/beginner_source/nlp/README.txt b/beginner_source/nlp/README.txt
@@ -14,9 +14,9 @@ Deep Learning for NLP with Pytorch
 	https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
 
 4. sequence_models_tutorial.py
-	Sequence Models and Long-Short Term Memory Networks
+	Sequence Models and Long Short-Term Memory Networks
 	https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
 
 5. advanced_tutorial.py
 	Advanced: Making Dynamic Decisions and the Bi-LSTM CRF
-	https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html
+	https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html
diff --git a/beginner_source/nlp/sequence_models_tutorial.py b/beginner_source/nlp/sequence_models_tutorial.py
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-
 r"""
-Sequence Models and Long-Short Term Memory Networks
+Sequence Models and Long Short-Term Memory Networks
 ===================================================
 
 At this point, we have seen various feed-forward networks. That is,
diff --git a/beginner_source/nlp/word_embeddings_tutorial.py b/beginner_source/nlp/word_embeddings_tutorial.py
@@ -268,6 +268,8 @@ def forward(self, inputs):
     losses.append(total_loss)
 print(losses)  # The loss decreased every iteration over the training data!
 
+# To get the embedding of a particular word, e.g. "beauty"
+print(model.embeddings.weight[word_to_ix["beauty"]])
 
 ######################################################################
 # Exercise: Computing Word Embeddings: Continuous Bag-of-Words
@@ -277,7 +279,7 @@ def forward(self, inputs):
 # learning. It is a model that tries to predict words given the context of
 # a few words before and a few words after the target word. This is
 # distinct from language modeling, since CBOW is not sequential and does
-# not have to be probabilistic. Typcially, CBOW is used to quickly train
+# not have to be probabilistic. Typically, CBOW is used to quickly train
 # word embeddings, and these embeddings are used to initialize the
 # embeddings of some more complicated model. Usually, this is referred to
 # as *pretraining embeddings*. It almost always helps performance a couple