-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning
Older informal definition:
Machine Learning: A Field of study that gives computers the ability to learn without being explicitly programmed. Arthur Samuel (1959)
Modern definition:
Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance P, if its performance on T, as measured by P, improves with experience E. Tom Mitchell (1998)
Example: playing checkers:
- E = the experience of playing many games of checkers
- T = the task of playing checkers.
- P = the probability that the program will win the next game.
- Supervised learning - teach the computer to do something
- Unsupervised learning - let the computer to learn by itself
- Reinforcement learning
- Recommender systems
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
Can scale up to infinite number of input features or properties.
Supervised learning problems categories:
- Regression problem: trying to predict results within a continuous output (price), map input variables to some continuous function.
- Classification problem: trying to predict results in a discrete output (0 or 1). map input variables into discrete categories.
Unsupervised learning approaches problems with little or no idea what our results should look like. Derive structure from data where not necessarily know the effect of the variables. No feedback based on the prediction results.
- Clustering: Derive structure that the data based on relationships among the variables in the data. Finds a way to automatically create group that are somehow similar or related by different variable.
- Non-clustering: The "Cocktail Party Algorithm", find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds).
Description of a supervised learning problem more, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y.
Notation:
- m - number of training examples, training set
- x's' - input variable / feature
- y's' - output variable / feature
- x,y - one training set example
- x^(i),y^(i) - ith training example
Training Set => |> Learning Algorithm, x |> h (hypothesis function) |> Estimated value of y
h maps from x's to y's
Also called univariate linear regression
Choose /theta_0
, /theta_1
so that h_/theta(x)
is close to y
for our training set (x,y)
h_\theta = \theta_0 + \theta_1x
Also called squared error function
Measure the accuracy of the hypothesis function. This takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's.
J(\theta_0, \theta_1) = \frac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \frac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2
- it is
{1}\over{2}\bar{x}
where\bar{x}
is the mean of the squares ofh_\theta(x_i) - y_i
- or the difference between the predicted value and the actual value
The objective is to get the best possible line. The best possible line is the average squared vertical distances of the scattered points from the line will be the least. Ideally, the line should pass through all the points of our training data set. Value of J(\theta_0, \theta_1)
will be 0. Try to minimize the cost function.
Used to estimate the parameters in the hypothesis function. To minimize the cost function with any number of input variables.
The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α
, which is called the learning rate.
The direction in which the step is taken is determined by the partial derivative of J(\theta_0, \theta_1)
. Depending on where one starts on the graph, one could end up at different points
The gradient descent algorithm is repeated repeat until convergence:
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)
-
j=0,1
represents the feature index number. - At each iteration
j
, one should simultaneously update the parameters\theta_1, \theta_2...\theta_n
. - Updating a specific parameter prior to calculating another one on the
j^{(th)}
iteration would yield to a wrong implementation. - Should adjust our parameter
/alpha
to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong. - Gradient descent can converge to a local minimum, even with the learning rate
/alpha
fixed. - As we approach a local minimum, gradient descent will automatically takes smaller steps. So no need to decrease
/alpha
over time.
- Combining gradient descent and linear regression.
- "Batch": Each step of gradient descent uses all the training examples.
Repeat until convergence:
\begin{align*} \text{repeat until convergence: } \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*}
While gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum.
- Matrices are two dimensional arrays.
- If a matrix had 4 rows and 3 columns, its a 4 x 3 matrix.
- Vector is a matrix with one column and many rows, vectors are a subset of matrices.
-
A_{ij}
refers to the element in the ith row and jth column of matrix A. - A vector with 'n' rows is referred to as an 'n'-dimensional vector.
-
v_i
refers to the element in the ith row of the vector. - Matrices can be 1-indexed, or like in some programming languages, 0-indexed.
- Matrices are usually denoted by uppercase names while vectors are lowercase.
- "Scalar" means that an object is a single value, not a vector or matrix.
-
\mathbb{R}
refers to the set of scalar real numbers. -
\mathbb{R^n}
refers to the set of n-dimensional vectors of real numbers.
- Addition and subtraction are element-wise, so you simply add or subtract each corresponding element.
- To add or subtract two matrices, their dimensions must be the same.
- In scalar multiplication, we simply multiply every element by the scalar value.
- In scalar division, we simply divide every element by the scalar value.
- We map the column of the vector onto each row of the matrix, multiplying each element and summing the result.
- The result is a vector. The number of columns of the matrix must equal the number of rows of the vector.
- An m x n matrix multiplied by an n x 1 vector results in an m x 1 vector.
- We multiply two matrices by breaking it into several vector multiplications and concatenating the result.
- an m x n matrix multiplied by an n x o matrix results in an m x o matrix.
- To multiply two matrices, the number of columns of the first matrix must equal the number of rows of the second matrix.
\begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix} *\begin{bmatrix} w & x \newline y & z \newline \end{bmatrix} =\begin{bmatrix} a*w + b*y & a*x + b*z \newline c*w + d*y & c*x + d*z \newline e*w + f*y & e*x + f*z\end{bmatrix}
- Matrices are not commutative:
A * B !== B * A
- Matrices are associative:
( A * B ) * C = A * ( B * C )
- The identity matrix: when multiplied by any matrix of the same dimensions, results in the original matrix.
- It's just like multiplying numbers by 1. The identity matrix simply has 1's on the diagonal and 0's elsewhere.
- When multiplying the identity matrix after some matrix (A∗I), the square identity matrix's dimension should match the other matrix's columns.
- When multiplying the identity matrix before some other matrix (I∗A), the square identity matrix's dimension should match the other matrix's rows.
\begin{bmatrix} 1 & 0 & 0 \newline 0 & 1 & 0 \newline 0 & 0 & 1 \newline \end{bmatrix}
- The inverse of a matrix A is denoted
A^{-1}
. Multiplying by the inverse results in the identity matrix. - A non square matrix does not have an inverse matrix.
- We can compute inverses of matrices in octave with the
pinv(A)
function and in Matlab with theinv(A)
function. - Matrices that don't have an inverse are singular or degenerate.
- The transposition of a matrix is like rotating the matrix 90° in clockwise direction and then reversing it.
- We can compute transposition of matrices in matlab with the
transpose(A)
function orA'
.
A_{ij} = A^T_{ji}
Linear regression with multiple variables is also known as "multivariate linear regression".
Notation for equations where we can have any number of input variables:
-
x^{(i)_j}
- value of featurej
in thei
'th training example -
x^{(i)}
- the input (features) of thei
'th training example -
m
- the number of training examples -
n
- the number of features
The multivariable form of the hypothesis function accommodating these multiple features is as follows:
h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n
Using the definition of matrix multiplication, the multivariable hypothesis function can be concisely represented as a vector. This is a vectorization of the hypothesis function for one training example
\begin{align*}h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\end{align*}
For convenience reasons we assume x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } )
. This allows us to do matrix operations with theta and x. Hence making the two vectors 'θ' and x(i) match each other element-wise, (that is, have the same number of elements: n+1).]
The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:
Repeat until convergence {
}
\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*}
In other words:
\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0...n}\newline \rbrace\end{align*}
Speed up gradient descent by normalizing each input value to roughly the same range. This is because θ
will descend quickly on small ranges and slowly on large ranges, oscillate inefficiently down to the optimum when the variables are very uneven.
The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same:
-1 <= x_{(i)} <= 1
or -0.5 <= x_{(i)} <= 0.5
- Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
- Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
:= \dfrac{x_i - \mu_i}{s_i}
Where μi
is the average of all the values for feature (i)
and si
is the range of values (max - min), or si
is the standard deviation.
- Debugging gradient descent. Make a plot with number of iterations on the x-axis.
- Plot the cost function,
J(θ)
over the number of iterations of gradient descent. IfJ(θ)
ever increases, you probably need to decreaseα
. -
Automatic convergence test. Declare convergence if
J(θ)
decreases by less thanE
in one iteration, whereE
is some small value such as10^−3
. In practice it's difficult to choose this threshold value. - It was proved that if learning rate
α
is sufficiently small, thenJ(θ)
will decrease on every iteration. - If
α
is too small: slow convergence. - If
α
is too large: may not decrease on every iteration and thus may not converge.
We can improve our features and the form of our hypothesis by combining multiple features into one. For example, we can combine x1
and x2
into a new feature x3
by taking x1⋅x2
.
Our hypothesis function need not be linear (a straight line) if that does not fit the data well. We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
Hypothesis function:
h_\theta(x) = \theta_0 + \theta_1 x_1
Create additional features based on x1, to get the Quadratic function:
h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2
Cubic function
h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3
Square root function
h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1}
If you choose your features this way then feature scaling becomes very important. if x1
has range 1 - 1000
then range of x^21
becomes 1 - 1000000
and that of x^31
becomes 1 - 1000000000
.
Performing the minimization explicitly and without resorting to an iterative algorithm. In the "Normal Equation" method, will minimize J
by explicitly taking its derivatives with respect to the θj
’s, and setting them to zero. This allows us to find the optimum theta without iteration.
The normal equation formula:
\theta = (X^T X)^{-1}X^T y
There is no need to do feature scaling with the normal equation.
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
O (kn^2) | O (n^3), need to calculate inverse of X^TX |
Works well when n is large | Slow if n is very large |
If we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.
When implementing the normal equation in octave we want to use the pinv
function rather than inv
The pinv
function will give you a value of θ
even if X^TX
is not invertible.
If X^TX is noninvertible, the common causes:
- Redundant features, where two features are very closely related (linearly dependent).
- Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization".
Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.
To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because classification is not actually a linear function.
- The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values
- Binary classification problem in which y can take on only two values, 0 and 1.
- Values
y ∈ {0,1}
. 0 is also called the negative class, and 1 the positive class, - Denoted by the symbols
-
and+
Givenx^{(i)}
, the correspondingy^{(i)}
called the label for the training example.
- Change the form for our hypotheses
hθ(x)
to satisfy0 ≤ hθ(x) ≤ 1
- Plug the hypothesis into the Logistic Function / Sigmoid Function
\begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^{-z}}\end{align*}
- The function g(z), maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.
-
hθ(x)
gives us the probability that our output is 1. -
hθ(x) = 0.7
gives us a probability of 70% that our output is 1. - Probability that our prediction is 0 is just the complement of our probability that it is 1.
\begin{align*}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \newline& P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1\end{align*}
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
\begin{align*}& h_\theta(x) \geq 0.5 \rightarrow y = 1 \newline& h_\theta(x) < 0.5 \rightarrow y = 0 \newline\end{align*}
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.
\begin{align*}& \theta^T x \geq 0 \Rightarrow y = 1 \newline& \theta^T x < 0 \Rightarrow y = 0 \newline\end{align*}
theta = [5; -1; 0]
% y = 1 if 5 + (-1) * x(1) + 0 * x(2) >= 0
% 5 - x(1) >= 0
% -x(1) >= -5
% x(1) <= 5
The boundary is a straight vertical line placed on the graph where x(1) == 5
, and everything to the left of that denotes y == 1
, while everything to the right denotes y == 0
.
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.
\begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*}
Which translates to:
- If our correct answer
y
is 0, then the cost function will be 0 if our hypothesis function also outputs 0. - If our hypothesis approaches 1, then the cost function will approach infinity.
- If our correct answer
y
is 1, then the cost function will be 0 if our hypothesis function outputs 1. - If our hypothesis approaches 0, then the cost function will approach infinity.
\begin{align*}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \newline \end{align*}
Compress our cost function's two conditional cases into one case:
\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))
When y
is equal to 1, then the second term will be zero and will not affect the result. If y
is equal to 0
, then the first term will be zero and will not affect the result.
Entire cost function:
J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]
Vectorized implementation:
\begin{align*} & h = g(X\theta)\newline & J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \end{align*}
This algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.
\begin{align*} & Repeat \; \lbrace \newline & \; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \newline & \rbrace \end{align*}
Vectorized implementation:
\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})
"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ
that can be used instead of gradient descent.
First need to provide a function that evaluates the J value of theta, and the gradient functions for a given input value theta
There is no need for alpha when using fminunc
, therefore the gradient formula is simpler:
grad = (1 / m) * X' * (hypothesis - y)
A single function that returns both of these:
function [jVal, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [...code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
Then we can use octave's fminunc()
optimization algorithm along with the optimset()
function that creates an object containing the options we want to send to fminunc()
.
We give to the function fminunc()
our cost function, our initial vector of theta values, and the "options" object that we created beforehand.
Classification of data when we have more than two categories. Instead of y = {0,1}
we will expand our definition so that y = {0,1...n}
.
\begin{align*}& y \in \lbrace0, 1 ... n\rbrace \newline& h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \newline& h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \newline& \cdots \newline& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \newline& \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\newline\end{align*}
Choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.
- Train a logistic regression classifier
hθ(x)
for each class to predict the probability thaty = i
. - To make a prediction on a new x, pick the class that maximizes
hθ(x)
Underfitting - data clearly shows structure not captured by the model, underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features.
Overfitting - high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:
- Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm
- Regularization
- Keep all the features, but reduce the magnitude of parameters
θj
- Regularization works well when we have a lot of slightly useful features.
If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.
We could also regularize all of our theta parameters in a single summation as:
min_\theta\ \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2
The λ
, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated. With the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.
We can apply regularization to both linear regression and logistic regression.
The first term in the above equation, 1−α(λ/m)
will always be less than 1. Intuitively you can see it as reducing the value of θj
by some amount on every update.
\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}
Regularization using the alternate method of the non-iterative normal equation. To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:
Where L = [0 0 0; 0 1 0; 0 0 1]
. L
is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should have dimension (n+1)×(n+1)
. Intuitively, this is the identity matrix (though we are not including x0
), multiplied with a single real number λ
.
\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}
Recall that if m < n
, then X^TX
is non-invertible. However, when we add the term λ⋅L
, then X^TX + λ⋅L
becomes invertible.
We can regularize logistic regression in a similar way that we regularize linear regression. As a result, we can avoid overfitting.
We can regularize this equation by adding a term to the end:
J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2
The second sum, means to explicitly exclude the bias term, this sum explicitly skips θ0
, by running from 1
to n
, skipping 0
. Thus, when computing the equation we should continuously update the two following equations:
\begin{align*} & \text{Repeat}\ \lbrace \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace \end{align*}
Neurons are computational units that take inputs (dendrites) as electrical inputs (called "spikes") that are channeled to outputs (axons). In our model, our dendrites are like the input features x1...xn
, and the output is the result of our hypothesis function.
In this model our x0
input node is sometimes called the "bias unit." It is always equal to 1. In neural networks, we use the same logistic function as in classification, yet it is sometimes called the sigmoid (logistic) activation function. Our Theta parameters are sometimes called weights.
[x0; x1; x2] => [ ] => h_/theta(x)
- Input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer").
- Intermediate layers of nodes between the input and output layers called the "hidden layers."
- We label the intermediate or hidden layer nodes
a^2_0...a^2_n
, and call the activation units.
\begin{align*}& a_i^{(j)} = \text{"activation" of unit $i$ in layer $j$} \newline& \Theta^{(j)} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$}\end{align*}
The values for each of the "activation" nodes is obtained as follows:
[x0; x1; x2; x3] => [a^(2)_1; a^(2)_2; a^(2)_3] => h_/theta(x)
\begin{align*} a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \newline a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \newline a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \newline h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)}) \newline \end{align*}
- We compute our activation nodes by using a 3×4 matrix of parameters.
- We apply each row of the parameters to our inputs to obtain the value for one activation node.
- Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix
Θ(2)
containing the weights for our second layer of nodes. - Each layer gets its own matrix of weights,
Θ(j)
.
\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$.}
The +1 comes from the addition in Θ(j)
of the "bias nodes," x0
and Θ(j)0
. In other words the output nodes will not include the bias nodes.
Example: If layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of
Θ(1)
is going to be4×3
wheresj=2
andsj+1=4
, sosj+1×(sj+1)=4×3
.
Vectorized implementation of the above functions.
- We're going to define a new variable
z_k^{(j)}
that encompasses the parameters inside ourg
function.
\begin{align*}a_1^{(2)} = g(z_1^{(2)}) \newline a_2^{(2)} = g(z_2^{(2)}) \newline a_3^{(2)} = g(z_3^{(2)}) \newline \end{align*}
For layer j=2
and node k
, the variable z
will be:
z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_n
Setting x=a(1),
we can rewrite the equation as:
z^{(j)} = \Theta^{(j-1)}a^{(j-1)}
- We are multiplying our matrix
Θ(j−1)
with dimensionssj×(n+1)
(wheresj
is the number of our activation nodes) - By our vector
a(j−1)
with height(n+1)
- This gives us our vector
z(j)
with heightsj
Now we can get a vector of our activation nodes for layer j as follows. Where our function g
can be applied element-wise to our vector z(j)
.
a^{(j)} = g(z^{(j)})
- Add a bias unit (equal to 1) to layer
j
after we have computeda(j)
. - This will be element
a(j)0
and will be equal to 1. - To compute our final hypothesis, first compute another z vector
z^{(j+1)} = \Theta^{(j)}a^{(j)}
- We get this final
z
vector by multiplying the next theta matrix afterΘ(j−1)
with the values of all the activation nodes we just got. - This last theta matrix
Θ(j)
will have only one row which is multiplied by one columna(j)
so that our result is a single number. - We then get our final result with:
h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})
- In this last step, between layer
j
and layerj+1,
we are doing exactly the same thing as we did in logistic regression. - Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.
-
L
= total number of layers in the network -
sl
= number of units (not counting bias unit) in layerl
-
K
= number of output units/classes
Denote h_\Theta(x)_k
as being a hypothesis that results in the k'th output. Cost function for neural networks is going to be a generalization of the one we used for logistic regression.
\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}
We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.
- the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
- the triple sum simply adds up the squares of all the individual
Θ
s in the entire network. - the
i
in the triple sum does not refer to training examplei
To minimize the cost function, we need to compute J(\Theta)
, and the derivative.
"Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute \min_\Theta J(\Theta)
. That is, we want to minimize our cost function J using an optimal set of parameters in theta.
To compute the partial derivative of J(Θ)
:
\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)
Given training set {(x1, y1) ... (xm, ym)}
- Set
\Delta^{(l)}_{i,j}
:= 0 for all (l,i,j) - For training example t = 1 to m:
- Set
a^{(1)} := x^{(t)}
- Perform forward propagation to compute
a^{(l)}
for l=2,3...,L - Using
y^{(t)}
compute\delta^{(L)} = a^{(L)} - y^{(t)}
Where L
is our total number of layers and a^{(L)}
is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:
- Compute
\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}
, using:
\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})
The delta values of layer l
are calculated by multiplying the delta values in the next layer with the theta matrix of layer l
. We then element-wise multiply that with a function called g'
, or g-prime, which is the derivative of the activation function g
evaluated with the input values given by z^{(l)}
The g-prime derivative terms can also be written out as:
g'(z^{(l)}) = a^{(l)}\ .*\ (1 - a^{(l)})
Combining individual deltas into Delta:
\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}
or with vectorization:
\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T
Update the new Delta matrix:
, if j != 0
D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)
, if j=0
D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}
The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative
\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}