Lately, I was into the studying process of machine learning, and outputting(taking notes) is a vital step of it. Here, I am using Andrew Ng’s Stanford Machine Learning course in Coursera with the language of MATLAB.

`So the rest of the code I will write in this post by default are based on MATLAB.`

**What is ML?**

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Tom Mitchell

**Supervised Learning&Unsupervised Learning**

SL: with labels; direct feedback; predict

Under SL, there are **regression** and **classification**

USL: without labels; no feedback; finding the hidden structure

Under USL, there are **clustering** and **non-clustering**

For now, I would focus on these two but not **reinforcement learning**.

**The Basic Model & Notation**

We use x^{(i)} to represent the “input” value, with the variable x represent the value at the position i in a matrix , or vector in most of the time. And y^{(i)} is the actual “output” when we have a input x at position variable i. A pair of (x^{(i)}, y^{(i)}) is called a training sample. Then we have a list of such samples with i=1,...,m—is called a training set. And the purpose of ML is to have a “good” hypothesis function h(x) which could predict the output while only knowing the input x. If we only want to have a simple linear form of h(x), then it looks like: h(x)=\theta_0 + \theta_1x, which both \theta_0 and \theta_1 is the parameter we want to find that letting h(x) to predict “better”.

**Linear Algebra Review**

Matrix-Vector Multiplication:\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} *\begin{bmatrix} x\\y \end{bmatrix} =\begin{bmatrix} a*x + b*y \\ c*x + d*y \\ e*x + f*y \end{bmatrix}

Matrix-Matrix Multiplication: \begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} * \begin{bmatrix} w & x \\ y & z \\ \end{bmatrix}=\begin{bmatrix} a*w + b*y & a*x + b*z \\ c*w + d*y & c*x + d*z \\e*w + f*y & e*x + f*z \end{bmatrix}

**Identity Matrix** looks like this—with 1 on the diagonal and the rest of the elements are zeros: \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix}

`eye(3) `

**Multiplication Properties**

Matrices are not commutative: A∗B \neq B∗A

Matrices are associative: (A∗B)∗C = A∗(B∗C)

**Inverse and Transpose**

**Inverse**: A matrix **A** mutiply with its inverse** A_inv** results to be a identity matrix **I**:

`I = A*inv(A)`

**Transposition** is like rotating the matrix 90 degrees, for a matrix A with dimension m * n, its transpose is with dimension n * m:

A = \begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}, A^T = \begin{bmatrix} a & c & e \\ b & d & f \\ \end{bmatrix}

Also we can get:

A_{ij}=A_{ji}^{T}

**Cost Function**

A cost function shows how accurate our hypothesis function predict while output the error (the deviation between y(x) and h(x)). And it looks like this:

J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2

For people who are familier with statistics, it is called “Squared error funtion” while the square makes each error becomes a positice value, and the \frac{1}{2} helps to simplify the expression later when we do derivative during the process of gradient descent. Now, we turn the question to “How to find the \theta_0&\theta_1 that minilize J(\theta_0, \theta_1)?”

**Contour Plot**

A contour plot is actually an alternative way to show 3D graphs in 2D, in which the color blue represents low points while red means the high. So the J(\theta_0, \theta_1) that gives the red point is the set of the parameter gives h(x) with the lowest error with the actual output y(x)

**Gradient Descent**

Gradient Descent is one of the most basic ML tools. The basic idea is to “move some small steps which lead to minimizing the cost function J(\theta). And it looks like this:

```
Repeat until convergence:
{
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)
}
```

Here, the operator := just means assign the latter part to the former part while we know it could be the same as = in many languages. We say the former \theta_j as the “next step” while the latter one as the “current position”, \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) shows the “direction” that make the move increase J(\theta) the most, so that we could just add a negative sign to make it becomes the fastest decrease direction.\alpha gives the length of step we want it to take for each step. And it’s important to make the update of each \theta be simultaneous.

If we take the code above apart, then we have:

repeat until convergence: { \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}) }

The term x_i is nothing but a result of the derivative, there is no x_i for \theta_0 because we defined x^{(i)}_0 as 1.

Then here is a full derivative process to show the partial dervative of the cost function J(\theta):

\begin{aligned}\frac{\partial }{\partial \theta_j}J(\theta) &= \frac{\partial }{\partial \theta_j}\frac{1}{2}(h_\theta(x)-y)^{2}\\&=2 \cdot \frac{1}{2}(h_\theta(x)-y) \cdot \frac{\partial }{\partial \theta_j}(h_\theta(x)-y)\\&= (h_\theta(x)-y)\frac{\partial }{\partial \theta_j}\left ( \sum\limits_{i=0}^n\theta^{(i)}x^{(i)}-y \right )\\&=(h_\theta(x)-y)x_j\end{aligned}

And such basic method is called **batch gradient descent** while it uses all the training set we provide, and just saying for future reference, J(\theta)is convex which means it only has only one global minima and has no chance to be affected by local minima.

**Multivariate Linear Regression**

So saying we have not only one variables of input, but many of them. Then we use j in x_j from 1 to n to represents the index of it just like we use i to represents the index of the training example from 1 to m.

x_{j}^{(i)} = value of, in i^{th} training example, feature j

For convenience of notation, we have to define x_0 = 1, since we have \theta_0 in the hypothesis function, and the matrix mutiplication thing:

x = \begin{bmatrix} x_1\\x_2 \\\vdots\\x_n \end{bmatrix} \in\mathbb{R}^{n} , \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\\vdots\\\theta_n \end{bmatrix}\in\mathbb{R}^{n+1} \rightarrow x = \begin{bmatrix}x_0\\x_1\\x_2\\\vdots\\x_n\end{bmatrix}\in\mathbb{R}^{n+1}, \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\\vdots\\\theta_n \end{bmatrix}\in\mathbb{R}^{n+1}

And the cool thing here we can do now is using vectorization to represents the long mutivariable hypothesis function:

h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + ... + \theta_n x_n = \begin{bmatrix}\theta_0&\theta_1&\cdots&\theta_n\end{bmatrix}\begin{bmatrix}x_0 \\ x_1 \\ \vdots \\ x_n\end{bmatrix}=\theta^T x

**Feature Scaling**

If the input set x contains features that have very large difference on their data range, the process of getting \theta could oscillating, being slow, or even failed, and **feature scaling**, or **mean normalization** is a technique to make the range of data in each feature more even, and the process is very familir if knowing statistics:

x_j:=\frac{x_j-\mu_j}{s_j}

So the input x with feature index j minus the mean of the data in this feature then divided by the standard deviation(or range in some cases)

### Normal Equation

Other than gradient descent, there is another way to find the minimized cost functionJ(\theta). We first need to construct a matrix X which is a another way to show the input data set of x:

x = \begin{bmatrix}x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \rightarrow X = \begin{bmatrix}x^{(1)}_0 & x^{(1)}_1 & \cdots&x^{(1)}_n \\ x^{(2)}_0&x^{(2)}_1 & \cdots & x^{(2)}_n \\ \vdots & \vdots & \ddots & \vdots \\x^{(m)}_0&x^{(m)}_1 & \cdots & x^{(m)}_n \end{bmatrix} = \begin{bmatrix} 1&x^{(1)}_1&\cdots&x^{(1)}_n \\ 1&x^{(2)}_1&\cdots&x^{(2)}_n \\ \vdots &\vdots &\ddots &\vdots \\ 1&x^{(m)}_1 &\cdots &x^{(m)}_n \end{bmatrix}

Actually, each row of the matrix X is the transpose of each element in x_j^{(i)}, contains the data set for all features in one iteration. And the normal equation itself looks like:

\theta = (X^{T}X)^{-1}X^{T}y

I am not going to show how it comes but comparing to gradient descent, the normal equation: 1. no need to choose \alpha 2. no need to iterate 3. but slow.

**Classification**

Not only we need to solve some continuous problems(linear regression), but also a lot of discrete problems like if someone gets cancer(YES/NO) by the size of one’s tumor. Normally we use 1 and 0 to represent the two outcomes. And the new form of the function we need to use to better shows the concept of classification is called the sigmoid function:

h(x) = \dfrac{1}{1 + e^{-\theta^{T}x}}

So what we did here is basically put the original hypothesis function \theta^{T}x into the standard sigmoid function:

g(z) = \dfrac{1}{1 + e^{-z}}

So that the new hypothesis function will output the probability toward one of the binary output(1/0) without overlapping.

**Decision Boundary**

We consider:

h(x) \geq 0.5 \rightarrow y = 1 \\ h(x) < 0.5 \rightarrow y = 0

Becuase of the bahavior of the logistic function:

\theta^{T}x=0, e^{0}=1 \Rightarrow h(x)=1/2 \\ \theta^{T}x \to \infty, e^{-\infty} \to 0 \Rightarrow h(x)=1 \\ \theta^{T}x \to -\infty, e^{\infty}\to \infty \Rightarrow h(x)=0

So that:

\theta^T x \geq 0 \Rightarrow h(x) = 1 \\ \theta^T x < 0 \Rightarrow h(x) = 0

Then you can just set h(x) to 1 or 0 to get the decision boundary. For example:

\theta = \begin{bmatrix}5 \\ -1 \\ 0\end{bmatrix} \\ y = 1 \; \mathbf{if} \; 5 + (-1) x_1 + 0 x_2 \geq 0 \\ Desicion Boundary: x_1 \leq 5

The plot should looks like: