≡ Menu
DeemOcean header image

*This note is still open

Machine Learning Notes I

Machine Learning Notes II

The Primal Question of Optimization

For a general optimization problem, it usually could be rewritten as maximizing or minimizing a certain function with several constrictions. For example, maybe you want the optimized value non-negative. The most basic form of such is called the primal question which looks like this:

\begin{matrix} \underset{x}{min}f(x), x \in \mathbb{R}^n \\ s.t. \\ g_i(x)\leq 0, i=1,2,…,m \\ h_i(x)= 0, i=1,2,…,p \end{matrix}

And we call f(x) the objective function and the rest two the constriction function.

The Lagrangian Function

To solve such primal question with constrictions, we can use the Lagrange multiplier to encode the three functions into one:


As such, we call u_i,v_i the lagrange multipliers.

And we make u_i \geq 0, v_i \in \mathbb{R} so that:

Because g_i(x) \leq 0, so that the maximum of \sum_{i=1}^{m}u_ig_i(x) is 0. And since h_i(x)=0, so \sum_{i=1}^{m}v_ih_i(x) is also equal to 0. Thus:


In this way, we find:


The Lagrangian Dual Function

But we find the expression above is hard to solve, so we need to transfer it to a dual function as such:

Define D the feasible domain:

g(u,v)=\underset{x\in D}{inf}L(x,u,v)=\underset{x\in D}{inf}[f(x)+\sum_{i=1}^{m}u_ig_i(x)+\sum_{i=1}^{m}v_ih_i(x)]
the lagrangian dual function(g(x))

while the function does not have a lower bound, we define the value as -\infty, so that the dual is a concave function, and since we are trying to find the maximum, it could be treated as a convex optimization problem.


\underset{x}{min}f(x)=\underset{x}{min}\underset{u,v}{max}L(x,u,v)\geq \underset{u,v}{max}\underset{x}{min}L(x,u,v)= \underset{u,v}{max}g(x)

Proof of the Minimax Theorem


Dual gap

we use p^* to represent the optimized solution of the primal problem:


And d^* to represent the optimized solution of the lagrangian dual:


And because the minimax theorem:

p^* \geq d^*

We are using d^* to approach p^* and calling p^*-d^* the dual gap

When the dual gap is zero, we call the situation as strong dual, otherwise the weak dual.

I am not going to write about KKT&Slater for now…*

Loading Likes...
{ 1 comment }

Link to Machine Learning Notes I

The least squares estimates of α and β

For simple linear regression:

E\left ( Y|X=x \right ) = \alpha +\beta x

we have:

\hat{\beta } = \frac{cov\left ( X, Y \right )}{var\left ( X \right )}

\hat{\alpha} = \bar{Y} - \hat{\beta}\bar{X}

Linear Regression way

We can all use the NN method to solve the regression problem but that leads to being nearly impossible to locate exactly which layer foreshadows which feature of the data. Thus, maybe the better way is to upscale the dimension of the linear regression method. That, we not only use x but x, x^{2}, x^{1/2}... to approach the true curve.

Classification method

Parametric methods:

Direct way = E\left ( Y|X=x \right ) = Sigmoid(\beta ^{T}X)

Bayes way *TODO:needs elaboration

Nonparametric methods:

KNN *TODO:needs elaboration

Elastic Net *TODO:needs elaboration

PCA *TODO:needs elaboration

Spline *TODO:needs elaboration

Local Regression *TODO:needs elaboration

Tree Methods *TODO:needs elaboration

Convex Analysis&Optimization

For the case of convex data, it is very easy to find the minimal(global), but for the case of non-convex data, the situation would be a little complex. To analysis such a situation, we need to discuss it under different assumptions.

Lipschitz continuous

Definition, for all x_{0}, x_{1}:

\left | f(x_{0})-f(x_{1}) \right |\leq L(|x_{0}-x_{1}|)

For the purpose of clearance, we would use 2-dimensional space.

If the function is only Lipschitz continuous, then even it could varies in a certain range but it is not smooth, so we cannot apply gradient descent.

Let’s define a domain [0,1], and we can divide the domain in to k cuts, so the cutting points are: \frac{1}{k}, \frac{2}{k},...,\frac{k}{k}, and know the target is \chi \in [\frac{i}{k}, \frac{i+1}{k}]

Now, how about the distance \chi -\frac{i}{k}|? Since we assume the function is lipschitz continuous, and each cut is \frac{1(,or L)}{k}, then we have |f(\chi) -f(\frac{i}{k})|\leq L|\chi -\frac{i}{k}| \leq \frac{L}{k}

And we have a concept called tolerance, \epsilon, which talks about the precision of the result you want, for example, 1e-6.

*TODO: Here remains a question



\left | \bigtriangledown f(x_{0})-\bigtriangledown f(x_{1}) \right |\leq \left | x_{0}-x{1} \right |

Which is equivalents to:

f(y)\leq f(x)+\bigtriangledown f(x)^{T}(y-x)+\frac{L}{2}\left | y-x \right |_{2}^{2}

So such assumption implies a certain relationship between f(y) and f(x)


Set Convexity

convex set G
non-convex set G

Function Convexity

convex function f(G)

“A function is convex if and only if its epigraph, the region (in green) above its graph (in blue), is a convex set.”

So for a convex function, there would only be one local, which is the global, minimum.

The Convergence rate of Gradient Descent

Suppose the function f:\mathbb{R}^{n}\rightarrow \mathbb{R} is convex and differentiable, and that its gradient is Lipschitz continuous with constant L > 0, i.e. we have that \left | \bigtriangledown f(x_{0})-\bigtriangledown f(x_{1}) \right |\leq \left | x_{0}-x_{1} \right | for any x, y. Then if we run gradient descent for k iterations with a fixed step size t ≤ \frac{1}{L}, it will yield a solution f(k) which satisfies:

f(x^{(k)})-f(x^{*})\leq \frac{\left | x^{(0)}-x^{*} \right |_{2}^{2}}{2tk}

Such expression implies how it guarantees improvement(converge)

Epochs to run:



Loading Likes...
{ 1 comment }

Bayes’ Rule

When I was in the high school learning about AP statistics I learned the formula:

P(A|B)=\frac{P(A\cap B)}{P(B)}, P(B|A)=\frac{P(A\cap B)}{P(A)}

Which able to be transformed as:

P(A\cap B)=P(A)\cdot P(B|A)=P(B)\cdot P(A|B)

P(A|B) is called “Conditional probability” which pretty much self-explained itself. For which I only knew the meaning of each element but not the whole idea, what I do is just plug in numbers, because it is kinda abstract to understand from itself: “The probability of event A happens given event B happened = The probability of events A and B happens divided by the probability of event B happens”

Before we talking about the Bayes’ Rule, I want to discuss why the conditional probability satisfies such a relationship:

For the Venn diagram showing above, the outer space represents the whole sample space. When we want to know P(A|B), which by definition: the probability of A while B is true. Naturally, we can find that the portion makes A true in circle BA\cap B, then divided it by B\frac{P(A\cap B)}{P(B)}. So the latter B restricts the space into that blue circle B, and what we do is just find the part in which A is true.

For people who dig deep, they might ask: “That’s can’t be right, A and B are just samples, they are not probabilities, so the formula did above does not match!” Well, the fact is we did omit a little about the sample space, for which we call it S. What we saying “probability” is actually, using A as an example, P(A)=\frac{A}{S}. And the S would be canceled out, so I omit a little and directly using the A&B portion in the diagram.

P(A|B)=\frac{ P(B|A) }{P(B)}\cdot P(A)

So, what should the Bayes’ Rule mean?

Base on the formula, it is asking the probability of event A to happen with the restriction of the event B. The formula itself could be easily derived from the conditional probability formula. And here we focus on how to interpret it.

The basic idea of the Bayes’ Rule is to adjust the general probability, in this case, we say P(A), by a parameter \frac{P(B|A)}{P(B)} to gain a better idea of P(A) with a restriction, or a piece of new evidence, B, and be called as P(A|B)

More related equations:

P(B)=\sum_{i=1}^{n}P(B|A_{i})\cdot P(A_{i})

Loading Likes...

My Graduation Speech

My dear teachers and parents, my lovely fellow class of 2020, it’s my great honor here to speak as a student representative, and thank you for choosing to join today’s graduation ceremony.

You know, it might be the last time you see most of your surroundings, your classmates, your teacher. I mean, you would not even get in touch with them anymore. Maybe in the near future, you would call them, you would Wechat them, but as long as your life track does not match with them, then eventually you guys would go apart. 

20,000 days, this is the number of days you last in this world, have you figure out who you are yet? And what’s “you” even mean? Are you still you if you lose your arms, or your legs? If you have a third level burn, and loses your original skin and look, are you still you? And actually the human cells in your body would be fully replaced for seven years, are you just the physical body? Well, yes, but no. You see, if we beginning to rip off your body parts when you stop being you? If I replace your brain with other’s brains, would this body still be considered as you? I don’t think so.OK, brain, if we say the brain is you, then what is it? Can two random babies grow with the exact same experience have the same consciousness? Can two same babies with exact brain structure grow with different experiences have the same consciousness? Obviously not, so how about two babies with exact brain structure grow with the exact same experience? Yes, they definitely would. You are the current consciousness which is the combination of your brain structure and the memory.

According to the second law of thermodynamics, the entropy of an isolated system always increases. Which implies one day our universe would be completely homogeneous, or same everywhere. This also implies no one could reverse the process because the information has already lost. Our entire existence will be nothing, and cannot even be tracked. What we live for? What do we want?  What is the goal? What is the fuel that drives your existence? 

Our life, or our consciousness, is about making choices, it is a very general idea, which includes which eye you gonna blink in the next second, to the college experience you choose to have. There is a standard that helps us make decisions, a feeling that we expect to get, or we can call it happiness. Happiness here is not about emotional feelings like  “fun”, “chill”, or “relax” that are typically positive. It is a general thing we expect to get after choosing to do a thing. We want happiness, this is the only goal, and the only fuel keeps us burning. Let’s have an example: you choose to save another person by sacrificing yourself, then you choose to do so is just because you expect to get more happiness by doing so, and that’s it.

Happiness is the only reason, we at any moment are choosing the one we expect could bring us more happiness, So you cannot blame the past you, you are always making the best decision you think.

Our universe might be started with a simple parameter, and theoretically could be calculated. Does that mean we do not have a free willing? Is our destiny a sure thing? Yes, but no. It is true that our universe could be calculated, but only the universe itself could have such huge computing power to do so, so when we talk about prediction, it is actually the real-time calculation.

Shamefully, even every time we made the choice we expect to bring us more happiness, but it still not coming. The problem is, many people do not have a clear and clean choosing standard. Do you understand who you are? What do you want? So-called the feeling of guilt, confusion, and regret, are all because there is no long term goal. But if you have, then it is OK to offense people, or be offended. You would not regret about short-term failures, and you would not be confused immediately after making choices. Once you have the standard, every choice you made is the trajectory regression towards that target, maybe the instant happiness does not come, but every adjustment is bringing you out from that comfort zone dragging you down.

Above are my answer to: “Who I am?” “ Where am I?”, and “Where am I going to?”

So, my schoolmates, my speech is over, your new life is beginning–please answer, hope you all could make the right choice.

Loading Likes...
{ 1 comment }

Lately, I was into the studying process of machine learning, and outputting(taking notes) is a vital step of it. Here, I am using Andrew Ng’s Stanford Machine Learning course in Coursera with the language of MATLAB.

So the rest of the code I will write in this post by default are based on MATLAB.

What is ML?

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Tom Mitchell

Supervised Learning&Unsupervised Learning

SL: with labels; direct feedback; predict

Under SL, there are regression and classification

USL: without labels; no feedback; finding the hidden structure

Under USL, there are clustering and non-clustering 

For now, I would focus on these two but not reinforcement learning.

The Basic Model & Notation

We use x^{(i)} to represent the “input” value, with the variable x represent the value at the position i in a matrix , or vector in most of the time. And y^{(i)} is the actual “output” when we have a input x at position variable i. A pair of (x^{(i)}, y^{(i)}) is called a training sample. Then we have a list of such samples with i=1,...,m—is called a training set. And the purpose of ML is to have a “good” hypothesis function h(x) which could predict the output while only knowing the input x. If we only want to have a simple linear form of h(x), then it looks like: h(x)=\theta_0 + \theta_1x, which both \theta_0 and \theta_1 is the parameter we want to find that letting h(x) to predict “better”.

Linear Algebra Review

Matrix-Vector Multiplication:\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} *\begin{bmatrix} x\\y \end{bmatrix} =\begin{bmatrix} a*x + b*y \\ c*x + d*y \\ e*x + f*y \end{bmatrix}

Matrix-Matrix Multiplication: \begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} * \begin{bmatrix} w & x \\ y & z \\ \end{bmatrix}=\begin{bmatrix} a*w + b*y & a*x + b*z \\ c*w + d*y & c*x + d*z \\e*w + f*y & e*x + f*z \end{bmatrix}

Identity Matrix looks like this—with 1 on the diagonal and the rest of the elements are zeros: \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix}


Multiplication Properties

Matrices are not commutative:  A∗B \neq B∗A

Matrices are associative: (A∗B)∗C = A∗(B∗C)

Inverse and Transpose

Inverse: A matrix A mutiply with its inverse A_inv results to be a identity matrix I:

I = A*inv(A)

Transposition is like rotating the matrix 90 degrees, for a matrix A with dimension m * n, its transpose is with dimension n * m:

A = \begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}, A^T = \begin{bmatrix} a & c & e \\ b & d & f \\ \end{bmatrix}

Also we can get:


Cost Function

A cost function shows how accurate our hypothesis function predict while output the error (the deviation between y(x) and h(x)). And it looks like this:

J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2

For people who are familier with statistics, it is called “Squared error funtion” while the square makes each error becomes a positice value, and the \frac{1}{2} helps to simplify the expression later when we do derivative during the process of gradient descent. Now, we turn the question to “How to find the \theta_0&\theta_1 that minilize J(\theta_0, \theta_1)?”

Contour Plot

J(\theta_0, \theta_1) in contour plot From Andrew Ng

A contour plot is actually an alternative way to show 3D graphs in 2D, in which the color blue represents low points while red means the high. So the J(\theta_0, \theta_1) that gives the red point is the set of the parameter gives h(x) with the lowest error with the actual output y(x)

Gradient Descent

Gradient Descent is one of the most basic ML tools. The basic idea is to “move some small steps which lead to minimizing the cost function J(\theta). And it looks like this:

Repeat until convergence: 
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)

Here, the operator := just means assign the latter part to the former part while we know it could be the same as = in many languages. We say the former \theta_j as the “next step” while the latter one as the “current position”, \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) shows the “direction” that make the move increase J(\theta) the most, so that we could just add a negative sign to make it becomes the fastest decrease direction.\alpha gives the length of step we want it to take for each step. And it’s important to make the update of each \theta be simultaneous.

If we take the code above apart, then we have:

repeat until convergence: 
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) 
\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)})

The term x_i is nothing but a result of the derivative, there is no x_i for \theta_0 because we defined x^{(i)}_0 as 1.

Then here is a full derivative process to show the partial dervative of the cost function J(\theta):

\begin{aligned}\frac{\partial }{\partial \theta_j}J(\theta) &= \frac{\partial }{\partial \theta_j}\frac{1}{2}(h_\theta(x)-y)^{2}\\&=2 \cdot \frac{1}{2}(h_\theta(x)-y) \cdot \frac{\partial }{\partial \theta_j}(h_\theta(x)-y)\\&= (h_\theta(x)-y)\frac{\partial }{\partial \theta_j}\left ( \sum\limits_{i=0}^n\theta^{(i)}x^{(i)}-y \right )\\&=(h_\theta(x)-y)x_j\end{aligned}

And such basic method is called batch gradient descent while it uses all the training set we provide, and just saying for future reference, J(\theta)is convex which means it only has only one global minima and has no chance to be affected by local minima.

Convex function&non-convex function

Multivariate Linear Regression

So saying we have not only one variables of input, but many of them. Then we use j in x_j from 1 to n to represents the index of it just like we use i to represents the index of the training example from 1 to m.

x_{j}^{(i)} = value of, in i^{th} training example, feature j

For convenience of notation, we have to define x_0 = 1, since we have \theta_0 in the hypothesis function, and the matrix mutiplication thing:

x = \begin{bmatrix} x_1\\x_2 \\\vdots\\x_n \end{bmatrix} \in\mathbb{R}^{n} , \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\\vdots\\\theta_n \end{bmatrix}\in\mathbb{R}^{n+1} \rightarrow x = \begin{bmatrix}x_0\\x_1\\x_2\\\vdots\\x_n\end{bmatrix}\in\mathbb{R}^{n+1}, \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\\vdots\\\theta_n \end{bmatrix}\in\mathbb{R}^{n+1}

And the cool thing here we can do now is using vectorization to represents the long mutivariable hypothesis function:

h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + ... + \theta_n x_n = \begin{bmatrix}\theta_0&\theta_1&\cdots&\theta_n\end{bmatrix}\begin{bmatrix}x_0 \\ x_1 \\ \vdots \\ x_n\end{bmatrix}=\theta^T x

Feature Scaling

If the input set x contains features that have very large difference on their data range, the process of getting \theta could oscillating, being slow, or even failed, and feature scaling, or mean normalization is a technique to make the range of data in each feature more even, and the process is very familir if knowing statistics:


So the input x with feature index j minus the mean of the data in this feature then divided by the standard deviation(or range in some cases)

Normal Equation

Other than gradient descent, there is another way to find the minimized cost functionJ(\theta). We first need to construct a matrix X which is a another way to show the input data set of x:

x = \begin{bmatrix}x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \rightarrow X = \begin{bmatrix}x^{(1)}_0 & x^{(1)}_1 & \cdots&x^{(1)}_n \\ x^{(2)}_0&x^{(2)}_1 & \cdots & x^{(2)}_n \\ \vdots & \vdots & \ddots & \vdots \\x^{(m)}_0&x^{(m)}_1 & \cdots & x^{(m)}_n \end{bmatrix} = \begin{bmatrix} 1&x^{(1)}_1&\cdots&x^{(1)}_n \\ 1&x^{(2)}_1&\cdots&x^{(2)}_n \\ \vdots &\vdots &\ddots &\vdots \\ 1&x^{(m)}_1 &\cdots &x^{(m)}_n \end{bmatrix}

Actually, each row of the matrix X is the transpose of each element in x_j^{(i)}, contains the data set for all features in one iteration. And the normal equation itself looks like:

\theta = (X^{T}X)^{-1}X^{T}y

I am not going to show how it comes but comparing to gradient descent, the normal equation: 1. no need to choose \alpha 2. no need to iterate 3. but slow.


Not only we need to solve some continuous problems(linear regression), but also a lot of discrete problems like if someone gets cancer(YES/NO) by the size of one’s tumor. Normally we use 1 and 0 to represent the two outcomes. And the new form of the function we need to use to better shows the concept of classification is called the sigmoid function:

h(x) = \dfrac{1}{1 + e^{-\theta^{T}x}}

So what we did here is basically put the original hypothesis function \theta^{T}x into the standard sigmoid function:

g(z) = \dfrac{1}{1 + e^{-z}}

A standard sigmoid function

So that the new hypothesis function will output the probability toward one of the binary output(1/0) without overlapping.

Decision Boundary

We consider:

h(x) \geq 0.5 \rightarrow y = 1 \\ h(x) < 0.5 \rightarrow y = 0

Becuase of the bahavior of the logistic function:

\theta^{T}x=0, e^{0}=1 \Rightarrow h(x)=1/2 \\ \theta^{T}x \to \infty, e^{-\infty} \to 0 \Rightarrow h(x)=1 \\ \theta^{T}x \to -\infty, e^{\infty}\to \infty \Rightarrow h(x)=0

So that:

\theta^T x \geq 0 \Rightarrow h(x) = 1 \\ \theta^T x < 0 \Rightarrow h(x) = 0

Then you can just set h(x) to 1 or 0 to get the decision boundary. For example:

\theta = \begin{bmatrix}5 \\ -1 \\ 0\end{bmatrix} \\ y = 1 \; \mathbf{if} \; 5 + (-1) x_1 + 0 x_2 \geq 0 \\ Desicion Boundary: x_1 \leq 5

The plot should looks like:

The green portion is “1” while the red(x_1>5) is “0”

Loading Likes...

There was a big explosion, the time be created, the world was created; the particles appear, they make impacts on each other by forces, they interact with each other–then the future be determined–just like begin a game of billiard, at the moment the stick hit the first ball, then all the future is determined. If the initial factors of the universe are typed into a supercomputer, then the computer could precisely shows (but not predict) the future. Or there is no future anyway…

By the second law of thermodynamics, entropy increases as the by-product of the time. Everything will go to be the same in every direction.

Then, if our “fate” is pre-determined, and will go nothing.

Why we just not give up our life immediately?

Because human emotion is preventing it.

Trying to give a fake funny cover.

Loading Likes...

The Green Night

It’s summer night, hot, but cold inside.

The castle is full of people—women and knights. At that time, I could be the youngest one there. “What is a real knight?” the King Arthur suddenly looking at me with his deep green eyes. I did not except that, but as quick as it is my nature, I response words that I called out for thousands of times when I was an initiate knight: “Honor, wisdom, humble, never afraid!” He looked at me with satisfactory. I firmly believe those words; I already engraved them on my heart–it is my motto, but until the come of that man, I have no clear understanding of that.

It should be another banquet to celebrate another triumphant return of King Arthur—He conquered the East, West, never failed. But for today, the real star is the Green Knight. He was like a green sword, stick into the group; the ladies be shocked and run around like sheep be invaded by a green wolf. Saying he is a knight, but the Green Knight did not even bring any equip—even his sword. That is not the most striking part—He wants someone to cut his head off! While other knights looked at King Arthur, but I consider, it is my chance—just chop head! After my request, the Green Knight looks at me with a strange-looking.

Green Knight’s head was off, but he was still alive. He grasped his hair, looked at me, the thing is, he asked me to come to his place and cut my head off after one year. It is unfair! I was raging in my mind, he has some kind of magic, I don’t! But I cannot say that out loud, King Arthur looking at me, deeply with a smile.

The time in the castle could be very short, I tried not think about the promise, but the day passes, and finally comes. I must go, not only this is a promise for King Arthur and the Green Knight, but also for me, for my chivalry.

The Green Knight lives on the top of a snowy mountain, but I decide to take a rest in a castle under the mountain. The master of the castle is a hunter, he and his wife who has crystal blue eyes, long smooth black hair, and lithe body welcomed me. The wife made an unspeakably delicious borsch for me—she even put shaped basil at the top of the soup as a decorate at that late night. The hunter goes out at whole daytime, so naturally, I and the lady had a great time. It’s the joy of both, I do consider it is a violation of the motto. The night before I go to meet the Green Knight, I got a special gift—a sash could prevent being hurt. I did not think much about that.

I climbed up to the top without a little fear, I almost see the banquet of triumphant return –only for me. I saw him, took a big step in front of him, let him chop my head off. He looks at me, slowly transforms him into the hunter. He looks at me—I cannot make any sound, just a slight wave of his hands, my sash is off. Then he turns back, walks away slowly.

I know that I may not the smartest and the strongest knight, or the humblest, and I may never be. But, here, now, at this moment, I cannot turn around, I cannot return to King Arthur. I can only either live all the rest of my life with shame or die in honor, die in no fear.

“Wait! There is a head you forget to bring away.”

He stopped, turn around, still, with that simile.

I had my return, and I also realize the Green Knight could give his power to others.

After years, there is another Green Knight live in that castle.

Loading Likes...


AI face change is a product under Neuro Network and Machine Learning technologies. And this article aims to be a tutorial that could let people who do not have experience in Machine Learning to replace the face in one video with other faces.

Key Words: #deepfakes #faceswap #face-swap #deep-learning #deeplearning #deep-neural-networks #deepface #deep-face-swap #fakeapp #fake-app #neural-networks #neural-nets

What is DeepFaceLab?

DeepFaceLab is a tool that utilizes machine learning to replace faces in videos.  Github Project: Click here. DeepFaceLab does not have GUI but it does not require a high RAM (at least 2g).

1. Environment

Hardware requirement:

I. Graphics Card: Nvidia Graphic Card with CC(Compute Capability)>3.0 CC Chart, or at least GTX 1060 6G. AMD not be supported 🙁

II. CPU: Intel or AMD CPU recommended with RAM>8G

III. OS: Windows 10 x64

Software requirement:

I. Microsoft Visual Studio 2015: Download

in custom option, select Visual C++

II. CUDA 9.0 For Windows 10: Download

III. CuDNN 7.0.5: Download

Open the downloaded CuCNN zip, and copy the three files in it to CUDA’s install location

Red files were copied from the CuDNN file

The Environment in this tutorial:

Graphics Card: Nvidia GTX GeForce 1070 8G

RAM: 16.0GB

CPU: Intel(R) Core(TM) i7-7700k 4.2GHz

OS: Windows 10 Pro 1903 – 18362.10024

2. DeepFaceLab

I. Install:

DeepFaceLab: Download

II. Preparation

1. Prepare two videos, and name them:

The video has the face need to be replaced as data_dst

The video has the face to replace to another video’s face as data_src

2. Put those two into the workspace

III. Processing

There are many .bat files, do as shown below:

All .bat needed

(1) clear workspace

While doing a new model of Face Change, you want clear the previous data

(2) extract images from video data_src

Using Enter to skip (set as default)

(3.2) extract images from video data_dst FULL FPS

Using Enter to skip (set as default)

(4) data_src extract faces MT all GPU

Identify the faces in data_src and pull out

(4.2.2) data_src sort by similar histogram

(4.1) data_src check result

Delete unpleasant faces

(4.2.1) data_src sort by blur

(4.1) data_src check result

Delete unpleasant faces

(5) data_dst extract faces MT all GPU

(5.2) data_dst sort by similar histogram

(5.1) data_dst check results

Delete unpleasant faces

(5.3) data_dst sort by blur

(5.1) data_dst check results

Delete unpleasant faces

(6) train H128

Using Enter to skip (set as default)

Press P to update previews; press Enter to save and quit

We wait until the loss rate close to 0.02

Training state with around 2.5 loss rate
Training state with around 1.0 loss rate

(7) convert H128



Erode mask
Blur Mask
Hist match
Face Scale
Factors in this case

Press Enter to convert the whole frames

(8) converted to mp4

Whole video would be uploaded later on

Loading Likes...