Optimisation | Gradient descent

And a refresher on derivatives

Pierre Portal
3 min readNov 9, 2018

To better understand what’s going on underneath Gradient descent, it has been useful for me to really understand what is the derivative of a function and how to calculate it.

So we know that the derivative is the rate of change of a function at a particular point. Basically it’s the slope of a function calculated with two point really, really close to each other.

Here is the step of calculation :

let’s take the equation y = f(x).
When x increases by x, y increases by y :

If we subtract both formulas :

And simplify the result equation :

Divided by △x to get the rate of change :

And reduce △x really really close to zero and rename it dx :

= f ’(x)

Some differentiation rules are good to know, for example the derivative of a constant, of x, the power rule and — of course — the chain rule.

The Grad now, is what allows us to combine calculus and linear algebra by storing the gradients of a function with multiple variables (like f(x,y) = x² y for example) in a vector.

Here is the definition of a gradient from Wikipedia :

The gradient of a scalar function (f(x1, x2,…,xn)) is denoted ▽f where ▽ (the nabla symbol) denotes the vector differential operator, del. The notation grad f is also commonly used for the gradient. The gradient of f is defined as the unique vector field whose dot product with any vector v at each point x is the directional derivative of f along v. That is,

ref : Wikipedia

If we “dot” our grad vector by another vector we have :

So by doing this, a dot product of our grad f and a vector r, we can get a directional gradient. For example :

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function.

If we already now briefly what is gradient descent and why we use it to minimise a function (our Cost function for example), then the description of gradient descent on Wikipedia is quiet easy to understand now.

--

--

Pierre Portal

AI enthusiast 🤖 Software engineer 💻 I share study notes about computer science, AI, deep learning and more. More about my work on www.pierreportal.com