Neural network introduction | Part 3

🧠 Backward propagation

5 min readNov 15, 2018

Once we have a model ready to perform forward propagation and a function that returns a cost J, we can do backward propagation with gradient descent to update our parameters W and b and minimise our cost function.

To perform back propagation we need to compute the partial derivatives of our cost function J with respect to our parameters because we need to understand how a change in their values will affect our cost. To do so, we can first compute the error for each layer :

With the subscript j representing the j-th neuron in the l-th layer.

To compute the error of the output layer L, we multiply grad J (a vector whose components are the partial derivatives 𝜕J/𝜕a[L](j)) by the derivative of the activation function (here sigmoid) of z[L].

The operation ⊙ between those two components is an Hadamard product.
The Hadamard product is a binary operation that takes two matrices of the same dimensions, and produces another matrix where each element i,j is the product of elements i,j of the original two matrices. (Wikipedia)

𝜎’(z) = 𝜎(z)(1-𝜎(z)).

To compute the error of other layers, we will ‘move the error’ backward through the neural network by multiplying the dot product of the transposed weight matrix W[l+1] and the error of layer l+1 by the derivative of sigmoid(z[l])

We can see the pattern of moving the error backward through a deep network :

Now, we can compute the partial derivatives of J with respect to W and b :

And we can now update the parameters. We need a new hyper parameter here : the learning rate, noted 𝛼 which is the size of the steps gradient descent takes to reduce J.

On the neural network and deep learning course on Coursera we can see the following set of equations to compute 𝜕J/𝜕W and 𝜕J/𝜕b :

With its vectorised version where m is the number of example (x,y):

We also have this python code to compute 𝜕A[L] which is the derivative of J with respect to A[L] (or ŷ) :

-(np.divide(Y,AL)-np.divide(1-Y, 1-AL))

It’s easier to understand this way but we can see we are computing the exact same thing. dz is our layer’s error. Even if this last example is easier to understand, I think it’s important to be able to read a more complex notation that can be used in more advanced courses or documentations. It is also easier to manipulate and it becomes easier to use these concepts in other fields or for other purposes rather than building a “simple” neural network.

We can now build a L=2 functioning neural network by assembling the blocs : parameters initialisation, loop( forward propagation, computation of the cost function, backward propagation, parameters update ).

On Coursera deep learning course, we implement the following code :

def init_paramaters(n_x, n_y, n_h):
    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h,1))
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros((n_y,1))
    param = {"W1":W1,
            "b1":b1,
            "W2":W2,
            "b2":b2}
    return param

With init_parameters , by giving in input the number of neurons in each layer, we create the parameters W1, b1, W2 and b2.

def linear_activation_forward(A_prev, W, b, activation):
    if activation == 'sigmoid':
        Z, linear_cache = np.dot(W, A_prev) + b, (A_prev, W, b)
        A, activation_cache = sigmoid(Z)
    if activation == 'relu':
        Z, linear_cache = np.dot(W, A_prev) + b, (A_prev, W, b)
        A, activation_cache = relu(Z)
    cache = (linear_cache, activation_cache)
    return cache

linear_activation_forward computes Z = WA + b and A = g(Z)
and stores in cache the intermediate values used to compute Z.

def compute_cost(AL, Y):
    m = Y.shape[1]
    cost = -np.sum(np.multiply(np.log(AL),Y) + np.multiply(1- \ np.log(Al), 1-Y)) / m
    cost = np.squeeze(cost)
    return cost

This returns the cross entropy loss function.

def linear_activation_backward(dA, cache, activation):
    linear_cache, activation_cache = cache
    if activation == 'sigmoid':
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    if activation == 'relu':
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    cache = (linear_cache, activation_cache)
    return dA_prev, dW, db

linear_activation_backward takes in input 𝜕A, the cache values from linear_activation_forward and a type of activation function (‘sigmoid’ or ‘relu’) and computes 𝜕Z, then passes 𝜕Z and the values of A_prev, W, b from cache through the linear_backward function that will compute 𝜕A_prev, 𝜕W and 𝜕b.

def linear_backward(dZ, cache):
    A_prev, W, b = caches
    m = A_prev.shape[1]
    dW = np.dot(dZ, A_prev.T) / m
    db = np.sum(dZ, axis = 1, keepdims = True) / m
    dA_prev = np.dot(W.T, dZ)
    return dA_prev, dW, dZ

Now, linear_backward takes in input 𝜕Z and cache to compute 𝜕A_prev, 𝜕W and 𝜕b.

def update_parameters(parameters, grads, learning_rate):
    L = len(parameters) // 2
    for l in range(L):
        parameters['W'+str(l+1)] = parameters['W'+str(l+1)] - \ learning_rate * grads['dW'+str(l+1)]
        parameters['b'+str(l+1)] = parameters['b'+str(l+1)] - \ learning_rate * grads['db'+str(l+1)]
    return parameters

And now we update the parameters with the function update_parameters that takes as input the parameters, the gradients (𝜕W and 𝜕b) and the learning_rate and returns the updated parameters :
W := W-𝛼 𝜕W ; b:= b-𝛼 𝜕b

def two_layer_model(X,Y,layers_dims,lr=0.001,num_iterations=1500):
    grads = {}
    m = X.shape[1]
    (n_x, n_h, n_y) = layers_dims
    parameters = init_paramaters(n_x,n_h,n_y)
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    for i in range(0, num_iterations):
        A1, cache1 = linear_activation_forward(X, W1, b1, \ activation='relu')
        A2, cache2 = linear_activation_forward(A1, W2, b2, \ activation='sigmoid')
        cost = compute_cost(A2, Y)
        dA2 = -(np.divide(Y,A2) - np.divide(1-Y, 1-A2))
        dA1, dW2, db2 = linear_activation_backward(dA2, cache2, \ activation='sigmoid')
        dA0, dW1, db1 = linear_activation_backward(dA1, cache1, \ activation='relu')
        grads['dW1'] = dW1
        grads['db1'] = db1
        grads['dW2'] = dW2
        grads['db2'] = db2
        parameters = update_parameters(parameters, grads, lr)
    return parameters

We will do a forward propagation, compute the cost function, backward propagation and update the parameters in a loop for i in range(0, num_iterations) .

Neural network introduction | Part 3

🧠 Backward propagation

Written by Pierre Portal

No responses yet