Neural network introduction | Part 3
π§ Backward propagation
Once we have a model ready to perform forward propagation and a function that returns a cost J, we can do backward propagation with gradient descent to update our parameters W and b and minimise our cost function.
To perform back propagation we need to compute the partial derivatives of our cost function J with respect to our parameters because we need to understand how a change in their values will affect our cost. To do so, we can first compute the error for each layer :
With the subscript j representing the j-th neuron in the l-th layer.
To compute the error of the output layer L, we multiply grad J (a vector whose components are the partial derivatives πJ/πa[L](j)) by the derivative of the activation function (here sigmoid) of z[L].
The operation β between those two components is an Hadamard product.
The Hadamard product is a binary operation that takes two matrices of the same dimensions, and produces another matrix where each element i,j is the product of elements i,j of the original two matrices. (Wikipedia)
πβ(z) = π(z)(1-π(z)).
To compute the error of other layers, we will βmove the errorβ backward through the neural network by multiplying the dot product of the transposed weight matrix W[l+1] and the error of layer l+1 by the derivative of sigmoid(z[l])
We can see the pattern of moving the error backward through a deep network :
Now, we can compute the partial derivatives of J with respect to W and b :
And we can now update the parameters. We need a new hyper parameter here : the learning rate, noted πΌ which is the size of the steps gradient descent takes to reduce J.
On the neural network and deep learning course on Coursera we can see the following set of equations to compute πJ/πW and πJ/πb :
With its vectorised version where m is the number of example (x,y):
We also have this python code to compute πA[L] which is the derivative of J with respect to A[L] (or Ε·) :
-(np.divide(Y,AL)-np.divide(1-Y, 1-AL))
Itβs easier to understand this way but we can see we are computing the exact same thing. dz is our layerβs error. Even if this last example is easier to understand, I think itβs important to be able to read a more complex notation that can be used in more advanced courses or documentations. It is also easier to manipulate and it becomes easier to use these concepts in other fields or for other purposes rather than building a βsimpleβ neural network.
We can now build a L=2 functioning neural network by assembling the blocs : parameters initialisation, loop( forward propagation, computation of the cost function, backward propagation, parameters update ).
On Coursera deep learning course, we implement the following code :
def init_paramaters(n_x, n_y, n_h):
W1 = np.random.randn(n_h, n_x) * 0.01
b1 = np.zeros((n_h,1))
W2 = np.random.randn(n_y, n_h) * 0.01
b2 = np.zeros((n_y,1))
param = {"W1":W1,
"b1":b1,
"W2":W2,
"b2":b2}
return param
With init_parameters
, by giving in input the number of neurons in each layer, we create the parameters W1, b1, W2 and b2.
def linear_activation_forward(A_prev, W, b, activation):
if activation == 'sigmoid':
Z, linear_cache = np.dot(W, A_prev) + b, (A_prev, W, b)
A, activation_cache = sigmoid(Z)
if activation == 'relu':
Z, linear_cache = np.dot(W, A_prev) + b, (A_prev, W, b)
A, activation_cache = relu(Z)
cache = (linear_cache, activation_cache)
return cache
linear_activation_forward
computes Z = WA + b and A = g(Z)
and stores in cache
the intermediate values used to compute Z.
def compute_cost(AL, Y):
m = Y.shape[1]
cost = -np.sum(np.multiply(np.log(AL),Y) + np.multiply(1- \ np.log(Al), 1-Y)) / m
cost = np.squeeze(cost)
return cost
This returns the cross entropy loss function.
def linear_activation_backward(dA, cache, activation):
linear_cache, activation_cache = cache
if activation == 'sigmoid':
dZ = sigmoid_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
if activation == 'relu':
dZ = relu_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
cache = (linear_cache, activation_cache)
return dA_prev, dW, db
linear_activation_backward
takes in input πA, the cache
values from linear_activation_forward
and a type of activation function (βsigmoidβ or βreluβ) and computes πZ, then passes πZ and the values of A_prev, W, b from cache
through the linear_backward
function that will compute πA_prev, πW and πb.
def linear_backward(dZ, cache):
A_prev, W, b = caches
m = A_prev.shape[1]
dW = np.dot(dZ, A_prev.T) / m
db = np.sum(dZ, axis = 1, keepdims = True) / m
dA_prev = np.dot(W.T, dZ)
return dA_prev, dW, dZ
Now, linear_backward
takes in input πZ and cache
to compute πA_prev, πW and πb.
def update_parameters(parameters, grads, learning_rate):
L = len(parameters) // 2
for l in range(L):
parameters['W'+str(l+1)] = parameters['W'+str(l+1)] - \ learning_rate * grads['dW'+str(l+1)]
parameters['b'+str(l+1)] = parameters['b'+str(l+1)] - \ learning_rate * grads['db'+str(l+1)]
return parameters
And now we update the parameters with the function update_parameters
that takes as input the parameters, the gradients (πW and πb) and the learning_rate and returns the updated parameters :
W := W-πΌ πW ; b:= b-πΌ πb
def two_layer_model(X,Y,layers_dims,lr=0.001,num_iterations=1500):
grads = {}
m = X.shape[1]
(n_x, n_h, n_y) = layers_dims
parameters = init_paramaters(n_x,n_h,n_y)
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
for i in range(0, num_iterations):
A1, cache1 = linear_activation_forward(X, W1, b1, \ activation='relu')
A2, cache2 = linear_activation_forward(A1, W2, b2, \ activation='sigmoid')
cost = compute_cost(A2, Y)
dA2 = -(np.divide(Y,A2) - np.divide(1-Y, 1-A2))
dA1, dW2, db2 = linear_activation_backward(dA2, cache2, \ activation='sigmoid')
dA0, dW1, db1 = linear_activation_backward(dA1, cache1, \ activation='relu')
grads['dW1'] = dW1
grads['db1'] = db1
grads['dW2'] = dW2
grads['db2'] = db2
parameters = update_parameters(parameters, grads, lr)
return parameters
We will do a forward propagation, compute the cost function, backward propagation and update the parameters in a loop for i in range(0, num_iterations)
.