Neural network introduction | Part 3
๐ง Backward propagation
Once we have a model ready to perform forward propagation and a function that returns a cost J, we can do backward propagation with gradient descent to update our parameters W and b and minimise our cost function.
To perform back propagation we need to compute the partial derivatives of our cost function J with respect to our parameters because we need to understand how a change in their values will affect our cost. To do so, we can first compute the error for each layer :
With the subscript j representing the j-th neuron in the l-th layer.
To compute the error of the output layer L, we multiply grad J (a vector whose components are the partial derivatives ๐J/๐a[L](j)) by the derivative of the activation function (here sigmoid) of z[L].
The operation โ between those two components is an Hadamard product.
The Hadamard product is a binary operation that takes two matrices of the same dimensions, and produces another matrix where each element i,j is the product of elements i,j of the original two matrices. (Wikipedia)
๐โ(z) = ๐(z)(1-๐(z)).
To compute the error of other layers, we will โmove the errorโ backward through the neural network by multiplying the dot product of the transposed weight matrix W[l+1] and the error of layer l+1 by the derivative of sigmoid(z[l])
We can see the pattern of moving the error backward through a deep network :
Now, we can compute the partial derivatives of J with respect to W and b :
And we can now update the parameters. We need a new hyper parameter here : the learning rate, noted ๐ผ which is the size of the steps gradient descent takes to reduce J.
On the neural network and deep learning course on Coursera we can see the following set of equations to compute ๐J/๐W and ๐J/๐b :
With its vectorised version where m is the number of example (x,y):
We also have this python code to compute ๐A[L] which is the derivative of J with respect to A[L] (or ลท) :
-(np.divide(Y,AL)-np.divide(1-Y, 1-AL))
Itโs easier to understand this way but we can see we are computing the exact same thing. dz is our layerโs error. Even if this last example is easier to understand, I think itโs important to be able to read a more complex notation that can be used in more advanced courses or documentations. It is also easier to manipulate and it becomes easier to use these concepts in other fields or for other purposes rather than building a โsimpleโ neural network.
We can now build a L=2 functioning neural network by assembling the blocs : parameters initialisation, loop( forward propagation, computation of the cost function, backward propagation, parameters update ).
On Coursera deep learning course, we implement the following code :
def init_paramaters(n_x, n_y, n_h):
W1 = np.random.randn(n_h, n_x) * 0.01
b1 = np.zeros((n_h,1))
W2 = np.random.randn(n_y, n_h) * 0.01
b2 = np.zeros((n_y,1))
param = {"W1":W1,
"b1":b1,
"W2":W2,
"b2":b2}
return param
With init_parameters
, by giving in input the number of neurons in each layer, we create the parameters W1, b1, W2 and b2.
def linear_activation_forward(A_prev, W, b, activation):
if activation == 'sigmoid':
Z, linear_cache = np.dot(W, A_prev) + b, (A_prev, W, b)
A, activation_cache = sigmoid(Z)
if activation == 'relu':
Z, linear_cache = np.dot(W, A_prev) + b, (A_prev, W, b)
A, activation_cache = relu(Z)
cache = (linear_cache, activation_cache)
return cache
linear_activation_forward
computes Z = WA + b and A = g(Z)
and stores in cache
the intermediate values used to compute Z.
def compute_cost(AL, Y):
m = Y.shape[1]
cost = -np.sum(np.multiply(np.log(AL),Y) + np.multiply(1- \ np.log(Al), 1-Y)) / m
cost = np.squeeze(cost)
return cost
This returns the cross entropy loss function.
def linear_activation_backward(dA, cache, activation):
linear_cache, activation_cache = cache
if activation == 'sigmoid':
dZ = sigmoid_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
if activation == 'relu':
dZ = relu_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
cache = (linear_cache, activation_cache)
return dA_prev, dW, db
linear_activation_backward
takes in input ๐A, the cache
values from linear_activation_forward
and a type of activation function (โsigmoidโ or โreluโ) and computes ๐Z, then passes ๐Z and the values of A_prev, W, b from cache
through the linear_backward
function that will compute ๐A_prev, ๐W and ๐b.
def linear_backward(dZ, cache):
A_prev, W, b = caches
m = A_prev.shape[1]
dW = np.dot(dZ, A_prev.T) / m
db = np.sum(dZ, axis = 1, keepdims = True) / m
dA_prev = np.dot(W.T, dZ)
return dA_prev, dW, dZ
Now, linear_backward
takes in input ๐Z and cache
to compute ๐A_prev, ๐W and ๐b.
def update_parameters(parameters, grads, learning_rate):
L = len(parameters) // 2
for l in range(L):
parameters['W'+str(l+1)] = parameters['W'+str(l+1)] - \ learning_rate * grads['dW'+str(l+1)]
parameters['b'+str(l+1)] = parameters['b'+str(l+1)] - \ learning_rate * grads['db'+str(l+1)]
return parameters
And now we update the parameters with the function update_parameters
that takes as input the parameters, the gradients (๐W and ๐b) and the learning_rate and returns the updated parameters :
W := W-๐ผ ๐W ; b:= b-๐ผ ๐b
def two_layer_model(X,Y,layers_dims,lr=0.001,num_iterations=1500):
grads = {}
m = X.shape[1]
(n_x, n_h, n_y) = layers_dims
parameters = init_paramaters(n_x,n_h,n_y)
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
for i in range(0, num_iterations):
A1, cache1 = linear_activation_forward(X, W1, b1, \ activation='relu')
A2, cache2 = linear_activation_forward(A1, W2, b2, \ activation='sigmoid')
cost = compute_cost(A2, Y)
dA2 = -(np.divide(Y,A2) - np.divide(1-Y, 1-A2))
dA1, dW2, db2 = linear_activation_backward(dA2, cache2, \ activation='sigmoid')
dA0, dW1, db1 = linear_activation_backward(dA1, cache1, \ activation='relu')
grads['dW1'] = dW1
grads['db1'] = db1
grads['dW2'] = dW2
grads['db2'] = db2
parameters = update_parameters(parameters, grads, lr)
return parameters
We will do a forward propagation, compute the cost function, backward propagation and update the parameters in a loop for i in range(0, num_iterations)
.