Neural network introduction | Part 1

🧠 Building the model

Pierre Portal
7 min readNov 4, 2018

Forward propagation and cost function 👉 Neural Network Introduction | part 2

Backward propagation and learning 👉 Neural Network Introduction | part 3

Here is a summary of my notes from the first module of the Deep Learning Specialization by Andrew Ng on Coursera which I recommend to everyone interested in the subject. This only covers the structure of a neural network and in future notes, I will talk about forward propagation and the cost functions, then gradient descent algorithm and backward propagation. Sharing my notes is the best way to self-study and to relieve the excitement of learning something as fascinating as AI and machine learning. It’s also a good way for me to reorganise my thoughts. I’m also really open to talk about my notes, some errors I could have made, or if my notes could help anyone who’s learning. I complete my courses with the help of other sources because I like having a solid understanding of each elements. This is why the notation may change a little bit sometimes and I will show a lot of equations and mathematical notations because even without a mathematical background, with a little bit of vocabulary it’s easy to manipulate objects and concepts to get an understanding of how things work without the need of manually compute them.

There are different ways to build an artificial neural network but they all seem to be designed with the same basic elements that we can divide in two groups. We have our parameters and our hyper-parameters. The hyper-parameters include the number of layers in the network, the number of neurons by layers, the activation functions and the learning rate (it determines the size of step Gradient descent will take to minimise a Cost function during the neural network training). The parameters include the weights and the bias. Their values will determine the quality of our model. If we take for example the linear regression equation y = mx + b, we can see m as a weight and b as a bias. This equation can also be written as h𝜃(x) = 𝜃0 + 𝜃1x.

In a situation where we have multiple variables x(n), each variable is multiplied by a weight.

The weights are variables by which the output of the nodes (neurons) of a layer l are multiplied before being sent to the nodes of the next layer l+1. (See below the structure of a neural network). Without them, it would be impossible for our model to fit the training data, like it would be impossible to fit any data with a simple linear regression model without the variable m. When we build the model, we initialise the weight matrices with random values.

For a better understanding, we can see our linear regression equation where the variable m is replaced by the variable weight :

The bias are also variables but those ones are not connecting layers. They only affect the nodes of the layer l by adding to their values, a value between -1 and 1 that can have a big importance in the activation of neurons. When we built the model, our bias will be initialised with a value of 0.

Now we can see our linear regression equation where the variable b is now replaced by the variable bias :

An easy way to understand how a neural network actually works is to visualise how does the perceptron work. Here is the following process : calculate the sum of every variable x(i) multiplied by their weight 𝜃(i) and apply an activation function. (More about activation functions later).

The structure of a simple L=2 neural network (where L is the number of layers l in the network) is the following: 1 input layer, 1 hidden layer, 1 output layer. (L=2 because we don’t count the input layer).

Our input layer will have as many neurons as there are independent variable x. In our case, we’ll have 2 variables x1 and x2, our input layer will be made of 2 nodes.

We are building a model for binary classification so our output layer will have only 1 node, our prediction ŷ will be a value between 0 and 1.

The number of nodes in the hidden layer is more difficult to determine. Let’s take 3 for our example.

In a neural network, the weights between layers are matrices with dimension W(l) = (number-of-node-in-layer-(l) × number-of-node-in-layer-(l-1)), and bias are (number-of-nodes-in-layer-(l) × 1) vectors.

A vectorised version of our equation :

If our input layer l0 (or x) has two nodes corresponding to the variables x1 and x2, and if our hidden layer has three nodes, our weight matrix W1 will be a 3 × 2 matrix and our bias b1 will be a 3 × 1 vector. The operation between our input x and our weight matrix W1 is a matrix product. Here is the definition of the matrix product:

If A is an n × m matrix and B is an m × p matrix, the matrix product C = AB is defined to be the n × p matrix such that :

for i = 1, …, n and j = 1, …, p.

So :

The matrix product between W1 and x will result to a 3 × 1 vector.

We add to the product of W1 and x the bias b1. Their sum will result in a new vector z1.

We have now built (almost) half of our neural network, from the input layer x to the hidden layer l1. We just have to reproduce the same process from the hidden layer l1 to the output layer l2 (or ŷ) by doing the matrix product of W2 and z1 and then add b2. We can visualise the whole precess this way :

But before going further we need activation functions g for each node (i) of the hidden layer l1 which will take in input z1. We will choose the ReLu function (Rectified Linear Unit) which will return 0 if z1(i) < 0 or z1(i). And the whole will form the a1 vector.

Once the activation function added to the nodes, z1 become g(z1).

The whole will form the a1 vector.

Now we can multiply a1 by W2 and add b2 to their sum to from the z2 vector.

The activation function that we will use now to compute our output a2 (which will be our prediction ŷ) is the sigmoid function (𝜎) which will return a value between 0 and 1, corresponding to the two classes of our binary classified data. (y ∈ {0,1})

We now have all the necessary elements to built a two-layers neural network, initialise the parameters and perform forward propagation. In a next note about backward propagation we’ll see why we need to initialise our weight matrices with random values.

With Python and Numpy :

def init_L2_paramaters(X,3,Y):
x_layer_size = X.shape[0]
h_layer_size = 3
y_layer_size = Y.shape[0]

W1 = np.random.randn(h_layer_size, x_layer_size)
b1 = np.zeros([h_layer_size], 1)
W2 = np.random.randn(y_layer_size, h_layer_size)
b2 = np.zeros([y_layer_size], 1)
parameters = {'W1': W1,
'b1': b1,
'W2': W2,
'b2': b2}
return parametersdef forward_propagation(X, parameters):
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
Z1 = np.dot(W1, X) + b1
A1 = g(Z1)
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)
return A2def g(z):
# relu :
return np.maximum(0, z)
def sigmoid(z):
return 1 / 1 + (np.exp(-z))

--

--

Pierre Portal

AI enthusiast 🤖 Software engineer 💻 I share study notes about computer science, AI, deep learning and more. More about my work on www.pierreportal.com