How Neural Networks Actually Work — Python Implementation Part 2 (Simplified)

Original Source Here

How Neural Networks Actually Work — Python Implementation Part 2 (Simplified)

In this article, we continue to debunk the theory that Neural Network is a black box that we don’t quite understand how it works. We aim to implement Neural Nets in an easily understandable way. In case some concepts don’t make clear sense immediately, please check the previous articles on the reference section.

At the end of the article, you should be able to implement a single forward pass of data through a network given input data with many features and training examples.

Let’s start with the data and the Neural Network architecture we will use.

Parameter Initialization

In order to perform a forward pass, we need to define the initial values for the parameters (weights and biases). We will randomly generate the required weights from a standard normal distribution and set the biases to zero. The most important concept to grasp when conducting parameter initialization is to ensure that the weights and bias matrices are of the required dimensions. Getting dimensions wrong can get us into problems when performing matrix multiplication.

Recap on dimensions of weighs and bias matrices (refer to link)

Here are two rules to always consider when initializing parameters:

Equation 1: Dimensions for matrices of weights and biases (Source: Post).
  • The weights matrix for layer l () will be of dimension (nˡ, n^(l-1)), that is, will have the number of rows equal to the number of neurons in the current layer, l, and the number of columns will be equal to the number of neurons in the previous layer, l-1 citation: previous post.
  • The bias matrix will have rows and 1 column. This is just a vector citation: previous post.
  • For example, in our case, we will need 4 by 3 weights matrix at the hidden layer. Remember that we don’t need weights at the input layer because no computation is happening there — we are just passing the input values.

Note 1: The input data must be of dimension (n, m), where n is the number of features and m is the number of training examples. That is to say, each training example will come as a column in the input matrix, x.

Notations: In this article, dim(A)=(r, c) means that the dimension of matrix A is r by c, where r is the number of rows and c is the number of columns. nˡ is the number of neurons in layer l. In the architecture shown above, n⁰=3, n¹=4, and n²=1.

After parameter initialization, the next thing to do is to perform actual computations.

Actual Computations

At any given layer l, the following two computations are done:

Equation 2: Computations done at layer l. In Equation 2–1, weight matrix is multiplied with input matrix and bias added whereas in the second equation activation is applied.

Equation 2–1: Weights matrix is multiplied with the input, and bias is added. Note that the input here can be the actual input values or just values from the previous layer, layer l-1. Computations of these equations require us to understand matrix multiplication and matrix addition. Two matrices, A and B, can be multiplied as A·B if the number of columns in A equals the number of rows in B. Remember that A·B is not the same as B·A (matrix multiplication is not commutative). In addition, the two matrices being added must be of the same dimension.

Equation 2–2: Activation function, , is applied to the result of Equation 2–1, which becomes the output of layer l. Note that using the activation on the resulting matrix/vector does not affect its dimension.

Lets Put All that into Python Code Now

For each layer, we will initialize parameters then perform the required computation. There is no computation happening at the input layer, layer 0 and therefore we go straight into the hidden layer, layer 1.

Computations at Layer 1

(Refer to the rules in Equation 1) For the hidden layer, layer 1, we need a matrix of weights with the dimension (n¹,n⁰) (that is, the number of neurons in the current layer by the number of neurons on the previous layer. This is 4 by 3. We need (n¹, 1) for the bias, which is 4 by 1. To make it easy to print out data, we are not using the data defined at the beginning of the article at this point (we will do that later in this article). We will instead use the following subset.

Figure 2: The input data converted into an input matrix. Each row of the input matrix contains the feature data. If each row contains the training example, we need to transpose x before multiplying it with the weights matrix (Source: Author).


X:  [[ 9 14 10 11  8]
[ 9 16 8 12 9]
[ 9 16 7 10 9]]
X shape: (3, 5)
w1: [[-0.14441 -0.05045 0.016 ]
[ 0.08762 0.03156 -0.20222]
[-0.03062 0.0828 0.02301]
[ 0.0762 -0.02223 -0.02008]]
w1 shape: (4, 3)
b1: [[0.]

Line 1–5: The required package is imported, and input data is defined. The input data, X, is of dimension (3, 5), 3 features and 5 training examples.

Recall (The number of features influences the number of neurons in the input layer): The number of neurons in the input layer is always equal to the number of features, and therefore, we have 3 neurons in the input layer.

Line 7–12: Initializing parameters based on the two rules in Equation 1. dim(w¹) = (n¹, n⁰) = (4,3).


f11 shape:  (4, 5)
w2: [[0.01866 0.04101 0.01983 0.0119 ]]
w2 weights: [[0.01866 0.04101 0.01983 0.0119 ]]
b2: [[0.]]
y0: [[0.03629 0.03134 0.04005 0.03704 0.03582]]
y0 shape: (1, 5)
y_hat: [[0.50907 0.50783 0.51001 0.50926 0.50895]]
y_hat shape: (1, 5)

Line 1–2: w1 is multiplied by the input matrix, x. dim(w1)=(4, 3) and dim(x)=(3, 5) and therefore dim(w1·x)=(4,5) based on the rules of matrix multiplication.

Note 2: Initially, we said (and it is a rule of matrix addition) that two matrices, A and B, can only be added if they are of the same dimension. But dim(w¹·x)=(4, 5) but dim(b¹)=(n¹, 1) =(4, 1) and yet we are adding them in line 1. Why is that? Numpy package comes to our rescue here.

“The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.” — Source: Numpy documentation.

After array broadcasting, dim(b¹) becomes (4,5), and therefore, z11=w¹·x+b¹ becomes a 4 by 5 matrix as well.

Line 4 through 10: Sigmoid activation is applied to get the output of layer 1, f11. That is the output for all the neurons in the layer.

Line 12–24: Shows the computations at the output layer. As expected dim(w²)=(n², n¹)=(1, 4) and dim(b²)=(n², 1)=(1, 1). We also need to apply the concept of array broadcasting here.

Note 3: The output is a 1 by 5 matrix. Did you expect that you expected a single number? Remember that this output is a vector of predictions for all the 5 training examples that we have in our data. In other words, using Numpy allowed us to pass all the training examples through the network in such an efficient way. Don’t worry about the values on the output vector because this is just a single pass of the data through the network based on randomly initialized weights.

We need to optimize the parameter values during training for the model to learn. Training the network entails a forward pass/propagation (as discussed in this article) and back propagation. The latter allows us to evaluate the model using some loss function, compute the partial derivatives of the function with respect to the parameters, and adjust the parameter values accordingly. This process is iterative, and the process of passing the entire dataset through the network (forward pass) and back (backpropagation) makes up one complete iteration. This is called an epoch. We will discuss all these in the next articles.

Before we end, let’s write some code on a complete forward pass of the dataset mentioned at the start (we will load the data straight from


w1 shape:  (4, 3)
w2 shape: (1, 4)
X shape: (3, 395)
w1 shape: (4, 3)
b1 shape (4, 1)
w2 shape: (1, 4)
b2 shape (1, 1)
f1 shape (4, 395)
z2.shape (1, 395)
yhat shape (1, 395)

Few points to note about the above code:

  • We need to pass input features data with the dimension (# features, # training examples). Typically and as in the pandas dataframe (df), we have training examples in rows and features in columns, but we need the data otherwise in the input matrix, and thererefore we transpose X in line 107.
  • The output of the forward pass, yhat, is a vector of predictions for all the training examples. For that reason, we have yhat(1, 395) for the 395 training examples on the data.
  • At this point, we are not using the target variable (y). We will use it during backpropagation and model evaluation.



Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: