Original Source Here
Neural Network Design (Recap)
Neural Network is a system made up of many neurons stacked into layers. The first layer, the input layer (we will call it layer 0) performs no computation except to pass the input values. For this reason, when counting the number of layers in a NN, we ignore the input layer. Figure 1 below, therefore, is a 2-layer network.
The output layer computes the final output of the network. Layer(s) between the input and the output layer are called hidden layers. The NN in Figure 1 below is described as 3–4–1 network for 3 units in the input layer, 4 units in the input layer, and one-valued output.
The number of layers in a NN determines the depth of the network. Based on this, NNs with many hidden layers are termed as Deep Neural Networks (DNNs).
The design of NN explained by the number of layers and neurons in each layer, is often termed as the Architecture of the NN. We use these terms (NN design and NN architecture) interchangeably going forward in the series.
Feed-Forward Neural Network (FF-NN)
Feed-forward network, also called Forward pass, approximate some function
y=f(x|θ) for input values, x, and known output, y. The network learns the parameters in θ that best approximate function f to establish a good mapping ŷ=f(x|θ). ŷ is a prediction of the model. In NN, w, b ∈ θ — that is, the parameters we optimize during model training are2 weights (w) and biases (b).
Key Feature of Feed-forward NN — Feed-forward network allows information to flow only in one direction (no feedback loop or connection).
Definition: Multi-layer Perceptron (MLP)
MLP is a special case of Feed-forward NN. In MLP, all nodes are densely-connected, that is, each neuron/node is connected to all nodes in the immediate previous layer. In fact, the NN in Figure 1 is a Multi-Layer Perceptron.
Feed-Forward Neural Network (FF-NN) — Example
This section will show how to perform computation done by FF-NN. The essential concepts to grasp in this section are the notations describing different parameters and variables, and how the actual computation is conducted. We will use the architecture of NN shown in Figure 1.
1. The Architecture and the Notations
Let’s redraw Figure 1 to show the essential variables and parameters of the NN. Note that in this figure (Figure 2 below), other connections between nodes are eliminated just to make the plot less cluttered; otherwise, this NN is a Multi-layer Perceptron (all nodes in any two adjacent layers are interconnected).
- x⁰ᵢ — iᵗʰ input value. Value for feature i at the input layer (layer 0),
- wˡⱼᵢ — The weight comes from neuron i in layer l-1 to neuron j in the current layer l,
- fˡⱼ — Output of unit j in layer l. This becomes the input to the units in the next layer, layer l+1,
- zˡⱼ — weighted input for jᵗʰ neuron in layer l,
- bˡⱼ — The bias on the jᵗʰ neuron of layer l,
- nˡ — Number of neurons in layer l,
- fˡⱼ = gˡ(zˡⱼ+bˡⱼ) — gˡ is the activation function at layer l. As always the practice, we will apply one activation function to all the neurons in a given layer, therefore, we do not need to specify the activation function for each neuron, hence, no subscript for g.
- l — a given layer.For, l =0, 1, …, L. In our case, the neural network as 2 layers, therefore L=2 (remember we said we do not count the input layer)
Here are some examples of interpreted parameters and variables:
- x⁰₂ — second input value,
- w¹₄₃ — weight of the connection from neuron 3 in layer 0 into neuron 4 in layer 1,
- z¹₃ — weighted input of unit 3 in layer 1,
- g¹(z¹₃) — applying activation function g¹ on z¹₃.
- Number of units in each layer — input layer as 3 units (n⁰=3), the hidden layer has 4 units (n¹=4), and the last layer as one neuron (n²=1), and
- b²₁ — bias subjected to neuron 1 in layer 2 (output layer in our case).
2. The Data
The data that we will use in this example contains 3 features G1, G2, and G2 (this is the reason we chose 3 neurons in the input layer of the architecture, by the way) with a target, pass (we will call it
y) which is either 0 or 1 for fail and pass, respectively. This means that we are dealing with a binary classification problem.
The original data contains 30 features and 395 rows (data points), but we will use the 3 features only to show the necessary computations efficiently. We will be dealing with the entire dataset later in the series.
Again, to make the concepts easy to grasp, we will show how to pass a single data point (training example) through the network in a single Forward Pass.
3. Parameter Initialization
We will initialize weights randomly with values between 0 and 1, whereas bias will always be initialized with 0. We will discuss more about this later in the series, just keep in mind that weights are best initialized to 0–1, but not zero, and bias are best kept 0 at the start.
4. Computations on the Neurons of the Hidden Layer
For the first neuron in the hidden layer, we need to compute f¹₁, meaning we need initial values for the three weights w¹₁₁, w¹₁₂, and w¹₁₃.
Let’s initialize them as follows w¹₁₁=0.3, w¹₁₂=0.8, and w¹₁₃=0.62. And as said earlier, we will set the bias, b¹₁=0. Let’s introduce an activation function called Rectified Linear Unit (ReLU) that we will use as function g¹ (this is just an arbitrary choice). As said earlier, we will discuss more about activation functions later but for now, we can just define ReLU and use it.
You can proceed to compute f¹₂, f¹₃, and f¹₄ in the same way, given that
- w¹₂₁=0.9, w¹₂₂=0.1, w¹₂₃=0.1, and b¹₂=0, for the calculation of f¹₂,
- w¹₃₁=0.7, w¹₃₂=0.2, w¹₃₃=0.4, and b¹₃=0, for the calculation of f¹₃, and,
- w¹₄₁=0.01, w¹₄₂=0.5, w¹₄₃=0.2, and b¹₄=0, for the calculation of f¹₄.
Confirm that f¹₂=8.1, f¹₃=10.5, and f¹₄=6.07.
5. Calculations in the Last Layer
In the final layer, let’s initialize the parameters (weights) and biases: w²₁₁=0.58, w²₁₂=0.1, w²₁₃=0.1, w²₁₄=0.42, and b²₁=0.
On activation function in this layer — since we are dealing with binary classification, we can use sigmoid/logistic function because this function outputs values between 0 and 1 and, therefore, can be interpreted as a prediction probability of a class. Sigmoid function is defined as:
Since the output of the final layer is generated by Sigmoid (it ranges from 0 to 1), the result can be interpreted as a probability. y²₀=0.999997502 means that the likelihood of a pass is almost one. From the data, the true value is 1 meaning the Forward Pass got the prediction correct. Not however this is a mere coincidence because no training has been done. The weights were randomly generated, and all bias values set to 0.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot