Original Source Here
Machine Learning Models
Our dataset is now balanced! This ensures that our model can see the same amount of elements from each of the classes. We can build machine learning models for classification. Two models will be used:
- Random Forest
- Logistic Regression
Before we start building models, it is crucial to define the performance metrics we will be using. As our validation data is imbalanced, if the model would predict the majority class every time, our model will receive an accuracy that is significantly skewed without consideration for the minority class. Hence, we will be using precision, recall, and F1 score.
First, we will build our Random Forest Model, with 15 estimators. Keep in mind that due to the size of the dataset, it will take a couple minutes to run.
Great! Our model had a high precision and recall, meaning that it did not select as many false positives (precision) or false negatives (recall) as it did true positives! Because of this, the F1 score was high as well.
Second, we will also build a Logistic Regression Model to solve this classification problem.
Well…in this case, while recall was high (model selected more true positives than false negatives), the precision was really low, which means that the model selected more false positives than true positives. Hence, the F1 score decreased.
In comparing these two models, it is important to evaluate the metrics by which we compare them. Our logistic regression model had a higher recall (by approximately 5%), while the random forest classifier had a higher precision, (by approximately 78%). Hence, in terms of F1 score, which is a harmonic mean of both precision and recall, the random forest model performed better.
Finally, we will also make a neural network with two hidden layers to solve this classification problem.
Before creating our model, it is important to define the hyperparameters that we will be using. Here are the ones I used, but feel free to tune them even further with the goal of achieving improved results!
As shown in the diagram, we have:
- An input layer of 64 neurons with ReLU activation
- Two dense hidden layers of 32 and 16 neurons respectively, each with a ReLU activation
- An output layer of 1 neuron with a sigmoid activation
Additionally, we have:
- An Adam optimizer with a learning rate of 1e-4 and a decay of 1e-6
- Loss of Binary Crossentropy and metrics of precision and recall
The model will be trained on 5 epochs
Below is the code I used to create the model, with the Keras module of Tensorflow.
Now that we built the model, let’s run it to see our results!
As we can see, our validation precision ranged from approximately 19.52 to 61.26%, while the validation recall ranged from 86.03 to 90.44%. Epoch 4 had the best validation recall, and epoch 1 had the best validation precision. The model did not overfit, as the values for precision and recall are still increasing. Feel free to change the number of epochs to view how the model performs over greater periods of time! Do note however that it will take longer to train…
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot