Mitigate Your Model With Data



Original Source Here

Part №1: Our Model

For our model, I went over to FiveThirtyEight to grab this data on airline safety. FiveThirtyEight is also a great platform to get some data to practice with in general, so for those who are studying, or constantly improving, these data-sets are clean enough to use but not so clean that no work goes into them. That being said, I think they are great practice.

Data Wrangling

For our first instance, we will not be modeling according to the process I outlined in that article. This is just so we can get a particularly bad model and bring it back to life by fixing up the features. Now lets go ahead and start the wrangling step, one of the few steps we are actually going to do here, starting with importing our dependencies and getting a clone of our CSV file with Git via Bash:

# deps
import pandas as pd
import numpy as np
!git clone https://github.com/fivethirtyeight/data

This data is licensed under the Creative Commons — attribution license by FiveThirtyEight, so it is pretty much free for use anywhere you want it.

Now we will continue by reading in our CSV with the Pandas via the read_csv() :

df = pd.read_csv("data/airline-safety/airline-safety.csv")
df.head(5)
(image by author)

Okay, I do not know who Aroflot* is, but

I would not get on their planes.

Preprocessing

This processing step is more like pre-pre-processing, as our mission is just to get the data frame readable enough. Once it is readable enough, we will be skipping the next few steps I would take, as we are going to see just how important and effective those steps are towards making a great prediction. For this data-frame, the first thing I will be doing is is just checking for missing values by getting a sum of the observations which are null.

df.isnull().sum()airline                   0
avail_seat_km_per_week 0
incidents_85_99 0
fatal_accidents_85_99 0
fatalities_85_99 0
incidents_00_14 0
fatal_accidents_00_14 0
fatalities_00_14 0
dtype: int64

How clean and convenient, there are no missing values. That being said, I am just going to head straight into feature selection, however I am not going to be separating these features based on their strengths, as we will do that after we get an initial little-to-no-effort metric on a model for this data.

Feature and Target Selection

Now that we know we can actually put these features through some data processing, lets just go straight to selecting our features and our target. Since this is my low effort example, which we are not having a focus on the data for, I am not going to put much effort in here. Instead I will just select a target and move on.

So… I could not help myself, as the features here are quite alluring.

(image by author)

According to the README.md, as well as just assumptions that I can make about this data, we have incidents, fatal incidents (incidents that resulted in 1 or more death,) and then we have number of deaths. The way I saw it was that we could make a few different modeling examples that would be really cool. One idea I had initially is that we could predict how many fatalities we might run into depending on how many kilometers of travel our planes go through on average per week. Another cool idea I have is to provide some of these other values and have a binary classifier try to predict whether or not it took place between 1985–1999 or 2000–2014, given that 1999 was the year I was born, I think it would be interesting to find out if planes were deathtraps before.

There is also a real-world change that happened in September of 2001 that I think could very potentially render a change in how these airlines are monitored by the government. I think that with more security, more responsibility, and a more watchful eye on these companies, it might be interesting to imagine how much of an impact an event like that had on something that is both unrelated and not related. I think the same phenomenon might be able to be observed for similar data in 2021, as well, which I do think would be an interesting study.

Look at what happened to me.

This is why Data Science is so much fun, but let us for now just select a target to work through, maybe in the future I will try a different target on this data-set when I get some free time, and maybe I will use it for another article here on my blog! Anyway, today I am going the safe route, because an unfortunate thing about this data-set is…

len(df)56

With such few observations, that are split amongst two separate sets of years, it is going to have a serious risk of underfitting with the less features that we retain here. That in mind, I think the best thing to do here might be the binary classifier. The binary classifier is going to require us to engineer a target, however, and this is problematic because this is supposed to be a very low effort module. As I discussed in the conclusion of many of my articles like this, however, data is never the same — so there are always cases where we are going to need to do out of the ordinary things. That is part of what makes doing this so engaging. However, with this observation length, we only have a 56 total observation count, so we might want to approach things differently. For example, if we were to do a train/test/val split on this, the training data would be

len(df) * .75 * .7531.5

observations long. In other words, we would be training on 31 observations and predicting on 11 validation observations, and 14 testing observations while training our model with a low amount of data. A better choice is probably to just do a regular train/test split here. Let us first engineer our target, which is just a boolean basically, but we will store it as an integer. We also need to engineer some the features to go along with it. This will help our no data problem substantially. Given that our length is so low, we can effectively double our data by choosing this target. I will start by getting our features out into an array of labels:

features = ["incidents", "fatal_accidents", "fatalities"]

Notice that these features do not include the year. This is because I am now going to use a little one line for-loop to concatenate all of these up for each feature, so that we no longer have the year contained within the feature names. We essentially need to translate these names into a new column on our data frame. This neat little for loop does this, you could also use lambda and map together to accomplish the same goal.

z = [list(df[feature + "_85_99"]) + list(df[feature + "_00_14"])for feature in features]

Now we have effectively doubled our observations when we put this together into a DataFrame:

len(z[1])112
features = pd.DataFrame({"indicidents" : z[0], "fatal_accidents" : z[1],
"fatalities" : z[2]})

I wrote a cruddy little function to generate y along with this, just a simple condition that returns one number set or another depending on the value it is representing. Since all the values above the length of our original dataframe, which has not been mutated, are our appended values, we can use that to our advantage. Unfortunately, this means that many functions will not work for this, as we MUST know the index of the value. That in mind, I wrote this elementary function to do all of this for me:

def yearcategory(x):
z = []
for count, i in enumerate(x):
if count > len(df):
z.append(2014)
else:
z.append(1985)
return(z)
years = yearcategory(features["fatalities"])

And now we have all we need for modeling this target! Along the way, we also did some light feature engineering. We will come back and iterate on this to make our model even better in the future. Another feature I did get rid of, but probably shouldn’t have because we might not end up needing to get rid of any features at all in this example is the km traveled feature. However, this is clearly not significant, just like the airline labels. They are not important to the research therefore we might as well get rid of it.

It is hard for me to build a model without engaging with the data. It is probably entirely impossible to do in all honesty, but hopefully this is just a further step to prove that even doing this much is not always going to be sufficient.

Random Elements

There are some elements of each step’s outline that are usually done a bit out of order. For example, splitting the data is generally considered part of feature engineering, but generally it makes sense to process your train X and testing X together, that way they receive the same sort of processing. We will start this off by sending years into our features data frame.

features["years"] = years

Given that our other features are all continuous, we could very easily just lump all of these values in together into a model and get a prediction now, no processing necessary. Since this is our dry run, we are going to do that of course. Here is our train test split and then putting those values into proper formats for our model to interpret:

from sklearn.model_selection import train_test_splittarget = 'years'
train, test = train_test_split(features)
trainX = train.drop(target, axis = 1)
trainy = train[target]
testX = test.drop(target, axis = 1)
testy = test[target]

Modeling

The good news is; we still do not really know much about these features. It seems weird to not even know what the mean of each of these values is, and be modeling it as we speak, but regardless this is the path I have selected in life. We need a classifier model that will be very effective when it comes to predicting a set of binary features like this. We could use essentially any classifier, however there is a tendency for a solver to over-think things if it is a general purpose classifier, like gini index via a random forest classifier. During this whole process, especially due to our smaller amount of data and the various restrictions we have on our target, we might want to select a model that is more directed towards targets with a binary set.

from sklearn.linear_model import LogisticRegression

I ended up going with a logistic regressor here. The reason I chose this is because they are quite good at predicting binary values.

model = LogisticRegression
model.fit(trainX, trainy)
yhat = model.predict(testX)

Now lets see just how well that model did:

from sklearn.metrics import accuracy_score
accuracy_score(testy, yhat)
0.4642857142857143

Wow, this model did terribly.

No surprises here, but we can now improve this model’s accuracy by demonstrating some processing and mitigation techniques in this sort of situation!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: