Original Source Here
Step №1: Data Wrangling
The first stop in the Data Science process has us wrangling data. “ Wrangling” is just a word that in this context is meant to mean collecting. In this step, we are doing just as the title implies, we are collecting data from various sources. All of this data is usually not incredibly well-organized, although an important precaution is to ensure that your data is in some format that your software can actually read.
Data wrangling is usually pretty straightforward, but there are definitely some nuances and things to know that could save a lot of time and effort. My first piece of advice is to only wrangle data that you need, and always output the data to some kind of non-proprietary, traditional data format, like .CSV, for example. This will allow the data in your project to be reproducible, which is definitely apt if you want to be rejecting or accepting null hypotheses and doing other Science things. In some cases, this step might not even be required, as one might be working with a data-set that is already in these sorts of files.
Generally, whenever we wrangle data, the goal is not to keep the data clean by any measures, but instead just to get the data into some sort of readable inputs that we can then process. However, you could save yourself some time, potentially storage space, or potentially memory, by taking a little extra time in this process to wrangle your data clean. While it is by no means what everyone does or a necessity, it can save you a good bit of time, including the entire next step.
For those working in Python, the things I would work on for this step are methods of data collection, such as the requests module or ScraPy. If you are entirely new, I would recommend skipping this portion and downloading a .CSV or .JSON file from the internet.
Step №2: Preprocessing
The initial preprocessing of data should not be very much. If there are features like “ date”, “ name, “id”, or similar features that are entirely useless, then it might be a good idea to go ahead and get rid of them as well. The less features you are working with, the less steps you have to do. However, the more features you have that are statistically significant, the better chance you might have at getting a great model out of your project.
Probably the biggest thing that needs to be done in the preprocessing stage is dropping missing values from the DataFrame. If we were to do this at any other point, it is likely all of our functions would return errors when they encounter a missing value, and just as well, after we split our features up we will have multiple names that are littered with such missing values.
For working in Python, you might want to get familiar with the Pandas df.drop() function, and the df.dropna() function. These two functions can be used to both drop columns, and drop observations with missing values respectively.
Step №3: Analysis
The third step that I usually take is analysis. Now that the data is at least able to be looked at without throwing errors, we should take a deep dive into each feature. If our data already has a target in mind, which is usually the case, then try to analyze features that might be more correlated — or prove to be more correlated with your target very effectively. Find oddities in them, find out the mean, the most common value, how many categories, things like this. By the time all of the features are analyzed, you should have some features that you believe correlate really well to the values, which will help in the next step.
For Python programmers, it is probably a good idea to familiarize yourself with both numpy and scipy.stats before analyzing any data. Visualizations via matplotlib, seaborn, plot.ly, etc. are also an excellent way to figure more about features in a really fast way. During this time, it is also a great idea to go ahead and fit a baseline model. A baseline model will make it a lot easier to know just how hard this feature is to produce. More importantly it will give you a solid starting place.
Step №4: Feature Selection
The next step is feature selection. Feature selection is probably one of the most important steps in the entire modeling process. This is because the features are the one absolute key to building a model that can successfully predict your target. Bad features make bad predictions, so feature selection and feature processing are probably two of the hardest parts and parts that will affect your model most dramatically.
During this time, your tests from your analysis of the data can be used to inform the most valuable of features. This is also where you can engineer features. Why should you engineer features? Engineering features is a great way to reduce the dimensionality of your input data. If you have two features, for example we were studying pinecones and we had three features, width and height and depth. We could multiply those together into volume, and volume would likely accumulate the strength of those features a lot better than separate features. I mean just think about it this way.
How big is a pinecone with a 10cm height and a 6m width? Compared to saying the pinecone has a volume of 60 cubic centimeters. The second one is one value that we can immediately assess and compare. There are examples like this all over machine-learning, and feature selection is important because it creates these values. All of this really is typically done by hand, or via indexes. Filtering values can be done via pd.DataFrame[BitArray]. Getting the index of a BitArray on a DataFrame will just remove values based on a conditional there. You can also use mapping functions here to map a mask to a value. The mask just needs to return 0, 1, or true/false values.
Step №5: Feature Processing
The next step in my Data Science process is feature processing. Features often require encoding, normalization, and important steps like this. In some cases, the model will not be able to predict off our input data if we have not processed it.
In Python, some tools you might want to look at for this are SkLearn, as well as some others that come in Tensorflow for batching and these sorts of things. Encoders and scalers or probably the most popular choices for these operations, but really your processors could be anything. Usually these objects come together in some sort of pipeline wrapper, or Python file, as typically we would then serialize this model and automate the feature processing. This is another good reason to put more effort into some of the testing. We also need to do one more portion of feature processing, as these lines are somewhat blurred. Now we will want to do a test/train/val split using the train_test_split() method most likely. This subsamples observations randomly and then separates them into two different sets of the same features for us. The reason why we want to do this to our data AFTER it is processed, rather than doing it when our
Step №6: Modeling
The step that might feel the biggest and most engaging is modeling. In the modeling step, we will then be taking this data that we have carefully put together into our input data. This data then is provided as inputs to our machine-learning model. Depending on your model, hyper-parameters might need to be tuned during this part of the process, as well.
This part is relatively straightforward, as often the inputs for libraries are usually mapped to two simple positional arguments. Ensure your dimensions are correct, and send your features into the model. Get a prediction, and check it on your validation set, then go back and see if there is more that can be done to achieve a better accuracy. Eventually with enough work, you will get a fairly confident model that might be ready for pipelining.
Step №7: Pipelining
The last step is pipelining your stuff together. You’ll want to include any methods used of pre-processing inside of this pipeline. It is vital that the processing is the same so that ultimately the inputs of the model remain in the same format, therefore the outputs for each feature set remain the same as well.
Inside of most ML modules, there is usually a somewhat robust pipelining interface. In the example of SkLearn, which you will likely be using for your first couple models, you can use the Pipeline constructor to create a new pipeline.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot