Original Source Here
In the real world, data may be in different forms. Sometimes the differences among the data are arbitrarily varied which is difficult to interpret. If we can systematically organize the unstructured data, the data will be more useful. It also has a great impact on data-driven models like machine learning, data analysis, etc. But for doing so we should be very careful that no information is altered during the transformation. In consequence, real-world data should be cleaned and transformed into a useful format. Among the many transformed techniques, Normalization is a widely used technique.
What is Normalization?
Normalization is the key part of data pre-processing which is used to transform features on a similar scale. The ultimate target of data normalization is to transform the values of numeric columns in the dataset to use a common scale, without losing information. In machine learning, every dataset does not require normalization. It requires when features have different ranges.
For example, we have two different variables like height and weight. Height is measured in cm (e.g. 160cm, 171cm, etc.) and weight is measured in kg (e.g. 40kg, 80kg, etc.). Can you find any inconsistencies between the two variables? One problem is different measurement units and another problem is distribution. If we can bring the two variables under the same distribution without any information loss, it will be very helpful for comparing and other efficient model creation. And the normalization does this great job.
Normalization helps to improve the performances and training stability of the model.
Why is normalization needed in real-life scenarios?
In the real-world scenario, where one feature might be fractional and range between zero and one, and another might range between zero and a thousand. The great difference in the scale of the numbers could cause problems when the values are combined as features during modeling
In the above example, we can see a dataset with two features: Age and Salary. In the first scenario, we can see the distribution between two variables which can not be recognized easily. But in the second distribution, we can see the distribution between two variables which is quite the same. The first scenario describes the distribution before normalization and the second scenario describes the distribution after normalization.
It also indicates that if we create a model based on the normalization, it will outperform than the model without normalization.
Difference between Normalization and Standardization
Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale.
Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1 (unit variance).
In the business world, “Normalization” typically means that the range of values is “normalized to be from 0.0 to 1.0”.
“Standardization” typically means that the range of values is “standardized to measure how many standard deviations the value is from its mean”.
Different normalization techniques with hands-on examples.
There are multiple normalization techniques in statistics. Let’s have a look at the techniques
- The maximum absolute scaling
- The min-max feature scaling
- The z-score method
- The robust scaling
Now, we try to explain the technique with hands-on examples.
To understand the normalization techniques we have created a simple dataframe with two columns. However, in the real world, the dataset will be larger. But the technique remains the same.
import pandas as pddf = pd.DataFrame([[10500, 11.2], [20500, 11.0], [17500, 14.5], [30500, 14.3], [40000, 10.1]],columns=['col_1', 'col_2'])
If we observe the dataframe, we may find that col_1 reading ranges from 10500 to 40000, at the same time col_2 from 10.1 to 14.5. In case of training the model with this dataframe, the model will automatically impose more weight in col_1 than col_2. But it doesn’t make any sense. To solve this problem we have to normalize both variables.
The maximum absolute scaling
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It rescales each feature between -1 and 1 by dividing every feature by its maximum absolute value.
We can use the Scikit-learn library to find the maximum absolute scaling. At first, we have created an abs_scaler object with the MaxAbsScaler class. Then, to find the maximum absolute value of each feature we use the fit method.
When we observe the scaled_dataframe, we can find that the variable ranges between 0 to 1.
The min-max feature scaling
Min-max scaling is similar to z-score normalization in that it will replace every value in a column with a new value using a formula. It rescales the feature to a fixed range of [0,1] by subtracting the minimum value of the feature and then dividing by the range. In this case, that formula is:
- Xnorm is our new value
- x is the original cell value
- Xmin is the minimum value of the column
- Xmax is the maximum value of the column
We can use the Scikit-learn library to find the maximum absolute scaling. At first, we have created a scaler object with the MinMaxScaler class. Then, we fit the scaler parameters, meaning we calculate the minimum and maximum value for each feature. Finally, we transform the data using those parameters.
from sklearn.preprocessing import MinMaxScaler# create a scaler objectscaler = MinMaxScaler()# fit and transform the datadf_norm = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)df_norm
The maximum absolute scaling and the min-max scaling are very much sensitive to outliers. Because small changes in outliers affect the maximum and minimum value and it affects the final output.
The z-score method
In Z-scores, the normalization deviation of each point is taken from the mean and divided by the standard deviation, and thus standard deviation is set to 1, and the mean is set to zero. Each standardized value is computed by subtracting the mean of the corresponding feature and then dividing by the standard deviation.
The z-score typically ranges from -3.00 to 3.00 (more than 99% of the data) if the input is normally distributed. However, the standardized values can also be higher or lower, as shown in the picture below.
We can use the Scikit-learn library to find the maximum absolute scaling. At first, we have created a standard_scaler object with the StandardScaler class. We calculate the parameters of the transformation (in this case the mean and the standard deviation) using the .fit() method. Next, we call the .transform() method to apply the standardization to the data frame. The .transform() method uses the parameters generated from the .fit() method to perform the z-score.
from sklearn.preprocessing import StandardScaler# create a scaler objectstd_scaler = StandardScaler()std_scaler# fit and transform the datadf_std = pd.DataFrame(std_scaler.fit_transform(df), columns=df.columns)df_std
This transformed distribution has a mean of 0 and a standard deviation of 1. It will be a standard normal distribution (see the image above) if and only if the input feature follows a normal distribution.
The robust scaling
In robust scaling, Each feature is subtracted from the median value, and dividing value using the interquartile range. Q2 refers to the median value. And the range between Q3 and Q2 refers to the interquartile range.
Robust scaling techniques which are used for percentiles, can be used to scale numerical input variables that include outliers.
Robust scaling techniques which are used percentiles can be used to scale numerical input variables that contain outliers.
We have used the Scikit-learn library to find robust scaling.
from sklearn.preprocessing import RobustScaler
# create a scaler objectscaler = RobustScaler()# fit and transform the datadf_robust = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)df_robust
Though there are multiple normalization techniques, it has different usages as well. The following table shows a proper direction for the usages of different types of normalization techniques
The best normalization method depends on the data to be normalized. Normally, a z-score is commonly used. The min-max normalization method guarantees that all features will have the same scale but it does not handle outliers. The robust scaling method will be helpful if your dataset has numerous outliers. It is always better to visualize each feature to have an insight into their distribution, skewness, and so on. Based on this analysis, you should apply the normalizations technique.
Don’t forget to read out the following amazing data science articles
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot