Missing Value Handling — Imputation and Advanced Models


Original Source Here


Imputation is an effective tool to handle missing values. The problems with different missing data types are mitigated by inserting a descriptive value or even computing a value based on the remaining known value.

While no approach is perfect and not better than the actual data, imputation can be better than removing the instance entirely.

There are many different approaches to missing value imputation, but a few methods are focused on here.

  • Simple imputation
  • KNN Imputation
  • Iterative Imputation

These methods are found in the commonly used scikit-learn packages and compatible with standard data formats in Python. The basic process to impute missing values into a dataframe with a given imputer is written in the code block below.

imputer = SimpleImputer(strategy=’mean’)# df is a pandas dataframe with missing values
# fit_transform returns a numpy array
df_imputed = imputer.fit_transform(df)
# Convert to pandas dataframe again
df_imputed = pd.DataFrame.from_records(df_imputed, columns=df.columns)

Simple Imputation

The most basic imputation method imputes either a constant value for each missing data point. Alternatively, you can use these methods to calculate and impute either the mean, median, or most frequent value for your dataset.

When the number of features is relatively large and, missing values are few, this is a practical approach as the few missing values may be negligible to the overall model performance.

KNN Imputation

KNN Imputation provides a more detailed approach than simple imputation. Using the K-most similar records to the instance with the missing value, some dependencies between missing and non-missing values can be modeled.

Thus, this method is more flexible and can somewhat handle data that is missing at random.

KNN Imputation is more computationally expensive than simple imputation. Still, if your dataset is not in the range of 10s of millions of records, this method works fine.

Iterative Imputation

Similar to KNN imputation, iterative imputation can model complex relationships between known values and predict missing features. This method is a multi-step process that creates a series of models to predict missing features based on the known values of other features.

Iterative imputation is a complicated algorithm, but the overall approach is relatively straightforward.

  1. Impute missing values with simple imputation. This step allows the models to fit and predict correctly.
  2. Determine an order of imputation. The implementation has several options. This parameter can affect the final predictions as previous predictions are used for future predictions.
  3. Impute one feature by training a model on all other features. The target variable is the feature being imputed that contains some known values.
  4. Repeat this process for each feature.
  5. Repeat the process across all features several times or until the changes between complete iterations are below a threshold tolerance.

Iterative imputation uses Bayesian Ridge regression as the default estimator; however, you can modify this to an estimator of your choice.

One drawback to iterative imputation is that it is more computationally expensive compared to the other imputation methods. Thus, for enormous datasets, KNN imputation may be preferable.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: