Classification.



Original Source Here

We all know, linear regression is for qualitative scenarios and when the response is categorical we go for classification. As the name suggests — the process classifies observations into groups in order to predict a qualitative observations. Often the process first predicts the probability of each of the category for making classification, thereby behaving like regression algorithms.

However the divide is not always true. Least squares; linear regression can be used with quantitative response, on the other hand we find logistic regression being used on a 2 class qualitative response. Since there is a step involved with determining class probabilities in order to classify, it is referred to as regression. Similarly K-NN and Boosting can be used for both qualitative and quantitative data. We tend to base our analysis and model selection based on whether the response is Qualitative or Quantitative and thereby choosing linear regression or logistic regression accordingly, however the type of predictor is of less significance.

There are many ways by which we can classify and predict. Depending on the data, the ultimate goal, scenario involved — we may choose one. No matter which model we choose the focus is accurate prediction. While we are still designing our models the only way to determine accuracy of prediction is finding out the errors on the training dataset and revalidate on the test dataset. The most common approach of quantifying the accuracy of the estimate is training error rate, the proportion of mistakes that were made when we apply our estimate to the training observations. And the success of a model is when we get min test error.

One such classifier is Naïve Bayes Classifier. Naïve Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with strong independence between the features. It is called Naïve because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of the other feature. It can be used when the dataset is of moderate to large in size and has several attributes. The main criteria being the instances should be conditionally independent.

A typical R code to implement Naïve Bayes Classification

Now the steps involved performing Naïve Bayes remains same as any

Ø Split data into train — test

Ø Do feature scaling if required

Ø Fit Naïve Bayes to training set

Ø Predict test set

Ø Perform Model Evaluation.

Though it is easy to implement and gives good result, it has the problem of being less accurate at times because of conditional independence and practical dependencies among the variables.

Having said all these, we can use Naïve Bayes classification in following business scenarios

Ø Sentiment Analysis

Ø Document Categorization

Ø Email spam filtering

Ø Classify news articles into subgroups.

Now it might not always be the case that we know the conditional probabilities of Y given X and thereby not able to apply Bayes classification. In such cases our approach can be to try to estimate the conditional distributions of Y given X and then classify a given observation to the class with the highest estimated probability. KNN or K — nearest neighbor exactly does that. KNN classifier identifies the K points in the training data that are closest to x0, for a given integer value of K. We then estimate the conditional probability for the class. Thus we apply Bayes rule and classify the test observation to the class with the highest probability.

If we were to implement a KNN model, we would

  1. Choose the number of K neighbor
  2. Take the K nearest neighbor of the unknown data points according to distance, preferably Euclidian distance

3. Among all these K neighbors, count the number of data point in each category.

4. Assign the new data point into the category where we can count the most neighbors.

As apparent, the key is choosing the right value of K. While choosing the optimal value of K we need to keep in mind — if K is too small, neighborhood would be sensitive to noise points; then again if K is too large, neighborhood will include points from other classes.

R is a powerful language where in the optimal value of K can be determined with a few lines of code. For this specific example we see the accuracy of prediction increases as the value of K is increased but after K = 73 the accuracy drops again. Hence for this specific example K= 73 gives the best result. However there needs to be a bit of trial and error to determine the best K.

Another point to remember is scaling. Attributes may have to be scaled for preventing distance measures being dominated by any attribute. To avoid this bias we should normalize the feature variables. The whole principle of KNN is based on the distance between points and hence scaling is pivotal in this model.

Logistic Regression

Logistic regression is a very common approach for classification wherein instead of modeling the response Y, we try to to model the probability of Y belonging to a particular category. The logistic function will always produce a sigmoid curve of S shape regardless of the value of X. For any equation we need to estimate the coefficients. The regression coefficients are unknown and can be estimated based on the available training data, Least Square method is a common approach. However maximum likelihood is the preferred method.

Now if the classes are well separated , the parameter e stimates for the logistic regression turns out to be a bit unstable. Also when there are more than two response classes logistic regression does not perform the best ; as is the case when n is small and the distribution of the predictors is approximately normal in each of the classes . As an alternative LDA or Linear Discriminant Analysis gains prominence. In LDA we model the distribution of the predictors separately in each response class and then use Bayes’ theorem to flip these around into estimates. LDA classifier results from the very assumption that observations within each class are derived from a normal distribution and has a class specific mean and common variance.

Like LDA , QDA or Quadratic Discriminant Analysis , too assumes a Gaussian distribution but unlike LDA assumes that each class has its own covariance matrix. So effective if there are K classes and p predictors; LDA assumes that all the K classes share a common covariance matrix based on p(p+1)/2 parameters , while QDA estimates a separate covariance matrix for each class for a total of K*p(p+1)/2 parameters. So LDA by assuming a common covariance matrix becomes linear. Thereby LDA is much less flexible classifier than QDA and has lower variance. This might lead to better improved prediction performance. There has to be a flip side, the tradeoff if, the assumption that all the K classes share a common covariance matrix is badly off, then LDA would suffer high bias. which may help us to assume, if there are less training conservations, LDA might be a better choice then QDA. Which goes to say — a larger training set to be more suitable for QDA.

It’s worth mentioning approaches like Logistic regression, SVM are test suited for linear models; while for non-linear models we prefer Naïve Bayes, KNN, Decision Tree, ANN, and Random Forest etc.

A bit on model evaluation; Accuracy is not always the best measure of model evaluation. There are ideally some criteria we check to evaluate

  • AIC — Alkaline Information Criterion
  • Null Deviance and Residual Deviance
  • Confusion Matrix — Accuracy, Sensitivity, Specificity
  • ROC — Receiver Operating Characteristics
  • AUC — Area Under Curve
TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative

Accuracy alone should not be the judge of acceptance of a model. Sensitivity and Specificity are important too.

simple R code to determine AUC

The ROC curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds. ROC might not always give correct interpretation, so suggested is to look AUC. The over all performance of the classifier , summarized over all possible thresholds is given by the AUC . The larger the AUC the better the classifier. ROC curves are also useful for comparing different classifiers, since they take into account all possible thresholds.

If we were to compare the models

Logistic regression and LDA differ only in their fitting procedures, so it might look that both the models would give the same output. LDA assumes Gaussian distribution with a common covariance matrix in each class. If the Gaussian assumption is not met, logistic regression outperforms LDA.KNN on the other hand is a non parametric approach with no assumption about the shape of the decision boundary. Therefore an obvious scenario where KNN outperforms LDA and logistic regression would be when the decision boundary are complex and highly non — linear. On the flip side, KNN does not tell us significant predictors. QDA serves as a middle ground between non parametric KNN and linear approach of LDA and Logistic regression by assuming a quadratic decision boundary and limited number of training observations. It has also to be kept in mind no single method wins over the other in every practical scenario. When the true decision boundaries are linear, then the LDA and logistic regression approaches tend to perform well. When the boundaries are moderately non- linear, QDA would be a better choice. For more complicated decision boundaries, non parametric approach such as KNN would be best.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: