Original Source Here
MLShot of the Day — Brush up ML concepts in less than 5 mins
Handling Continuous Attributes in Decision Trees
Discretization of Continuous Attributes for training optimal Decision trees
A Crash Course on Decision Trees and Splitting Measures:
- Decision Trees and its variants, Random Forests, XGBoost, CatBoost are popularly used in Machine Learning competitions.
- Training a Decision Tree for a classification problem involves recursively splitting the data into smaller subsets until each node contains data belonging to a single class.
- Different measures (Information Gain, Gini Index, Gain ratio) are used for determining the best possible split at each node of the decision tree.
Splitting Measures for growing Decision Trees:
- Recursively growing a tree involves selecting an attribute and a test condition that divides the data at a given node into smaller but pure subsets.
- The measures used for determining the best split computes the degree of impurity of the child nodes.
- Computing the impurity of child nodes with respect to that of parent nodes is called Gain. Higher the Gain (G), the better the split.
- Let pₖ be the proportion of records belonging to class k at a given node. The impurity measures are given by :
The curious case of Continuous Attributes:
It can be seen that the computation of splitting measures assumes finite (read: discrete) attribute values. This begs the question, How are continuous-valued attributes handled in decision trees?
Take some time to think about it (Not long though..its an ML shot)
The test condition for continuous-valued attributes can either be expressed using a comparison operator (≥, ≤). Alternatively, the continuous-valued attribute can be split into a finite set of range buckets. It is important to note that a comparison-based test condition gives us a binary split whereas range buckets give us a multiway split.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot