Original Source Here
Concerning the crucial set of
group_1 features, they are almost balanced between left/right-skewed. However, the dominant holding factor is the great presence of outliers beyond the pertinent upper boundary. There are many players who oftentimes perform well-above the expectations and this fact comes in line with our initial conclusion:
Induction #1: We have to deeply study
group_1, in a way that will not only guarantee significant levels for the respective features, but also won’t compromise (the greatest possible number of) the rest.
With that in mind, we initiate a naive approach of sorting the dataset by a master feature (
AST_PCT), taking the upper segment of it (95th Percentile) and evaluating the plays ‘horizontally’ (across all features).
The outcome is disappointing. By comparing the population with the 95th percentile average features, we see that by maximising along
AST_PCT many of the remaining features get worse, violating that way Assumption #2. Besides, we wouldn’t like to buy a SG of great Assist ratio but poor Field Goal performance (
Therefore, it gets easily conceived that we cannot accomplish our mission of building the optimum SG’s profile, based on plain exploratory techniques. Thus:
Induction #2: We have to build better intuition on the available data and use more advanced techniques, to effectively segment it and capture the underlying patterns, which may lead us to the best SG’s profile.
Clustering picks up the torch…
[Refer to 01_clustering[kmeans_gmm].ipynb]
We begin with the popular K-Means algorithm, but firstly implement PCA, in order to reduce the dataset dimensions, while retaining most of the original features’ variance .
We opt for a 4-component solution, as it explains at least 80% of the population’s variance. Next, we find the optimum # of clusters (k), by using the Elbow Method and plotting the WCSS line:
The optimal # clusters is 4 and we are ready to fit K-Means.
The resulted clustering is decent, however there are many overlapping points of
cluster_3, turquoise & blue, respectively. Seeking for potential enhancement, we are going to examine another clustering algorithm. This time not a distance-based, but a distribution-based one; Gaussian Mixture Models .
In general, GMM can handle a greater variety of shapes without assuming the clusters to be of the circular type (like K-Means does). Also, as a probabilistic algorithm, it assigns probabilities to the datapoints, expressing how strong their association is with a specific cluster. Yet, there’s no free lunch; GMM may converge quickly to a local minimum, hence deteriorating results. To tackle this, we can initialize them with K-Means, by tweaking the respective Class parameter .
In order to pick the suitable # of clusters, we can utilize the Bayesian Gaussian Mixture Models class in Scikit-Learn which weights clusters, leveling the erroneous ones at or near zero.
array([0.07, 0.19, 0.03, 0.14, 0.19, 0.09, 0.06, 0.18, 0.05, 0.01])
Obviously, only 4 clusters surpass the 0.01 threshold.
cluster_3 (blue) is better separated this time, while
cluster_2 (turquoise) is better contained, too.
For the purpose of enhancing the clusters assessment, we introduce a new variable which depicts the net score of the examined features. Each group is weighted in order to better express the magnitude it has on the final performance and their algebraic sum is calculated. I allocate weights as following:
NET_SCORE = 0.5*group_1 + 0.3*group_2 + 0.2*group_3 - 0.3*group_5# group_4 (START_POSITION) shouldn't be scored (categorical feature)
# being a center ‘5’ doesn't mean to be ‘more’ of something a guard ‘1’ stands for!# group_5 (DEF_RATING) is negative in nature
# it should be subtracted from the Net Score
So, let’s score and evaluate clusters.
cluster_3 outperforms the rest ones with a
NET_SCORE of aprox. 662.49, while
cluster_1 takes position next to it. But, what worths to be highlighted here is the quantified comparison between the 95th percentile and the newly introduced
It gets visually clear that
cluster_3 dominates the 95th percentile segment, by noting an increase of the 146.5
NET_SCORE units! Consequently:
Cluster_3encapsulates those ‘plays’ which derive from great SG performance, in a really balanced way —
group_1features reach high levels, while most of the rest keep a decent average. This analysis, takes into account more features than the initially attempted (ref. EDA) which leveraged a dominant one (
AST_PCT). Which proves the point that…
Induction #4: Clustering promotes a more comprehensive separation of data, deriving from signals of more components and along these lines we managed to reveal a clearer indication of what performance to anticipate from a top-class SG.
Now, we are able to manipulate the labelled (with clusters) dataset and develop a way to predict the cluster a new sample (unlabelled ‘play’) belongs to.
[Refer to 02_classifying[logres_rf_xgboost].ipynb]
Our problem belongs to the category of Multi-Class Classification and the first step to take is choosing a validation strategy to tackle potential overfitting.
# check for the clusters' balance
The skewed dataset implies that a Stratified K-fold cross-validation has to be chosen over a random one. This will keep the labels’ ratio constant in each fold and whatever metric we choose to evaluate, it will give similar results across them all . And speaking of metrics, the F1 score (harmonic mean of precision and recall) looks more appropriate than accuracy, since the targets are skewed .
Next, we normalise data in order to train our (baseline) Logistic Regression model. Be mindful here to fit firstly on the training dataset and then transform both training and testing data. This is crucial to avoid data leakage !
Mean F1 Score = 0.9959940207018171
Such a tremendous accuracy from the very beginning is suspicious. Among the available ways to check the features’ importance (e.g. MDI), I choose the Permutation Feature Importance, which is model agnostic, hence we are able to use any conclusions to all the models .
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot