Can a Data Scientist Replace a NBA Scout? ML App Development for Best Transfer Suggestion

Original Source Here

Concerning the crucial set of group_1 features, they are almost balanced between left/right-skewed. However, the dominant holding factor is the great presence of outliers beyond the pertinent upper boundary. There are many players who oftentimes perform well-above the expectations and this fact comes in line with our initial conclusion:

Induction #1: We have to deeply study group_1, in a way that will not only guarantee significant levels for the respective features, but also won’t compromise (the greatest possible number of) the rest.

With that in mind, we initiate a naive approach of sorting the dataset by a master feature (AST_PCT), taking the upper segment of it (95th Percentile) and evaluating the plays ‘horizontally’ (across all features).

plays_df Descriptive Stats (Population)
plays_df Descriptive Stats (95th Percentile)

The outcome is disappointing. By comparing the population with the 95th percentile average features, we see that by maximising along AST_PCT many of the remaining features get worse, violating that way Assumption #2. Besides, we wouldn’t like to buy a SG of great Assist ratio but poor Field Goal performance (EFG_PCT)!

Therefore, it gets easily conceived that we cannot accomplish our mission of building the optimum SG’s profile, based on plain exploratory techniques. Thus:

Induction #2: We have to build better intuition on the available data and use more advanced techniques, to effectively segment it and capture the underlying patterns, which may lead us to the best SG’s profile.

Clustering picks up the torch…

3. Clustering

[Refer to 01_clustering[kmeans_gmm].ipynb]


We begin with the popular K-Means algorithm, but firstly implement PCA, in order to reduce the dataset dimensions, while retaining most of the original features’ variance [1].
PCA ~ Explained Variance

We opt for a 4-component solution, as it explains at least 80% of the population’s variance. Next, we find the optimum # of clusters (k), by using the Elbow Method and plotting the WCSS line:
WCSS ~ Clusters Plot

The optimal # clusters is 4 and we are ready to fit K-Means.
K-Means Clusters

The resulted clustering is decent, however there are many overlapping points of cluster_2 and cluster_3, turquoise & blue, respectively. Seeking for potential enhancement, we are going to examine another clustering algorithm. This time not a distance-based, but a distribution-based one; Gaussian Mixture Models [2].


In general, GMM can handle a greater variety of shapes without assuming the clusters to be of the circular type (like K-Means does). Also, as a probabilistic algorithm, it assigns probabilities to the datapoints, expressing how strong their association is with a specific cluster. Yet, there’s no free lunch; GMM may converge quickly to a local minimum, hence deteriorating results. To tackle this, we can initialize them with K-Means, by tweaking the respective Class parameter [3].

In order to pick the suitable # of clusters, we can utilize the Bayesian Gaussian Mixture Models class in Scikit-Learn which weights clusters, leveling the erroneous ones at or near zero.
# returns
array([0.07, 0.19, 0.03, 0.14, 0.19, 0.09, 0.06, 0.18, 0.05, 0.01])

Obviously, only 4 clusters surpass the 0.01 threshold.
GMM Clusters

That’s it! cluster_3 (blue) is better separated this time, while cluster_2 (turquoise) is better contained, too.

Clusters Evaluation

For the purpose of enhancing the clusters assessment, we introduce a new variable which depicts the net score of the examined features. Each group is weighted in order to better express the magnitude it has on the final performance and their algebraic sum is calculated. I allocate weights as following:

NET_SCORE = 0.5*group_1 + 0.3*group_2 + 0.2*group_3 - 0.3*group_5# group_4 (START_POSITION) shouldn't be scored (categorical feature)
# being a center ‘5’ doesn't mean to be ‘more’ of something a guard ‘1’ stands for!
# group_5 (DEF_RATING) is negative in nature
# it should be subtracted from the Net Score

So, let’s score and evaluate clusters.
GM Clusters scored by NET_SCORE

Apparently, cluster_3 outperforms the rest ones with a NET_SCORE of aprox. 662.49, while cluster_1 takes position next to it. But, what worths to be highlighted here is the quantified comparison between the 95th percentile and the newly introduced cluster_3:

NET_SCORE Whisker Box Plots for 95th percentile & cluster_3

It gets visually clear that cluster_3 dominates the 95th percentile segment, by noting an increase of the 146.5 NET_SCORE units! Consequently:

Induction #3: Cluster_3 encapsulates those ‘plays’ which derive from great SG performance, in a really balanced way — group_1 features reach high levels, while most of the rest keep a decent average. This analysis, takes into account more features than the initially attempted (ref. EDA) which leveraged a dominant one (AST_PCT). Which proves the point that…

Induction #4: Clustering promotes a more comprehensive separation of data, deriving from signals of more components and along these lines we managed to reveal a clearer indication of what performance to anticipate from a top-class SG.

Now, we are able to manipulate the labelled (with clusters) dataset and develop a way to predict the cluster a new sample (unlabelled ‘play’) belongs to.

4. Classifiers

[Refer to 02_classifying[logres_rf_xgboost].ipynb]

Our problem belongs to the category of Multi-Class Classification and the first step to take is choosing a validation strategy to tackle potential overfitting.

# check for the clusters' balance
0 27508
1 17886
3 11770
2 5729

The skewed dataset implies that a Stratified K-fold cross-validation has to be chosen over a random one. This will keep the labels’ ratio constant in each fold and whatever metric we choose to evaluate, it will give similar results across them all [4]. And speaking of metrics, the F1 score (harmonic mean of precision and recall) looks more appropriate than accuracy, since the targets are skewed [5].

Next, we normalise data in order to train our (baseline) Logistic Regression model. Be mindful here to fit firstly on the training dataset and then transform both training and testing data. This is crucial to avoid data leakage [6]!
# returns
Mean F1 Score = 0.9959940207018171

Feature Importance

Such a tremendous accuracy from the very beginning is suspicious. Among the available ways to check the features’ importance (e.g. MDI), I choose the Permutation Feature Importance, which is model agnostic, hence we are able to use any conclusions to all the models [7].


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: