The Newest Package for Instantly Evaluating ML Models — Deepchecks*3sslPNZ3Q2NPJ6N5

Original Source Here

The Newest Package for Instantly Evaluating ML Models — Deepchecks

Deepchecks is a ML suite used to evaluate the performance of Scikitlearn models. How does it compare to the built-in Scikeitlearn solutions?

Photo by Chris Liverani on Unsplash

Evaluation, especially in production environments, of ML models has forever been, and will likely continue to be, one of the most difficult subtopics in machine learning research. It is no surprise that there is a multitude of startups, researchers, and big tech companies pouring resources into perfecting this ongoing problem.

One of these companies, Deepchecks has just open-sourced their machine learning evaluation tool. The tool is simple to use and works on any Scikitlearn model.

This all sounds great, but Scikitlearn provides a multitude of built-in tools for model evaluation. Is it worth switching?

To test the suite, I set up a fairly simple toy problem. Given the following dataframe schema and 550 samples, I want to predict the column WasTheLoanApproved

Index: CustomerID
LoanPayoffPeriodInMonths object
LoanReason object
RequestedAmount int64
InterestRate object
Co-Applicant object
YearsAtCurrentEmployer object
YearsInCurrentResidence int64
Age int64
RentOrOwnHome object
TypeOfCurrentEmployment object
NumberOfDependantsIncludingSelf int64
CheckingAccountBalance object
DebtsPaid object
SavingsAccountBalance object
CurrentOpenLoanApplications int64
WasTheLoanApproved object

I then set up a purposely imperfect random forest classifier to solve this problem, with some simple parameter tuning such as minimal cost complexity pruning, and a train/test split of 80/20. This model is purposely unoptimized to evaluate the ease of spotting common issues.

X = df[df.columns.values[:-1]]
Y = df["WasTheLoanApproved"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
clf=RandomForestClassifier(n_estimators=100, ccp_alpha=0.008),y_train)

Scikitlearn simple evaluation

Now that I have my simple model trained, I can use some simple evaluation techniques provided by Scikitlearn to identify my model’s performance. I will look at my train vs test accuracy, my precision, recall, and F1 scores, and my feature importance. Here are the lines of code required to do so:

#Let us look at train and test accuracy
print("Accuracy Train:",metrics.accuracy_score(y_train, y_pred_train))
print("Accuracy Test:",metrics.accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
#and feature importance
feature_imp = pd.Series(clf.feature_importances_,index=df_num.columns.values[:-1]).sort_values(ascending=False)
#A chart
sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel('Feature Importance Score')
plt.title("Visualizing Feature Importance")

Not bad, in 10 lines of code I was able to get the following results: (If you need a refresher on the below scores, I recommend this medium article)

Accuracy Train: 0.7985436893203883
Accuracy Test: 0.7184466019417476
precision recall f1-score support

0 0.50 0.31 0.38 29
1 0.76 0.88 0.82 74

accuracy 0.72 103
macro avg 0.63 0.59 0.60 103
weighted avg 0.69 0.72 0.70 103
Image by Author

I think it is pretty easy to interpret these results — F1 score for 0’s is very low due to a low recall, meaning it is mislabeling many 0’s as 1’s in this dataset. Support also indicates that there are many more instances of 1’s in the dataset than 0’s, but this may or may not be an issue. The feature importance graph looks good, indicating that there does not seem to be 1 overarching feature controlling the classifier. An experienced ML engineer could take this information and deduce that we have an imbalanced classification, and will likely need to try subsampling a more even ratio of 0’s and 1’s for a better model, but this result may not be apparent to everyone.

Deepchecks out-of-the-box evaluation

Deepchecks strength comes in it allowing anyone to run a suite of evaluation metrics with very little code to get started. All you have to do is create dataset objects, specifying the label column, index, if meaningful, and categorical features (recommended, not required).

from deepchecks import Dataset
from deepchecks.suites import full_suite
ds_train = X_train.merge(y_train, left_index=True, right_index=True)
ds_test = X_test.merge(y_test, left_index=True, right_index=True)
ds_train = Dataset(ds_train, label="WasTheLoanApproved", cat_features=["TypeOfCurrentEmployment", "LoanReason", "DebtsPaid"])
ds_test = Dataset(ds_test, label="WasTheLoanApproved", cat_features=["TypeOfCurrentEmployment", "LoanReason", "DebtsPaid"])
suite = full_suite(), test_dataset=ds_test, model=clf)

In 8 lines of code, the suite then ran over 30 checks on my dataset. These checks fall into 4 major categories: Data Distribution (Is my test data similar to my training data), Data Integrity (are there erroneous values in my data that may cause issues), Methodology (are my train/test sets sized correctly and free of leakage), and Performance (fairly obvious — does my model perform well).

Image by Author

Here is the Conditions summary, as it can be seen, This is an incredibly powerful evaluation tool. Not only did this provide similar results to the few Scikitlearn tests I ran, but problems like data integrity and data drift can also easily go unnoticed and take a significant amount of code to test. Having a platform that can check for these alone makes it worth using.

Now let us return to my model. All of these tests were run and the current model only failed 2 checks, both performance-related. The tests with hyperlinks have associated graphs. Let’s take a look at the failures.

Image by Author

The performance report failure tells me that the model is slightly overfitting. Especially when looking at class 0 (loan not approved), the test set precision degradation is significant.

The simple model comparison tells me that the way the current model is predicting is not significantly better than just guessing a constant “1” every time. This is extremely problematic and is a clear sign that 1’s are being overrepresented in the dataset. I can confirm this by looking at our ROC report for the test set, which came in at barely acceptable due to the fact we are misclassifying our underrepresented class.

auc < 0.7 is considered a failure | Image by Author

Overall, this lead me to about the same conclusion as the built-in Scikitlearn methods, but Deepchecks provided this information in a manner that was both easier to both produce and understand.


Now that I have shown the power of this testing suite, I would like to dive into some features I would love to see. Some of these are more nitpicky, but they are weaknesses I identified in the overall platform.

Inferring categorical features is inconsistent

The documentation states that it is highly suggested to explicitly declare any categorical features, but that the platform can try to infer which are categorical if not specified. The current inferring process uses a simple heuristic, and in my tests, I found that to be unreliable.

Testing Inferred Categorical Features | Image by Author

As you can see, the first issue is it will not necessarily detect the same categorical features. The second issue is a lack of accuracy. The training set identified 4 categorical features, only 1 of which was actually categorical. I would like to see a statistical method replace this heuristic for possibly better results. In the meantime, when using this package, please define your categorical variables.

Declaring the Dataset class could be made easier

This falls into the nitpicking category. The first issue is with declaring an index. To declare an index (which enables more data integrity checks), your index must be made a column (using reset_index) and then explicitly declared. It would be a nice enhancement to simply have a use_dataframe_index=True/False parameter that would easily bypass an explicit declaration and use the index of the dataframe itself.

The second small enhancement would be the ability to pass in the X (input) and Y(output/prediction) dataframes separately instead of only accepting the full dataframe with an explicitly declared label. This is simply because this could better match the way many individuals innately use Scikitlearn’s train-test split. You can see above that I had to rejoin my X and Y dataframes to pass into the suite.

Inconsistent explanations and Interpretation resources

A good portion of the graphs provide a good explanation of what the test is, why it is important, and then some external documentation for further explanation. This can easily be seen in the calibration metric test.

Calibration Metric Test | Image by Author

For this reason, I was disappointed when graphs/tests were presented that had little to no explanation, for example, the ROC report

Image by Author

I think this can, in someone more recently introduced to ML, lead to a sense of information overload, and a lack of direction in how to debug their model when something is not right.

Lenient pass-fail out-of-the-box

The out-of-the-box suite comes pre-tuned with some lenient definitions of what makes a “good” model. Take for example my algorithm’s test ROC. Sure, the test passed, but it was hardly above failure. This, without further investigation, could lead to a user neglecting a problem that may need to be addressed. If there were predefined performance tiers (for example, a “strict” mode), it could throw more warnings, letting the user know that this is not a definitive problem, but could be something worth considering optimizing for.

This is a nitpick because Deepchecks has already addressed this partially by implementing custom suites. For any of the multitude of tests, the performance boundaries and metrics can be adjusted, allowing a fine-tuning of what constitutes a passing model. This suggestion would purely serve as a “middle ground” between the single out-of-the-box suite and a fully customized model.

Limited to Scikitlearn

It would be nice if the package could handle different ML packages. For example, Gensim for NLP, or TensorFlow for deep learning. At the very least the ability to detect data drifts in these other packages could be worth exploring.


This package, for as new as it is, is fantastic and will only serve to make machine learning model valuation an easier experience. Running all of those tests by hand would take hundreds of lines of code. Having it reduced to a simple function call will easily move this package to one I plan to integrate closely into my development process. I cannot wait for this package to expand past Scikitlearn and develop more complex insights.

If you enjoyed what you read, feel free to follow me, and read more of what I write. I tend to deep-dive into a topic once per month and this normally involves new packages, tips, and tricks, or whatever else in the ML space I feel needs explained better!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: