Original Source Here
1. Overview of the AutoML App
The AutoML App that we are building today is well less than 200 lines of code at 171 lines to be exact.
1.1. Tech Stacks
The web app is going to be built in Python using the following libraries:
streamlit— the web framework
pandas— handle dataframes
numpy— numerical data processing
base64— encoding data to be downloaded
scikit-learn— perform hyperparameter optimization and build machine learning model
1.2. User Interface
The web app has a simple interface comprising of 2 panels: (1) Left Panel accepts the input CSV data and the Parameter settings while the (2) Main Panel displays the output consisting of printing out the dataframe of the input dataset, the model’s performance metric, the best parameters from hyperparameter tuning as well as the 3D contour plot of the tuned hyperparameters.
1.3. Demo of the AutoML App
Let’s take a glimpse of the web app as shown in the 2 screenshots below so that you can get a feel of the app that you are going to be building.
1.3.1. AutoML App using the Example Dataset
The easiest way to try out the web app is to use the supplied Example dataset by clicking on the
Press to use Example Dataset button in the Main Panel, which will load the Diabetes Dataset as an example dataset.
1.3.2. AutoML App using Uploaded CSV Data
Alternatively, you can also upload your own CSV datasets either by dragging and dropping the file directly into the upload box (as shown in the screenshot below) or by clicking on the
Browse files button and choosing the input file to be uploaded.
In both of the above screenshots, upon providing either the example dataset or the uploaded CSV dataset, the App prints out the dataframe of the dataset, automatically builds several machine learning models by using the supplied input learning parameters to perform hyperparameter optimization followed by printing out the model performance metrics. Finally, an interactive 3D contour plot of the tuned hyperparameters is provided at the bottom of the Main Panel.
You can also take the App for a test drive by click on the following link:
2. The Code
Let’s now take a dive into the inner workings of the AutoML App. As you can see, the entire App uses up only 171 lines of code.
It should be noted that all comments provided in the code (denoted by lines containing the hash tag symbols
#) are used to make the code more readable by documenting what each code blocks are doing.
Imports the necessary libraries consisting of
set_page_config() function allows us to specify the webpage title to
page_title=‘The Machine Learning Hyperparameter Optimization App’ as well as setting the page layout to be in full width mode as specified by the
layout=’wide’ input argument.
Here, we use the st.write() function together with markdown syntax, write the webpage header text as done on line 20 via the use of the
# tag in front of the header text
The Machine Learning Hyperparameter Optimization App. On subsequent lines we write the description of the web app.
These blocks of code pertains to the input widgets in the Left Panel that accepts the user input CSV data and model parameters.
- Lines 29–33 — Line 29 prints the header text for the Left sidebar panel via the
st.sidebar.header()function where sidebar in the function dictates the location of the input widget that it should be placed in the Left sidebar panel. Line 30 accepts the user input CSV data via the
st.sidebar.file_uploader()function. As we can see, there are 2 input arguments where the first is the text label
Upload your input CSV filewhile the second input argument
type=[“csv”]makes a restriction to only accept CSV files only. Lines 31–33 prints the link to the example dataset in Markdown syntax via the
- Line 36 — Prints the header text
Set Parametersvia the
- Line 37 displays a slider bar via the
st.sidebar.slider()function where it allows the user to specify the data split ratio by simply adjusting the slider bar. The first input argument prints the widget label text
Data split ratio (% for Training Set)where the next 4 values represents the minimum value, maximum value, default value and the increment step size. Finally, the specified value is assigned to the
- Lines 39–47 displays the input widgets for Learning Parameters while Lines 49–54 displays the input widgets for General Parameters. Similar to the explanation for Line 37, these lines of code also make use of the
st.sidebar.slider()as the input widget for accepting the user specified values for the model parameters. Lines 56–58 combines the user specified value from the slider input into an aggregated form where it then serves as input to the
GridSearchCV()function that is responsible for the hyperparameter tuning.
A sub-header text saying
Dataset is added above the input dataframe via the
This block of code will encode and decode the model performance results via the
base64 library as a downloadable CSV file.
At a high-level, this block of code is the
build_model() custom function that will take the input dataset and together with the user specified parameters it will then perform the model building and hyperparameter tuning.
- Lines 76–77 — The input dataframe is separated into
X(deletes the last column which is the Y variable) and
Y(specifically select the last column) variables.
- Line 79 —Here w notify the user via the
st.markdown()function that the model is being built. Then in Line 80 the column name of the Y variable is printed inside an info box via the
- Line 83 — Data splitting is performed via the
train_test_split()function by using
Yvariables as the input data while the user specified value for the split ratio is specified by the
split_sizevariable, which takes its value from the slider bar described on Line 37.
- Lines 87–95 — Instantiates the random forest model via the
RandomForestRegressor()function that is assigned to the
rfvariable. As you can see, all of the model parameters as defined inside the
RandomForestRegressor()function takes its parameter values from the input widgets that the user specifies as discussed above on Lines 29–58.
- Lines 97–98 — Performs the Hyperparameter tuning.
→ Line 97 — The above random forest model as specified in the
rfvariable is assigned as an input argument to the
estimatorparameter inside the
GridSearchCV()function, which will perform the hyperparameter tuning. The hyperparameter value range to explore in the hyperparameter tuning is specified in the
param_gridvariable that in turn takes its value directly from the user specified value as obtained from the slider bar (Lines 40–43) and pre-processed as the
param_gridvariable (Lines 56–58).
→ Line 98 — The hyperparameter tuning process begins by taking in
Y_trainas input data.
- Line 100 — Print the Model Performance sub-header text via the
st.subheader()function. The following lines will then print the model performance metrics.
- Line 102 — The best model from the hyperparameter tuning process as stored in the
gridvariable is applied for making predictions on the
- Lines 103–104 — Prints the R2 score via the
r2_score()function that uses
Y_pred_testas input argument.
- Lines 106–107 — Prints the MSE score via the
mean_squared_error()function that uses
Y_pred_testas input argument.
- Lines 109–110 — Prints the best parameters and rounds it to 2 decimal places. The best parameter values are obtained from
- Line 112–113 — Line 112 prints the sub-header
Model Parametersvia the
st.subheader()function. Line 113 prints the model parameters stored in
- Lines 116–125 — Model performance metrics are obtained from
grid.cv_results_and reshaped to
→ Line 116 — We’re going to selectively extract some data from
grid.cv_results_that will be used to create a dataframe containing the 2 hyperparameter combinations along with their corresponding performance metric, which in this case is the R2 score. Particularly, the
pd.concat()function will be used to combine the 2 hyperparameters (
params) and the performance metric (
→ Line 118 — Data reshaping will now be performed to prepare the data to be in a suitable format for creating the contour plot. Particularly, the
groupby()function from the
pandaslibrary will be used to literally group the dataframe according to 2 columns (
n_estimators) whereby the contents of the first column (
max_features) are merged.
- Lines 120–122 — Data will now be pivoted into an m ⨯ n matrix whereby the rows and columns correspond to the
- Lines 123–125 — Finally, the reshaped data to the respective
zvariables that will then be used for making the contour plot.
- Lines 128–146 — These code blocks will now create the 3D contour plot using the
zvariables via the
- Lines 149–152 — The
zvariables are then combined into a
- Line 153 — The model performance results stored in the grid_results variable will now be made downloadable via the
filedownload()custom function (Lines 69–73).
At a high-level, these code blocks will perform the logic of the App. This is comprised of 2 code blocks. The first is the
if code block (Lines 156-159) and the second is the
else code block (Lines 160–171). Every time the web app loads, it will default to running the
else code block while the
if code block will be activated upon the upload of the input CSV file.
For both code blocks, the logic is the same, the only difference is the contents of the
df data dataframe (whether it is coming from the input CSV data or from the example data). Next, the contents of the
df dataframe is displayed via the
st.write() function. Finally, the model building process is initiated via the
build_model() custom function.
3. Running the AutoML App
Now that we have coded the App, let’s proceed to launching it.
3.1. Create the conda environment
Let’s first start by creating a new
conda environment (in order to ensure reproducibility of the code).
Firstly, create a new
conda environment called
automl as follows in a terminal command line:
conda create -n automl python=3.7.9
Secondly, we will login to the
conda activate automl
3.2. Install prerequisite libraries
Firstly, download the
Secondly, install the libraries as shown below
pip install -r requirements.txt
3.3. Download the App files
You can either download the web app files that are hosted on the GitHub repo of the Data Professor or you could also use the 171 lines of code found above.
Next, unzip the file contents
Now enter the
main directory via the
Now that you’re inside the
main directory you should be able to see the
3.4. Launching the web app
To launch the App, type the following commands into a terminal prompt (i.e. ensure that the
ml-opt-app.py file is in the current working directory):
streamlit run ml-opt-app.py
In a few seconds, the following message in the terminal prompt.
> streamlit run ml-opt-app.pyYou can now view your Streamlit app in your browser.Local URL: http://localhost:8501
Network URL: http://10.0.0.11:8501
Finally, a browser should pop up and the App appears.
You can also test the AutoML App at the following link:
Now that you have created the AutoML App as described in this article, what next? You can perhaps tweak the App by to another machine learning algorithm. Additional features such as feature importance plot could also be added to the App. The possibilities are endless and have fun customizing the App! Please feel free to drop a comment on how you’ve modified the App for your own projects.