How to Build an AutoML App in Python



Original Source Here

1. Overview of the AutoML App

The AutoML App that we are building today is well less than 200 lines of code at 171 lines to be exact.

1.1. Tech Stacks

The web app is going to be built in Python using the following libraries:

  • streamlit — the web framework
  • pandas — handle dataframes
  • numpy — numerical data processing
  • base64 — encoding data to be downloaded
  • scikit-learn — perform hyperparameter optimization and build machine learning model

1.2. User Interface

The web app has a simple interface comprising of 2 panels: (1) Left Panel accepts the input CSV data and the Parameter settings while the (2) Main Panel displays the output consisting of printing out the dataframe of the input dataset, the model’s performance metric, the best parameters from hyperparameter tuning as well as the 3D contour plot of the tuned hyperparameters.

Screenshot of the AutoML App.

1.3. Demo of the AutoML App

Let’s take a glimpse of the web app as shown in the 2 screenshots below so that you can get a feel of the app that you are going to be building.

1.3.1. AutoML App using the Example Dataset

The easiest way to try out the web app is to use the supplied Example dataset by clicking on the Press to use Example Dataset button in the Main Panel, which will load the Diabetes Dataset as an example dataset.

Screenshot of the AutoML App using the example dataset.

1.3.2. AutoML App using Uploaded CSV Data

Alternatively, you can also upload your own CSV datasets either by dragging and dropping the file directly into the upload box (as shown in the screenshot below) or by clicking on the Browse files button and choosing the input file to be uploaded.

Screenshot of the AutoML App using the input CSV dataset.

In both of the above screenshots, upon providing either the example dataset or the uploaded CSV dataset, the App prints out the dataframe of the dataset, automatically builds several machine learning models by using the supplied input learning parameters to perform hyperparameter optimization followed by printing out the model performance metrics. Finally, an interactive 3D contour plot of the tuned hyperparameters is provided at the bottom of the Main Panel.

You can also take the App for a test drive by click on the following link:

2. The Code

Let’s now take a dive into the inner workings of the AutoML App. As you can see, the entire App uses up only 171 lines of code.

It should be noted that all comments provided in the code (denoted by lines containing the hash tag symbols #) are used to make the code more readable by documenting what each code blocks are doing.

Lines 1–10

Imports the necessary libraries consisting of streamlit, pandas, numpy, base64, plotly and scikit-learn.

Lines 15–16

The set_page_config() function allows us to specify the webpage title to page_title=‘The Machine Learning Hyperparameter Optimization App’ as well as setting the page layout to be in full width mode as specified by the layout=’wide’ input argument.

Lines 19–25

Here, we use the st.write() function together with markdown syntax, write the webpage header text as done on line 20 via the use of the # tag in front of the header text The Machine Learning Hyperparameter Optimization App. On subsequent lines we write the description of the web app.

Lines 29–58

These blocks of code pertains to the input widgets in the Left Panel that accepts the user input CSV data and model parameters.

  • Lines 29–33 — Line 29 prints the header text for the Left sidebar panel via the st.sidebar.header() function where sidebar in the function dictates the location of the input widget that it should be placed in the Left sidebar panel. Line 30 accepts the user input CSV data via the st.sidebar.file_uploader() function. As we can see, there are 2 input arguments where the first is the text label Upload your input CSV file while the second input argument type=[“csv”] makes a restriction to only accept CSV files only. Lines 31–33 prints the link to the example dataset in Markdown syntax via the st.sidebar.markdown() function.
  • Line 36 — Prints the header text Set Parameters via the st.sidebar.header() function.
  • Line 37 displays a slider bar via the st.sidebar.slider() function where it allows the user to specify the data split ratio by simply adjusting the slider bar. The first input argument prints the widget label text Data split ratio (% for Training Set) where the next 4 values represents the minimum value, maximum value, default value and the increment step size. Finally, the specified value is assigned to the split_size variable.
  • Lines 39–47 displays the input widgets for Learning Parameters while Lines 49–54 displays the input widgets for General Parameters. Similar to the explanation for Line 37, these lines of code also make use of the st.sidebar.slider() as the input widget for accepting the user specified values for the model parameters. Lines 56–58 combines the user specified value from the slider input into an aggregated form where it then serves as input to the GridSearchCV() function that is responsible for the hyperparameter tuning.

Line 64

A sub-header text saying Dataset is added above the input dataframe via the st.subheader() function.

Lines 69–73

This block of code will encode and decode the model performance results via the base64 library as a downloadable CSV file.

Lines 75–153

At a high-level, this block of code is the build_model() custom function that will take the input dataset and together with the user specified parameters it will then perform the model building and hyperparameter tuning.

  • Lines 76–77 — The input dataframe is separated into X (deletes the last column which is the Y variable) and Y (specifically select the last column) variables.
  • Line 79 —Here w notify the user via the st.markdown() function that the model is being built. Then in Line 80 the column name of the Y variable is printed inside an info box via the st.info() function.
  • Line 83 — Data splitting is performed via the train_test_split() function by using X and Y variables as the input data while the user specified value for the split ratio is specified by the split_size variable, which takes its value from the slider bar described on Line 37.
  • Lines 87–95 — Instantiates the random forest model via the RandomForestRegressor() function that is assigned to the rf variable. As you can see, all of the model parameters as defined inside the RandomForestRegressor() function takes its parameter values from the input widgets that the user specifies as discussed above on Lines 29–58.
  • Lines 97–98 — Performs the Hyperparameter tuning.
    → Line 97 — The above random forest model as specified in the rf variable is assigned as an input argument to the estimator parameter inside the GridSearchCV() function, which will perform the hyperparameter tuning. The hyperparameter value range to explore in the hyperparameter tuning is specified in the param_grid variable that in turn takes its value directly from the user specified value as obtained from the slider bar (Lines 40–43) and pre-processed as the param_grid variable (Lines 56–58).
    → Line 98 — The hyperparameter tuning process begins by taking in X_train and Y_train as input data.
  • Line 100 — Print the Model Performance sub-header text via the st.subheader() function. The following lines will then print the model performance metrics.
  • Line 102 — The best model from the hyperparameter tuning process as stored in the grid variable is applied for making predictions on the X_test data.
  • Lines 103–104 — Prints the R2 score via the r2_score() function that uses Y_test and Y_pred_test as input argument.
  • Lines 106–107 — Prints the MSE score via the mean_squared_error() function that uses Y_test and Y_pred_test as input argument.
  • Lines 109–110 — Prints the best parameters and rounds it to 2 decimal places. The best parameter values are obtained from grid.best_params_ and grid.best_score_ variables.
  • Line 112–113 — Line 112 prints the sub-header Model Parameters via the st.subheader() function. Line 113 prints the model parameters stored in grid.get_params() via the st.write() function.
  • Lines 116–125 — Model performance metrics are obtained from grid.cv_results_ and reshaped to x, y and z.
    → Line 116 — We’re going to selectively extract some data from grid.cv_results_ that will be used to create a dataframe containing the 2 hyperparameter combinations along with their corresponding performance metric, which in this case is the R2 score. Particularly, the pd.concat() function will be used to combine the 2 hyperparameters (params) and the performance metric (mean_test_score).
    → Line 118 — Data reshaping will now be performed to prepare the data to be in a suitable format for creating the contour plot. Particularly, the groupby() function from the pandaslibrary will be used to literally group the dataframe according to 2 columns (max_features and n_estimators) whereby the contents of the first column (max_features) are merged.
  • Lines 120–122 — Data will now be pivoted into an m ⨯ n matrix whereby the rows and columns correspond to the max_features and n_estimators, respectively.
  • Lines 123–125 — Finally, the reshaped data to the respective x, y and z variables that will then be used for making the contour plot.
  • Lines 128–146 — These code blocks will now create the 3D contour plot using the x, y and z variables via the plotly library.
  • Lines 149–152 — The x, y and z variables are then combined into a df dataframe.
  • Line 153 — The model performance results stored in the grid_results variable will now be made downloadable via the filedownload() custom function (Lines 69–73).

Lines 156–171

At a high-level, these code blocks will perform the logic of the App. This is comprised of 2 code blocks. The first is the if code block (Lines 156-159) and the second is the else code block (Lines 160–171). Every time the web app loads, it will default to running the else code block while the if code block will be activated upon the upload of the input CSV file.

For both code blocks, the logic is the same, the only difference is the contents of the df data dataframe (whether it is coming from the input CSV data or from the example data). Next, the contents of the df dataframe is displayed via the st.write() function. Finally, the model building process is initiated via the build_model() custom function.

3. Running the AutoML App

Now that we have coded the App, let’s proceed to launching it.

3.1. Create the conda environment

Let’s first start by creating a new conda environment (in order to ensure reproducibility of the code).

Firstly, create a new conda environment called automl as follows in a terminal command line:

conda create -n automl python=3.7.9

Secondly, we will login to the automl environment

conda activate automl

3.2. Install prerequisite libraries

Firstly, download the requirements.txt file

wget https://raw.githubusercontent.com/dataprofessor/ml-opt-app/main/requirements.txt

Secondly, install the libraries as shown below

pip install -r requirements.txt

3.3. Download the App files

You can either download the web app files that are hosted on the GitHub repo of the Data Professor or you could also use the 171 lines of code found above.

wget https://github.com/dataprofessor/ml-opt-app/archive/main.zip

Next, unzip the file contents

unzip main.zip

Now enter the main directory via the cd command

cd main

Now that you’re inside the main directory you should be able to see the ml-opt-app.py file.

3.4. Launching the web app

To launch the App, type the following commands into a terminal prompt (i.e. ensure that the ml-opt-app.py file is in the current working directory):

streamlit run ml-opt-app.py

In a few seconds, the following message in the terminal prompt.

> streamlit run ml-opt-app.pyYou can now view your Streamlit app in your browser.Local URL: http://localhost:8501
Network URL: http://10.0.0.11:8501

Finally, a browser should pop up and the App appears.

Screenshot of the AutoML App launched locally.

You can also test the AutoML App at the following link:

4. Conclusion

Now that you have created the AutoML App as described in this article, what next? You can perhaps tweak the App by to another machine learning algorithm. Additional features such as feature importance plot could also be added to the App. The possibilities are endless and have fun customizing the App! Please feel free to drop a comment on how you’ve modified the App for your own projects.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: