From Jupyter Notebook to Deployment — A Straightforward Example



Original Source Here

From Jupyter Notebook to Deployment — A Straightforward Example

A step-by-step example of taking typical machine learning research code and building a production-ready microservice.

This article is intended to serve as a consolidated example of the journey I took in my work as a Data Scientist, beginning from a typical solved problem in Jupyter Notebook format and developing it into a deployed microservice. Although there will be code, this is not intended as a tutorial, rather a step-by-step illustrated example of the challenges and solutions I faced in bringing a data science solution to production. I also don’t claim that the choices I make along the way here are the only way to do things, rather I hope that it serves as a useful example for others who may face a similar task in the future.

The companion code to compliment this article can be found on my GitHub account with the original Jupyter Notebook here and the eventual, complete project here. I will include snippets of these changes to demonstrate various topics as we discuss them, but this will not be a complete account of all the changes made. In this article I will instead focus on the process and motivation for each step that I have taken.

The Jupyter Notebook Solution (code)

This toy solution is designed to act as a highly simplified example of a data scientists modelling output. We can imagine that the task was to build a classifier that, given some measurements of an iris flower, predicts the exact species of that observation. The data scientist was given Fisher’s famous iris dataset and built a pipeline that visualises and preprocesses the data and then trains and evaluates a simple model. This model can now be used to make predictions on any new observations. The goal of this article is to describe the steps that one might take in bringing this typical research output to production.

The famous Iris flower dataset introduced by Ronald Fisher in 1936.

Step 1: Refactoring, code style and testing (code)

The first step is to modularise the notebook into a reasonable folder structure, this effectively means to convert files from .ipynb format to .py format, ensure each script has a clear distinct purpose and organise these files in a coherent way. I’ve taken the following standard layout, but this is flexible and should be adapted to suit different use cases.

What our file structure will eventually look like.

Once the project is nicely structured we can tidy up or refactor the code. For example, consider the following block of code that does the task of taking a stratified train-test split of the data:

Stratified split logic from our original Jupyter Notebook.

In its current format, it is not clear exactly what this code is doing, it’s not reusable, and it can’t be tested. To alleviate this we can re-write it as a function with documentation and type hints.

The refactored stratified split function.

This looks much better and follows the pep8 style guide. In fact, we can ensure that the entire project conforms to pep8 by using a linting package such as pycodestyle. Once this package is installed we can navigate to the projects root directory and, on the command line, run pycodestyle src. This will list out the location and details of any pep8 issues. We can even go one step further and use a package like autopep8 which automatically formats any code identified by pycodestyle.

We should also briefly touch on testing. There is extensive literature on code testing (see here for example) but we will only implement unit tests in this project. These test the smallest possible units of code independently from one another. For our previous stratified_split() function from the previous example, a unit test might look something like this:

A unit test for our stratified_split function.

By running this code we are checking to see if the function output is as expected in a number of predefined cases. This confirms that our existing code is working as it should but it also ensures that if any future changes cause this function to stop working as intended, we can catch this error early when it causes these tests to fail.

Step 2: Collaboration (code)

It is unlikely that any project at scale is going to be the product of just a single individuals work so for this reason we are going to discuss version control. A version control tool such as git is a system that records changes to a project over time enabling multiple users to branch and merge changes to a repository. There are plenty of articles on the principles and best practices of using version control, however we are assuming that the bulk of the data science work has already been completed so instead we will focus on a useful file called pre-commit.sh. This is what is known as a git hook, a useful feature offered by git that allows custom scripts to be run when certain important actions occur — in this case the command git commit. We can use this feature to conveniently run the test-all.sh script before any commit and this, in turn, runs all of our unit tests as well as a pycodestyle check only allowing the commit to proceed if there are no failures. This process ensures that any changes to the repository don’t break the existing functionality and conform to pep8.

Another useful file for collaboration is a requirements file named requirements.txt. Saved under the project root directory, this file lists all of the Python packages used throughout the project. Any new user to the project can simply use pip to install these requirements with the following command.

pip install -r requirements.txt

Creating a requirements.txt is also pretty straightforward. Assuming we have been using a virtual environment such as venv or conda just for this project with all of the packages installed, we can dump their names and versions into requirements.txt by activating the environment and running

pip freeze > requirements.txt

Step 3: Prepare to deploy (code)

Now that we have the project in good shape for local usage it’s time to make some changes to prepare it for cloud usage. One such change is logging. When working locally, keeping track of a programme and debugging are relatively straightforward. A combination of print statements, debugging tools and the console output usually does the trick, however once components are deployed we need something a little bit more sophisticated. That’s where Pythons logging module can help. Once the module is configured, we can simply replace print statements like

print('Splitting train and test sets')

with

logger.info('Splitting train and test sets')

Now logs can be stored in a log file along with user defined meta data such as time, module, function name and line number. This information not only allows us to track the progress of our programme but also gives detailed information for debugging purposes all stored away safely in the log file.

This would also be a good time to move away from storing data and files locally and instead move to remote storage. The training data, for example, is currently stored in an online csv file and if we obtain additional data the process of updating this file is awkward. One alternative is to use MySQL an open-source relational database management system and it’s Python connector. We won’t dwell on the details here, but once setup we can easily read in data from our database using sql queries.

We also want to progress from saving files locally. When it comes time to make predictions we might want to use a model that was trained months ago and one of the easiest, cheapest ways to get this type of object storage service is through Amazon S3. Again we will skip over some of the configuration details here but boto3 offers a convenient Python interface for accessing an S3 bucket. Uploading and downloading files to this remote bucket is a breeze giving us easy access to all of our models. I haven’t implemented MySQL or S3 in the sample code with this article but there are plenty of other guides for taking this steps online (e.g. here and here respectively).

One final change we should make to this repository before we are ready to start deploying is abstracting away all of the hard coded variables. Once this project is deployed remotely it will be quite a task to make changes to the code. For that reason we will move all of the parameters from hard coded values into json configuration files. Now when it comes to changing a parameter we only need to change the values in this json file rather than in the code itself which is much more straightforward to do. A line like

gscv = GridSearchCV(pipeline, parameters, cv=3, scoring="accuracy")

might move the number of cross validation folds and scoring metric into the configuration file and become

gscv = GridSearchCV(pipeline, parameters, cv=NUM_FOLDS, scoring=SCORING)

where the variables NUM_FOLDS and SCORING are loaded from the design.json configuration file and a get_default() function is added to config.py to easily access these values.

Step 4: Deployment (code)

With the previous three steps complete, we are finally ready to deploy our code. We will deploy this service as a web application which we will communicate with via a set of api endpoints. To do this we will use a web framework to automate a lot of the overhead. Although there are many we could use, for this simplified application Flask seems like a good choice since it is easy to get up and running and is extremely flexible. The main idea is that by running python3 run_app.py, this will access the functionality of the project via the src/template_app folder and will start running the application at http://localhost:5000. Once the application is up and running we can then use GET and POST requests to access its train, predict or visualise functionality through a browser or an API development platform such as Postman.

Running the flask app locally to visualise the data.

Even for this very simple project the requirements list has begun to grow considerably. For another developer to get up and running with our code would require setting up virtual environments and downloading packages. In order to automate this process and avoid the risk of running into dependancy issues we will use a tool called Docker. Docker effectively creates an empty container that installs the entire project from scratch. The advantage being that anyone, on any machine can get our code up and running by having docker installed and simply running docker build sample_app:v.1.0 . from the root folder of the project. Once the local image is built we can run and expose it with docker run -p 5000:5000 sample_app:v.1.0. And thats it! The Dockerfile includes the instructions to download Python, pip and the requirements. It then runs the tests and pycodestyle and if that finishes successfully it runs the application via the the run_app.py script.

Building a Docker container (note that this runs quicker than usual as the container is being built for a second time so most of the steps are in cache).

We have now converted our original jupyter notebook into a fully fledged microservice. Our final code is modularised, it adheres to pep8, it includes logging and testing, and it is ready to deploy as a Dockerised Flask application. We have also discussed data management with SQL and Amazon s3. Where you should deploy your application will depend on your projects needs, projects of scale often use AWS, Azure or Google Cloud while platforms such as Heroku offer free hosting for small hobby projects. For bigger projects that have numerous microservices, Kubernetes is an open-source system for automating deployment, scaling, and management of containerised applications. Once a large project is live, continuous integration can become a major challenge. In this case an automation server such as Jenkins can manage tasks related to building, testing, and deploying software.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: