Original Source Here
Keep Your ML Models out of Your Application Servers
Deploying your Machine Learning (ML) models is never a job to take lightly. Sure, some great tools can help you share your models with the world quickly, like Gradio and Streamlit, but if we’re talking about anything more than a proof of concept, you have some decisions to make!
Gradio and Streamlit are great at what they do but provide limited frontend flexibility. There is only so much you can do, and the frontend of your ML application will look like so many other prototypes out there. Furthermore, they are not built to scale to many concurrent requests. Your model will become the bottleneck, eating into the server’s resources.
In this story, I will give you one good reason to keep your model within your application server and then tell you where this will fail and why you should not do it.
Learning Rate is a newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me on the first Saturday of every month with updates and thoughts on the latest AI news and articles. Subscribe here!
To begin with, let’s look at how a typical web application is structured. The figure below depicts the entities that participate in a usual scenario where a user interacts with a web application:
The client is your user. Your user can either be a real-life person or an application making a request. The server is typically where most of your code runs. It accepts requests, processes them, and responds to the client. The database stores your application’s data. It can take many forms and shapes, but this is not our current concern.
In this context, the client (e.g., a user or an application) makes a request to the server over the network. The server looks in a database system to gather the necessary information and returns a response to the user.
The procedure I briefly described is a very simple scenario with many alternatives. However, this is arguably the most common sequence of steps when a client makes a request.
Now, if you choose to go with the model-in-server architecture, your server is responsible for loading your model, processing the request’s data, running the forward pass of your model, transforming the predictions, and returning the response to the user. If it seems like a lot of work, it’s because it is!
Why should you start there?
Witnessing the work your server has to do is discouraging. However, there is a very good reason to use this approach; you have to fail fast and fail a lot.
When you are prototyping a new ML application, it is good to have a few things in place:
- Have a basic UI to make it easier for your users to interact with the application
- Put your application behind a URL, so it is easier to share it with your friends and testers
Tools like Streamlit and Gradio can do the heavy lifting for you. So, during your early days, where you should deploy early and often, you should definitely leverage their features. Moreover, as you develop your model, your focus should stay on this. So, keep it simple and add complexity later.
If you’re working at a company, re-using existing infrastructure is another good reason to follow this approach. So, your company may already have established processes to reliably deploy code to servers, and you can take a piggyback ride.
Why you shouldn’t stop there
Having said this, there are many reasons why you wouldn’t want to follow this architecture when deploying in production.
Then, your model in production will change frequently. Performance degradation, concept drift issues, and new data will make you update your model often. On the other hand, your web server code will not change that frequently. If you follow the model-in-server architecture, you’ll have to deploy the whole thing from scratch every time you have to update your model. You will agree with me that this is not a good and optimized roll-out process for your application.
Next, your server’s hardware may not be optimized for ML workloads. For example, it might not have access to GPU devices that your model can leverage to make predictions faster. Now, you can argue that you might not need GPU devices during inference, but in a later article, where we will talk about performance optimization, you may find out that you do need them after all, depending on your use case.
Finally, your model and your web server will most likely scale differently. If your model is complicated or large, you may want to host it on GPUs and distribute the load on many machines. You wouldn’t want to bring all that complexity to your web server. Plus, your model will eat into your web server’s resources, which will use all its power to make the model run.
In this story, we saw why you’d want to use tools like Streamlit and Gradio to deploy fast and often versions of your ML application. We saw what the advantages and disadvantages of the model-in-server architecture are and why you wouldn’t want to go down this way in production.
So, what can you do? In the next story, we’ll see how to separate your model from your web server and, finally, use a well-known tool to serve your model in a production-ready environment.
About the Author
My name is Dimitris Poulopoulos, and I’m a machine learning engineer working for Arrikto. I have designed and implemented AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA.
Opinions expressed are solely my own and do not express the views or opinions of my employer.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot