ML Infrastructure at Flo

Original Source Here

Some of our models require huge amounts of memory. Others are very GPU-hungry. Still others use large-scale distributed frameworks like Spark. Yet all of them are connected, often passing data or artifacts to each other. To support these use cases, we created a unified development environment where every engineer can easily access the required resources. All the components in this environment are well-known and unsurprising, but they work surprisingly well together.

Kubernetes — Resource allocation

Kubernetes has been the default option for production deployments at Flo for quite a while. But its use is not limited to long-running tasks like web services — with a proper user interface, Kubernetes acts as a perfect resource manager, allowing engineers to launch development containers with as much CPU, GPU, and memory as they need.

So why not give ML engineers direct access to the AWS management console? We considered this option at the early stages of our ML platform, but it turns out that it’s extremely hard to allow someone to create EC2 instances while simultaneously disallowing other potentially dangerous actions like granting access to data to wrong services or destroying existing resources. Moreover, manually created instances aren’t managed by Terraform (the infrastructure as code software tool used at Flo) and can be easily forgotten or lost.

The real challenge is to give engineers just enough flexibility to do their job efficiently and not do any harm to others. Kubernetes and proper UI fit this purpose perfectly.

JupyterLab — Quick start

JupyterLab provides both the type of UI we need and a set of tools for a quick start. Usually when people mention Jupyter, they mean notebooks — a web-based interactive application where you write code, run it, and visualize the results. But JupyterLab is more than that — by launching a container, you create a complete development environment with terminal, multi-project structure, pre-installed packages, and everything you need to get started.

A typical session in JupyterLab at Flo looks like this: A user — typically an engineer or an analyst — logs in to the JupyterLab UI, launches one of the predefined machine configurations, and starts hacking around.

Although each session may use a different hardware configuration, changes to the installed software and project files stored in the home directory are preserved on a mounted disk. Users can also exchange data files via a common S3 bucket configured in all containers.

In addition to S3, all JupyterLab containers share configuration for Git, experiment tracking and model registry server, SQL engine (we use Presto), Spark cluster, feature store, etc. Upon login, the user gets not only hardware resources needed for their task, but also a fully configured environment.

When the day is over and the container is not needed anymore, either the user stops it explicitly or the JupyterLab server culls it automatically after a configured timeout.

VSCode — Complete IDE, completely remote

JupyterLab is perfect for quick experimentation, but as your codebase grows, things start getting complicated. Code refactoring, automatic formatting, working with multiple branches, and many other practices that software developers are used to are much easier to do in the “traditional” IDEs. But traditional IDEs can only run locally or via annoyingly slow remote desktop software, don’t they?

It turns out that modern IDEs — or at least one of them, Visual Studio Code — have very rich support for remote development. With just a few official extensions, one can run VSCode UI locally but keep the source code and run the interpreter on a remote machine via SSH, in a Docker container locally or in Kubernetes. And yes, the remote container can be the one created via the JupyterLab server.

Here’s what a new ML engineer should do to connect their VSCode to the remote container:

  • Start the container via JupyterLab UI
  • Install kubectl and set up kubeconfig with enough permissions to view the list of running pods
  • Install VSCode extensions “Kubernetes” and “Remote — Containers”
  • In the Kubernetes extension menu, select “Workloads → Pods → <name of their container>”
  • In the context menu, select “Attach Visual Studio Code”

At this point, the engineer can open a terminal, create a virtual environment, clone a repository, and open it as a directory. Except for a few minor differences, the UX is almost identical to that on a localhost.

What amazed us the first time we ran this configuration is how well all the components are integrated. For example, VSCode automatically detects scientific Python code and proposes to launch a remote TensorBoard server and forward the needed ports.

One concern we had was the latency of user input. And indeed, the integrated terminal has a slight delay between typing a character and the character making it to the terminal. But we got used to it pretty quickly, and typing the code in the editor doesn’t have any delay at all.

Despite excellent integration, some habits still have to be adjusted. For example, since the interpreter runs on the remote server, plotting images via matplotlib, for example, from a script doesn’t work. Multiple alternatives are available though, such as saving a plot to a file and plotting in JupyterLab UI or in the VSCode built-in notebooks as in the image below.

All in all, VSCode has incredibly flexible capabilities for remote development. If it sounds like something that could be useful to you, take a look at the official Remote Development overview.

Model registry

Training a good model is one thing. Handling hundreds of model versions is a very different thing. At Flo, we went a long way to formalize and simplify ML model processing.

With the very first models, we did everything manually. We trained a model, serialized it to a disk, trying to give it a descriptive name and version, and uploaded it to S3 together with a short description of the data and code used for training. We then updated the source code of all services that used the model to download the new version artifact.

Needless to say, it was a really tedious and error-prone process. In the next iteration we started to automate it. To create a new version of a model, an engineer would do the following:

  • Create or modify a training script
  • Update Dockerfile for a training image to use this script
  • Commit the changes and push to a branch
  • Launch training via the CI/CD pipeline

CI/CD would then record the Git hash of the commit, run training, and save the generated artifact to S3 using the hash as the model version.

It solved some of the pain points we had at that time — most of the steps were automated, and reproducing them became as easy as re-running a CI/CD pipeline. At the same time, experimentation became longer and less convenient because we now had to update Dockerfile every now and then, and the training was triggered from the remote server, making monitoring the process harder. This solution also didn’t address model serialization — every time we added a new ML framework or changed preprocessing, we had to update the storage format.

Surely, we were not alone in these problems. After some exploration, we discovered that MLflow — a popular tool for managing the ML model lifecycle — covered most of our needs. In particular, MLflow Model Registry provides a unified API for serializing models created in one of the popular ML frameworks and storing them in a specified location (local disk or S3). For example, to save a PyTorch model, one can simply write:

my_model = ..., data)

MLflow handles the artifact and assigns it a version number. Another project in a completely different place can then retrieve it using the name and version as:


In the next section, we will also cover our MLflow setup and experiment reproducibility.

Metric tracking and reproducibility

Model versioning actually makes little sense without tracking the metrics of each version. Fortunately, MLflow covers this part too. In our setup, we run MLflow Tracking Server backed by PostgreSQL (to store the logged metrics) and S3 (to store the model artifacts).

The Tracking Server provides UI for both models and metrics. For example, here’s the screen for one of our models:

The metrics are bound to so-called runs, which in turn are combined into experiments.

With MLflow, you can log any metric as simply as mlflow.log_metric(name, value). But in most cases, we log the same set of values, such as hyperparameters, loss, number of epochs, etc. MLflow automates it further, providing autologging capabilities for the most widespread frameworks. For example, the table on the above screenshot was generated using the following code:

with mlflow.start_run():, data)

One of the most important parts of a good ML experiment is its reproducibility. MLflow has this covered with the Project feature. However, MLflow projects require either a Conda environment or a Docker image. Since not all teams at Flo use this configuration, we didn’t want to force them to switch, rewrite their codebase, or somehow restrict their style of working. Instead, we developed a small library that logs the current state of the environment to the MLflow Tracking Server. At each experiment run, this library records:

  • Current Git commit or the word “dirty” if there are uncommitted changes
  • List of the installed libraries
  • The file (i.e., module or script) from which the experiment was launched
  • Command line arguments, if any

This way, any experiment that has been logged to MLflow from a clean Git state can be easily repeated, no matter how or in which environment it was launched.

What’s next

There are two things that we didn’t talk about here — dataset management and model serving. We don’t have established solutions for these yet, but they are still important parts of the ML model lifecycle and are worth mentioning.

Unlike many other companies working on a single AI product, Flo has dozens of ML-related tasks. We predict symptoms, detect signs of health issues, filter out spam in Secret Chats, optimize push notifications, and so on. At the same time, we collect data about user well-being, search queries, behavior in the content library, activity patterns, etc. One interesting thing is that any of these data sources may help in any of these tasks. For example, if a user searches for “painkillers” in the middle of their cycle, it’s more likely to be related to painful ovulation or other causes and less to PMS symptoms. A content ranker can use the data about cycle state to prioritize the most relevant content for this specific person at this specific point in time.

To enable such cross-domain scenarios, we are starting to adopt so-called feature stores — specialized services working as feature hubs. After one team with domain knowledge creates a feature definition, all other teams can readily reuse it in their models, putting nearly zero effort into feature engineering.

In addition to feature registry, many feature stores provide double storage (offline for dataset collection and online for serving user requests in real time), dataset versioning, feature discovery, and many other cool things. We’ve already made a choice of feature store implementation, but we haven’t moved to it completely. So stay tuned!

ML model lifecycle doesn’t end when the model is trained. To become actually useful, the model must be deployed to production and serve user requests. But unlike typical microservices that are usually IO-bound, ML models are CPU-bound and thus require special treatment.

We have already learned a few new tricks to make web frameworks handle heavy computations efficiently, but that’s only one part of the problem. It turns out that ML engineers don’t like web programming and usually have a hard time keeping up with the latest trends in this field.

Thus we aim to develop a serving infrastructure that would allow ML engineers to do what they are best at and also create services with the best performance possible. Come back in a few months to see how it looks!

To summarize, here’s the diagram of the most important components of our ML infrastructure:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: