Original Source Here
Empowering End-to-End Data Science at Oda
TL;DR: We are building world-class capabilities for unleashing the power of data science in grocery shopping and logistics at Oda. If you’re already sold and want to help us with this, feel free to scroll all the way down to the section on why you should join our Data Platform team right away. But if you want to know all our thoughts on how we want to empower end-to-end data science at scale, please read on from here!
Unleashing the potential in data and algorithms is a core capability of our company and our product at Oda. To make this a reality in our everyday work, having a large team of skilled and creative data scientists with a desire for impact is obviously extremely important, but it is not enough.
Without the infrastructure and tooling that allows data scientists to be productive and focus on actually solving the data science tasks at hand as well as pursuing their own ideas, they will be starved for most of the impact they crave and unable to reach their full potential.
The company may still benefit from their efforts to some degree as long as they are solving the right problems, but the road not taken is paved with huge opportunity costs. You are missing out on many of the innovative and transformative capabilities that your product-savvy data scientists would surely be capable of delivering if they were able to iterate quickly and end-to-end; all the way from raw data to A/B test in production.
Of course, there is a lot more to this than the actual data science infrastructure and tooling. You need an organizational structure, culture, and way of working that lend themselves well to learning by doing, being comfortable with ambiguity, and balancing long- and short-term returns.
We will likely write more about those topics later. We focus on the tooling in this blog post, because if you have those, data scientists will be able to explore ideas, generate hypotheses and uncover evidence to substantiate business impact for their prototypes at a rapid pace. This can then be used to inform the product development process, increasing the likelihood that you maximize your ability to deliver end-user value using data science.
Our enablement philosophy
At Oda, our Data Platform team is responsible for providing such tooling and infrastructure to the data scientists across the entire company.
We prefer enablement over handovers, and believe that data scientists should be empowered to take solutions, ideas and experiments all the way from raw data to production, while having minimal exposure to lower-level infrastructure components and esoteric languages. Data scientists should spend their time processing and analyzing data and writing code, not fighting with project environments, containers, and cloud infrastructure. That should primarily be the Data Platform team’s job.
Sure, it’s terrific to have data scientists who are both willing and able to learn everything they need about Docker and Kubernetes (which we use extensively), and we have actually benefited greatly from this throughout our scale-up phase at Oda. But if this becomes a requirement for data scientists to generate value autonomously, you can be absolutely sure that you are leaving money on the table.
Data scientists come from a wide variety of backgrounds and have different skillsets and specializations, which we have chosen to truly embrace at Oda. Sure, everybody needs to operate at a certain technical level. But as long as you are a data scientist who knows Python and SQL, it is the Data Platform team’s job to technically enable you to translate your analytical and developer skills into business value for the company.
Just think about it: When your mandate is to unleash the potential in data and algorithms by building new and innovative capabilities, there is a huge opportunity cost to having data scientists that are primarily skilled in applying data analysis and algorithms spend a lot of time upskilling on technologies they should not really need to know. The cognitive load experienced by data scientists should be primarily related to solving the actual data task at hand, and you want to avoid having them pulled out of their flow state to deal with extraneous technical problems.
In short, you need an excellent data science platform that allows data scientists to do what they’re actually meant to be good at. And while we certainly won’t claim to have nailed this at Oda yet, we believe we are heading in the right direction. Read on to learn more about how we are doing things today and the challenges we want to tackle within the MLOps space going forward.
How we pick and choose technologies
At Oda, we strive to make pragmatic, proven, and sustainable technology choices that solve the problems we need to solve without incurring large operational costs. We try to avoid jumping on the “flavor of the month”-bandwagon and resist the urge to pick up something just because it is new and shiny.
This is a big challenge in the machine learning engineering and MLOps field, which is advancing at a very rapid pace and where today’s golden goose could easily be tomorrow’s legacy. There is very little truly “proven” technology to support the type of data science you have to do to build innovative and transformative capabilities across a brand new type of value chain.
In such conditions, it is hard to know which technologies and frameworks are truly here to stay, and by extension which ones you should pick to solve your problems. But since we want to be at the forefront of creating innovative capabilities using data science, we do need to make some bets here so that the technology — or lack thereof — does not hold us back!
We won’t claim to be doing everything right all the time, but we believe we have done pretty well so far using the following rules of thumb:
- We try to postpone decisions about technology until we actually have to make them. If you’re looking for a library to package machine learning models as APIs, you don’t have to establish a standard until you know that you actually have use cases that require this capability in the foreseeable future. If there is no use case, it’s also very likely that you’ll pick the wrong thing and implement it in the wrong way.
- We prefer slim solutions that solve specific problems very well over big suites of tools and big platform solutions. There may come a time when all of MLOps can be solved really, really well with a single, mature platform, but that time has not come yet.
- Following from the above point, we also prefer technologies and solutions with a low level of lock-in. We do of course want to commit for some reasonable period of time, but we also don’t want replacing a component of our stack once something better comes along to be a huge hassle. This also means we necessarily favor technologies that integrate really well with the rest of our stack using open standards and protocols.
- Being a tech company doesn’t mean you have to build everything yourself or exclusively use open source, so we always consider “as-a-Service” solutions as well. But all the aforementioned rules of thumb also apply to these types of solutions.
How we actually empower our data scientists today
The requirements for data scientists to be able to do their magic will of course vary based on the context they are working in. For end-to-end data science generalists, like we mostly have at Oda, their requirements are usually something along these lines:
- Easy and hassle-free setup for new and existing projects
- Easy access to company data sources
- Tools that simplify common tasks and operations in data science projects
- Ability to draw on extensive compute resources when required
- Cloud infrastructure for storage of artifacts (like datasets, models, etc.)
- Easily run and deploy data and machine learning pipelines
- Easily package and deploy an inference API, Streamlit app, or some other type of lightweight application
- Automated testing and deployment (CI/CD) to safely make changes to existing projects
- Ability to monitor and troubleshoot pipelines and APIs
So what kind of tools and technologies does the Data Platform team actually provide to our data scientists at Oda, and how do we want to evolve this moving forward?
Our data science monorepo: Fabrica
Almost all of our data science happens in Python and currently lives in a single repository which we call Fabrica. As we have scaled from 3 to a total of 13 data scientists who work in different places of the company, this approach has been very valuable for establishing best practices and templates, facilitating knowledge sharing, and keeping an overview of pain points and excessive boilerplate code so we can build the right core functionality for common data science tasks. There may come a time when a monorepo is not the right way to do things anymore, but we are not there yet.
Having everything in a single repo does not mean we treat data science as a monolith. All our projects are completely standalone and function separately from one another. We don’t allow cross-project dependencies in production, but having very easy access to functionality from other projects has actually proven to be a great boost for quickly running experiments that may lead to new projects down the line.
If a project is quite big and has characteristics that don’t really fit well with the rather standardized data science project setup, and generally feels like it definitely needs to be “its own thing”, we will of course move it to its own repo. Our in-house route planner is an example of such a project — which is actually written in Rust, by the way!
In Fabrica, we currently use a combination of Conda and Docker to set up and isolate project environments. Conda was our stable workhorse for a long time, but as we have started adopting other tools that are container-based, we have decided to phase it out and base both development and runtime environments for all projects on Docker and pip-tools. Conda does not really play well with Docker (those images become huge!), and it’s a pain to use with certain Linux setups.
We are currently working on this new setup, where the end goal is for data scientists to just run a single command (maybe
develop?) to spin up a development container and start working on a new or existing project.
As alluded to earlier, Fabrica also contains some core functionality we have built to make certain data science tasks easier. These are mostly tasks that are specific to our own company setup and cannot easily be solved by other open-source libraries, or where we simply have had trouble finding the right tool for the job. This core functionality lives in a set of packages that we host on our private Gemfury repo, which means projects can implement and version them exactly like any public package.
It is the core functionality that allows our data scientists to very easily pull data from our Snowflake DWH, store and retrieve objects in Google Cloud Storage, implement logging in our preferred format in any project, track metadata from pipeline runs and artifacts, and a whole lot more. While these tools are primarily owned and maintained by the Data Platform team, all data scientists are actively encouraged to contribute with useful functionality, bug fixes, etc. This ensures that the development of our internal tools is primarily use-case-driven and that we build a proper community around our data platform ecosystem where everybody contributes.
Compute resources and data pipelines
We don’t really have a lot of “big data” at Oda yet, but that does not mean data scientists should be limited to what they can easily run on their laptops. And since we are now going international and generally growing our business very fast, our data scientists will definitely be working with bigger and bigger datasets in the years to come.
Still, the current main motivations for providing data scientists with extensive compute resources at Oda is to increase the speed at which they can iterate, not be limited by lack of access to CPU and GPU cycles when considering and pursuing different solutions to problems, and give them the freedom to spend less time writing extremely performant code.
On the subject of performant code: While this is of course something to strive for, it’s not necessarily most data scientists’ strength. Again, consider the opportunity cost of having creative, analytically minded data scientists spend a lot of time optimizing code runtime when they could instead be making their models better or tackling the next big data science problem.
If a data scientist requires a GPU or more resources than they have available on their laptop to work efficiently on a project, they can spin up a virtual machine on Google Cloud Compute Engine with their required specifications. This is done using a very thin wrapper we have built on top of the Google Cloud CLI, which ensures that everything is set up correctly and that the data scientist can start working on and running the code on the new machine right away.
While this setup works, it is quite hacky and has some drawbacks, and since it is heavily based on Conda it would need to be refactored significantly to play well with our work-in-progress Docker setup for development environments. With our adoption of Docker and Kubernetes, we see an opportunity to instead double down on this stack instead and run remote development on Kubernetes as well. We are looking into the Okteto CLI for this purpose, and our experiments with this so far show great promise!
All our Python data pipelines, ML pipelines, and other data science jobs and workflows — both for production and ad-hoc purposes — run on Kubernetes Engine using Argo Workflows. We considered several options when we had to adopt a tool for running container workloads, and we are very happy with our choice of Argo Workflows since it does exactly this one thing, and it does it really well.
While we have a lot of Argo workflows running today, we have not yet determined how to implement workflow standards and best practices that allow data scientists to efficiently create and deploy pipelines for new and existing projects without a ton of “YAML-ing.” Argo workflows are custom Kubernetes resources that require quite a bit of boilerplate to set up, and since their contents usually map well to abstractions that we have to build in Python in our projects anyway (functions or methods like
train_model etc.), there is a lot of potential to make the creation and deployment of Argo workflows super easy using some additional tooling.
Deploying and hosting machine learning models
Not all data science projects need to be APIs. We have several projects that deliver lots of value to our company that previously ran very well as a batch cron job on a weak old-fashioned virtual machine in the cloud (it now runs on Kubernetes using Argo, obviously).
While this is important to recognize, we definitely cannot limit our data science efforts to these kinds of “offline” use cases. We have several projects — like our ML model that predicts how long it will take to make a delivery at a customer — where real-time inference has been a requirement. We have packaged and implemented the backend of these APIs using a library called BentoML, which does exactly what it says on the tin: make model serving easy. And because it is Python-based, it is easy to implement directly into our machine learning pipelines, so newly trained models can be automatically deployed without changes to the code in the repository. This helps us cover the often overlooked continuous training (CT) aspect of MLOps.
The actual inference endpoints are implemented as vanilla Kubernetes deployments without a lot of bells and whistles. We are currently experimenting with different combinations of Argo Workflows, Google Cloud Build and/or Spinnaker — or other technologies, for that matter — for updating these deployments with newly trained models as automatically and smoothly as possible.
Running our pipelines and applications obviously also requires managing a set of Google Cloud resources like service accounts, file storage buckets, application secrets, IAM groups, etc. We practice infrastructure as code, so all of these resources are defined in and managed by Terraform. This also goes for the setup of our Snowflake DWH, which is used extensively by almost all of our data science projects.
Testing and monitoring
Nothing kills productivity and creativity like not feeling safe about implementing and deploying changes to existing code. This speaks to the importance of continuous integration (CI) and continuous deployment (CD), which is no less relevant for data science applications than for any other type of application.
That said, it’s worth mentioning that you can often deliver a lot of business value with data science without an extensive CI/CD setup when you’re operating on a rather small scale. Many data science projects can be implemented by a single developer in a pretty uncomplicated manner with code that does not often need to change.
Still, this scales poorly and is obviously not a sustainable way to operate data science in the long run. We have had automated tests for most of our projects for quite a while, which has made it a lot easier for data scientists to contribute to projects they are not intimately familiar with, for the data platform team to roll our changes and fixes to our internal packages that are used by almost all projects, and of course generally made everyone a lot more comfortable rolling out changes in production.
We write our tests using Pytest, orchestrate them using GitHub Actions when code is pushed to remote, and the actual test code for a project runs on Argo using one or several test workflows implemented by that project. Data scientists are strongly incentivized to create tests for the projects they develop, because if they do not, their teams are the ones that feel the pain.
Both our Kubernetes deployments and Argo workflows are monitored in Datadog. Here we can track endpoint health and dig through logs and tracebacks to figure out what’s going on when something isn’t behaving as it should. Of course, the quality of this troubleshooting is dependant on properly implemented logging in our data science projects.
The ultimate data science platform? We are not there yet!
While we think we have a decently functional setup, we believe the greatest, most challenging, and most interesting work in our MLOps journey lies ahead of us. We believe our future is truly enabling data scientists to work end-to-end doing what they do best. This is how we foster and scale the diversity in backgrounds and specializations that we believe makes our data science group so strong, and it is also how we truly unleash the power of data and algorithms in our product.
Being a product-oriented organization extends to our internal tooling. We view the data science platform as a product we offer to our data scientists. It should be delightful to use, it should make their lives simpler and easier, and it should be built to last. This is how we build for our customers, and it is also how we want to build for our employees.
While MLOps has been the main subject of this blog post and is a big part of what the Data Platform team does, there is obviously a lot more to it. The Data Platform team also owns all the setup and tooling around our Snowflake DWH, both our generic and custom data integrations, our event tracking toolkit, and parts of our core ELT (we use dbt) and business intelligence capabilities (we use Looker).
Want to join us on our data platform journey? As of the time of writing, we have several positions open in the Data Platform team. Check out our career pages to learn more about what it’s like working at Oda, read up on the Data Engineer roles on our jobs page, and don’t hesitate to apply if you feel it could be a good match.
If you have questions about MLOps at Oda, feel free to reach out to Kjetil Åmdal-Sævik directly. If you have questions about the Data Platform team in general or about the other data engineering roles, feel free to reach out to Stian Tokheim directly. We would love to hear from you!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot