Data science workflow: It’s not as linear as you might think*s9s6ZGdzPAFs8VrZ

Original Source Here

Finding a problem

As a data scientist in the organization, you don’t search for a nail just because you have a hammer. There’s a couple of standard starting points for most data science projects.

Most projects start with a business problem — either it’s already known or you have to figure that out. Some teams will have well-defined problems at hand, such as — how to segment customers to make email marketing more cost-effective? How to improve the process of credit approval with machine learning (ML)? Or how to identify frauds in customer transactions?

Once you’ve identified the problem then you start to think about a solution.

Sometimes businesses will already have a data science/ML solution in production (e.g. fraud detection algorithm in production). In those situations, your responsibility is either to maintain the system and monitor for possible drifts, or to introduce a new solution that improves upon an existing one (e.g. higher accuracy rate).

A second possible starting point is a given dataset. Organizations often generate a wealth of information that never gets used. So as a data scientist in the organization, you could do data mining to find out what actionable insights you can discover. These may sound like entry-level analytics problems, but often businesses are immensely benefited from such insights.

A data science workflow (Source: author)

Working on the problem

Now that you have a problem to solve, naturally, the next step is to find out how to solve it.

You could always break the problem into smaller pieces. For example, if you are trying to make a revenue forecast for the next year, you do not immediately install Facebook Prophet and jump into forecasting.

Instead, you search for relevant information to understand the historical patterns in sales; how’s the sales currently trending; how’s the demand trending in the market; what other competitors are doing; etc.

You gather as much background information as possible from different sources to understand the problem from different angles — zooming out before zooming in. A good data scientist will allocate a good portion of their time doing the background work before jumping onto Jupyter notebook.

Thinking through the methodology

Once you’ve done the due diligence on background research, you are not yet thinking about models, tools, visualization techniques.

You are thinking about a methodological process that will guide you through answering the question. You layout a list of datasets needed, locate where to find them and how to get a hold of the data. Getting data from someone else’s computer to yours is more difficult than you think!

Having an overall process in your mind or written on paper helps a lot. It’s kind of similar in academic settings where you write your research proposal before actually executing the research; things may change along the way but you have the big picture in mind right from the start.

What tools are needed?

In this step, you are thinking about what tools can help answer the question. If it’s a forecasting problem, should a time series forecasting model work? Or is it a linear regression problem instead? Do you need GIS technology? Is there a good package in R or Python that you can rely upon?

Once you have explored all available options and decided on a particular set of tools, you are now ready to go hunting data. Data that you need can be a multi-million rows dataset or may very well be a hundred data points — depending on your problem and the model you choose.

Have you found the required data for the chosen model? If yes, you are good to go build your model. But if you don’t have all the required inputs, you should stop here and go back to the methodological process in step 2. Maybe there are other tools/methods that don’t require time series data? How about a system dynamic model that does not require a large dataset?

Photo by Hunter Haley on Unsplash


If you haven’t heard it before, get ready to be shocked. Modeling is the easiest part of the data science workflow. It is well known in the industry that 80% (more or less) of time spent in a project goes into data cleaning, feature engineering etc.

Most of the model building and model testing process is pretty standardized. For example, if you are implementing a classification problem:

  • first, you define your independent and dependent variables and then split the data into training, validation and testing set
  • then you run a baseline model (e.g. Logistic Regression) to compare the performance of other models against that baseline
  • next, you run several other models — SVM, Decision Trees, Random Forests, Bagging , AdaBoost, XGBoost as many as you wish — all with default hyperparameters.
  • select one or two best performing models for hyperparameter tuning (can you guess why you are not using all models for hyperparameter tuning? I’ll answer it later).
  • finally, you run several versions of the model with different combinations of hyperparameters, chosen either manually or through a GridSearch process.

The process above, with some iteration, gives you the best-performing model you are looking for. You are now ready to hand your model over to the engineering team for production.

Production and post-production

You’ll probably use Jupyter notebook or a similar environment for model experimentation, but that’s not going into production. Mid- to large-sized organizations will have separate engineering teams who are responsible for transforming your code into more efficient production quality codes and putting them into production.

An important point I should note here. The best-performing model according to your experimentation may not make into production. The final choice depends on several factors such as model explainability, complexity and the level of difficulty in maintaining the codebase.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: