6 Things You Did Not Learn In Your Data Science Course


Original Source Here

1) Software Architecture

There is a good reason why engineers hate Jupyter notebooks: they are the exact opposite of a “modular approach.”

Good software design upholds three basic principles: high cohesion, low coupling, and low redundancy. In other words, each module is specialized at a single problem, they are highly independent, and there is little to no code duplication. For instance, the code that loads a dataset shouldn’t do anything else (like data cleaning), shouldn’t depend on any other module (such as a data augmentation module), and should be the only place in the codebase designed to load data.

Most data science tutorials put everything in a single notebook — a massive no-no for engineering-wise. All-in-one-file means the dataset download, cleaning, and preparation are together with the code that serves and consumes it. The resulting file has multiple intertwined responsibilities, and, likely, several cells came from other notebooks.

Software architecture is about the big picture: how each component relates to and how code is arranged in a project. There is no better way to learn it than critically thinking about all the code you touch daily. For example, consider how effortless it is to use several packages in Python, how you can mix and match PyTorch with NumPy, Pandas, SkLearn, etc. Each module solves its niche of problems, and, together, they solve yours.

Now, consider the functionalities you have coded recently. Did you code them for the first time, or are they recurring features that you end up recoding every project more or less? If it is the latter, you might not be writing reusable code. Often, code ends up not being reused is being designed too close to the problem or too coupled to the complete solution.

As a rule of thumb, the more focused, independent, and unique each module is, the better the architecture. Maybe most of what you have written so far is too spaghetti for you to bother fixing. However, this doesn’t mean all future code you will come to write has to be that way too. It all starts with the right mindset: writing independent and reusable pieces.

2) UI Development

Developing interfaces is an entirely different business. Many of the challenges a React or Vue developer faces are utterly foreign to data professionals. Learning UI development will, first and foremost, improve your coding skills tremendously. However, the real deal is being able to code your tools— and tools mean productivity.

The reality is most tools suck. The data labeling app you use was designed to label anything, not custom-made to label what you label. The plotting library you use was designed to support most plots, not your specific designs. The same goes for IDEs and programming languages. They are made for the general case, not your particular uses. So it is a no-brainer why we use so many VS Code extensions or why we use TensorFlow, not pure NumPy and C.

Knowing how to code UIs is not about making the next Excel or Tableau. Coding your own tools is about solving a very particular task really well. A job no mainstream tool would ever bother to implement. It can be a data exploration UI or a custom data labeler. I have had fantastic results over the years with custom data labeling solutions. Plus, you can bake your models as pre-labelers with little to no effort (you coded them already).

In this topic, consider how a what-you-see-is-what-you-get interface could help what you are doing right now. How complex would it be? Would it be that painful to code? How many hours of work could it save? Etc.

Sticking with Python, here is an excellent introduction to the available GUI packages. I recommend starting with Tkinter out of these. However, today’s real deal for UI development is learning a web framework, such as ReactJS or Vue. If you are into mobile, I suggest Flutter.

3) Software Engineering

No data science course I ever watched covers fundamental topics of software engineering. We are taught how to solve training and inference tasks but not how to code them properly. The lack of software engineering skills is the main reason data teams are often divided into scientists and (you guessed it) engineers. The former ships models, insights, analyses, etc. The latter assembles the pieces for production.

Early on, I mentioned software architecture, which loosely means how complex software is split into smaller, more manageable pieces. However, software engineering is a much broader area, being software architecture one of its subtopics. Informally, the term is often used in place of two of its most essential subtopics: how can we better write software and how can we better arrange software teams.

My favorite definition of engineering is “to optimize, to make better.” Software engineering is nothing more than anything regarding how to create software better. This means making it faster, more maintainable, more secure, less buggy, etc., and how to write it more quickly, split work among people, structure teams, etc. It is about the product and the process. Better software and better software making.

So why study that if we have a whole engineering team? Because almost all we do is software or is accomplished by software. No team has ever complained their code is too good. No team will ever complain that they are coding too fast or the code is too maintainable. The complaint is always about how terribly written something is or how slowly their progress has been. Not the other way around.

Where to begin? Best practices.

Follow your language style guide (PEP-8 for Python) and learn about general and big data design patterns. Learn how to unit test your code or how to parallelize it efficiently. You are (hopefully) aware Git, but have you ever wondered about using an experiments tracker or automated tuning system? Your team likely follows some form of Scrum or Kanban strategy, but have you ever read the original Agile Manifesto?

These are pointers. Each is a world in itself that deserves an entire article. However, if I am to choose one, read the Agile Manifesto. All four values and twelve principles. Being agile is not about being fast; it is about being flexible. About adapting to change. It is the opposite of rigid, stiff. This applies to everything in life. Adapt to things. Change what needs changing. This goes from training models to your entire career and the coffee you drink.

4) How Databases Work

One thing is knowing SQL, being able to query data. Another is to understand indexes and how data is stored and fed to you by the database system. A big part of training (massive) models on (monumental) datasets are the input pipeline. If it takes more to fetch a batch than process it, you won’t use your GPU rig to its full potential. Database theory can teach you a lot about how to handle and service data at scale.

The point here is not to have data scientists do database work or replace database administrators. Instead, the end is learning the fundamental techniques behind servicing data. Mainly, how to hash, sort, cache, and page data — the four main avenues for software optimization, and how reordering operations can tremendously affect the running time.

The TensorFlow documentation has two fantastic articles on optimizing and analyzing input pipelines. It shows how performing operations such as pre-fetching data, processing it in parallel or vectorizing it can significantly enhance performance — and sometimes, it is just a matter of changing the order of operations. An example is how databases rewrite queries for faster executions. Most of what both articles teach is equally applicable in other contexts, such as processing Panda’s data frames.

As an example, a core principle of query rewriting is to filter data first. Say you need to service a 1TB image dataset to train a neural network. The dataset has 100 classes, but you only want to learn how to distinguish 10. You can either (1) load each image blindly and then check their labels, or (2) filter out which files are from the classes you want and skip the others altogether. The obvious answer is (2), as it bypasses loading data you won’t use. In database terminology, the where queries execute first.

5) Basics of Design

We do ugly plots. It doesn’t hurt to read some “design for dummies” content. Some simple basic rules can teach you how to color your charts and size the text therein. Knowing a bit more about design can also help you be more creative in how you show data. A good visualization is worth a million images.

An excellent place to start is following some presets, such as using Seaborn, MatplotLib’s colormaps, or Plotly’s themes. Then, you can further improve your understanding of colors and how to find good color combinations. However, the most important thing you should learn is how to compose a visual hierarchy. A good plot guides the reader’s eyes towards what the reader needs to know/realize first. If not, it will look confusing and intimidating.

Nonetheless, no matter how you design your visualizations, always consider that plots should be self-explanatory. Never rely on a text that is not within the plot itself to clarify what the plot shows.

6) Machine Learning Besides Neural Networks

There is a whole world of ML algorithms that many data scientists are unaware of. While linear/logistic regression and neural networks are popular, many problems could be solved much more quickly by simple SVMs under an RBF kernel or an XGBoost model. Traditional decision trees are often good enough on tabular data and have the additional benefit of being completely explainable. We shouldn’t be ignoring the past approaches as much. There are many gems out there.

While most of you are familiar with these algorithms, the growing dominance of neural networks has consistently reduced the time devoted to classic techniques. As a result, many professionals today are far more familiar with Transformers and ResNets than good old Scikit-Learn.

In this spirit, I highlight three models all data professionals should be aware of and know how to use:

  1. Decision Trees: trees are one of the simplest classifiers available. Yet, they are one of the most powerful tools you can have. Their power is being incredibly explainable. You can easily print how your decision tree thinks. This is not something most models can do. Their limitation is not being very suited to complex problems. You can read more here.
  2. Boosters: the most famous is XGBoost. These are decision trees on steroids. They can still explain themselves, but they trade a lot of simplicity for classification power. When working with massive tabular datasets, boosters are some of the most efficient methods available. Some alternatives to XGBoost are LightGBM and CatBoost.
  3. Support Vector Machines (SVMs): based on a beautiful mathematical formulation, SVMs are among the most successful models ever devised. They shine at complex problems with small datasets but lots of features. While other solutions struggle at learning under these harsh circumstances, SVMs thrive. Here are the SkLearn docs on SVMs.

An honorable mention goes to Naive-Bayes. This model is relatively fast and straightforward but can often solve problems with many features quite well. In addition, they were pretty popular in the Natural Language Processing (NLP) community for accounting for each word in a document quite effectively.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: