Top 4 things you need to know to get started in Data Science — 2021*QN4i41-S2NCV6ezi

Original Source Here

Top 4 things you need to know to get started in Data Science — 2021

A complete overview of the most essential components of data science and how to master them

Photo by Lukas Blazek on Unsplash

If you are reading this article, it means that you are hoping to become a great data scientist and you just don’t know where to start. Although I am not an industry expert, I have been creating my own data science path for over 2 years now. I have been through 9-month long research projects at University College London, to real-life AI projects at a hospital, to multiple Kaggle competitions, and finally to writing data science articles here. And I think it’s always better to get advice from someone who isn’t too far down the line, someone who went through these challenges recently and can give you some tips to get through them faster.

In this article, I will do my best to layout the most essential areas that you need to look at in order to start diving deep into Data Science. I will also try to layout the best learning resources to do so.

1. Essential theoretical knowledge of statistics and calculus

I think you kind of expected this to be the first one, but before you just skip to the other section or just another article, let me tell you why. An okay Data Scientist learns how to use a bunch of tools like PowerBI, Scikitlearn, etc… This will be fine for building baseline models, but then you will find out that it’s not enough and you need to improve your model. This brings us to reading ML research papers, and you have to trust me on this, you will not understand most of the ML papers if you don’t understand essential statistics. And if you don’t understand most of the papers, you probably won’t be able to implement them and improve them, which is a big issue.

I remember struggling with understanding ML papers at university, it used to take me a few days if not weeks to fully grasp them. However, this all changed when I spent a few weeks learning the fundamentals of statistics and calculus. Now, I can easily digest those papers in an hour or 2. If you haven’t already done so, you will not believe how much those papers are relying on those foundations.

One very important note that I want to stress here is that I am not asking you to be an expert in these foundations. This is what most people struggled with in high school, being good at math and statistics to get through an exam. You don’t need this here, you just need to understand the foundations to digest the research papers. Understanding them is much easier than actually being good at solving theoretical math problems (which is a good skill to have, but a hard one to acquire).

Khan academy is an excellent place to start. You can start by checking out their algebra course here and their stats one here.

2. Essential programming basics

Okay, you have got your math and stats knowledge, now it’s time to move into something more practical and more hands-on. A lot of people get into Data Science from non-technical backgrounds (which is actually quite impressive). Believe me when I tell you this, the worst way to learn programming is to keep watching courses endlessly. I know there are tons of articles and videos about learning programming and I don’t want this to just be another duplicate, I do however want to give you the most important tips to save you a lot of time.

When I was learning the programming basics I used to watch tons of tutorials, which is somehow useful. But, a lot of people (including me) think that watching those tutorials is actually improving our skills as programmers, it’s not! It’s only telling you how to do something, you will never learn until you do it. Although this seems straightforward, it’s actually harder to do this (compared to just saying it). So for the sake of brevity, here is my advice:

For every, few tutorials you watch or articles you read, make sure you implement at least one of them. If you aren’t doing this, you are wasting your time.

If you don’t believe me, feel free to check out TraversyMedia and FreeCodeCamp article that are going to affirm this concept. A lot of programmers realize this, but it’s usually a bit later than they should have.

I am not going to point you to a course, instead, I am going to point you to one of the best places to improve your programming skills and more importantly improve your problem-solving skills. This is the advice that I wish I have received when I was at university because programming languages change all the time, problem-solving skills don’t. And when you actually start applying to jobs, especially at FAANG, a decent interviewer will be examining your problem-solving skills, not your syntax accuracy skills.

Start by integrating at least 2–3 hours every week of easy hackerrank or leetcode into your schedule, if you are struggling, watch some tutorials, but start with approaching the problems first (not the other way around).

3. Experience, experience, experience

Photo by Jo Szczepanska on Unsplash

At this point, you know your theory, you have good programming and problem-solving skills, you are ready to start gaining data science skills. The best way to do this is to start developing End-to-End data science projects. From my experience, the best projects must have at least a few of these components:

  1. Data gathering, filtering, and engineering: This can be as simple as an online search or as complex as building a web scraping server that aggregates certain websites and saves the required data into a database. This is actually the most significant stage because if you don’t have data, then you don’t have a data science project! This is actually the reason why a lot of AI startups fail. Once I realized this, it was quite an eye-opening thing for me, even though it’s kind of obvious!

Model training is only the tip of the iceberg. What most users and AI/ML companies overlook is the massive hidden cost of acquiring appropriate datasets, cleaning, storing, aggregating, labeling, and building reliable data flow as well as infrastructure pipeline.

Source: The Single Biggest Reason Why AI/ML Companies Fail to Scale?

2. Model training (this is obvious)

3. Gathering metrics & exploring model interpretability: One of the biggest mistakes that I have made during my first few ML projects is not giving this point what it’s worth. I was extremely eager to learn and so I kept jumping from model to model too quickly. Don’t do this, when you train a model, fully evaluate it, explore its hyperparameters, checkout interpretability techniques (such as DeepDream for CNNs), and more importantly figure out why it works well and why it doesn’t.

One of the best places to learn these concepts (except data gathering) is Kaggle, I can’t stress enough how much you will learn from doing a few Kaggle competitions.

4. Model deployment & Data Storages

This is a very important step that a lot of people skip. You will need basic web development skills at this point. You don’t have to build a complete app around your model, but at least try to deploy it to a Heroku web app, you will learn so much.

A central piece of your data science project is selecting the correct data storage framework. Because keep in mind that your production model will be consistently using and updating this data. If you don’t choose the correct data storage framework, your whole app will suffer in quality and performance.

One of the fastest-growing storage frameworks is data lakes.

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Source: Amazon

Data lakes are being widely used by top companies currently to manage the insane amount of data that is being generated. If you are interested, I suggest checking out this talk by Raji Easwaran, a manager at Microsoft azure about the “Lessons Learned from Operating an Exabyte Scale Data Lake at Microsoft,”.

There are also frameworks that operate on data lakes that ease the consumption of data by machine learning models such as Apache Spark. I used to think that adding these layers is not that effective, but separating these groups of operations into different layers saves you tons of time in debugging your models in the long run. This is actually the backbone of most high-quality web applications/software projects.

Final Thoughts

The biggest misconception I had going into data science was that it’s all about model fitting and data engineering. Although that is of course an important part, it’s not the most difficult and significant one. There are multiple other factors (as discussed above) that go into play when getting into Data Science and developing high-quality ML projects.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: