Original Source Here
I’ve officially accepted a role as a Data Scientist at Facebook. I wrote my first line of code nearly four years ago to the day. And likewise, I had never taken calculus or linear algebra during undergrad. So my work was cut out for me…
A solid foundation is super important, so here are the resources that helped me the most.
Part 1: Resources
Start with Automate the Boring Stuff with Python. This emphasis here is automating boring stuff. There’s something magical about web-scraping as a (previously) non-coder. The best thing you could do for yourself is get hooked on python early on!
After you’ve gotten 5–10 chapter worth of ATBS under your belt, it’s worth having a more detailed view of how Python is structured as a programming language. Corey Schafer has resources on virtually everything — leave Flask and Django alone for now! What you need now is Object Oriented Programming (OOP), generators, decorators, etc. We need to thoroughly understand python’s built in data structures before building our own!
Data Structures and Algorithms
Back to Back SWE discusses every algorithm in terms of pseudo-code, (read as not python!) the emphasis is understanding the data structures and algorithms (DS&A) themselves, not memorizing code. It’s a good idea to get familiar with recursion, tree structures, and dynamic programming — these ideas pop up all over the place, such as reinforcement learning, hidden markov models, etc.
To test your understanding, try solving problems on LeetCode. This will help you get ready for coding interviews. The site is geared towards software engineers, not data scientists; you don’t need to bother with the “hard” questions. But handling virtually every “easy” question and a moderate sampling of “medium” questions will make the coding portion of a data scientist interview manageable.
Probability Theory & Statistics
Most people begin their journeys into statistics from the well-established and de facto approach, Frequentism. This is where p-values, confidence intervals, and ANOVA come from. The algorithms are computationally simple but the assumptions required of the analyst are very nuanced. As a probability and statistics novice, this lends itself to using a given procedure out of context and/or misinterpreting its results. Frequentism is extremely common in the workplace, so you have to learn about it at some point. But I believe that’s important to build intuitions early. For this reason, I recommend starting with Bayesian statistics. The algorithms are computationally more complex, however, the interpretation of results is much more straightforward. This is ideal is as the computations can be automated and abstracted with software but the intuitions cannot.
To this end, start with Think Bayes. The author implements all algorithms from scratch and starts with simple intuitions; this is ideal as it will help you get your hands dirty with probability theory, which is gold! You need not finish the whole book, but it’s important to get a feel for the terms, prior, likelihood, evidence and posterior.
After building some basic intuitions, it’s time to gain experience with something more robust in practical in real world settings. Statistical Rethinking is the best book on probability and (Bayesian) statistics in my opinion. The textbook isn’t free — but the lectures are! The code was originally written in R and Stan. However, with our growing knowledge of python, it’s better to stay in the python ecosystem, hence I’ve linked the PyMC3 port of the textbook’s code. PyMC3 is a “probabilistic programming language” — in other words, you specify networks of probability distributions with various observed and/or latent variables, and infer how the system behaves as a function of its parameters.
Lastly, you do need to know some amount of Frequentist statistics. To this end, I recommend StatQuest — as it will cover topics as intuitively as possible, while emphasizing the assumptions made.
Calculus and Linear Algebra
Much of statistics is built on calculus and linear algebra. For example, the area enclosed by the Multivariable Gaussian distribution is scaled by the determinant of its covariance matrix. What does this mean?? And parameters are typically determined (by Frequentists!) using Maximum Likelihood Estimation — in other words, derivatives. The best resource here for high level intuitions is 3Blue1Brown. If you want some hands-on experience working with differentiation and integration, I recommend the Organic Chemistry Tutor. He doesn’t have a playlist on Linear Algebra, so I’ve linked PatrickJMT, who does.
It might be tempting to jump right into Pandas. We’re almost there! But first, it’s important to get a good handle on SQL. Web Dev Simplified offers a great 1-hr crash course on SQL. What’s important is that you master the basics (Select, From, Where, Group by, Having, and Joins) and then get a handle on correlated subqueries. After this, it’s really a matter of “practice makes perfect!” Refer to LeetCode (introduced previously) to test your SQL mastery. Again, target “easy” and “medium” questions.
After understanding the SQL approach to Data Wrangling, you’re ready for Pandas. Again, refer to Corey Schafer for help here. Like SQL, the real litmus test is handling practice problems. I recommend StrataScratch, which is sort of the LeetCode of data wrangling interview questions.
If you’ve followed my recommendations on sequencing your learning, then you must’ve noticed that I’ve saved Machine Learning (ML) for the end. Too many aspiring data scientists jump directly into ML without a firm foundation on everything else required to properly motivate it. StatQuest (introduced previously) has good content on the high level intuitions regarding ML. However, the best way to master an ML algorithm is to implement it yourself. With your newfound knowledge of OOP, calculus, linear algebra and probability theory — believe it or not — you’re actually ready for this! To this end, I highly recommend Sentdex.
Product Sense is your ability to take ambiguous product problems, formulate assumptions, identify consequential metrics (aka Key Performance Indicators — KPI), design experiments, and interpret results. Most data scientists aren’t going to be tasked with taking a given ML model and improving its accuracy from 90% to 95%. More likely, a data scientist will be given a problem by a stakeholder and need to structure an analysis from the ground up. To this end, I highly recommend Data Interview Pro.
Not every job is centered on “products” in the technology sense, like Instagram Stories or Google Search. Some products really are tangible goods. To this end, a lighthearted view of economics will be helpful. Crash Course has a great playlist on all things economics. No need to watch beyond microeconomics, unless of course, you’re curious.
Part 2: General Tips
Here are some general tips for your study efforts.
It’s key to lay solid foundations; the material data scientists are expected to know is pretty vast. If you jump into ML directly, you’ll be referred back to the fundamentals continuously, which is a poor method to structure your search space. Better to build a proper foundation on the fundamentals so that a reference to gradient descent doesn’t take you on an unexpected journey through differentiation.
Similarly, master SQL and Pandas. Nobody will be impressed that you have a handle on PySpark if basic SQL syntax is unclear to you…
Integration is a “nice to have”
Differentiation is a must! But integration is tedious and troublesome. The bottom line is that integration is reverse-differentiation. And often, a tractable (solvable) solution doesn’t exist. Integration will come up frequently in probability theory (generally proofs, such as deriving Poisson PMF from the Binomial PMF.) However, and as you learned from Think Bayes and Statistical Rethinking, we frequently approximate integration through simulations. It’s better to get a proper handle on how Monte Carlo methods (and especially Markov Chain Monte Carlo methods) work. The resources I’ve outlined will get you here!
Forget Deep Learning
Deep Learning (DL) is on fire right now! But a little known secret — the supply of data scientists who can use DL outnumbers the demand for such data scientists. Another little known secret — most data scientists don’t use DL o the job whatsoever. A third little known secret — domain knowledge can compensate for lack of technical know-how. If you know a system works, you have good prior beliefs on how to structure your experiments and analysis therein. You don’t to use a hyper-flexible ML model if you know how to engineer features that are truly consequential. Focus on using simpler statistical and ML models to answer questions and don’t shy away from reducing the search space via domain knowledge.
Automation is Overrated
Every 2–3 years Google, Facebook, and Amazon break headlines with new advancements in Computer Vision (CV), Natural Language Processing (NLP), or Reinforcement Learning (RL.) All three have one major theme in common — they’re black box models with no tangible interpretation. In plain English — the model is useless if it’s not actively sucking electricity and using computational resources. Linear regression will spit out parameters; you can use these to make business decisions whether or not your machine is read into memory. Not so for CV, NL, and RL. These models are best suited for automating tasks, which humans can typically do (translating English to Italian, facial recognition, and driving a car.)
There will be some DS roles that will say, “design an algorithm, which will dynamically price products based on real-time demand.” But these jobs are few and far between. Most jobs will ask you to “Optimize a fixed price of a product for all consumers.” Say you work for a grocery store, you’ve used logistic regression to estimate the demand curve differentiation to determine the optimal price of a milk gallon. Now say that you expect traffic of a thousand customers today and you have 500 gallons, which need to be sold or thrown out. How much should you lower the price such that you expect every gallon to be sold?
There’s no reason to design a reinforcement learning model to answer this question! You already have the demand curve, via logistic regression (a slope and an intercept fed into a sigmoid function.) Say the current price corresponds to 40% of customers purchasing milk. 40% of a thousand customers is 400 customers. If 400 customers buy 400 gallons of milk, then you expect that you’ll have to throw out 100 units. However, if you simply search the demand curve for the price corresponding to 50% chance of purchase and set the price of milk to this value, then you’d expect all 500 units to be sold!
Automation is powerful but overrated in terms of applicability to the average data scientist job. Using domain knowledge and parametric models is severely underrated — I hope this example convinced you!
Subscribe if you like my content 🙂
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot