Original Source Here
AcquireX Sports Methodology
By Michael Kralis, Chief Data Scientist
Sports betting has become a sensation and is on pace for record numbers in revenue generation. Sports betting’s exponential growth and popularity is ushering in a powerful duality with measuring probability. This is where the field of data science and the business of sports betting collide. The tools used in data science are what allow us to measure these probabilities. But that begs the question — what type of data can we use to measure these probabilities? The answer is far from easy, nor is it easy to test whether the answer we come up with is the best solution possible.
In most sports, ratings are used to assess a team’s strength and can be a strong indicator of win probability — but win probability is not the only thing people bet on. Bettors can bet on the sum of the scores and the difference being over or under a certain threshold. In these cases, sometimes more data is needed. Take football for example. In football, most games are played in the elements, which means that having access to weather data can be crucial in predicting the sum of scores. Historically, games in harsh weather conditions usually lead to lower outputs in scores, which could also mean that absolute deviation of scores can be smaller as well (i.e. difference of scores). However, just having the data does not ensure immediate results. There is still the technical matter of using this data to generate accurate and reliable probabilities. This is what we aim to do here at AcquireX.
Inputs to Models
Knowing which data to use is no small feat and, most of the time, is a process of trial and error. A good starting ground is to use ratings. For most sports, there are mainstream ways to measure a team’s defensive and offensive strength separately. In football, there is the FPI (Football Power Index) and for basketball, there are Offensive and Defensive ratings. Usually, these are solid benchmarks to use for predictive analytics. However, these mainstream ratings can sometimes give false assessments to the team’s current strength or scoring output.
Sometimes these ratings are judging their performance over the whole season, and are not adjusted for current performance. A key player may have just gone down with an injury, or a previously injured superstar may be returning after a lengthy absence. Would judging their team based on this season’s performance be an accurate measure? Most likely not. Instead, ratings need to be constructed for each player. These allow us to assess not just the team as a whole, but the individuals participating in the game instead. This way, we can produce an assessment of the team’s offensive and defensive strength with maximum relevancy. But with all this data, how exactly can we use it to predict/measure probabilities of certain events occurring?
The Dimensionality Curse
To capitalize on the data accumulated, we must develop and test statistical models for measuring probabilities. However, this is where the curse of dimensionality can play a large role. The curse of dimensionality refers to the many problems that can arise when analyzing high-dimensional spaces that don’t occur in our normal three-dimensional setting. These issues typically occur in machine learning, data mining, combinatorics and other higher level numerical analyses.
In our case, the problem is having a plethora of intuitively useful data, which can lead to an overwhelming number of factors for our model. If too many factors are used as inputs to train a model, we could be overfitting to past data. This does not necessarily mean that we cannot use many inputs for our model. With the help of dimension reduction techniques like PCA, t-SNE, or even clustering algorithms to categorize, we can reduce our model input size dramatically. Once we know what our input sets are, we need to select models for predicting win probabilities, over/under probabilities, spread probabilities, etc. This also requires some trial and error, but it can be tested until an acceptable level of significance is found.
There are many factors to consider when determining an acceptable model for any job. If we are looking for score projections for example, then our model would have to include some form of regression since the dependent variable is best modeled as a real number. Most popular techniques would include forms of linear regressions like Lasso Regression and Ridge Regression which can help to not overfit to data. Regressions can certainly be useful, but a better strategy for trying to model sports betting is classification algorithms. The difference is that in sports betting, the bettor does not care whether they crushed the bet or won the bet by a small margin. Either outcome is a success. This means that we can model winning or losing a bet by the different outcomes a bet can have. In some cases, this is only two outcomes, while in other situations there could be three outcomes.
However, there are so many different modeling techniques to use. Through testing, the models that typically give the best results when ratings are used as inputs are logistic regressions. Logistic regressions tend to have more monotonicity in the independent variables, which makes intuitive sense for our model. If a team increases their offensive rating, one would expect that their probability of covering a spread or winning the game should increase. Logistic regression would ensure that would be the case due to its mathematical construction.
How Sports Books Produce Odds
So how do sports books produce odds, anyway? Well, they do the same thing except now they have to account for the public. The reason these books stay open is because they charge a premium on the odds they quote. For example, the Bucks could be quoted with a 58% chance to win tonight’s game, while the Suns can be quoted with a 49% chance to win tonight’s game. Obviously, these probabilities do not add up to 100%. The more comical example is that every Super Bowl, fans get to bet on where the coin toss will land. The quote is usually a (-105) on either side which means that they believe there is a 51% chance it will land on heads as well as tails. This quote allows the books to make their money if all goes according to plan. If you take this bet, you should be doubling your money if you win to make it a fair bet. In this case though, you need to be $105 to make $100 which is no longer fair. If the books can ensure that 50% of people bet heads and the rest on tails, they make risk-less money.
What happens if it isn’t evenly spread? This is where the books can lose money. If 70% of people bet on heads instead of tails in this scenario, then if the coin lands heads the books will lose a lot of money even with the premium they charged. In other words, on top of modeling the real probability for a certain bet, sports books need to model how the public will bet to ensure that the premiums they charge on top of their quotes actually make them money.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot