Original Source Here
5 Useful Tips For Calculating Average In Python
It cannot be both Accurate and Fast
Measuring the central tendency of a dataset is one of the most common techniques of Exploratory Data Analysis. In this article, I won’t repeat what is mean, median and mode because these are basic statistics knowledge that is easy to be found by Googling. Instead, I will focus on the Python Statistics library.
- What are the cool features it provides?
- Why we still need the Statistics library over Numpy?
- What are the other ways to measure the centre of a dataset based on different problems?
In this article, all the code provided assume all the functions are imported from the Statistics library as follows.
from statistics import *
0. The Basics Usage
We cannot bypass the basic usage before start to introduce any libraries. To measure the centre of a dataset, the Statistics library provides all three methods.
- Mean, which is basically the average of the data.
mean([1, 2, 3, 4])
2. Median, which is the middle number of the dataset after sorted.
median([1, 2, 3, 4])
3. Mode, which is the value that has the highest occurring frequency in the dataset.
mode([1, 2, 3, 4, 4])
Not too much to explain for the basics. Then, let’s focus on what special feature that the Statistics library provides.
1. Low/High Median
Based on the definition of the “median” number, it is clear that it is either the number in the middle (if the total number of elements are odd) or the average of the two numbers in the middle (if the total number of elements are even).
However, it is not uncommon that we have such a requirement. That is, we want to use the median number, but it has to be existing in the list of the numbers.
Usually, it won’t be straightforward to get this number. For example, we could get the median number first, and then try to find the number in the list that is just greater or less than it. Or, we could also use the total count of the list and then get the index in the middle. So, we can then get the low or high median based on the index.
With the Statistics library, we don’t actually have to do this, because it provides the functions out-of-box.
median_low([1, 2, 3, 4])
median_high([1, 2, 3, 4])
During my job, I have worked with such a practical use case. We have a list of strings and we need to get the string with the highest frequency in that list. However, the string having the highest frequency is not guaranteed to be unique. Sometimes, there are two or more.
At that time, when the mode is not unique, the “StatisticsError” will be thrown. It is pleasing to see that this has been fixed in Python 3.8 by returning the first mode.
mode(['a', 'b', 'c', 'c', 'd', 'd'])
However, what if we want to keep all the modes? In the above case, we should have two modes which are “c” and “d”. In this case, we could use the
multimode() function starts in Python 3.8.
multimode(['a', 'b', 'c', 'c', 'd', 'd'])
3. Fast Mean
Since Python 3.8, there is a new function called
fmean() added. The name should be interpreted as “fast mean”. Basically, it will be faster than the normal function
mean(). I’m sure you will have several questions, let me answer them one by one and hopefully, you don’t have any more 🙂
3.1 Why the fast mean is faster?
The fast mean will convert all the numbers into float type before calculation. Therefore, using
fmean() will always give us a float number as result, but
mean() does not.
mean([1, 2, 3, 4])
fmean([1, 2, 3, 4])
However, please note that this does NOT mean
- Float operations are faster
mean()performance are the same for all float number
The difference is that
fmean() uses simple, but fast floating-point math, while
mean() does much more complicated procedures to achieve maximum possible precision, even at the cost of performance.
3.2 Why we still need the original mean function if it is slower?
In fact, just like my subtitle says, we cannot have both performance and accuracy. That means, by having the
fmean() function is much faster than
mean(), it sacrifices the accuracy to a very tiny extent.
We can say that
fmean() is still very accurate, except dealing with fractions.
from fractions import Fractionmean([Fraction(1, 2), Fraction(1, 3)])
fmean([Fraction(1, 2), Fraction(1, 3)])
The example shows that the fractions are converted into float numbers in
mean() function takes a lot of effort to not just convert all the data to exact values, but to monitor the “best” class and return the same.
fmean() is still very accurate under normal circumstances, but mean will give you a mathematically perfect result if your data is Fractions.
3.3 How fast is the fast mean, how about compare it to Numpy?
We can do a simple experiment here. Firstly generate 100k random float numbers using Numpy.
import numpy as nparr_float = np.random.random(100000)
Then, test the three different functions.
In the above example, we can see that the
fmean() is about 20x faster than
mean(), while the Numpy mean function is 150x faster than
fmean(). Well, if we do care about the accuracy, we might not choose the Numpy mean. Let me show you another example here.
Consider we have such a list of numbers.
[1e20, 1, 3, -1e20]
It is obvious that the sum of these 4 numbers should 4 so that the average should be 1. Let’s have a look at the behaviours of each mean function.
In such extreme circumstances, Numpy mean loses accuracy because of the pairwise summation algorithm which dramatically reduces the computational cost but introduce some “acceptable” round-off error.
Whether your use case will “accept” such inaccuracy, it is your decision.
4. Geometric Mean
We use the term “average” in many cases. However, it does NOT always mean that we need to calculate the “arithmetic” mean. Consider the problem below.
A Fund that grows by 10% in the first year, declines by 20% in the second year, and then grows by 30% in the third year.
What is the average performance of this fund over the past three years?
Can we say
mean([0.1, -0.2, 0.3]) = 0.067 = 6.7%? This is wrong. We should use geometric mean in this case.
geometric_mean([(1+0.1), (1-0.2), (1+0.3)]) - 1
Therefore, the average performance of this fund is about 4.6%.
5. Harmonic Mean
Another type of mean is the harmonic mean. Consider that the problem below.
We are driving a car. For the first 10 minutes our speed is 60km/h and then we increased the speed to 80km/h for the next 10 minutes. Then, we arrived.
What is our average speed?
Of course, we can NOT say that the arithmetic mean of [60, 80] is 70 so that the average speed is 70. That is wrong.
In this case, we should use the harmonic mean.
Therefore, our average speed is about 68.57km/h over 20 minutes. This also tells us over speed doesn’t “help” too much to arrive at the destination much faster. So, let’s drive safely 🙂
In this article, I have introduced several important functions of the Python built-in Statistics library. It provides many cool features such as low/high median and multi-mode that allows us to flexibly measure the centre of a dataset. Also, the fast mean function gives us a much faster way to calculate the arithmetic mean without losing accuracy in most cases. Moreover, it provides some types of mean functions other than arithmetic mean out-of-the-box.
If you are interested in more Python built-in libraries, please check out some related papers below.
11 Python Built-in Functions You Should Know
7 Useful Tricks for Python Regex to Learn
Do You Know Python Has Built-In Array?
“Find the Difference” in Python
6 Python Container Data Types You Should Know
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot