Original Source Here
In any machine learning problem, having good data is just as important as having a good model. Or, as the famous saying about bad data goes — garbage in, garbage out. In this article we explore some common, yet not well recognized sources of bad data.
Topcoding and Bottomcoding
Topcoding and bottomcoding occur when a dataset replaces very high or very low numbers with the same value. This is sometimes used to protect the identity of people in the dataset. For example, consider a publicly available dataset with one part of the data being income. There aren’t many people who have incomes above one billion USD, meaning those entries, if unchanged, could probably be matched to the exact person just based on the income number. To prevent this and preserve anonymity, we could perform topcoding and set all incomes in the dataset above a large number, say 100 million, to the same number. Of course, this means any model built on the income data in this dataset will be inaccurate. There is no way to get the true values of topcoded/bottomcoded numbers, so the best we can do is try to estimate their true values based on the other, non-modified values. For example, you could train a linear model on the non-coded income values, and then use that model to extrapolate the true coded values.
Proxy reporting occurs when one member in a survey answers questions for other members in the survey. One common example of this is family surveys, where the head of the household answers questions for everyone in the family. The problem with proxy reporting is that the person answering the survey might give blank or inaccurate answers for the people he is answering for. Therefore, if your dataset contains proxy reported values, it is hard to say whether or not you can trust the data. One way to make sure proxy reported values are accurate is to compare them to nonproxy values.
Generally speaking, selection bias issues occur when the sample behind a dataset is not random, which introduces bias. This happens for all kinds of reasons. One common cause of selection bias is that people who tend to answer surveys have certain special characteristics — for example doing more civil and political activity. Then, if the survey in question is about how much political activity people do, it will be biased towards more activity. Another example of selection bias occurs in medicine, when doctors ask for volunteers to test the effectiveness of a new treatment. People who volunteer for these treatments are probably not representative of the general population, because they may be more medically literate, take better care of themselves, be younger, etc.
The problem with selection bias is that sometimes it isn’t avoidable, especially when your survey relies on asking for volunteers. There also isn’t a good way to mitigate it in your dataset — there is no statistical trick we can do to unbias the dataset. The best we can do is design surveys to avoid selection bias, and take conclusions from biased surveys with a grain of salt.
Many datasets have blank values. One way to deal with these blanks is to fill in, or impute, the values based on some formula. When the percentage of imputed values is significant, conclusions drawn from the data may be suspect. For example, we might want to measure the difference across time in income for each person in a group. This requires taking two income surveys at two different times. Let’s assume some values for the one of the surveys were imputed by taking the sample average of the non-imputed values. Then, for each person with an imputed data point, we aren’t measuring the difference in income for that person — we’re actually measuring the difference between one data point and the average of everyone else, which has no meaning.
Another way of dealing with imputed values is to drop them from the dataset. Unfortunately, this could bias the data. The best way to deal with possibly imputed data would be to find the actual data. This is sometimes possible, for example if you are a government employee and have access to other datasets with similar data. A third solution is to reweight the data so that imputed values don’t count as much as non-imputed values.
There are some situations where imputed data isn’t a huge problem. For example, if all we’re trying to do is find the mean of a dataset, using imputed data that was calculated with the sample average shouldn’t change the mean by much. Also, for “well-behaved” datasets with few outliers and low variance, imputed data has less impact.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot