Scientists and other theory-driven data analysts focus on eliminating bias and maximizing accuracy so they can find trends and patterns in their data. That’s necessary for any type of data analysis. For statisticians, though, the real enemy in the battle to discover knowledge isn’t so much accuracy as it is precision. Call it lack of precision, variability, uncertainty, dispersion, scatter, spread, noise, or error, it’s all the same adversary.
One caution before advancing further. There’s a subtle difference between data mining and data dredging. Data mining is the process of fining patterns in large data sets using computer algorithms, statistics, and just about anything else you can think of short of voodoo. Data dredging involves substantial voodoo, usually involving overfitting models to small data sets that aren’t really as representative of a population as you might think. Overfitting doesn’t occur in a blitzkrieg; it’s more of a siege process. It takes a lot of work. Even so, it’s easy to lose track of where you are and what you’re doing when you’re focused exclusively on a goal. The upshot of all this is that you may be creating false intelligence. Take the high ground. Don’t interpret statistical tests and probabilities too rigidly. Don’t trust diagnostic statistics with your professional life. Any statistician with a blood alcohol level under 0.2 will make you look silly.
Here are ten tactics you can use to try to control and reduce variability in a statistical analysis.
Know Your Enemy
There’s an old saying, six months collecting data will save you a week in the library. Be smart about your data. Figure out where the variability might be hiding before you launch an attack. Focus at first on three types of variability—sampling, measurement, and environmental.
Sampling variability consists of the differences between a sample and the population that are attributable to how uncharacteristic (non-representative) the sample is of the population. Measurement variability consists of the differences between a sample and the population that are attributable to how data were measured or otherwise generated. Environmental Variability consists of the differences between a sample and the population that is attributable to extraneous factors. So there are three places to hunt for errors—how you select the samples, how you measure their attributes, and everything else you do. OK, I didn’t say it was going to be easy.
Start with Diplomacy
Start by figuring out what you can do to tame the error before things get messy. Consider how you can use the concepts of reference, replication, and randomization. The concept behind using a reference in data generation is that there is some ideal, background, baseline, norm, benchmark, or at least, generally accepted standard that can be compared to all similar data operations or results. If you can’t take advantage of a reference point to help control variability, try repeating some aspects of the study as a form of internal reference. When all else fails, randomize.
Five maneuvers you can try in order to control, minimize, or at least be able to assess the effects of extraneous variability,are :
- Procedural Controls—like standard instructions, training, and checklists.
- Quality Samples and Measurements—like replicate measurements, placebos, and blanks.
- Sampling Controls—like random, stratified, and systematic sampling patterns.
- Experimental Controls—randomly assigning individuals or objects to groups for testing, control groups, and blinding.
- Statistical Controls—Special statistics and procedures like partial correlations and covariates.
Even if none of these things work, at least everybody will know you tried.
Prepare, Provision, and Deploy
Before entering the fray, you’ll want to know that your troops data are ready to go. You have to ask yourself two questions—do you have the right data and do you have the data right? Getting the right data involves deciding what to do about replicates, missing data, censored data, and outliers. Getting the data right involves making sure all the values were generated appropriately and the values in the dataset are identical to the values that were originally generated. Sorting, reformatting, writing test formulas, calculating descriptive statistics, and graphing are some of the data scrubbing maneuvers that will help to eliminate extraneous errors. Once you’ve done all that, the only thing left to do is lock and load.
While analyzing your data, be sure to look at errors in every way you can. Is it relatively small? Is it constant for all values of the dependent variable? Infiltrate the front line of diagnostic statistics. Look beyond r-squares and test probabilities to the standard error of estimate, DFBETAs, deleted residuals, leverage, and other measures of data influence. What you learn from these diagnostics will lead you through the next actions.
Divide and Conquer
Perhaps the best, or at least the most common, way to isolate errors is to divide the data into more homogeneous groups. There are at least three ways you can do this. First, and easiest, is to use any natural grouping data you might have in your dataset, like species or sex. There may also be information you can use to group the data in the metadata. Second is the more problematical visual classification. You may be able to classify your data manually by sorting, filtering, and most of all, plotting. For example, by plotting histograms you may be able to identify thresholds for categorizing continuous-scale data into groups, like converting weight into weight classes. Then you can analyze each more homogeneous class separately. Sometimes it helps and sometimes it’s just a lot of work for little result. The other potential problems with visual classification are that it takes a bit of practice to know what to do and what to look for, and more importantly, you have to be careful that your grouping isn’t just coincidental.
The third method of classifying data is the best or the worst, depending on your perspective. Cluster analysis is unarguably the best way to find the optimal groupings in data. The downside is that the technique requires even more skill and experience than visual classification, and furthermore, the right software.
Call in Reinforcements
If you find that you need more than just groupings to minimize extraneous error, bring in some transformations. You can use transformations to rescale, smooth, shift, standardize, combine, and linearize data, and in the process, minimize unaccounted for errors. There’s no shame in asking for help—not physical, not mental, and not mathematical.
Shock and Awe
If all else fails, you can call in the big guns. In a sense, this tactic involves rewriting the rules of engagement. Rather than attacking the subjects, you aim at reducing the chaos in the variables. The main technique to try is factor analysis. Factor analysis involves rearranging the information in the variables so that you have a smaller number of new variables (called factors, components or dimensions, depending on the type of analysis) that represent about the same amount of information. These new factors may be able to account for errors more efficiently than the original variables. The downside is that the factors often represent latent, unmeasurable characteristics of the samples, making them hard to interpret. You also have to be sure you have appropriate weapons of math production (i.e., software) if you’re going to try this tactic.
Set Terms of Surrender
If you’re been pretty aggressive in torturing your data, make sure the error enemy is subdued before declaring victory. Errors are like zombies. Just when you think you have everything under control they come back to bite you. Rule 2: Always double tap. In statistics, this means that you have to verify your results using a different data set. It’s called cross validation and there are many approaches. You can split the data set before you do any analysis, analyze one part (the training data set), and then verify the results with the other part (the test data set). You can randomly extract observations from the original data set to create new datasets for analysis and testing. Finally, you can collect new samples. You just want to be sure no errors are hiding where you don’t suspect them
Have an Exit Strategy
In the heat of data analysis, sometimes it’s difficult to recognize when to disengage. Even analysts new to the data can fall into the same traps as their predecessors. There are two fail-safe methods for knowing when to concede. One is to decide if you have met the specific objective that you defined before you mobilized. If you did, you’re done. The other is to monitor the schedule and budget your client gave you to solve their problem. When you get close to the end, it’s time to withdraw. Be sure to save some time and money for the debriefing.
Live to Fight another Day
You don’t have to surrender in the war against error. In fact, every engagement can bring you closer to victory. If an analysis becomes intractable, follow the example of Dunkirk. Withdraw your forces, call your previous efforts a pilot study, and plan your next error raid.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.