The Data Scrub

Garbage in, garbage out is a saying that dates back to the early days of computers but is still true today, perhaps even more so. If the numbers you use in a statistical analysis are incorrect (garbage), so too will be the results. That’s why so much effort has to go into getting the numbers right. The process of checking data values can be divided into two parts — verification and validation.

Verification addresses the issue of whether each value in the dataset is identical to the value that was originally generated. Twenty years ago, verification amounted to a check of the data entry process. Did the keypunch operator enter all the data as it appeared on the hard copy from the person who generated the data? Today, it includes making sure automated devices and instruments generate data as they were designed to. Sometimes, for example, electrical interference can cause instrumentation to record or output erroneous values. Verification is a simple concept embodied in simple yet time-consuming processes.

I’m the oversight contractor. Give me all your QA/QC records … and a can of tuna.

Validation addresses the issue of whether the data were generated in accordance with quality assurance specifications. In other words, validation is a process to determine if the data are of a known level of quality that is appropriate for the analyses to be conducted and the decisions that will be made as a result. An example of a validation process is the assignment of data quality flags to each analyte concentration reported by an analytical chemistry laboratory. Validation is a complex concept embodied in a complex time-consuming process.

Changes in data points create a dilemma for statisticians. Change one data point and you’ve changed the entire analysis. It probably won’t change your conclusions, but all those means and variances, and other statistics not to mention the graphs and maps will be incorrect, or at least inconsistent with the final database. This is the most important warning I issue to my clients when I start a job. Still, on the majority of projects, some data point changes after I am well into the data analysis. It’s inevitable.

Finding Data Problems

Lots of bad things can happen in a dataset, so it’s useful to have a few tricks, tips, and tools for finding those mistakes and anomalies.

Spreadsheet Tricks

This is the point when using a spreadsheet to assemble your dataset really pays off. Most spreadsheet programs have a variety of capabilities that are well suited to data scrubbing.

Is that a 0 or an O?

Here are a few tricks:

Marker Rows — Before you do any scrubbing, create marker rows throughout the dataset. You can do this by coloring the cell fill for the entire row with a unique color. You don’t need a lot; just spread them through the dataset. If you make any mistakes in sorting that corrupt the rows, you’ll be able to tell. You could do the same thing with columns but columns usually aren’t sorted.
Original Order — Insert a column with the original order of the rows. This will allow you to get back to the original order the data was in if you need to.
Sorting — One at a time, sort your dataset by each of your variables. Check the top and the bottom of the column for entries with leading blanks and nonnumeric text. Then check within each column for misspellings, ID variants, analyte aliases, non-numbers, bad classifications, and incorrect dates.
Reformatting — Change fonts to detect character errors, such as O and 0. Change the format on any date, time, currency, or percentages and incorrect entries may pop out. For example, a percentage entered as 50% instead of 0.50 would be a text field that could not be processed by a statistical package. This trick works especially well with incorrect dates. Conditional formatting can also be used to find data that fall outside a range of acceptable values. For example, identify percentages greater than 1 by conditionally formatting them with a red font.
Formulas — Write formulas to check proportions, sums, differences and any other relationships between variables in your dataset. Use cell information functions (e.g., Value and Isnumber in Excel) to verify that all your values are numbers and not alphanumerics. Also, check to see if two columns are identical, in which case, one can be deleted. This problem occurs often with data sets that have been merged.

Descriptive Statistics

Even before getting involved in the data analysis, you can use descriptive statistics to find errors. Here are a few things to look for.

Counts — Make sure you have the same number of samples for all the variables. Otherwise, you have missing or censored data to contend with. Count the number of data values that are censored for each variable. If all the values are censored for a variable, the variable can be removed from the dataset. Also count the number of samples in all levels of grouping variables to see if you have any misclassifications.
Sums — If some measurements are supposed to sum to a constant, like 100%, you can usually find errors pretty easily. Fixing them can be another matter. If it looks like just an addition error, fix the entries by multiplying them by

{what the sum should be} divided by {what the incorrect sum is}

For example, if the sum should be 100% and the entries add up to 90%, multiply all the entries by 1.0/0.9 (1.11) and then they’ll all add up to 100%. There will be situations though, especially in opinion surveys, when you’ll have to try to divine the intent of the respondent. If someone entered 1%, 30%, 49%, did he mean 1%, 50%, 49%, or 1%, 30%, 69%, or even 21%, 30%, 49%? It’s like being in Florida during November of 2000. You want to use as much of the data as possible but you just have to be sure it’s the right data.

Min/Max — Look at the minimum and maximum for each variable to make sure there are no anomalously high or low values.
Dispersion — Calculate the variance or standard deviation for each variable. If any are zero, you can delete the variable because it will add nothing to your statistical analysis.
Correlations — Look at a correlation matrix for your variables. Look for correlations between independent variables that are near 1. These variables are statistical duplicates in that they convey the same information even if the numbers are different. You won’t need both so delete one.

There are also other calculations that you could do depending on your dataset, for example, recalculating data derived from other variables.

Plotting

Whatever plotting you do at this point is preliminary; you’re not looking to interpret the data, only find anomalies. These plots won’t make it into your final report, so don’t spend a lot of time on them. Here are a few key graphics to look at.

Bivariate plots — Plot the relationships between independent variables having high correlations to be sure they are not statistically redundant. Redundant variables can be eliminated. Check plots of the dependent variable versus the independent variables for outliers.
Time-series plots — If you have any data collected over time at the same sampling point, plot the time-series. Look for incorrect dates and possible outliers in the data series.
Maps — If you collected any spatially dependent samples or measurements, plot the location coordinates on a map. Have field personnel review the map to see if there are any obvious errors. If your surveyor made a mistake, this is where you’ll find it.

The Right Data for the Job

As it turns out, you’ll probably have to rebuild the dataset more than once, especially if your analysis is complex. You’ll add and discard variables more times than you can count (especially if you don’t document it). Plus, different analyses within a single project can require different data structures. Graphics especially are prone to requiring some special format to get the picture the way you want it. That’s just a part of the game of data analysis. It’s like mowing the lawn. You’re only finished temporarily.

When you do update a dataset, it usually involves creating new variables but not samples. Data for the existing samples might change, for example, you might average replicates, fill-in values for missing and censored data, or accommodate outliers, but you normally won’t add data from new samples. This is because the original samples are your best guess at a representation of the population. Change the samples and you change what you think the population is like. That might not sound too bad, but from a practical view, there are consequences. If you change samples you may have to redo a lot of data scrubbing. The new samples may affect outliers, linear relationships, and so on. It’s not just the new data you have to examine, it’s also how all the data, new and old, fit together. You have to revisit all those data scrubbing steps.

Scrubbing takes time but you have to do it.

Data scrubbing isn’t a standardized process. Some data analysts do a painstakingly thorough job, looking at each variable and sample to ensure they are appropriate for the analysis. Others just remove systematic problems with automated queries and filters, preferring to let the original data speak for themselves. That’s why it’s important to document everything you do. You may have to justify your data scrubbing decisions to a client or a reviewer. You may also need the documentation for contract modifications, presentations, and any similar projects you plan to pursue.

And that’s why it takes so long and costs so much to prepare a dataset for a statistical analysis. Be prepared, it will probably consume the majority of your project budget and schedule but you have to do it so that your analysis isn’t just garbage out.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.

The Data Scrub