Even if you took a class in statistics or another form of data analysis, you probably didn’t hear about frankendata. Frankendata is created when data, collected by different people, at different times and locations, analyzed with different procedures and equipment, and reported in different ways, are conglomerated together to use in a new analysis.
In statistics classes, students are provided the same data sets so they all have at least some chance of getting the same answer. Government agencies put great effort into producing consistent data, even while knowing that the data will be harassed, abused, and tortured before the next election. So where does Frankendata come from? Your boss. Your client. Your dissertation adviser. And maybe even your own evil inner twin.
Unlike the saintly sages of statistics who teach in their university utopias, your analysis overlords expect you to be able to make any conglomeration of data into a profitable analysis. Often, this is because your boss sold the client on the idea of cobbling together all the data from the consultants who had the project before your company was hired. The client totally bought into you being able to make sense of the data mishmash.
It is not uncommon that data for a statistical analysis are generated without the prior input of a statistician. Sometimes, even the statistical analysis is an afterthought, coming shortly after the investigator realizes that the data defy interpretation by any means known to him or her. In these cases, you have two possible courses of action. You can try to dodge the bullet, perhaps by explaining the problems with the dataset, and then declining the assignment. This never works. Your Boss wants the Client’s money. The sick-relative gambit works better and is a lot easier to explain, only you can’t use it very often. Most consultants, though, are simply incapable of saying no. This is not just for the money. It’s because they become consultants because they like to solve problems. And believe me, doing a statistical analysis using data that were generated without the oversight of a statistician is a problem.
To non-statisticians, data are data. Concepts like populations and representativeness and randomization and variability aren’t relevant. But data generated without statistical oversight are like cookies made by unsupervised kindergartners. You can’t expect that they followed a recipe since they can’t read yet. You can’t even assume that they know the differences between sugar and salt, or flour and baking powder, or cooking oil and motor oil. You won’t t know what you might have until you take a bite. Scary thought, huh!
So what do you do if faced with this situation? You can swallow hard and not take the assignment. Recognize, though, that someone else will. If it’s an issue that’s important to you, you’ll have more control over what gets done if you’re involved. You might start by following this recipe:
- What are the ingredients? — How were samples picked relative to the population of interest? Were any steps taken to minimize variability and bias? How many good samples do you have? Are the variables appropriate for solving the problem? Are outliers and missing data likely to be issues? Can other information be included to augment the analysis?
- Is it safe to eat? — What can you do with the data given the number of samples and variables? If a complete analysis isn’t feasible, can an exploratory/pilot study or partial analysis be done?
- Where’s the Maalox? — What are the limits/caveats/uncertainties of the analysis? Will the results satisfy the client and other reviewers?
If you can think through an approach that will at least get the client to the next step, it’s probably a good idea to take the assignment. If you do, be sure the client has a clear idea of what you think you can do
I once had a client who was considering buying some property. They were looking at several parcels in an industrialized area of several square miles. The client wanted to know if the groundwater of the area was contaminated because they did not want to get caught up in a regional problem not of their making. The traditional method for answering this type of question would have been to install and sample wells on each property and then develop contour maps for each pollutant of concern. Because of the size of the area and the large number of chemicals to be analyzed for, such an approach would have been prohibitively expensive.
There were, however, scores of industrial facilities in the area that did have groundwater monitoring data, which was publicly available under a State program. The problem was that each site was a different size, from an acre to hundreds of acres, and had different numbers of wells that were sampled on different schedules for different chemical analytes. Each facility used different chemicals, and so, had different monitoring requirements imposed by the State. No analyte was being tested for in even half of the several hundred wells. In a nutshell, nothing was comparable.
Resurrecting this data involved having groundwater specialists review the data from all of the wells in the area. For each of the wells, the specialists determined whether any of the analytes tested for exceeded the standards established by the State. Wells with groundwater that exceeded a standard were coded as 1; wells that did not exceed a standard were coded as 0. The 0s and 1s were then used to produce a contour map of the probability that the groundwater of the area was contaminated. So the client got the information they needed at a price they could afford, and never had to face a village of angry stakeholders with their torches and pitchforks.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.