Here are ten ways your love affair with statistical models can end up on the rocks.
Modeling is more than just meeting a dataset on the internet and jumping into some R code together. You have to develop relationships with the data and everyone associated with them. For example:
- Miscommunications. There are often quite a few people who have some stake in the model. They usually have different experiences and levels of understanding of modeling and, of course, different agendas for how the model will be treated. They won’t necessarily trust you. You have to try to keep them all happy and on the same page.
- Interference. You may be doing all the heavy lifting with the data and the modeling but there are often individuals, like a boss, the client, or independent reviewers, who poke their fingers into your efforts.
- Delays. You may feel under the gun to complete a modeling project but that doesn’t mean everyone associated with the project will share your constraints. You may be asked to redo the model every time new data become available, attend meetings, make presentations, and wait for decisions from upper management.
- Skepticism. Not everyone is driven to make decisions after a careful analysis of relevant data. Some people prefer to rely on their gut feel. They may look at your model but then ignore those results and use their own intuition.
- Indifference. On occasion, you might create a model, even what you consider a groundbreaking model, but nobody pays attention to it. Your model may be ignored for an inferior model, like an undrafted football player being benched in favor of a million-dollar bust. Or, people just don’t appreciate the importance of the model like you do. You’ll still need to get their acceptance.
You put your heart and soul into modeling the dataset but you get … NOTHING. No love in return. No matter how much you’ve planned, you can’t find a collection of independent variables that will adequately model your dependent variable. It happens to data analysts everywhere, all the time, for a variety of reasons. There may be non-linear relationships, outliers, or excessive uncontrolled variance. The variables may be inappropriate or inefficient.
What can you do?
First, you should reexamine the theory behind your model. Are your hypothesis and assumptions valid? Are your data suspect? Are the metrics you’re using as variables problematical? Are there latent concepts you could explore in a Factor Analysis? Do your samples need to be categorized in some way? Might conducting a Cluster Analysis provide insight?
Third, if you have appropriate software, consider looking into nonlinear statistical regression, neural networks, and data mining solutions. Finally, there may be ways to construct probabilistic models, or models based on optimization procedures, or relative solutions from experts using a Delphi Method.
In the end:
Some models were not meant to be. If you can’t fit the model to the data, you have to be prepared to call it quits. In a way, this is equivalent to a Do Not Resuscitate order in medicine, and likewise, it can be a sensitive subject. It’s usually easier to create new variables or try some other statistical manipulation than it is to give the bad news, and the bill, to the client.
Sometimes models go wrong right out of the box because they are improperly specified. You may not be pursuing the relationship for the right reasons or in the right ways. For example:
- The dependent or independent variables may be too expensive to collect. The model may even cost more to run than addressing the problem is worth.The dependent variable may not be actionable, at last not within the limits set by the client.
- An independent variable might incorporate part of the dependent variable, if one or the other is a ratio.
- The structure of the model may be wrong, for example, the model might be better as a multiplicative or other non-linear form instead of linear.
There are many different types of models, like fish in the sea. Some people are always looking for something better, even if what they have is pretty good.
For example, you might have a good model but it’s not what the client expected. Perhaps the results are not what the client wanted to hear or the model may look good for general trends but not be an adequate representation of the phenomenon for extreme or special cases. He wants you to try over and bring him something better.
One concept that often confuses novice model builders are the differences between models aimed at prediction vs explanation. Explanatory models are based on theory. They need to incorporate independent variables that make theoretical or logical sense to be associated with the dependent variable. Prediction models don’t rely on theory. They need independent variables that produce large values of the Coefficient of Determination (r2) but low values of the Standard Error of Estimate (sxy or SEE). Explanatory models assume (or hope) that there are cause-effect relationships between the dependent variable and the independent variables; prediction models do not.
That’s where some clients balk if the model doesn’t have the variables they feel should be in a prediction model. It usually doesn’t matter if the model produces excellent predictions, they feel it would be better if their favorite variables were there … even though it wouldn’t.
It’s not just clients, though. There are times when model builders, especially young professionals, want to try out some new analytical breakthrough. The tried-and-true regression approach may produce results that are nearly as good, but the cutting edge model looks and sounds so much sexier. It’s seductive, and for some, hard to resist.
Don’t you just hate it when you see something that isn’t at all the way it was described? “Hey, you should try analyzing this dataset. It’s a perfect match for you.” But then when you meet up, it’s nothing like you expected.
And it’s not just what goes into a model that might be disappointing but also what comes out of modeling activities. The regression model itself might be improperly specified or misleading. Sometimes correctly specified models are poorly calibrated. Fortunately, there are also a variety of statistical diagnostics and plots that can be used to identify the problems.
Every measurement of a phenomenon includes characteristics of the population and natural variability as well as unwanted sampling variability, measurement variability, and environmental variability. You can’t understand your data unless you control extraneous variance attributable to the way you select samples, the way you measure variable values, and any influences of the environment in which you are working. If you plan to conduct a statistical analysis, you need to understand the three fundamental Rs of variance control — Reference, Replication, and Randomization. Using the concepts of reference, replication and randomization, you can control, minimize, or at least be able to assess the effects of extraneous variability using: procedural controls; quality samples and measurements; sampling controls; experimental controls; and statistical controls.
Even after spending considerable effort trying to control extraneous variance in data collection, though, sometimes the models produced from them don’t share the precision. The models may have good accuracy, shown by large values of the Coefficient of Determination (r2) but low precision, shown by the large Standard Error of Estimate (sxy or SEE). You might have an accurate predictive model but it lacks enough precision to be useful. This is a surprisingly common occurrence. Some data analysts don’t seem to look past the r2. The sxy is ignored.
Look at any studies you can that involve predictive modeling. Do they discuss the uncertainty in the predictions? What do you think?
Sometimes you spend months and even longer getting to know your data and building a relationship only to have the model taken away. Maybe it’s a boss or more senior co-worker. Maybe it’s the client. You can chase after your model, keep up to speed with what’s happening in the model’s life, but that’s about it. There’s not much else you can do. It’s somebody else’s responsibility now.
You and your model may reach a point where you might want to go to the next level in your relationship only to find there are differences you did not expect and can’t overcome. When you try to extend the relationship to new situations, everything fails. There are several possible reasons. Maybe you have a multi-level model. What worked for the samples you used doesn’t work when they are aggregated into higher level associations. Maybe you’re a victim of Simpson’s Paradox. What worked for the samples you used doesn’t work when they are separated into component groups. Then again, maybe it’s something you did. Maybe your model is overfit. Perhaps you capitalized on chance and found associations that weren’t pervasive and lasting. The only thing you can do is reexamine the relationship and either start over or move on.
There comes a time when you have to decide whether to commit to the effort to build a relationship or back out of the commitment. Maybe you don’t have enough samples. Maybe your goals don’t fit what the model needs. Perhaps the model is being asked to do something it wasn’t designed for. What works for describing a population may not be suited to describing individuals in the population. Then there might also be ethical issues to consider. But statisticians rarely get to make these decisions. If they accepted the assignment, the product belongs to the client.
Deploying a model can sometimes change the behaviors of the population the model is based on. This is especially true when humans are involved; humans just love to game the rules. For example, if you develop a model for allocating resources, you can be assured that the potential recipients will do whatever it takes to increase their advantage. Once they do that, the model is no longer useful. That’s why models are often kept secret.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at amazon.com, barnesandnoble.com, or other online booksellers.