Is Your Regression Model Telling the Truth?
There are many technologies we use in our lives without really understanding how they work. Television. Computers. Cell phones. Microwave ovens. Cars. Even many things about the human body are not well understood. But I don’t mean how to use these mechanisms. Everyone knows how to use these things. I mean understanding them well enough to fix them when they break. Regression analysis is like that too. Only with regression analysis, sometimes you can’t even tell if there’s something wrong without consulting an expert.
Here are some tips for troubleshooting regression models.
Diagnosis
You may know how to use regression analysis, but unless you’re an expert, you may not know about some of the more subtle pitfalls you may encounter. The biggest red flag that something is amiss is the TGTBT, too good to be true. If you encounter an R-squared value above 0.9, especially unexpectedly, there’s probably something wrong. Another red flag is inconsistency. If estimates of the model’s parameters change between data sets, there’s probably something wrong. And if predictions from the model are less accurate or precise than you expected, there’s probably something wrong. Here are some guidelines for troubleshooting a model you developed.
Your Model |
Identification |
Correction |
Not Enough Samples | If you have fewer than 10 observations for each independent variable you want to put in a model, you don’t have enough samples. | Collect more samples. 100 observations per variable is a good target to shoot for although more is usually better. |
No Intercept | You’ll know it if you do it. | Put in an intercept and see if the model changes. |
Stepwise Regression | You’ll know it if you do it. | Don’t abdicate model building decisions to software alone. |
Outliers | Plot the dependent variable against each independent variable. If more than about 5% of the data pairs plot noticeable apart from the rest of the data points, you may have outliers. | Conduct a test on the aberrant data points to determine if they are statistical anomalies. Use diagnostic statistics like leverage to evaluate the effects of suspected outliers. Evaluate the metadata of the samples to determine if they are representative of the population being modeled. If so, retain the outlier as an influential observation (AKA leverage point). |
Non-linear relationships | Plot the dependent variable against each independent variable. Look for nonlinear patterns in the data | Find an appropriate transformation of the independent variable. |
Overfitting | If you have a large number of independent variables, especially if they use a variety of transformation and don’t contribute much to the accuracy and precision of the model, you may have overfit the model. | Keep the model as simple as possible. Make sure the ratio of observations to independent variables is large. Use diagnostic statistics like AIC and BIC to help select an appropriate number of variables. |
Misspecification | Look for any variants of the dependent variable in the independent variables. Assess whether the model meets the objectives of the effort. | Remove any elements of the dependent variable from the independent variables. Remove at least one component of variables describing mixtures. Ensure the model meets the objectives of the effort with the desired accuracy and precision.. |
Multicollinearity | Calculate correlation coefficients and plot the relationships between all the independent variables in the model. Look for high correlations. | Use diagnostic statistics like VIF to evaluate the effects of suspected multicollinearity. Remove intercorrelated independent variables from the model. |
Heteroscedasticity | Plot the variance at each level of an ordinal-scale dependent variable or appropriate ranges of a continuous-scale dependent variable. Look for any differences in the variances of more than about five times. | Try to find an appropriate Box-Cox transformation or consider nonparametric regression or data mining methods. |
Autocorrelation | Plot the data over time, location or the order of sample collection. Calculate a Durbin–Watson statistic for serial correlation. | If the autocorrelation is related to time, develop a correlogram and a partial correlogram. If the autocorrelation is spatial, develop a variogram. If the autocorrelation is related to the order of sample collection, examine metadata to try to identify a cause. |
Weighting | You’ll know it if you do it. | Compare the weighted model with the corresponding unweighted model to assess the effects of weighting. Consider the validity of weighting; seek expert advice if needed. |
Sometimes the model you are skeptical about isn’t one you developed; it is models that are developed by other data analysts. The major difference is that with other analysts’ models, you won’t have access to all their diagnostic statistics and plots, let alone their data. If you have been retained to review another analyst’s work, you can always ask for the information you need. If, however, you’re reading about a model in a journal article, book, or website, you’ve probably got all the information you’re ever going to get. You have to be a statistical detective. Here are some clues you might look for.
Another Analyst’s Model |
Identification |
Not Enough Samples | If the analyst reported the number of samples used, look for at least 10 observations for each independent variable in the model, |
No Intercept | If the analyst reported the actual model (some don’t), look for a constant term. |
Stepwise Regression | Unless another approach is reported, assume the analyst used some form of stepwise regression. |
Outliers | Assuming the analyst did not provide plots of the dependent variable versus the independent variables, look for R-squared values that are much higher or lower than expected. |
Non-linear relationships | Assuming the analyst did not provide plots of the dependent variable versus the independent variables, look for a lower-than- expected R-squared value from a linear model. If there are non-linear terms in the model, this is probably not an issue. |
Overfitting | Look for a large number of independent variables in the model, especially if they different types of transformation |
Misspecification | Look for any variants of the dependent variable in the independent variables. Assess whether the model meets the objectives of the effort. |
Multicollinearity | Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify multicollinearity. |
Heteroscedasticity | Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify heteroscedasticity. |
Autocorrelation | Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify serial correlation. |
Weighting | Compare the reported number of samples to the degrees of freedom. Any differences may be attributable to weighting. |
No Doubts
So there are some ways you can identify and evaluate eleven reasons for doubting a regression model. Remember when evaluating other analyst’s models that not everyone is an expert and that even experts make mistakes. Try to be helpful in your critiques, but at a minimum, be professional.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.
Pingback: Regression Fantasies: Part III | Pets