Common Reasons for Doubting a Regression Model
Finding a model that fits a set of data is one of the most common goals in data analysis. Least squares regression is the most commonly used tool for achieving this goal. It’s a relatively simple concept, it’s easy to do, and there’s a lot of readily available software to do the calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein lies the problem. Even if there is no intention to mislead anyone, it does happen.
Here are eleven of the most common reasons to doubt a regression model.
Not Enough Samples
Accuracy is a critical component for evaluating a model. The coefficient of determination, also known as R-squared or R2, is the most often cited measure of accuracy. Now obviously, the more accurate a model is the better, so data analysts look large values for R-squared.
R-squared is designed to estimate the maximum relationship between the dependent and independent variables based on a set of samples (cases, observations, records, or whatever). If there aren’t enough samples compared to the number of independent variables in the model, the estimate of R-squared will be especially unstable. The effect is greatest when the R-squared value is small, the number of samples is small, and the number of independent variables is large, as shown in this figure.
The inflation in the value of R-squared can be assesses by calculating the shrunken R-square. The figure shows that for an R-squared value above 0.8 with 30 cases per variable, there isn’t much shrinkage. Lower estimates of R-square, however, experience considerable shrinkage.
You can’t control the magnitude of the relationship between a dependent variable and a set of independent variables, and often, you won’t have total control over the number of samples and variables either. So, you have to be aware that R-squared will be overestimated and treat your regression models with some skepticism.
Almost all software that performs regression analysis provides an option to not include an intercept term in the model. This sounds convenient, especially for relationships that presume a one-to-one relationship between the dependent and independent variables. But when an intercept is excluded from the model, it’s not omitted from the analysis; it is set to zero. Look at any regression model with “no intercept” and you’ll see that the regression line goes through the origin of the axes.
With the regression line nailed down on one end at the origin, you might expect that the value of R-squared would be diminished because the line wouldn’t necessarily travel through the data in a way that minimizes the differences between the data points and the regression line, called the errors or residuals. Instead, R-squared is artificially inflated because when the correction provided by the intercept is removed, the total variation in the model increases. But, the ratio of the variability attributable to the model compared to the total variability also increases, hence the increase in R-squared.
The solution is simple. Always have an intercept term in the model unless there is a compelling theoretical reason not to include it. In that case, don’t put all your trust in R-square (or the F-tests).
Stepwise regression is a data analyst’s dream. Throw all the variables into a hopper, grab a cup of coffee, and the silicon chips will tell you which variables yield the best model. That irritates hard-core statisticians who don’t like amateurs messing around with their numbers. You can bet, though, that at least some of them go home at night, throw all the food in their cupboard into a crock pot, and expect to get a meal out of it.
The cause of some statistician’s consternation is that stepwise regression will select the variables that are best for the data set, but not necessarily the population. Model test probabilities are optimistic because they don’t account for the stepwise procedure’s ability to capitalize on chance. Moreover, adding new variables will always increase R-squared, so you have to have some good ways to decide how many variables is too many. There are ways to do this. So using stepwise regression alone isn’t a fatal flaw. Like with guns, drugs, and fast food, you have to be careful how you use it.
If you use stepwise regression, be sure to look at the diagnostic statistics for the model. Also, verify your results using a different data set by splitting the data set before you do any analysis, by randomly extracting observations from the original data set to create new data sets, or by collecting new samples.
Outliers are a special irritant for data analysts. They’re not really that tough to identify but they cause a variety of problems that data analysts have to deal with. The first problem is convincing reviewers not familiar with the data that the outliers are in fact outliers. Second, the data analysts have to convince all reviewers that what they want to do with them, delete or include or whatever, is the appropriate thing to do. One way or another, though, outliers will wreak havoc with R-squared.
Consider this figure, which comes from an analysis of slug tests to estimate the hydraulic conductivity of an aquifer. The red circles show the relationship between rising-head and falling-head slug tests performed on groundwater monitoring wells. The model for this relationship has an R-square of 0.90. The blue diamond is an outlier along the trend (same regression equation) about 60% greater than the next highest value. The R-squared of this equation is 0.95. The green square is an outlier perpendicular to the trend. The R-squared of this equation is 0.42. Those are fairly sizable differences to have been caused by a single data point.
How should you deal with outliers? I usually delete them because I’m usually looking to model trends and other patterns. But outliers are great thought provokers. Sometimes they tell you things the patterns don’t. If you’re not comfortable deciding what to do with an outlier, run the analysis both with and without outliers, a time consuming and expensive approach. The other approach would be to get the reviewer, an interested stakeholder, or an independent expert involved in the decision. That approach is time consuming and expensive too. Pick your poison.
Linear regression assumes that the relationship between a dependent variable and a set of independent variables are additive, or linear. If the relationship is actually nonlinear, the R-squared for the linear model will be lower than it would be for a better fitting nonlinear model.
This figure shows the relationship between the number of employed individuals and the number of individuals not in the U.S. work force between 1980 and 2009. The linear model has a respectable R-squared value of 0.84, but the polynomial model fits the data much better with an R-squared value of 0.95.
Overfitting involves building a statistical model solely by optimizing statistical parameters, and usually involves using a large number of variables and transformations of the variables. The resulting model may fit the data almost perfectly but will produce erroneous results when applied to another sample from the population.
The concern about overfitting may be somewhat overstated. Overfitting is like becoming too muscular from weight training. It doesn’t happen suddenly or simply. If you know what overfitting is, you’re not likely to become a victim. It’s not something that happens in a keystroke. It takes a lot of work fine tuning variables and what not. It’s also usually easy to identify overfitting in other people’s models. Simply look for a conglomeration of manual numerical adjustments, mathematical functions, and variable combinations.
Misspecification involves including terms in a model that make the model look great statistically even though the model is problematical. Often, misspecification involves placing the same or very similar variable on both sides of the equation.
Consider this example from economics. A model for the U.S. Gross Domestic Product (GDP) was developed using data on government spending and unemployment from 1947 to 1997. The model:
GDP = (121*Spending) – (3.5*Spending2) + (136*Time) – (61*Unemployment) – 566
had an R-squared value of 0.9994. Such a high R-squared value is a signal that something is amiss. R-squared values that high are usually only seen in models involving equipment calibration, and certainly not anything involving capricious human behavior. A closer look at the study indicated that the model term involving spending were an index of the government’s outlays relative to the economy. Usually, indexing a variable to a baseline or standard is a good thing to do. In this case, though, the spending index was the proportion of government outlays per the GDP. Thus, the model was:
GDP = (121*Outlays/GDP) – (3.5* (Outlays/GDP)2) + (136*Time) – (61*Unemployment) – 566
GDP appears on both sides of the equation, thus accounting for the near perfect correlation. This is a case in which an index, at least one involving the dependent variable, should not have been used.
Another misspecification involves creating a prediction model having independent variables that are more difficult, time consuming, or expensive to generate than the dependent variable. You might as well just measure the dependent variable when you need to know its value. Similarly with forecasting (prediction of the future) models, if you need to forecast something a year in advance, don’t use predictors that are measured less than a year in advance.
Multicollinearity occurs when a model has two or more independent variables that are highly correlated with each other. The consequences are that the model will look fine, but predictions from the model will be erratic. It’s like a football team. The players perform well together but you can’t necessarily tell how good individual players are. The team wins, yet in some situations, the cornerback or offensive tackle will get beat on most every play.
If you ever tried to use independent variables that add to a constant, you’ve seen multicollinearity in action. In the case of perfect correlations, such as these, statistical software will crash because it won’t be able to perform the matrix mathemagics of regression. Most instances of multicollinearity involve weaker correlations that allow statistical software to function, yet the predictions of the model will still be erratic.
Multicollinearity occurs often in the social sciences and other fields of study in which many variables are measured in the process of model building. Diagnosis of the problem is simple if you have access to the data. Look at correlations between the independent variables. You can also look at the variance inflation factors, reciprocals of one minus the R-squared values for the independent variables and the dependent variable. VIFs are measures of how much the model’s coefficients change because of multicollinearity. The VIF for a variable should be less than 10 and ideally near 1.
If you suspect multicollinearity, don’t worry about the model but don’t believe any of the predictions.
Regression, and practically all parametric statistics, requires that the variances in the model residuals be equal at every value of the dependent variable. This assumption is called equal variances, homogeneity of variances, or coolest of all, homoscedasticity. Violate the assumption and you have heteroscedasticity.
Heteroscedasticity is assessed much more commonly in analysis of variance models than in regression models. This is probably because the dependent variable in ANOVA is measured on a categorical scale while the dependent variable in regression is measured on a continuous scale. The solution to this is fairly simple. Break the dependent variable scale into intervals, like in a histogram, and calculate the variance for each interval. The variances don’t have to be precisely equal, but variances different by a factor of five are problematical. Unequal variances will wreak havoc on any tests or confidence limits calculated for model predictions.
Autocorrelation involves a variable being correlated with itself. It is the correlation between data points with the previously listed data points (termed a lag). Usually, autocorrelation involves time-series data or spatial data, but it can also involve the order in which data are collected. The terms autocorrelation and serial correlation are often used interchangeably. If the data points are collected at a constant time interval, the term autocorrelation is more typically used.
If the residuals of a model are autocorrelated, it’s a sure bet that the variances will also be unequal. That means, again, that tests or confidence limits calculated from variances should be suspect.
To check a variable or residuals from a model for autocorrelation, you can conduct a Durban-Watson test. The Durban-Watson test statistic ranges from 0 to 4. If the statistic is close to 2.0, then serial correlation is not a problem. Most statistical software will allow you to conduct this test as part of a regression analysis.
Most software that calculates regression parameters also allows you to weight the data points. You might want to do this for several reasons. Weighting is used to make more reliable or relevant data points more important in model building. It’s also used when each data point represents more than one value. The issue with weighting is that it will change the degrees of freedom, and hence, the results of statistical tests. Usually this is OK, a necessary change to accommodate the realities of the model. However, if you ever come upon a weighted least squares regression model in which the weightings are arbitrary, perhaps done by an analyst who doesn’t understand the consequence, don’t believe the test results.
Is Your Regression Model Telling the Truth?
There are many technologies we use in our lives without really understanding how they work. Television. Computers. Cell phones. Microwave ovens. Cars. Even many things about the human body are not well understood. But I don’t mean how to use these mechanisms. Everyone knows how to use these things. I mean understanding them well enough to fix them when they break. Regression analysis is like that too. Only with regression analysis, sometimes you can’t even tell if there’s something wrong without consulting an expert.
Here are some tips for troubleshooting regression models.
You may know how to use regression analysis, but unless you’re an expert, you may not know about some of the more subtle pitfalls you may encounter. The biggest red flag that something is amiss is the TGTBT, too good to be true. If you encounter an R-squared value above 0.9, especially unexpectedly, there’s probably something wrong. Another red flag is inconsistency. If estimates of the model’s parameters change between data sets, there’s probably something wrong. And if predictions from the model are less accurate or precise than you expected, there’s probably something wrong. Here are some guidelines for troubleshooting a model you developed.
|Not Enough Samples||If you have fewer than 10 observations for each independent variable you want to put in a model, you don’t have enough samples.||Collect more samples. 100 observations per variable is a good target to shoot for although more is usually better.|
|No Intercept||You’ll know it if you do it.||Put in an intercept and see if the model changes.|
|Stepwise Regression||You’ll know it if you do it.||Don’t abdicate model building decisions to software alone. What’s the fun in that?|
|Outliers||Plot the dependent variable against each independent variable. If more than about 5% of the data pairs plot noticeable apart from the rest of the data points, you may have outliers.||Conduct a test on the aberrant data points to determine if they are statistical anomalies. Use diagnostic statistics like leverage to evaluate the effects of suspected outliers. Evaluate the metadata of the samples to determine if they are representative of the population being modeled. If so, retain the outlier as an influential observation (AKA leverage point).|
|Non-linear relationships||Plot the dependent variable against each independent variable. Look for nonlinear patterns in the data||Find an appropriate transformation of the independent variable.|
|Overfitting||If you have a large number of independent variables, especially if they use a variety of transformation and don’t contribute much to the accuracy and precision of the model, you may have overfit the model.||Keep the model as simple as possible. Make sure the ratio of observations to independent variables is large. Use diagnostic statistics like AIC and BIC to help select an appropriate number of variables.|
|Misspecification||Look for any variants of the dependent variable in the independent variables. Assess whether the model meets the objectives of the effort.||Remove any elements of the dependent variable from the independent variables. Remove at least one component of variables describing mixtures. Ensure the model meets the objectives of the effort with the desired accuracy and precision..|
|Multicollinearity||Calculate correlation coefficients and plot the relationships between all the independent variables in the model. Look for high correlations.||Use diagnostic statistics like VIF to evaluate the effects of suspected multicollinearity. Remove intercorrelated independent variables from the model.|
|Heteroscedasticity||Plot the variance at each level of an ordinal-scale dependent variable or appropriate ranges of a continuous-scale dependent variable. Look for any differences in the variances of more than about five times.||Try to find an appropriate Box-Cox transformation or consider nonparametric regression or data mining methods.|
|Autocorrelation||Plot the data over time, location or the order of sample collection. Calculate a Durbin–Watson statistic for serial correlation.||If the autocorrelation is related to time, develop a correlogram and a partial correlogram. If the autocorrelation is spatial, develop a variogram. If the autocorrelation is related to the order of sample collection, examine metadata to try to identify a cause.|
|Weighting||You’ll know it if you do it.||Compare the weighted model with the corresponding unweighted model to assess the effects of weighting. Consider the validity of weighting; seek expert advice if needed.|
Sometimes the model you are skeptical about isn’t one you developed; it is models that are developed by other data analysts. The major difference is that with other analysts’ models, you won’t have access to all their diagnostic statistics and plots, let alone their data. If you have been retained to review another analyst’s work, you can always ask for the information you need. If, however, you’re reading about a model in a journal article, book, or website, you’ve probably got all the information you’re ever going to get. You have to be a statistical detective. Here are some clues you might look for.
|Another Analyst’s Model||Identification|
|Not Enough Samples||If the analyst reported the number of samples used, look for at least 10 observations for each independent variable in the model. If not, you may be able to estimate the number from a scatterplot.|
|No Intercept||If the analyst reported the actual model (some don’t), look for a constant term.|
|Stepwise Regression||Unless another approach is reported, assume the analyst used some form of stepwise regression.|
|Outliers||Assuming the analyst did not provide plots of the dependent variable versus the independent variables, look for R-squared values that are much higher or lower than expected.|
|Non-linear relationships||Assuming the analyst did not provide plots of the dependent variable versus the independent variables, look for a lower-than-expected R-squared value from a linear model. If there are non-linear terms in the model, this is probably not an issue.|
|Overfitting||Look for a large number of independent variables in the model, especially if they use different types of transformation|
|Misspecification||Look for any variants of the dependent variable in the independent variables. Assess whether the model meets the objectives of the effort.|
|Multicollinearity||Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify multicollinearity.|
|Heteroscedasticity||Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify heteroscedasticity.|
|Autocorrelation||Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify serial correlation.|
|Weighting||Compare the reported number of samples to the degrees of freedom. More DF than samples is usually attributable to weighting.|
So there are some ways you can identify and evaluate eleven reasons for doubting a regression model. Remember when evaluating other analyst’s models that not everyone is an expert and that even experts make mistakes. Try to be helpful in your critiques, but at a minimum, be professional.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at amazon.com, barnesandnoble.com, or other online booksellers.