Five Common Reasons for Doubting a Regression Model
Finding a model that fits a set of data is one of the most common goals in data analysis. Least squares regression is the most commonly used tool for achieving this goal. It’s a relatively simple concept, it’s easy to do, and there’s a lot of readily available software to do the calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein lies the problem. Even if there is no intention to mislead anyone, it does happen.
Here are five of the most common reasons to doubt a regression model.
Not Enough Samples
Accuracy is a critical component for evaluating a model. The coefficient of determination, also known as R-squared or R2, is the most often cited measure of accuracy. Now obviously, the more accurate a model is the better, so data analysts look large values for R-squared.
R-squared is designed to estimate the maximum relationship between the dependent and independent variables based on a set of samples (cases, observations, records, or whatever). If there aren’t enough samples compared to the number of independent variables in the model, the estimate of R-squared will be especially unstable. The effect is greatest when the R-squared value is small, the number of samples is small, and the number of independent variables is large, as shown in this figure.
The inflation in the value of R-squared can be assesses by calculating the shrunken R-square. The figure shows that for an R-squared value above 0.8 with 30 cases per variable, there isn’t much shrinkage. Lower estimates of R-square, however, experience considerable shrinkage.
You can’t control the magnitude of the relationship between a dependent variable and a set of independent variables, and often, you won’t have total control over the number of samples and variables either. So, you have to be aware that R-squared will be overestimated and treat your regression models with some skepticism.
Almost all software that performs regression analysis provides an option to not include an intercept term in the model. This sounds convenient, especially for relationships that presume a one-to-one relationship between the dependent and independent variables. But when an intercept is excluded from the model, it’s not omitted from the analysis; it is set to zero. Look at any regression model with “no intercept” and you’ll see that the regression line goes through the origin of the axes.
With the regression line nailed down on one end at the origin, you might expect that the value of R-squared would be diminished because the line wouldn’t necessarily travel through the data in a way that minimizes the differences between the data points and the regression line, called the errors or residuals. Instead, R-squared is artificially inflated because when the correction provided by the intercept is removed, the total variation in the model increases. But, the ratio of the variability attributable to the model compared to the total variability also increases, hence the increase in R-squared.
The solution is simple. Always have an intercept term in the model unless there is a compelling theoretical reason not to include it. In that case, don’t put all your trust in R-square (or the F-tests).
Stepwise regression is a data analyst’s dream. Throw all the variables into a hopper, grab a cup of coffee, and the silicon chips will tell you which variables yield the best model. That irritates hard-core statisticians who don’t like amateurs messing around with their numbers. You can bet, though, that at least some of them go home at night, throw all the food in their cupboard into a crock pot, and expect to get a meal out of it.
The cause of some statistician’s consternation is that stepwise regression will select the variables that are best for the dataset, but not necessarily the population. Model test probabilities are optimistic because they don’t account for the stepwise procedure’s ability to capitalize on chance. Moreover, adding new variables will always increase R-squared, so you have to have some good ways to decide how many variables is too many. There are ways to do this. So using stepwise regression alone isn’t a fatal flow. Like with guns, drugs, and fast food, you have to be careful how you use it.
If you use stepwise regression, be sure to look at the diagnostic statistics for the model. Also, verify your results using a different data set by splitting the data set before you do any analysis, by randomly extracting observations from the original data set to create new data sets, or by collecting new samples.
Outliers are a special irritant for data analysts. They’re not really that tough to identify but they cause a variety of problems that data analysts have to deal with. The first problem is convincing reviewers not familiar with the data that the outliers are in fact outliers. Second, the data analysts have to convince all reviewers that what they want to do with them, delete or include or whatever, is the appropriate thing to do. One way or another, though, outliers will wreak havoc with R-squared.
Consider this figure, which comes from an analysis of slug tests to estimate the hydraulic conductivity of an aquifer. The red circles show the relationship between rising-head and falling-head slug tests performed on groundwater monitoring wells. The model for this relationship has an R-square of 0.90. The blue diamond is an outlier along the trend (same regression equation) about 60% greater than the next highest value. The R-squared of this equation is 0.95. The green square is an outlier perpendicular to the trend. The R-squared of this equation is 0.42. Those are fairly sizable differences to have been caused by a single data point.
How should you deal with outliers? I usually delete them because I’m usually looking to model trends and other patterns. But outliers are great thought provokers. Sometimes they tell you things the patterns don’t. If you’re not comfortable deciding what to do with an outlier, run the analysis both with and without outliers, a time consuming and expensive approach. The other approach would be to get the reviewer, an interested stakeholder, or an independent expert involved in the decision. That approach is time consuming and expensive too. Pick your poison.
Linear regression assumes that the relationship between a dependent variable and a set of independent variables are additive, or linear. If the relationship is actually nonlinear, the R-squared for the linear model will be lower than it would be for a better fitting nonlinear model.
This figure shows the relationship between the number of employed individuals and the number of individuals not in the U.S. work force between 1980 and 2009. The linear model has a respectable R-squared value of 0.84, but the polynomial model fits the data much better with an R-squared value of 0.95.
So there are five of the most common reasons for doubting a regression model. If you’re grasping for flaws in a regression model, these are the best places to start looking. They occur commonly and are simple to identify. But, there are plenty more reasons to question a regression model, such as multicollinearity, weighting, overfitting, and misspecification. But those are topics for another time.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.