## Five Common Reasons for Doubting a Regression Model

Finding a model that fits a set of data is one of the most common goals in data analysis. Least squares regression is the most commonly used tool for achieving this goal. It’s a relatively simple concept, it’s easy to do, and there’s a lot of readily available software to do the calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein lies the problem. Even if there is no intention to mislead anyone, it does happen.

Here are five of the most common reasons to doubt a regression model.

## Not Enough Samples

Accuracy is a critical component for evaluating a model. The coefficient of determination, also known as R-squared or R^{2}, is the most often cited measure of accuracy. Now obviously, the more accurate a model is the better, so data analysts look large values for R-squared.

R-squared is designed to estimate the maximum relationship between the dependent and independent variables based on a set of samples (cases, observations, records, or whatever). If there aren’t enough samples compared to the number of independent variables in the model, the estimate of R-squared will be especially unstable. The effect is greatest when the R-squared value is small, the number of samples is small, and the number of independent variables is large, as shown in this figure.

The inflation in the value of R-squared can be assesses by calculating the *shrunken R-square*. The figure shows that for an R-squared value above 0.8 with 30 cases per variable, there isn’t much shrinkage. Lower estimates of R-square, however, experience considerable shrinkage.

You can’t control the magnitude of the relationship between a dependent variable and a set of independent variables, and often, you won’t have total control over the number of samples and variables either. So, you have to be aware that R-squared will be overestimated and treat your regression models with some skepticism.

## No Intercept

Almost all software that performs regression analysis provides an option to not include an intercept term in the model. This sounds convenient, especially for relationships that presume a one-to-one relationship between the dependent and independent variables. But when an intercept is excluded from the model, it’s not omitted from the analysis; it is set to zero. Look at any regression model with “no intercept” and you’ll see that the regression line goes through the origin of the axes.

With the regression line nailed down on one end at the origin, you might expect that the value of R-squared would be diminished because the line wouldn’t necessarily travel through the data in a way that minimizes the differences between the data points and the regression line, called the *errors* or *residuals*. Instead, R-squared is artificially inflated because when the correction provided by the intercept is removed, the total variation in the model increases. But, the ratio of the variability attributable to the model compared to the total variability also increases, hence the increase in R-squared.

The solution is simple. Always have an intercept term in the model unless there is a compelling theoretical reason not to include it. In that case, don’t put all your trust in R-square (or the F-tests).

## Stepwise Regression

Stepwise regression is a data analyst’s dream. Throw all the variables into a hopper, grab a cup of coffee, and the silicon chips will tell you which variables yield the best model. That irritates hard-core statisticians who don’t like amateurs messing around with their numbers. You can bet, though, that at least some of them go home at night, throw all the food in their cupboard into a crock pot, and expect to get a meal out of it.

The cause of some statistician’s consternation is that stepwise regression will select the variables that are best for the dataset, but not necessarily the population. Model test probabilities are optimistic because they don’t account for the stepwise procedure’s ability to capitalize on chance. Moreover, adding new variables will always increase R-squared, so you have to have some good ways to decide how many variables is too many. There are ways to do this. So using stepwise regression alone isn’t a fatal flow. Like with guns, drugs, and fast food, you have to be careful how you use it.

If you use stepwise regression, be sure to look at the diagnostic statistics for the model. Also, verify your results using a different data set by splitting the data set before you do any analysis, by randomly extracting observations from the original data set to create new data sets, or by collecting new samples.

## Outliers

Outliers are a special irritant for data analysts. They’re not really that tough to identify but they cause a variety of problems that data analysts have to deal with. The first problem is convincing reviewers not familiar with the data that the outliers are in fact outliers. Second, the data analysts have to convince all reviewers that what they want to do with them, delete or include or whatever, is the appropriate thing to do. One way or another, though, outliers will wreak havoc with R-squared.

Consider this figure, which comes from an analysis of slug tests to estimate the hydraulic conductivity of an aquifer. The red circles show the relationship between rising-head and falling-head slug tests performed on groundwater monitoring wells. The model for this relationship has an R-square of 0.90. The blue diamond is an outlier along the trend (same regression equation) about 60% greater than the next highest value. The R-squared of this equation is 0.95. The green square is an outlier perpendicular to the trend. The R-squared of this equation is 0.42. Those are fairly sizable differences to have been caused by a single data point.

How should you deal with outliers? I usually delete them because I’m usually looking to model trends and other patterns. But outliers are great thought provokers. Sometimes they tell you things the patterns don’t. If you’re not comfortable deciding what to do with an outlier, run the analysis both with and without outliers, a time consuming and expensive approach. The other approach would be to get the reviewer, an interested stakeholder, or an independent expert involved in the decision. That approach is time consuming and expensive too. Pick your poison.

## Non-linear relationships

Linear regression assumes that the relationship between a dependent variable and a set of independent variables are additive, or linear. If the relationship is actually nonlinear, the R-squared for the linear model will be lower than it would be for a better fitting nonlinear model.

This figure shows the relationship between the number of employed individuals and the number of individuals not in the U.S. work force between 1980 and 2009. The linear model has a respectable R-squared value of 0.84, but the polynomial model fits the data much better with an R-squared value of 0.95.

Non-linear relationships are a relatively simple problem to fix, or at least acknowledge, once you know what to look for. Graph your data and go from there.

## No Doubts

So there are five of the most common reasons for doubting a regression model. If you’re grasping for flaws in a regression model, these are the best places to start looking. They occur commonly and are simple to identify. But, there are plenty more reasons to question a regression model, such as multicollinearity, weighting, overfitting, and misspecification. But those are topics for another time.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order **Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis**** **at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.

Thanks for your great post. As usual it is very educative. I have a question about the last problem, the non-linear relationships. Basically I’d like to know how you select a specific non-linear model out of possibly infinite choices? You suggested to “Graph your data and go from there”. But how? I feel this is similar to the problem step-wise regression has: In both cases, the same dataset is used both to select the model and to find the parameters. How do I know I am not over-fitting?

Start with intrinsically linear transformations (i.e., in which the coefficients are constants and the nonlinear variables are additive). Look at the graph in this blog about transformations (http://statswithcats.wordpress.com/2010/11/21/fifty-ways-to-fix-your-data/). Compare your data pattern to the general forms in the plot. Pick one and give it a try. Pick another and give it a try. While you are doing this, you will be learning more about your data and the statistical procedures you are using. Remember, the journey you make in creating a model is often more important than the model itself. In the end, simple is best. Use as few variables as needed to give you the accuracy and precision you want. If two variables give you about the same result, go with the simpler one.

IMO, the concern about overfitting is somewhat overblown. If you know what overfitting is, you’re not likely to become a victim. It’s not something that happens in a keystroke. It takes a lot of work fine tuning variables and what not. It’s easy to see in other peoples models where there is a conglomeration of mathematical functions and variable combinations. And it’s easy to verify if you have data not involved in the original analysis. On the other hand, if you don’t understand what overfitting is, which many novice data analysts do not, you could be in trouble.

I am examining a study on real estate where the data is extremely heterogeneous with many probable attribute differences. The model used to measure an impact has very few predictor variables and no tests for interaction of the impact variable as a function of size (acres).

The study failed to reject the null hypothesis on the impact variable. I added an interaction variavle with acres times impact and converted the dependent variable price to LN(Price) and found highly significant results that slowly dissappear as acreage increases.

I’m looking for some reference where I can formally dicuss this proplem where the dataset has a tremendous amout of variation and the model is underspecified.

Thanks