The process of developing a statistical model (http://statswithcats.wordpress.com/2010/12/04/many-paths-lead-to-models/) involves finding the mathematical equation of a line, curve, or other pattern that faithfully represents the data with the least amount of error (i.e., variability). Variability and pattern are the yin and yang of models. They are opposites yet they are intertwined. Improve the fit of the model’s pattern to the pattern of the data, and you’ll reduce the variability in the model and vice versa. It’s wizardry.
Follow the Modeling Code
Say you have a conceptual model (http://statswithcats.wordpress.com/2010/12/12/the-seeds-of-a-model/) with a dependent variable (y) and one or more independent variables (x1
through xn) in the fear-provoking-yet-oh-so-convenient mathematical shorthand:
y = a0 + a1x1 + a2x2 + a3x3 … anxn + e
Estimating values for the model’s parameters (i.e., a0
through an) and the model’s uncertainty (i.e., the e) so that the model is the best fit for the data with the least imprecision is a process called calibrating or fitting a model. Every statistical method has criteria that the procedure uses to calculate the parameters of the best model given the variables, data, and statistical options you specify. Your job is to specify those variables, data, and statistical options.
This is how it works:
- You collect data that represent the y and the xs for each of the samples.
- You make sure the data are correct and appropriate for the phenomenon and put the values in a dataset.
- Using the software for the statistical procedure you selected, you specify the dependent variable, the independent variables, and any statistical option you want to use. Every statistical procedure has a variety of options that can be specified. If you’re doing a factor analysis, for instance, you can try different extraction techniques, different communalities, different numbers of factors, and so on. If you’re a statistician, you know what I mean. If you’re not a statistician, don’t worry about this.
- Magic happens. This is what you learn about if you major in statistics.
- You evaluate the output from the software and, if all is well, you record the parameters and the error, and you have a calibrated statistical model. If the model fit isn’t what you would like, which is what usually happens, you make changes and try again.
What changes could you make? Here are a few hints. If you are well acquainted with statistics, you can try making adjustments to the variables and the statistical options, and perhaps even the data, to see how the different combinations affect the model. For example, you can try including or excluding influential observations, filling in missing data, changing the variables in the model, or breaking down the analysis by some grouping factor (http://statswithcats.wordpress.com/2010/11/21/fifty-ways-to-fix-your-data/). If you are well acquainted with the data but not statistics, you might rely more on your intuition than your computations. Look for differences between the different models as well as between the results and your own expectations based on established theory.
Models and Variables and Samples, Oh My
If you specify only one way that you want to combine the variables, data, and statistical options, the statistical method will give you the best model. However, if you specify more than one combination of independent variables, you have to have some criteria for selecting which of the models to use as your final model and then decide how good the model is. The three most commonly used criteria are the coefficient of determination, the standard error of estimate, and the F-test.
- Coefficient of Determination—also called R2 or R-square, is the square of the correlation of the independent variables with the dependent variable. R-squared ranges from 0 to 1. It is thought of as the proportion of the variation in the dependent variable that is accounted for by the independent variables, or similarly, the proportion of the total variation in the relationship that the model accounts for. It is a measure of how well the pattern of the model fits the pattern of the data, and hence, is a measure of accuracy. Some statisticians believe that R-square is overused and flawed because it always increases as terms are added to a model. Whine. Whine. Whine.
- Standard Error of Estimate—also called sxy or SEE, is the standard deviation of the residuals. The residuals are the differences (i.e., errors) between the observed values of the dependent variable and the values calculated by the model. The SEE takes into account the number of samples (more is better) and the number of variables (fewer is better) in the model, and is in the same units as the dependent variable. It is a measure of how much scatter there is between the model and the data, and hence, is a measure of precision. For a set of models you are considering, the largest coefficient of determination usually will correspond to the smallest standard error of estimate. Consequently, many people look only at the coefficient of determination because it is easier to understand that statistic given its bounded scale. It’s essential to look at the standard error of estimate as well, though, because it will allow you to evaluate the uncertainty in the model’s predictions. In other words, R-square might tell you which of several models may be best while SEE will tell you if that best is good enough for what you need to do with the model.
- F-test and probability—A test of whether the R-square value is different from zero. The F-value will vary with the numbers of samples and terms in the model. The probability is customarily required to be less than 0.05. Many statisticians start by looking at the results of the F-test, using the probability as a threshold, and then look at the R-square and SEE.
Evaluating models doesn’t end with R-square, SEE, and F-test. There are many other diagnostic tools for evaluating the overall quality of statistical models, including:
- AIC and BIC—The Akaike’s Information Criterion and the Bayesian Information Criterion are statistics for comparing alternative models. For any collection of models, the one with the lowest values of AIC and BIC is the preferred model.
- Mallows’ Cp Criterion—A relative measure of inaccuracy in the model given the number of terms. Cp should be small and close to the number of terms in the model. Large values of Cp may indicate that the model is overfit.
- Plot of Observed vs. Predicted—On a graph with observed values on the y-axis and predicted values on the x-axis, data points should plot close to a straight 45-degree line passing through the origin of the axes. Systematic deviations from the line indicate a lack-of-fit of the model to the data. Individual data points that deviate substantially from the line may be considered outliers.
- Plot of Observed vs. Residuals—On a graph with observed values on the y-axis and residuals (predicted values minus observed values) on the x-axis, data points should plot randomly around the origin of the axes.
- Histogram of Residuals—If the frequency distribution of the model’s residuals does not approximate a Normal distribution, the probabilities calculated for the F-test may be in error.
Usually, all of these statistics should be considered when building a model. Once a small number of alternative models is selected, statistical diagnostics are used to evaluate the components of a statistical model, the variables, including:
- Regression Coefficients—If you use statistical software, you’ll see two types of regression coefficients. The unstandardized regression coefficients are the a0 through an terms in the model. They are also referred to as B or b. These are the values you use if you want to calculate a prediction of the y variable from the values of the x variables. The standardized regression coefficients are equal to the unstandardized regression coefficients divided by the standard errors of the coefficients. Standardized regression coefficients, also called Beta coefficients, are used to compare the relative importance of the independent variables. If you forget which is which, remember that there is no standardized coefficient for the constant intercept term in the model. The column with a number for the model intercept contains the unstandardized coefficients you use for calculating predictions.
- t-tests and probabilities—Tests of whether the regression coefficients are different from zero. The t-values may change significantly depending on what other terms are in the model. The probability for the tests are commonly used to include or discard independent variables.
- Variance Inflation Factor—VIFs are measures of how much the model’s coefficients change because of correlations between the independent variables. The VIF for a variable should be less than 10 and ideally near 1 or multicollinearity may be a concern. The reciprocal of the VIF is called the tolerance.
- Partial Regression Leverage Plots—Leverage plots are graphs of the dependent variable (y-axis) versus an independent variable from which the effects of the other independent variables in the model have been removed (x-axis). The slope of a line fit to the leverage plot is the regression coefficient for that independent variable. These plots are useful for identifying outliers and other concerns in the relationship between the independent variable and the dependent variable.
These statistics are calculated for each independent variable in a model.
Finally, the observations used to create the statistical model are evaluated using diagnostic statistics, including:
- Residuals—Residuals are the differences between the observed values and the model’s predictions. The residuals should all be small and Normally distributed.
- DFBETAs—The changes in the regression coefficients that would result from deleting the observation. DFBETAs should all be small and relatively consistent for all the observations.
- Studentized Deleted Residual—A measure of whether an observation of the dependent variable might be overly influential. The studentized deleted residual is like a t-statistic; it should be small, preferably less than 2, if the observation is not overly influential.
- Leverage— A measure of whether an observation for an independent variable might be overly influential. The leverage for an observation should be less than two times the number of terms in the model divided by the sample size.
- Cook’s Distance— A measure of the overall impact of an observation on the coefficients of the model. If the CD for an observation is less than 0.2, the observation has little impact on the model. A CD value over 0.5 indicates a very influential observation.
These statistics are calculated for each sample used to create the model.
You won’t necessarily use all of these diagnostics every time you build a model. Then again, you may also have to use some of the many other diagnostic statistics. You have to have the brains to know what statistics to use, the heart to follow through all the calculations and plots, and the courage to decide what diagnostics to ignore and what parts of the model you should change.
No Place for a Tome
In step 5 of the modeling process, “If all is well” means that all the statistical tests and graphics that your software provides indicate that the model will be satisfactory for your needs. This, of course, is the crux of statistical modeling that statisticians write all those books about. You’ll want to get at least one reference for the type of analysis you want to do and maybe another one for the software you plan to use. Then you actually have to read them. Good luck with that.
The best results you can hope for, in a way, are the mundane conclusions that confirm what you expect, especially if they add a bit of illumination to the dark places on the horizon of your current knowledge. Expect that there will be some minor differences between simulations. They’ll probably be inconsequential. But be cautious if the results are a big surprise. Be skeptical of anything that might make you want to call a press conference. It’s OK to get surprising results, just be sure you aren’t the one surprised later to find an error or misinterpretation.
After you’re done with model calibration, you’re ready to implement the model in a process called deployment or rollout. You’ll find a lot of information about deployment on the Internet, particularly in regards to software. Most data analyses give birth to reports, mostly shelf debris. Statistical models that perform a function, though, usually involve software. These models can be programmed into a standalone application or integrated into available software like Access or Excel. Consider your audience. Perhaps the best advice is to keep a deployed model as simple as possible. Most users won’t have to know the details of the model, only how to use it. Be sure you provide enough documentation, though, so that any number crunchers in the group can marvel at your accomplishment.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.