One thing that makes sports so much fun to follow is the plethora of statistics associated with every player, every game, every team, and every season. Other than government agencies, you won’t find better sources of data to practice on. It’s a simple matter to go to the website of a professional sport and find some raw data that needs analyzing.
In football (the American kind) it is often said that good offense provides excitement but good defense wins games. Fans of the 2006 Indianapolis Colts probably wouldn’t agree. Ranked 3rd in offense but 21st of 32 teams in defense, the Colts had a regular season record of 12 wins and 4 losses and won the Super Bowl. Maybe they were an anomaly. So the question is: are teams that make the post-season playoffs better defensively than the rest of the league as the conventional wisdom claims?
Data for this analysis consisted of 26 variables (i.e., team performance statistics, such as number of plays, penalties, fumbles, 3rd and 4th down conversions, and time of possession) for the 32 NFL teams (thank you nfl.com). Having that many performance variables with comparably few teams is a flag that factor analysis might be a useful way to proceed (http://statswithcats.wordpress.com/2010/08/27/the-right-tool-for-the-job/). Factor analysis (FA) is based on the concept that the variation in a set of variables can be rearranged and attributed to new variables, called factors. The use of factors instead of raw variables is sometimes preferable because factors are more efficient (i.e., fewer factors are needed to evaluate almost the same proportion of variability as the original variables).
FA requires some intuition to interpret. FA produces equations that define each factor in terms of the original variables:
F1 = a11x1 + a12x2 + a13x3 … a1nxn
F2 = a21x1 + a22x2 + a23x3 … a2nxn
Fm = am1x1 + am2x2 + am3x3 … amnxn
F1 through Fm are the m factors that replace the original n variables
x1 through xn are the original variables
a1 through an are factor analysis weights.
m is always less than or equal to n, but is a lot less if you’re lucky.
What you have to do is look at the correlations between the original variables and the factors and guess what each factor might mean. It’s like being given a big box of parts—gears, transistors, tires, fabric, motors, pipes, wires, and lumber—and trying to figure out what they’re supposed to make. Some parts will be integral and others will be left over.
FA derived two factors from the 26 NFL statistics—an Offense Factor and a Defense Factor. No big surprise there, in fact, that’s what we were hoping for. Each factor accounts for about 20% of the total variation in the original variables. So, we’ve lost 60% of the information contained in the original 26 variables in exchange for the simplicity of having just two variables. That’s a good example of why FA is often referred to as a data reduction technique.
FA and the associated data reduction techniques of correspondence analysis and multidimensional scaling are like photographs. A photograph conveys only two of three spatial dimensions and usually includes no information about time, odors, sounds, temperature, or other circumstances, yet it still presents enough information so that observers can discern what is happening. So data reduction shouldn’t be taken as a pejorative descriptor. Sometimes simplifying a problem is the best way to solve it; at least that’s what William of Ockham thought. And after all, isn’t that what modeling is about?
Once the number of variables has been reduced to a manageable few factors, you can analyze patterns of relationships much more efficiently. Consider the scatter plot of how the 32 teams scored on the two factors and how far they got in the postseason. The two gray lines represent the averages of the Offense and Defense Factors. The Seattle Seahawks could be considered the average team of the 2006 season because they are located closest to the intersection of these two lines. Draw an imaginary line through the plot origin and the intersection of the lines (i.e., a 45° angle), and you’ll identify the most balanced teams, the teams with about the same scores for their Offense and Defense Factors. The most balanced teams from best to worst would be the Pittsburgh Steelers, the New York Giants, the Seattle Seahawks, the Tennessee Titans, the Cleveland Browns, and the Houston Texans. Of these, only the Giants and the Seahawks made the playoffs. So much for the importance of balance.
[Note: There’s a reason why there are no values on the axes. Some readers who saw this graph were totally baffled by the numbers, so I took them out (http://statswithcats.wordpress.com/2011/01/16/ockham%E2%80%99s-spatula/). The units of the analysis were normalized and are meaningful only in relative terms. Both axes do have the same scale increments, however. A difference of 1 on the offense scale is analogous to a difference of 1 on the defense scale.]
The 2006 Super Bowl champion Colts had the highest score on the Offense Factor but the lowest score on the Defense Factor of any of the playoff teams. In fact, 63% of teams with an above average Offense Factor score made the playoffs compared to 44% of teams with an above average Defense Factor score. So, is the notion that good defense beats good offense wrong? Not necessarily; but it sure didn’t apply in 2006.
So remember, if there’s no NFL football in 2011 because of contractual problems, you can always fall back on statistics to fill the gap. Then again, there’s always sabermetrics …
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.