Predict the Next President of the United States

Cat for presidentThe American Statistical Association is sponsoring a new statistics contest for high school and college students. The contest, known as Prediction 2016, challenges students to use statistics to predict the next president of the U.S. The purpose of the contest is to get more students interested in statistics by showing them how it can apply to the real world. It’s part of the larger student education campaign This is Statistics. Here’s more information:


ASA Announces Prediction 2016, a National Student Contest to Predict the Next President of the United States


Sponsored by the American Statistical Association, Prediction 2016 is a contest for high school and undergraduate college students to predict the winner of the U.S. presidential election using statistical methods. Winners will receive a variety of prizes and perks, including exposure to the nation’s leading statisticians and data scientists.


One winner will be chosen among high school contestants and one among college contestants. Those with the most accurate predictions developed with sound statistical methods will win the contest.


October 24, 2016 at 5pm — Deadline for submitting predictions.

October 27, 2016 — ASA announces which candidate wins in the student predictions.

November 9, 2016 — ASA announces contest winners.

Learn more at ASA spokespersons are available for interviews about the contest, as well as trends in statistics education and careers that are shaping the economy and workforce.

Media Contact:

  • Sarah Litton
  • (202) 851-2479
Posted in Uncategorized | Tagged , | 1 Comment

Regression Fantasies

Common Reasons for Doubting a Regression Model

Finding a model that fits a set of data is one of the most common goals in data analysis. Least squares regression is the most commonly used tool for achieving this goal. It’s a relatively simple concept, it’s easy to do, and there’s a lot of readily available software to do the calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein lies the problem. Even if there is no intention to mislead anyone, it does happen.

Here are eleven of the most common reasons to doubt a regression model.

Not Enough Samples

Accuracy is a critical component for evaluating a model. The coefficient of determination, also known as R-squared or R2, is the most often cited measure of accuracy. Now obviously, the more accurate a model is the better, so data analysts look large values for R-squared.

R-squared is designed to estimate the maximum relationship between the dependent and independent variables based on a set of samples (cases, observations, records, or whatever). If there aren’t enough samples compared to the number of independent variables in the model, the estimate of R-squared will be especially unstable. The effect is greatest when the R-squared value is small, the number of samples is small, and the number of independent variables is large, as shown in this figure.

The inflation in the value of R-squared can be assesses by calculating the shrunken R-square. The figure shows that for an R-squared value above 0.8 with 30 cases per variable, there isn’t much shrinkage. Lower estimates of R-square, however, experience considerable shrinkage.

You can’t control the magnitude of the relationship between a dependent variable and a set of independent variables, and often, you won’t have total control over the number of samples and variables either. So, you have to be aware that R-squared will be overestimated and treat your regression models with some skepticism.

No Intercept

Almost all software that performs regression analysis provides an option to not include an intercept term in the model. This sounds convenient, especially for relationships that presume a one-to-one relationship between the dependent and independent variables. But when an intercept is excluded from the model, it’s not omitted from the analysis; it is set to zero. Look at any regression model with “no intercept” and you’ll see that the regression line goes through the origin of the axes.

With the regression line nailed down on one end at the origin, you might expect that the value of R-squared would be diminished because the line wouldn’t necessarily travel through the data in a way that minimizes the differences between the data points and the regression line, called the errors or residuals. Instead, R-squared is artificially inflated because when the correction provided by the intercept is removed, the total variation in the model increases. But, the ratio of the variability attributable to the model compared to the total variability also increases, hence the increase in R-squared.

The solution is simple. Always have an intercept term in the model unless there is a compelling theoretical reason not to include it. In that case, don’t put all your trust in R-square (or the F-tests).

Stepwise Regression

Stepwise regression is a data analyst’s dream. Throw all the variables into a hopper, grab a cup of coffee, and the silicon chips will tell you which variables yield the best model. That irritates hard-core statisticians who don’t like amateurs messing around with their numbers. You can bet, though, that at least some of them go home at night, throw all the food in their cupboard into a crock pot, and expect to get a meal out of it.

The cause of some statistician’s consternation is that stepwise regression will select the variables that are best for the data set, but not necessarily the population. Model test probabilities are optimistic because they don’t account for the stepwise procedure’s ability to capitalize on chance. Moreover, adding new variables will always increase R-squared, so you have to have some good ways to decide how many variables is too many. There are ways to do this. So using stepwise regression alone isn’t a fatal flaw. Like with guns, drugs, and fast food, you have to be careful how you use it.

If you use stepwise regression, be sure to look at the diagnostic statistics for the model. Also, verify your results using a different data set by splitting the data set before you do any analysis, by randomly extracting observations from the original data set to create new data sets, or by collecting new samples.


Outliers are a special irritant for data analysts. They’re not really that tough to identify but they cause a variety of problems that data analysts have to deal with. The first problem is convincing reviewers not familiar with the data that the outliers are in fact outliers. Second, the data analysts have to convince all reviewers that what they want to do with them, delete or include or whatever, is the appropriate thing to do. One way or another, though, outliers will wreak havoc with R-squared.

Consider this figure, which comes from an analysis of slug tests to estimate the hydraulic conductivity of an aquifer. The red circles show the relationship between rising-head and falling-head slug tests performed on groundwater monitoring wells. The model for this relationship has an R-square of 0.90. The blue diamond is an outlier along the trend (same regression equation) about 60% greater than the next highest value. The R-squared of this equation is 0.95. The green square is an outlier perpendicular to the trend. The R-squared of this equation is 0.42. Those are fairly sizable differences to have been caused by a single data point.

How should you deal with outliers? I usually delete them because I’m usually looking to model trends and other patterns. But outliers are great thought provokers. Sometimes they tell you things the patterns don’t. If you’re not comfortable deciding what to do with an outlier, run the analysis both with and without outliers, a time consuming and expensive approach. The other approach would be to get the reviewer, an interested stakeholder, or an independent expert involved in the decision. That approach is time consuming and expensive too. Pick your poison.

Non-linear relationships

Linear regression assumes that the relationship between a dependent variable and a set of independent variables are additive, or linear. If the relationship is actually nonlinear, the R-squared for the linear model will be lower than it would be for a better fitting nonlinear model.

This figure shows the relationship between the number of employed individuals and the number of individuals not in the U.S. work force between 1980 and 2009. The linear model has a respectable R-squared value of 0.84, but the polynomial model fits the data much better with an R-squared value of 0.95.

Non-linear relationships are a relatively simple problem to fix, or at least acknowledge, once you know what to look for. Graph your data and go from there.


Overfitting involves building a statistical model solely by optimizing statistical parameters, and usually involves using a large number of variables and transformations of the variables. The resulting model may fit the data almost perfectly but will produce erroneous results when applied to another sample from the population.

The concern about overfitting may be somewhat overstated. Overfitting is like becoming too muscular from weight training. It doesn’t happen suddenly or simply. If you know what overfitting is, you’re not likely to become a victim. It’s not something that happens in a keystroke. It takes a lot of work fine tuning variables and what not. It’s also usually easy to identify overfitting in other people’s models. Simply look for a conglomeration of manual numerical adjustments, mathematical functions, and variable combinations.


Misspecification involves including terms in a model that make the model look great statistically even though the model is problematical. Often, misspecification involves placing the same or very similar variable on both sides of the equation.

Consider this example from economics. A model for the U.S. Gross Domestic Product (GDP) was developed using data on government spending and unemployment from 1947 to 1997. The model:

GDP = (121*Spending) – (3.5*Spending2) + (136*Time) – (61*Unemployment) – 566

had an R-squared value of 0.9994. Such a high R-squared value is a signal that something is amiss. R-squared values that high are usually only seen in models involving equipment calibration, and certainly not anything involving capricious human behavior. A closer look at the study indicated that the model term involving spending were an index of the government’s outlays relative to the economy. Usually, indexing a variable to a baseline or standard is a good thing to do. In this case, though, the spending index was the proportion of government outlays per the GDP. Thus, the model was:

GDP = (121*Outlays/GDP) – (3.5* (Outlays/GDP)2) + (136*Time) – (61*Unemployment) – 566

GDP appears on both sides of the equation, thus accounting for the near perfect correlation. This is a case in which an index, at least one involving the dependent variable, should not have been used.

Another misspecification involves creating a prediction model having independent variables that are more difficult, time consuming, or expensive to generate than the dependent variable. You might as well just measure the dependent variable when you need to know its value. Similarly with forecasting (prediction of the future) models, if you need to forecast something a year in advance, don’t use predictors that are measured less than a year in advance.


Multicollinearity occurs when a model has two or more independent variables that are highly correlated with each other. The consequences are that the model will look fine, but predictions from the model will be erratic. It’s like a football team. The players perform well together but you can’t necessarily tell how good individual players are. The team wins, yet in some situations, the cornerback or offensive tackle will get beat on most every play.

If you ever tried to use independent variables that add to a constant, you’ve seen multicollinearity in action. In the case of perfect correlations, such as these, statistical software will crash because it won’t be able to perform the matrix mathemagics of regression. Most instances of multicollinearity involve weaker correlations that allow statistical software to function, yet the predictions of the model will still be erratic.

Multicollinearity occurs often in the social sciences and other fields of study in which many variables are measured in the process of model building. Diagnosis of the problem is simple if you have access to the data. Look at correlations between the independent variables. You can also look at the variance inflation factors, reciprocals of one minus the R-squared values for the independent variables and the dependent variable. VIFs are measures of how much the model’s coefficients change because of multicollinearity. The VIF for a variable should be less than 10 and ideally near 1.

If you suspect multicollinearity, don’t worry about the model but don’t believe any of the predictions.


Regression, and practically all parametric statistics, requires that the variances in the model residuals be equal at every value of the dependent variable. This assumption is called equal variances, homogeneity of variances, or coolest of all, homoscedasticity. Violate the assumption and you have heteroscedasticity.

Heteroscedasticity is assessed much more commonly in analysis of variance models than in regression models. This is probably because the dependent variable in ANOVA is measured on a categorical scale while the dependent variable in regression is measured on a continuous scale. The solution to this is fairly simple. Break the dependent variable scale into intervals, like in a histogram, and calculate the variance for each interval. The variances don’t have to be precisely equal, but variances different by a factor of five are problematical. Unequal variances will wreak havoc on any tests or confidence limits calculated for model predictions.


Autocorrelation involves a variable being correlated with itself. It is the correlation between data points with the previously listed data points (termed a lag). Usually, autocorrelation involves time-series data or spatial data, but it can also involve the order in which data are collected. The terms autocorrelation and serial correlation are often used interchangeably. If the data points are collected at a constant time interval, the term autocorrelation is more typically used.

If the residuals of a model are autocorrelated, it’s a sure bet that the variances will also be unequal. That means, again, that tests or confidence limits calculated from variances should be suspect.

To check a variable or residuals from a model for autocorrelation, you can conduct a Durban-Watson test. The Durban-Watson test statistic ranges from 0 to 4. If the statistic is close to 2.0, then serial correlation is not a problem. Most statistical software will allow you to conduct this test as part of a regression analysis.


Most software that calculates regression parameters also allows you to weight the data points. You might want to do this for several reasons. Weighting is used to make more reliable or relevant data points more important in model building. It’s also used when each data point represents more than one value. The issue with weighting is that it will change the degrees of freedom, and hence, the results of statistical tests. Usually this is OK, a necessary change to accommodate the realities of the model. However, if you ever come upon a weighted least squares regression model in which the weightings are arbitrary, perhaps done by an analyst who doesn’t understand the consequence, don’t believe the test results.

Is Your Regression Model Telling the Truth?

There are many technologies we use in our lives without really understanding how they work. Television. Computers. Cell phones. Microwave ovens. Cars. Even many things about the human body are not well understood. But I don’t mean how to use these mechanisms. Everyone knows how to use these things. I mean understanding them well enough to fix them when they break. Regression analysis is like that too. Only with regression analysis, sometimes you can’t even tell if there’s something wrong without consulting an expert.

Here are some tips for troubleshooting regression models.


You may know how to use regression analysis, but unless you’re an expert, you may not know about some of the more subtle pitfalls you may encounter. The biggest red flag that something is amiss is the TGTBT, too good to be true. If you encounter an R-squared value above 0.9, especially unexpectedly, there’s probably something wrong. Another red flag is inconsistency. If estimates of the model’s parameters change between data sets, there’s probably something wrong. And if predictions from the model are less accurate or precise than you expected, there’s probably something wrong. Here are some guidelines for troubleshooting a model you developed.

Your Model Identification Correction
Not Enough Samples If you have fewer than 10 observations for each independent variable you want to put in a model, you don’t have enough samples. Collect more samples. 100 observations per variable is a good target to shoot for although more is usually better.
No Intercept You’ll know it if you do it. Put in an intercept and see if the model changes.
Stepwise Regression You’ll know it if you do it. Don’t abdicate model building decisions to software alone. What’s the fun in that?
Outliers Plot the dependent variable against each independent variable. If more than about 5% of the data pairs plot noticeable apart from the rest of the data points, you may have outliers. Conduct a test on the aberrant data points to determine if they are statistical anomalies. Use diagnostic statistics like leverage to evaluate the effects of suspected outliers. Evaluate the metadata of the samples to determine if they are representative of the population being modeled. If so, retain the outlier as an influential observation (AKA leverage point).
Non-linear relationships Plot the dependent variable against each independent variable. Look for nonlinear patterns in the data Find an appropriate transformation of the independent variable.
Overfitting If you have a large number of independent variables, especially if they use a variety of transformation and don’t contribute much to the accuracy and precision of the model, you may have overfit the model. Keep the model as simple as possible. Make sure the ratio of observations to independent variables is large. Use diagnostic statistics like AIC and BIC to help select an appropriate number of variables.
Misspecification Look for any variants of the dependent variable in the independent variables. Assess whether the model meets the objectives of the effort. Remove any elements of the dependent variable from the independent variables. Remove at least one component of variables describing mixtures. Ensure the model meets the objectives of the effort with the desired accuracy and precision..
Multicollinearity Calculate correlation coefficients and plot the relationships between all the independent variables in the model. Look for high correlations. Use diagnostic statistics like VIF to evaluate the effects of suspected multicollinearity. Remove intercorrelated independent variables from the model.
Heteroscedasticity Plot the variance at each level of an ordinal-scale dependent variable or appropriate ranges of a continuous-scale dependent variable. Look for any differences in the variances of more than about five times. Try to find an appropriate Box-Cox transformation or consider nonparametric regression or data mining methods.
Autocorrelation Plot the data over time, location or the order of sample collection. Calculate a Durbin–Watson statistic for serial correlation. If the autocorrelation is related to time, develop a correlogram and a partial correlogram. If the autocorrelation is spatial, develop a variogram. If the autocorrelation is related to the order of sample collection, examine metadata to try to identify a cause.
Weighting You’ll know it if you do it. Compare the weighted model with the corresponding unweighted model to assess the effects of weighting. Consider the validity of weighting; seek expert advice if needed.

Sometimes the model you are skeptical about isn’t one you developed; it is models that are developed by other data analysts. The major difference is that with other analysts’ models, you won’t have access to all their diagnostic statistics and plots, let alone their data. If you have been retained to review another analyst’s work, you can always ask for the information you need. If, however, you’re reading about a model in a journal article, book, or website, you’ve probably got all the information you’re ever going to get. You have to be a statistical detective. Here are some clues you might look for.

Another Analyst’s Model Identification
Not Enough Samples If the analyst reported the number of samples used, look for at least 10 observations for each independent variable in the model. If not, you may be able to estimate the number from a scatterplot.
No Intercept If the analyst reported the actual model (some don’t), look for a constant term.
Stepwise Regression Unless another approach is reported, assume the analyst used some form of stepwise regression.
Outliers Assuming the analyst did not provide plots of the dependent variable versus the independent variables, look for R-squared values that are much higher or lower than expected.
Non-linear relationships Assuming the analyst did not provide plots of the dependent variable versus the independent variables, look for a lower-than-expected R-squared value from a linear model. If there are non-linear terms in the model, this is probably not an issue.
Overfitting Look for a large number of independent variables in the model, especially if they use different types of transformation
Misspecification Look for any variants of the dependent variable in the independent variables. Assess whether the model meets the objectives of the effort.
Multicollinearity Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify multicollinearity.
Heteroscedasticity Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify heteroscedasticity.
Autocorrelation Assuming relevant plots and diagnostic statistics are not available, there may not be any way to identify serial correlation.
Weighting Compare the reported number of samples to the degrees of freedom. More DF than samples is usually attributable to weighting.

Follow-up Care

So there are some ways you can identify and evaluate eleven reasons for doubting a regression model. Remember when evaluating other analyst’s models that not everyone is an expert and that even experts make mistakes. Try to be helpful in your critiques, but at a minimum, be professional.


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , , | 4 Comments

How to Write Data Analysis Reports in Six Easy Lessons

WRITE Morris295222284_n

In every data analysis, putting the analysis and the results into a comprehensible report is the final, and for some, the biggest hurdle. The goal of a technical report is to communicate information. However, the technical information is difficult to understand because it is complicated and not readily known. Add math anxiety and the all too prevalent notion that anything can be proven with statistics and you can understand why reporting on a data analysis is a challenge.

The ability to write effective reports on a data analysis shouldn’t be assumed. It’s not the same as writing a report for a class project that only the instructor will read. It’s not uncommon for data analysts to receive little or no training in this style of technical writing. Some data analysts have never done it, and they fear the process. Some haven’t done it much, and they think every report is pretty much the same. Some learned under different conditions, like writing company newsletters, and figure they know everything there is to know about it. And worst of all, some have done it without guidance and have developed bad habits, but don’t know it.

It’s a pretty safe bet that if you haven’t taken college classes or professional development courses, haven’t been mentored on the job, and haven’t done some independent reading, you have a bit to learn about writing technical reports. Report writing is like any other skill, you get better by learning more about the process and by practicing. Here are four things you can try to improve your skills.READ JC 561445_460814310610242_552613593_n

  • Educate yourself. Learn what other people think about technical writing. Visit websites on “statistical analysis reports” and “technical writing,” there are millions of them. Take online or local classes. Read books and manuals. Join Internet groups, such as through Yahoo, Google, or LinkedIn. Immerse yourself in the topic as you did when you were in school.
  • Understand criticism. Over the course of your career, you’ll give and receive a lot of criticism on technical reports. Not all criticism is created equal. First, consider the source. Some critics have never written a report on a data analysis and some have never even analyzed data. Still, if the critic is the one paying the bills you have to deal with it. For your part, you should learn how to provide constructivecriticism. Unless a report you are reviewing is a complete mess, respect the report writer’s discretion for structure and format. Focus on content. Be nice.
  • Download examples. Search the internet for examples of data analysis reports (Hint: adding pdf and download to the search might help). Critique them. Who’s the audience? What’s the message? What’s good and bad about each report? Which reports do you think are good examples? What do they do that you might want to do yourself in the future?
  • Find what’s right for you. When you search the Internet for advice on technical writing or take a few classes from knowledgeable instructors, you’ll hear some different opinions. Everyone will talk about audience and content but most will have more limited views of report organization, writing style, and how you work at writing. Ignore what the experts tell you to do if it doesn’t feel right. Just be sure that the path you eventually choose works for you and the audiences who will read your reports.

If you’ve done all that, it’s just a matter of practice. You’ll learn something from each report you write. If you are new to the process of reporting on a data analysis, consider these six easy lessons:

  • Lesson 1—Know your content
  • Lesson 2—Know your audience
  • Lesson 3—Know your route
  • Lesson 4—Get their attention
  • Lesson 5—Get it done
  • Lesson 6—Get acceptance.

Lesson 1—Know your Content

CONTENT busy-cat-19Start with what you know best. In writing a data analysis report, what you know best would be the statistics, graphing, and modeling you did.

You should be able to describe how you characterized the population, how you generated the data or the sources that provided them, what problems you found in the data during your exploratory analysis, how you scrubbed the data, what you did to treat outliers, what transformations you applied, what you did about dropouts and replicates, and what you did with violations of assumptions and non-significant results.

From that, you’ll need to determine what’s important, and then, what’s important to the reader. Unless you’re writing the report to your Professor in college or your peers in a group of professional data analysts, you can be pretty sure that no one will want to hear about all the issues you had to deal with, the techniques you used, or how hard you worked on the analysis. No one will care if your results came from Excel or an R program you wrote. They’ll just want to hear your conclusions. So, what’s the message you want to deliver? That’s the most important thing you’ll have to keep in mind while writing.

Once you work out your message, write an overview to the report so you’ll know where you’re going. It will help you stay on track. Your summary might take one of three forms:

  • Executive Summary. Aimed at decision makers and people with not enough time or patience to read more than 400 words. Limit your summary to less than one-page, do not use any jargon, and provide only the result the decision maker needs to know to take an appropriate action (i.e., the message you want to convey).
  • Overview. Aimed at most people, whether they would read the report or not. An overview is an abridged version of what is in the report, with a focus on the message you wBLOCK fs-cat-birthday-card-2ant to convey. The overview shouldn’t be more than a few pages.
  • Abstract.  Aimed at peers and other people who understand data analysis. An abstract summarizes in a page or less everything of importance that you did, from defining the population through assessing effect sizes. Abstracts are most often used in academic articles.

Once you understand who your audience is, you can rewrite the summary to catch the attention of your readers.


Lesson 2—Know Your Audience

AUDIENCE many-cat-cats-islandEvery self-help article about technical writing starts by telling readers to consider their audience. Even so, probably few report writers do.

In a statistical analysis, you usually start by considering the characteristics of the population about which you want to make inferences. Similarly, when you begin to write a report on an analysis, you usually start by considering the characteristics of the audience with which you want to communicate. You have to think about the who, what, why, where, when, and how of the key people who will be reading your report. Here are some things to consider about your audience.


Audience is often defined by the role a reader plays relative to the report. Some readers will use the report to make decisions. Some will learn new information from the report. Others will critique the report in terms of what they already know. Thus, the audience for a statistical report is often defined as decision makers, stakeholders, reviewers, or generally interested individuals.

Some reports are read by only a single individual but most are read by many. All kinds of people may read your report. As a consequent, there can be primary, secondary, and even more levels of audience participation. This is problematical; you can’t please everyone. So in defining your audience, focus first on the most important people to receive your message and second on the largest group of people in the audience.


Once you define who you are targeting with your report, you should try to understand their characteristics. Perhaps the most important audience characteristic for a technical report writer is the audience’s understanding of both the subject matter of the report and the statistical techniques being described. You may not be able to do much about their subject matter knowledge but you can adjust how you present statistical information. For example, audiences a data analyst might encounter include:

  • Mathphobes. Fear numbers but may listen to concepts. Don’t use any statistical jargon. Don’t show formulas. Use numbers sparingly. For example, substitute “about half” for any percentage around 50%. The extra precision won’t be important to a Mathphobe.
  • Bypassers. Understand some but have little interest. Don’t worry about Bypassers, they won’t read past the summary. Be sure to make the summary pithy and highlight the most important finding otherwise they might key on something relatively inconsequential.
  • Tourists. Understand some and are interested. Be gentle. Use only essential jargon that you define clearly. Using numbers is fine just don’t use too many in a single table. Round off values so you’re not implying false precision. Stick with nothing more sophisticated than pie charts, bar graphs, and maybe an occasional scatter chart. Don’t use any formulas.
  • Hot Dogs. Know less than they think and want to show it. Using jargon is fine so long as you define what you mean. Even a Hot Dog may learn something. In the same vein, using numbers, statistical graphics, and formulas is fine so long as you clearly explain their meanings. Hot Dogs may come to erroneous conclusions if not guided.
  • Associates. Other analysts who understand the basic jargon. Anything is fine so long as you clearly explain what you mean.
  • Peers. Other data analysts who understand all the jargon. Anything goes.

The audience characteristics provide guidance for report length and writing tone and style


Are readers likely to be very interested in your report or just curious about it (if they have no interest, they won’t be readers)? Be honest with yourself. Why would anyone be interested in reading your report? What is the objective of the who you defined as your audience? What will they do with your findings? Will they get informed? Will they make a decision or take an action? Is this a big thing for them or just something they have to tune in to?

audience Man-Built-a-Sanctuary-for-Homeless-Cats-5Where

Is the report aimed at a finite, confined group, like the organization the analysis was conducted for, or will anyone be able to read it? Is the report aimed at the upper levels of the organization or the rank-and-file (i.e., bottom up or top down)? Are there any concerns for security or confidentiality, either on the individual or organizational levels?


When does the population need to see your report? Who has to review the report and how long might they take before the report is released? How firm are the deadlines? How much time does this leave you to write the report? Will there be enough time to think through what you need to write? Will there be time to conduct additional analyses needed to fill in gaps in the report outline? Will you be outraged when the time taken to review your report is twice as long as the time you took to write it?

Here’s some advice you should take to heart. Never, never, never submit a draft report for review that isn’t your fully complete, edited, masterpiece. I tell myself to follow this rule with every report I write. Unfortunately, like most people, I don’t listen to what I say.


Finally, consider how the report should be presented so that the audience will get the most out of it. Here are five considerations:

  • Package. How will your writing be packaged (i.e., assembled into product for distribution)? Will it be a short letter report, a  comprehensive report, a blog or an Internet article, a professional journal article, a white paper, or will your writing be included as part of another document?
  • Format. Will your report be distributed as an electronic file of as a paper document? If it will be an electronic document, will it be available on the Internet? Will it be editable? Will it be restricted somehow, such as with a password?
  • Appearance. Will the report be limited to black-and-white or will color be included? What will be the ratio of graphics to text? Will the report be conventional or glitzy, like a marketing brochure? Will there be 11”x17” foldout pages or oversized inserts like maps.
  • Specialty items. Will you need to provide some items apart from the report, such as electronic data files, analysis scripts or program codes, and outputs? Will you have to create a presentation from the contents of the report? Will your graphics be used for courtroom or public presentations?
  • Accessibility. Do you need to follow the guidelines of Section 508 of the Rehabilitation Act of 1973, which may affect your use of headings, tables, graphic objects, and special characters? Should you account for common forms of color blindness in your color graphics?

X images (5)Take a Few Moments

You won’t have to address all of these details in evaluating your audience and many will only require a few moments of thought. But, if you think through these considerations, you’ll have a much better idea of who you are writing the report for and how you should write it.


Lesson 3—Know Your Route

ROUTE Cat in a maze

You’ve been taught since high school to start with an outline. Nothing has  changed with that. However, there are many possible outlines you can follow depending on your audience and what they expect. The first thing you have to decide is what the packaged report will look like.

Will your report be an executive brief (not to be confused with a legal brief), a letter report, a summary report, a comprehensive report, an Internet article or blog, a professional journal article, or a white paper to name a few. Each has its own types of audience, content, and whiting style. Here’s a summary of the differences.

Report tableWriting a report is like taking a trip. The message is the asset you want to deliver to the ultimate destination, the audience. The package is the vehicle that holds the message. Now you need a map for how to reach your destination. That’s the outline.

Just as there are several possible routes you could take with a map, there are several possible outline strategies you could use to write your report. Here are six.

  • The Whatever-Feels-Right Approach. This is what inexperienced report writers do when they have no guidelines. They do what they might have done in college or just make it up as they go along. This might work out just fine or be as confusing as The Maury Show on Father’s Day. Considering that the report involves statistics, you can guess which it would be.
  • The Historical Approach. This is another approach that inexperienced report writers use. They do what was done the last time a similar report was produced. This also might work out fine. Then again, the last report may have been a failure, ineffective in communicating its message.
  • The “Standard” Approach. Sometimes companies or organizations have standard guidelines for all their reports, even requiring the completion of a formal review process before the report is released. Many academic and professional journals use such a prescriptive approach. The results may or may not be good, but at least they look like all the other reports.
  • The Military Approach. You tell ‘em what you’re going to tell ‘em, you tell ‘em, and then you tell ‘em what you told ‘em. The military approach may be redundant and boring, but some professions live by it. It works well if you have a critical message that can get lost in details.
  • Cat-on-a-MapThe Follow-the-Data Approach. If you have a very structured data analysis it can be advantageous to report on each piece of data in sequence. Surveys often fall into this category. This approach makes it easy to write the report because sections can be segregated and doled out to other people to write, before being reassembled in the original order. The disadvantage is that there usually is no overall synthesis of the results. Readers are left on their own to figure out what it all means.
  • The Tell-a-Story Approach. This approach assumes that reading a statistical report shouldn’t be as monotonous as mowing the lawn. Instead, you should pique the reader’s curiosity by exposing the findings like a murder mystery, piece by piece, so that everything fits together when you announce the conclusion. This is almost the opposite of the follow-the-data approach. In the tell-a-story approach, the report starts with the simplest data analyses and builds, section by section, to the great climax—the message of the analysis. Analyses that are not relevant to the message are omitted. There are usually arcs, in which a previously introduced analytical result is reiterated in subsequent sections to show how it supports the story line. Graphics are critical in this approach; outlines are more like storyboards. There may be the equivalent of one page of graphics for every page of text. Telling a story usually takes longer to write than the other approaches but the results are more memorable if your audience has the patience to read everything (i.e., don’t try to tell a story to a Bypasser.)

So59502, be sure that you have an appropriate outline but don’t let it constrain you. Having a map doesn’t mean you can’t change your route along the way, you just need to get to the destination. In building the outline, try to balance sections so the reader has periodic resting points. Within each section, though, make the lengths of subsections correspond to their importance.

Lesson 4—Get Their Attention

If you’re writing a report about statistics, you have to expect that many readers will lose interest after a while, if they even had it to begin with. So, in writing the report, think about how you might engage your audience. Here are five ideas.

  • Find Common Ground.  Every relationship begins with having something in common. Fighting a common foe or solving a common problem can form the strongest and longest lasting of bonds. So the first thing you should try to establish in your report is that common ground. This isn’t so difficult if you are working on an analysis at the behest of a client. The client is already immersed in the data and has invested in you to help solve the problem. Establishing common ground is not so easy if you are proffering an uninvited message. Some people, perhaps subconsciously, don’t really want the message you are offering, especially when you’re analyzing data in their area of expertise. Try to establish common ground in other areas. Perhaps your analysis touches on a similar or analogous issue the reader might have. Maybe the analysis procedure could be used on a different problem the reader might have.
  • Clear the Decks. Get rid of everything that doesn’t add to the progression of the report. That doesn’t necessarily mean you have to omit the content. You can relegate it to an appendix, which is pretty much the same thing. Unless required to be in the body of the report, things like the data, data collection surveys and forms, and scrubbing and analysis procedures should all be put in an appendix.
  • the Tone. Your writing style can either add to or detract from the readability of your report. A formal tone, with strict adherence to grammar rules, complex sentence structures, use of third-person point-of-view and passive voice, and plentiful jargon, is appropriate for most data analysis reports. Formal tones are good for describing details, specifications, and step-by-step instructions. However, formal tones can be more difficult to understand, especially for individuals not accustomed to reading technical reports. An informal tone, with simple grammar and vocabulary, colloquialisms, contractions, analogies, and humor, works well for blogs. Informal tones are good for discussing ideas and concepts, and for inspiring readers or communicating a vision. They are more engaging and tend to be easier for most individuals to understand. If you’re being paid to write the report, a formal tone is usually more appropriate. This is problematical, of course, because formal writing is usually harder to read and maintain an interest in.
  • Add Mind Candy. A Harry Potter novel consisting of page-after-page of text will keep readers, young and old, transfixed for hours. A data analysis report consisting of page-after-page of text will put readers into a coma faster than a handful of barbiturates taken with a glass of warm milk in a tub of hot water while meditating. The difference is that the novel engages readers with mental images. Data analysis reports need to use visual imagery, which for the most part means good graphics. Granted, most readers won’t understand anything more complicated than a pie chart or a bar chart, but don’t add to the confusion. Three-dimensions are a no-no. Avoid graphing data in more than a few categories to avoid making the slices and bars uninterpretable. And most importantly, make sure they add to the analysis. You can do more, too. Break up the text with subheadings and bullets. Reiterate information nuggets in boxes instead of just letting them get lost in the text. Use tables for explaining differences in data groups and not just for number buckets. Add footnotes or hyperlinks to explain collateral concepts.

July 22 2013 028

  • Make it Better. Just when you think you’re done writing, you’re not. That’s the time when you have to do even more to make the report better. First, take some time off if you can. Then, read it through again making improvements along the way. Read it aloud if you need to, even record it when you read it aloud and then play it back so you can engage both your vision and hearing. Consider getting a second opinion, especially if you can’t distance yourself from the report by setting it aside for a few days. A second opinion may come from a data analysis peer, but don’t ignore nontechnical editors. A good editor can help with spelling, grammar, punctuation, word choice, style and tone, formatting, references, and accessibility. It’s usually worth the effort. This is the time to go for purrfection.

Lesson 5—Get It Done

Source: and  309 other sources.Perhaps the hardest part of writing a data analysis report is just getting it completed. It takes discipline and persistence to stay on track. Even so, it’s easy to get distracted. Sometimes the problem is that the story of the analysis hasn’t been thought all the way through. Sometimes there are gaps in the analysis that necessitate stopping to complete more calculations. Sometimes there are too many interruptions and distractions to maintain focus. Sometimes, the process of writing becomes boring and requires a great effort to continue.

Writer’s block is an impediment experienced by all writers. Writer’s block might be attributable to not knowing what to write next, trying to write text that is perfect, or fear of failure. Any of these reasons may be applicable to the report writer. Here are ten ways to fight off writer’s block.

1. Stick with a routine. Keep writing even if you are dissatisfied with what you’ve written. You can, and should, edit your draft after you’re done. Try to identify your productivity tipping point. For some people, accomplishing a specific goal by a certain time in a day helps ensure the rest of your day is productive. For example, my productivity tipping point is beginning to write by 8AM. If I do, I’ll be writing productively all day.

2. Visualize. If you’ve never used visualization techniques before, now is a good time to develop the skill. The idea is to close your eyes, get relaxed, and think about what you want to do or see. Start by visualizing what the next few sentences you have to write might look and sound like. Eventually, you’ll be able to visualize what paragraphs, sections, and even the entire final product will look like.

Source: and 33 other sites.3. Eschew perfection. If it’s not perfect the first time you write it, leave it alone. Let it age while you write the rest of the report. You can reevaluate and rewrite it later when you know more about the rest of the report.

4. Write in parallel. Some parts of reports, like introductions and summaries, and descriptions of variables and other details, are almost formulaic. Write all the similar parts at the same time. Set up a second file in your word processing software to serve as a staging area for the repeated parts. Then, copy and paste the standardized parts to your report and edit the text as appropriate.

5. Grow the outline. Instead of trying to write the report section by section, try using the outline as a template rather than a map. Add key phrases, instructions, notes, sentences, and even paragraphs to the template-outline. You can skip around the template-outline as you come up with ideas for what to write. Eventually, you can consolidate these ideas into paragraphs and then sections. Continue to expand the template-outline until it ultimately becomes the complete report.

6. Tiptoe through the tables. Create all or most of your graphics (i.e., tables and figures) before starting to write. Lay the graphics out in your word processing software and write the text that would go with each graphic. Then, go back and fill in the gaps between graphics. Continue joining the pieces until the report is complete.

7. Chunk it up. Don’t try to write the entire report by yourself. Break it up into pieces and get help.

8. Set deadlines. Sometimes it helps to be able to work towards an interim goal. Set deadlines for sections or other tasks you have to accomplish. Make them challenging but achievable.

9. Give it a restSource: Absence makes the mind grow sharper. Consider taking some time off from report writing, but make sure you use the time productively. Schedule that colonoscopy you’ve been putting off. Clean the garage and paint the house. Visit your in-laws. Don’t just play video games or watch Netflix.

10. Do something different. If your routine isn’t working, try doing something different. If you can’t get anywhere because you’re pressing, work on something else or take some time off. If you can’t get anywhere because you’re slacking, try researching. If you can’t get anywhere because you’re stuck on writing, pull together graphics or the appendices. If you can’t get anywhere because you’re procrastinating, ask yourself why.


Lesson 6—Get Acceptance

Source: and 1,196 other sites.Data analysis reports have to go through one more hurdle after they are completely written. They have to be approved for acceptance by a gatekeeper. The approval for acceptance may involve allowing report distribution, starting the publishing process, issuing payment for your services, or just acknowledging that your work is done. The gatekeeper may be your client, your supervisor, your publisher, or for blog writers, you. To get that approval, formal reports usually have to be reviewed by reviewers. Reviewers are usually individuals the gatekeeper chooses based on their technical background or role in the gatekeeper’s organization. Sometimes, reviewers are individuals the gatekeeper is forced to listen to, like regulatory reviewers. In academic publishing, you may not even know who the peer reviewers are.

Logically, the acceptance review shouldn’t take too long compared to the time you took to analyze the data and write the report. After all, the reviewers only have to read it. In practice, though, reviews take far longer than report preparation. The report you wrote in a month may take six months to be reviewed. Don’t panic. It’s just the way things seem to happen.

The number of comments you get from the reviewers is inconsequential. Great reports can get dozens of highly critical comments. Again, don’t panic. The only review you should be concerned about is the one that provides no comments. That usually signals a lack of interest by the reviewers and the gatekeeper.

Six Tips for Responding to Reviewers from Editage Insights

When the review is complete, be sure to get the comments in writing. If you don’t, some comments may be forgotten or misunderstood. If there is more than one reviewer, compile all the comments together. This is essential because sometimes reviewers provide conflicting comments. The gatekeeper may compile the comments for you if he or she wants to control the process. The comments should be placed in the order they correspond to in the report. Be sure to identify the source of each comment. If a single comment has many parts, break the comment apart so you can respond to each part individually.

Then comes the challenging part—you must respond to each comment separately. Create a new document listing all the compiled comments. For each comment in this document, either describe what you’ll do in response or explain why you won’t make any changes. Start with the easy comments, such as those involving grammar and spelling. As you describe your response to a comment in the document, make the associated change in the report. Proceed through increasingly more difficult comments until you are done. For very complex comments, try to parse the ideas and respond to each separately. If a particular comment is very difficult to address, you may have to conduct additional analyses or information research. Cite information sources if appropriate.

When you’re done, reread both the response document and the changes in the report. Be sure all the changes were made in the report and that they are consistent with the rest of the report. Also, make sure the tone of your response is even; be stoic.

Nine Tips for Responding to Criticism by Alain Briot

If you’ve written an informal piece, like a blog, you don’t have to go through the grueling process of responding to formal comments from an acceptance review. Since you are the gatekeeper, you can release your blog whenever you feel it is done. But after you release the blog, you may well get comments. That’s good because it shows that people are reading your blog. Furthermore, there’s no pressure to compile these comments and document your responses. Unfortunately, at least some of the comments will come from spammers, trolls, 13-year-olds, head cases, angry arguers, and other individuals who won’t be providing constructive criticism. Therefore, first consider the source of each comment. In some cases, you won’t have to respond to any of them. Your blogging software will allow you to delete unwelcome comments. Beware of the overly gracious comments, too. Sometimes malicious commenters use addresses that link to spam or malware. If you don’t trust your instincts, just delete the comment.

Source: and 2 other sites.Don’t get upset by reviewers pointing out flaws in your report. That’s what they’re supposed to do. Having been on both sides of the writer/reviewer divide, I can tell you that creating a report takes a hundred times more knowledge, creativity, effort, and time than reviewing a report. Providing constructive criticism on a report requires a hundred times more experience, situational awareness, and interpersonal sensitivity than creating a report. Good writing combined with constructive reviewing makes a data analysis report the best it can be.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , | 4 Comments

Ten Ways Statistical Models Can Break Your Heart


Models are beautiful. The ways their features are combined sets them apart from each other.  Each has its own personality, sometimes pleasant, sometimes not, and often not what you would expect.

Here are ten ways your love affair with statistical models can end up on the rocks.

Relationship-Building Disasters

Modeling is more than just meeting a dataset on the internet and jumping into some R code together. You have to develop relationships with the data and everyone associated with them. For example:

  • Miscommunications. There are often quite a few people who have some stake in the model. They usually have different experiences and levels of understanding of modeling and, of course, different agendas for how the model will be treated. They won’t necessarily trust you. You have to try to keep them all happy and on the same page.
  • Interference. You may be doing all the heavy lifting with the data and the modeling but there are often individuals, like a boss, the client, or independent reviewers, who poke their fingers into your efforts.
  • Delays. You may feel under the gun to complete a modeling project but that doesn’t mean everyone associated with the project will share your constraints. You may be asked to redo the model every time new data become available, attend meetings, make presentations, and wait for decisions from upper management.
  • Skepticism. Not everyone is driven to make decisions after a careful analysis of relevant data. Some people prefer to rely on their gut feel. They may look at your model but then ignore those results and use their own intuition.
  • Indifference. On occasion, you might create a model, even what you consider a groundbreaking model, but nobody pays attention to it. Your model may be ignored for an inferior model, like an undrafted football player being benched in favor of a million-dollar bust. Or, people just don’t appreciate the importance of the model like you do. You’ll still need to get their acceptance.

Unrequited Models

No love 2You put your heart and soul into modeling the dataset but you get … NOTHING. No love in return. No matter how much you’ve planned, you can’t find a collection of independent variables that will adequately model your dependent variable. It happens to data analysts everywhere, all the time, for a variety of reasons. There may be non-linear relationships, outliers, or excessive uncontrolled variance. The variables may be inappropriate or inefficient.

What can you do?

First, you should reexamine the theory behind your model. Are your hypothesis and assumptions valid? Are your data suspect? Are the metrics you’re using as variables problematical? Are there latent concepts you could explore in a Factor Analysis? Do your samples need to be categorized in some way? Might conducting a Cluster Analysis provide insight?

Second, examine your correlations thoroughly. See if there are any transformations that might be helpful.

Third, if you have appropriate software, consider looking into nonlinear statistical regression, neural networks, and data mining solutions. Finally, there may be ways to construct probabilistic models, or models based on optimization procedures, or relative solutions from experts using a Delphi Method.

In the end:

Some models were not meant to be. If you can’t fit the model to the data, you have to be prepared to call it quits. In a way, this is equivalent to a Do Not Resuscitate order in medicine, and likewise, it can be a sensitive subject. It’s usually easier to create new variables or try some other statistical manipulation than it is to give the bad news, and the bill, to the client.

Muddled Models

MuddledSometimes models go wrong right out of the box because they are improperly specified. You may not be pursuing the relationship for the right reasons or in the right ways. For example:

  • The dependent or independent variables may be too expensive to collect. The model may even cost more to run than addressing the problem is worth.The dependent variable may not be actionable, at last not within the limits set by the client.
  • An independent variable might incorporate part of the dependent variable, if one or the other is a ratio.
  • The structure of the model may be wrong, for example, the model might be better as a multiplicative or other non-linear form instead of linear.

Wandering Eye Models

There are many different types of models, like fish in the sea. Some people are always looking for something better, even if what they have is pretty good.

Cross eyedFor example, you might have a good model but it’s not what the client expected. Perhaps the results are not what the client wanted to hear or the model may look good for general trends but not be an adequate representation of the phenomenon for extreme or special cases. He wants you to try over and bring him something better.

One concept that often confuses novice model builders are the differences between models aimed at prediction vs explanation. Explanatory models are based on theory. They need to incorporate independent variables that make theoretical or logical sense to be associated with the dependent variable. Prediction models don’t rely on theory. They need independent variables that produce large values of the Coefficient of Determination (r2) but low values of the Standard Error of Estimate (sxy or SEE). Explanatory models assume (or hope) that there are cause-effect relationships between the dependent variable and the independent variables; prediction models do not.

That’s where some clients balk if the model doesn’t have the variables they feel should be in a prediction model. It usually doesn’t matter if the model produces excellent predictions, they feel it would be better if their favorite variables were there … even though it wouldn’t.

It’s not just clients, though. There are times when model builders, especially young professionals, want to try out some new analytical breakthrough. The tried-and-true regression approach may produce results that are nearly as good, but the cutting edge model looks and sounds so much sexier. It’s seductive, and for some, hard to resist.

Deceptive Profile Models

Don’t you just hate it when you see something that isn’t at all the way it was described? “Hey, you should try analyzing this dataset. It’s a perfect match for you.” But then when you meet up, it’s nothing like you expected.

Guilty 3Maybe the expected population from which the data are drawn doesn’t really exist. Maybe the quality of the data is questionable or needs a lot of cleanup. Maybe the samples are biased or misleading.

And it’s not just what goes into a model that might be disappointing but also what comes out of modeling activities. The regression model itself might be improperly specified or misleading. Sometimes correctly specified models are poorly calibrated. Fortunately, there are also a variety of statistical diagnostics and plots that can be used to identify the problems.

Mercurial Models

Every measurement of a phenomenon includes characteristics of the population and natural variability as well as unwanted sampling variability, measurement variability, and environmental variability. You can’t understand your data unless you control extraneous variance attributable to the way you select samples, the way you measure variable values, and any influences of the environment in which you are working. If you plan to conduct a statistical analysis, you need to understand the three fundamental Rs of variance control — Reference, Replication, anLaptopd Randomization. Using the concepts of reference, replication and randomization, you can control, minimize, or at least be able to assess the effects of extraneous variability using: procedural controls; quality samples and measurements; sampling controls; experimental controls; and statistical controls.

Even after spending considerable effort trying to control extraneous variance in data collection, though, sometimes the models produced from them don’t share the precision. The models may have good accuracy, shown by large values of the Coefficient of Determination (r2) but low precision, shown by the large Standard Error of Estimate (sxy or SEE). You might have an accurate predictive model but it lacks enough precision to be useful. This is a surprisingly common occurrence. Some data analysts don’t seem to look past the r2. The sxy is ignored.

Look at any studies you can that involve predictive modeling. Do they discuss the uncertainty in the predictions? What do you think?

Run-away ModelsRun away

Sometimes you spend months and even longer getting to know your data and building a relationship only to have the model taken away. Maybe it’s a boss or more senior co-worker. Maybe it’s the client. You can chase after your model, keep up to speed with what’s happening in the model’s life, but that’s about it. There’s not much else you can do. It’s somebody else’s responsibility now.

Irreconcilable Difference Models

You and your model may reach a point where you might want to go to the next level in your relationship only to find there are differences you did not expect and can’t overcome. When you try to extend the relationship to new situations, everything fails. There are several possible reasons. Maybe you have a multi-level model. What worked for the samples you used doesn’t work when they are aggregated into higher level associations. Maybe you’re a victim of Simpson’s Paradox. What worked for the samples you used doesn’t work when they are separated into component groups. Then again, maybe it’s Fingersomething you did. Maybe your model is overfit. Perhaps you capitalized on chance and found associations that weren’t pervasive and lasting. The only thing you can do is reexamine the relationship and either start over or move on.

Marry or Break Up

There comes a time when you have to decide whether to commit to the effort to build a relationship or back out of the commitment. Maybe you don’t have enough samples. Maybe your goals don’t fit what the model needs. Perhaps the model is being asked to do something it wasn’t designed for. What works for describing a population may not be suited to describing individuals in the population. Then there might also be ethical issues to consider. But statisticians rarely get to make these decisions. If they accepted the assignment, the product belongs to the client.

Happily Never After

Deploying a model can sometimes change the behaviors of the population the model is based on. This is especially true when humans are involved; humans just love to game the rules. For example, if you develop a model for allocating resources, you can be assured that the potential recipients will do whatever it takes to increase their advantage. Once they do that, the model is no longer useful. That’s why models are often kept secret.


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , | 2 Comments

Looking for Insight through a Window

Black_cat_on_windowAt a press briefing on February 12, 2002, then Secretary of Defense Donald Rumsfeld addressed the absence of evidence linking the government of Iraq with weapons of mass destruction:

There are known knowns. There are things we know that we know. There are known unknowns. That is to say, there are things that we now know we don’t know. But there are also unknown unknowns. There are things we do not know we don’t know.

Now, despite the statement being a transparently irresponsible attempt to cover up a monumental failure in the collection and analysis of information or just a REALLY BIG LIE, the statement actually makes some sense. Similar words have been attributed to Confucius and others. But whether he realized it or not, Mr. Rumsfeld was describing a type of data analysis window.

Analytical windows are a type of matrix plot. Matrix plots are just grids for organizing information. The cells of a matrix plot can contain data, tables, graphs, or text. Windows consist of two criteria, or dimensions, defined by rows and columns. Each dimension usually has two categories, or levels, resulting in four cells, or panes. Rumsfeld’s window would look like this:


Things that We
Know Don’t Know
We Know Things that we know we know. Things that we don’t know we know.
We Don’t Know Things that we know we don’t know. Things that we don’t know we don’t know.

So, for example, a Rumsfeld Window could be used for planning a statistical study.

  • Things that we know we know would be things like background information on the study environment, the underlying theory on the phenomenon being explored, and the statistical characteristics of the population.
  • Things that we don’t know we know would be things like the statistical assumptions we make to perform the analysis — independence of observations, normality and homoscedasticity of errors.
  • Things that we know we don’t know would be things like the results of the research questions and test hypotheses we plan to focus on.
  • Things that we don’t know we don’t know would be things like the causes of outliers and other data and analysis anomalies.

The beauty of a window is the way it can organize sometimes complex information into simple binary categories. As a consequence, windows are used in many ways to analyze data.


Johari Windows

cat-window-1A Johari Window is a tool used by psychologists to help individuals and groups evaluate interpersonal communications. Its name comes from the first names of Joseph Luft and Harry Ingham, who created it in 1955. To use the window, subjects are told to pick five or six adjectives they feel describe their own personality from a standard list of 56 adjectives. Peers of the subject are then given the same standard list of 56 adjectives, and each pick five or six adjectives that describe the subject. These adjectives are then paced in the appropriate pane of the Johari Window.


Known to Self Not Known to Self

Known to Others



Not Known to Others



Johari windows were featured on a 2010 episode of the television series Fringe, which was seen by six million viewers, most of whom probably had no idea what they are.

images (1)Variance Windows

Windows can also be applied to planning how to control extraneous variance in the process of collecting data. If you plan to conduct a statistical analysis, you’ll need to understand the three fundamental Rs of variance control — Reference, Replication, and Randomization. Every measurement of a phenomenon includes characteristics of the population and natural variability as well as unwanted sampling variability, measurement variability, and environmental variability. You can’t understand your data unless you control extraneous variance attributable to the way you select samples, the way you measure variable values, and any influences of the environment in which you are working. Using the concepts of reference, replication and randomization, you can control, minimize, or at least be able to assess the effects of extraneous variability using: procedural controls; quality samples and measurements; sampling controls; experimental controls; and statistical controls.


Sources of Variance that we


Don’t Understand


Sampling and measurement variance

Sampling and measurement variance, environmental variance

Don’t Control

Natural variance

Sampling and measurement variance, environmental variance

To use a window to plan a variance control program, fill the panes of the window with all the sources of variability you can think of, categorized by how well you understand the source and think you can control it. Then identify a control measure for each source of variation.

Pick Charts

cat in windowA Pick Chart is a Lean Six Sigma tool for comparing difficulty of implementation (in terms of costs, effort, complexity, or time) to possible results (paybacks, returns, impacts, or improvements) for actions being considered. These two concepts serve as the axes of a data analysis window having four quadrants:

  • Possible“ideas that are considered “low hanging fruit”. The effort to implement is low, but the impact is also low. These should only be implemented after everything in the “Implement” quadrant.”
  • Implement“ideas that should be implemented as they will have a high impact and require low effort.”
  • Challenge“ideas that should be considered for implementation after everything in the “Implement” column. The impact is high, but the effort is also high.”
  • Kill“ideas that should be “killed” or not implemented. The effort to do so is high and the impact is low.”

Here’s an example involving the federal Employee Viewpoint Survey. In this pick chart, eighteen EVS question areas are compared according to:

  • Payoff from the actions being considered to improve EVS scores
  • Difficulty anticipated in successfully undertaking the actions.

Pick ChartPayoff was calculated (after scale adjustments) as the product of the score for a question and the decline in the scores from 2012 to 2014. Difficulty was based on: (1) who would have to be involved in implementing the change (i.e., many or few staff; in the main office or satellite offices; at staff, supervisor, or senior leader levels); (2) if existing programs or policies would be used or if they would have to be created; and (3) the funding required to implement the change. Payoff is based on actual EVS data so there is not much uncertainty. Difficulty is based on judgments concerning what generic actions might be taken to improve job satisfaction, so there is considerable uncertainty. Thus, the positions of the icons representing the EVS question areas are likely to shift horizontally, depending on the nature of specific projects being considered, but not vertically.

Performance Windows

154978-Cat-Watching-Rain-Out-WindowA performance window is a way to convey the results of a statistical test or classification. It is a table with two rows and two columns that summarize the number of correct classifications (true positives and true negatives), and the number of misclassifications (false positives and false negatives). This type of window is also called a confusion matrix, an error matrix, or a matching matrix.

Here are performance windows for classifications and statistical tests.

Predicted Classification



Actual Classification


Correct Classification


B Misclassification

Correct Classification

Statistical Test

Null hypothesis is not rejected

Null hypothesis is rejected

Actual Condition


Correct Inference False Positive – Type I Error


False Negative -Type II Error

Correct Inference

A contingency table is a type of matrix plot, frequently for more than two levels on the dimensions or even more than two dimensions, which summarizes the occurrence of data. They are also called cross tabulation ‎tables.

Windows on Scatter Plots

The concept of dividing areas of information into more understandable parts can be extended to scatter plots. Plots can be divided into quadrants, for example, using the means (or medians) of the data points for each axis. In essence, the window is overlain on the scatter plot. The window can be subdivided further by standard deviations (or quartiles).

English-Math 2

The performance window for this scatter plot would be:

Math Grade


Below Average

Above Average

English Grade

Above Average

9 18 27
Below Average 16 8


    25 26



Read more about using statistics at the Stats with Cats blog. Join other imagesfans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at,, or other online booksellers.





Posted in Uncategorized | Tagged , , , , , , , , , , , , | 2 Comments

It’s Hard to be a Data-Driven Organization

Why is it so Hard?

Should I follow the data or my instinctsDo you work for a data-driven organization, or one that claims to be a data-driven organization, or one that wants to be a data-driven organization? You probably do, whether you work for a big retailer or a small service provider. Every organization wants to believe that they use information to make decisions in an unbiased manner, although not every organization actually does that. It’s definitely not easy getting to be a real data-driven organization. At a minimum, an organization has to address five issues:

  • Funding. Being data-driven is a top-down decision because it must be supported by adequate funding. Without funding, all you can do is talk about how you’re data-driven. Talk is cheap; funding is commitment.
  • Data. Organizations should have standard processes that generate relevant business data of appropriate granularity and quality. There should be owners for each type of data who are responsible for the data quality, availability, and security. Small organizations can implement these concepts in less elaborate ways than large organizations. For example, one person may oversee all data operations in a small organization compared to a department of experts in a large organization. Even micro-sized organizations can have ready access to data. All it takes is an internet connection that allows searching for data and analyses others have posted.
  • IT Support. Generating, storing, accessing, analyzing, and reporting on data requires software and hardware resources, connectivity technologies, and communications capabilities. Again, one person can do everything or there can be a whole department of technicians supported by vendors and contractors. An organization just has to have enough consistently available support that it can rely on.
  • User Skillset. To be of any use, data has to be converted into information, and information into knowledge. One person can do everything but it’s better if there is a team of data scientists because no individual is likely to be familiar with all the different types of data analysis that might be appropriate. In an ideal situation, all employees would have some knowledge of data analysis techniques, even if it’s just a required statistics course they took in college. It’s easier to run a data-driven organization if everyone understands the roles data and business analytics have in their daily work and the organization’s objectives.
  • Decision-making Culture. The most important aspect of successful data-driven organizations is the attitudes of the individuals making decisions. If they would prefer to rely exclusively on their intuition to run their organizations, the organization won’t be data-driven no matter how much funding, data, support, and employee skills there are.

Why Do Some Individuals Avoid Data?

It may seem counterintuitive that some people avoid using data for their decision-making. They will guess, speculate, make assumptions, and argue for hours about matters that could be resolved quickly and convincingly by using data. They’ll follow hunches to decide what they want to do and then claim success based on little more than a few cherry-picked anecdotes. If you suggest looking at data, you might be asked “what do we need data for?” They’ll caution you against “information overload” and “paralysis by analysis.” They might tell you “that’s not what the big boss wants.” They’ll find all sorts of excuses. In the end, you can lead your boss to data but you can’t make him think.

Why do these people avoid collecting and analyzing data to address problems, especially in the current age of pervasive technological connectivity? There are a few possibilities.


Some people actually have a fear of information, possibly related to a fear of numbers (arithmophobia), technology (technophobia), computers (logizomechanophobia or cyberphobia), ideas (ideophobia), truth (alethephobia or veritaphobia), novelty (kainolophobia or kainophobia), or change (metathesiophobia). More likely, they might fear that they are incompetent to make a decision, perhaps associated with the Peter Principle. They might say “Let’s do it the way we did it before,” or “let’s not rock the boat.”


Some people just aren’t comfortable with numbers. Artists, for example, tend to be more comfortable with creative spatial and visual thinking compared to engineers who tend to be more comfortable with logical and quantitative thinking. Perhaps it’s a right-brain versus left brain phenomena, perhaps not. Think of how you make a major purchase. If you compare specifications and unit prices for each possible brand or model, going back and forth and back and forth, you’re what is called an analytical buyer. If you just buy the product in the red box because it has a picture of a cat on it that looks like one you own, you’re what is called an intuitive buyer. The same goes with decision-making. Some people trust their hunches more than they trust numbers.


What the heck am I doing?Some people aren’t accustomed to solving problems with data. They don’t know how to collect and analyze data. They wouldn’t even know where to start. They might talk to a few co-workers for anecdotal information but wouldn’t know how to generate representative data. They don’t know that data may already exist. They don’t understand how readily available some information is on the Internet. Even then, they wouldn’t know how to use data to make decision. They might defend themselves by saying available information is not actionable.


Some people just want to control everything they can. They might already have a preferred decision and don’t want any information that might call their hunch into question. Or, they may not know what they want to do but they don’t want any information that might limit their options or prevent them from controlling the debate. They may be control freaks. They may be subject to biases attributable to illusory superiority like the Dunning–Kruger effect.

How Can Reluctant Decision-Makers be Encouraged to be Data-Driven?

If you’re in an organization that is making the journey to being data-driven, changing the culture of decision-making will be your most formidable obstacle. The easiest problem to fix is ignorance. Training, encouragement, coaching and mentoring, and peer support combine to enlighten. The fears and inherent natures of some decision-makers are harder to address. Again, encouragement and personal support will encourage change. Control freaks are the most problematic. They are intransigent, as any of their exes will affirm. Don’t make them a focus of your efforts to change your decision-making culture. You’ll be disappointed.

Here are some actions you can take to support the adjustment.

If you work in upper management, the most important thing you can do is communicate your expectations and lead by example. Recognize that not every decision must be based on data. Sometimes data is just the starting point for a visionary leader’s intuition. Make funds available for actions that will support the initiative, like training in data analysis and decision-making. Require managers to at least bring data with them to the table when arguing their points. Challenge speculation. Help them through the process of incorporating information into their decision-making process by coaching and mentoring. Finally, recognize and reward staff members who take the lead in using data.

If you work in middle management, you’re probably the primary focus of the cultural change your company is trying to make. The most important thing you can do is accept the inevitability of the change and recognize you don’t have to do it all yourself. Communicate to your staff what things they can do to support the new decision-making strategy, like collecting and analyzing data. Approve funds for staff training and data collection/analysis activities. And again, recognize and reward staff members who take the lead in providing you with data.

If you work as a member of the staff, the most important thing you can do is collaborate with your co-workers in collecting and analyzing data. Help each other. Congratulate those who provide good examples of data collection, analysis, and reporting. And of course, take as much training as you can and use your initiative to interject data into activities you are working on.

downloadBe Patient

Changing an organization’s culture from intuition-based decision-making to data-driven decision-making is a long evolutionary process. It won’t happen by the end of next quarter, or next fiscal year, or for that matter, maybe ever. You won’t necessarily even know when you’ve achieved the goal. But, if you start to see that decisions work out better and are more defensible than in the past, you’re probably there. That’ll make everyone in the organization happier.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , | 11 Comments

There’s a reason analysis begins with anal. Always evaluate the validity of your assumptions, your data scrubbing, and your interpretations. If you don’t, someone else will.

Posted in Uncategorized | Tagged , , , , | Leave a comment