Share Your Career with Students

I got a special request from my daughter in Hawaii that I hope you will read.

Aloha. I teach 5th grade special education in a resource room setting. My students are currently researching careers they are interested in as part of our expository writing unit. I’d love to have guest speakers come in and talk about jobs, but that’s tough to arrange, especially since there is so much confidentiality involved with the setting I’m in. Instead, I’d love to share letters written to them from people in different careers. My students are researching careers such as veterinarian, robotic engineer, biologist, Navy, Air Force, musician, fashion designer, and teacher, but I’d love a variety of careers to share with them.

If you are willing to type up a message to them, please include the following information

  1. Introduce yourself and your career.
  2. Explain the type of education/training you went through (you could mention what obstacles you encountered and how you overcame them (cost of school, a difficult class, etc.).
  3. Explain how the use of reading, writing, and math factors into your job and/or daily life.
  4. Close with what you enjoy about your career and some words of wisdom (optional)

Send your message to me at: mirandameow87@gmail.com  Include a picture you don’t mind me showing to my students when I read it (optional). THANKS! You will receive my eternal gratitude!

If you think you might want to share your career but are looking for ideas for starting, here’s what I wrote:

Aloha, my name is Charlie Kufs and I work as a Statistician for the for the United States government. My job is to take information, which we call data, and figure out how to use it to help the government run better. Statisticians also work for many other places like schools and companies. Most of the data statisticians work with are numbers that describe the things you buy in stores, the medicines you might take, the sports you play, and many more things. To be a statistician you have to love working with numbers.

To become a statistician, I had to complete elementary school, then four years of high school, and four years of college. I also studied two more years after college to learn more about math and statistics. As much as I loved learning about how to work with numbers, I also had to learn about reading and writing. Reading is very important to me because that’s how I learn new things. Even after going to school for almost twenty years, there are still many things to learn. I learn new things by reading books and articles on the Internet about statistics. Writing is just as important because I have to explain the work I’ve done to people who aren’t statisticians and don’t like numbers as much as I do. I’ve even written a book to help people work with statistics.

I really like working with numbers. Using math and statistics, I can solve very difficult problems at work and also have fun at home studying data about how I spend money, what foods I eat and exercising I do, and my favorite sports teams. If you like math and working with numbers, you might like to be a statistician when you get older.

20150222_230034

Posted in Uncategorized | Tagged , | Leave a comment

How to Tell if Correlation Implies Causation

Critter growlYou’ve probably heard the admonition:

Correlation Does Not Imply Causation.

Everyone agrees that correlation is not the same as causation. However, those two words — correlation and causation — have generated quite a bit of discussion.

Why Causality Matters

No one gets perturbed if you say two conditions or events are correlated but even suggest that causation is possible and you’ll get the clichéd admonition and perhaps with even harsher criticism. It’s not easy to prove causality, though, so there must be a reason for putting in the effort. For example, if you can figure out what causes a condition or event, you can:

  • Promote the relationship to reap benefits, such as between agricultural methods and crop production or pharmaceuticals and recovery from illnesses.
  • Prevent the cause to avoid harmful consequences, such as airline crashes and manufacturing defects.
  • Prepare for unavoidable harmful consequences, such as natural disasters, like floods.
  • Prosecute the perpetrator of the cause, as in law, or lay blame, as in politics.
  • Pontificate about what might happen in the future if the same relationship occurs, such as in economics.
  • Probe for knowledge based on nothing more than curiosity, such as how cats purr.

So how can you tell if correlation does in fact imply causation?

http://xkcd.com/552/

Criteria for Causality

Sometimes it’s next to impossible to convince skeptics of a causal relationship. Sometimes it’s even tough to convince your supporters. Developing criteria for causality has been a topic of concern in medicine for centuries. Several sets of criteria have been proffered over those years, the most widely cited of which are the criteria described in 1965 by Austin Bradford Hill, a British medical statistician. Hill’s criteria for causation specify the minimal conditions necessary to accept the likelihood of a causal relationship between two measures as:

  1. Look right IMG_3861Strength: A relationship is more likely to be causal if the correlation coefficient is large and statistically significant.
  2. Consistency: A relationship is more likely to be causal if it can be replicated.
  3. Specificity: A relationship is more likely to be causal if there is no other likely explanation.
  4. Temporality: A relationship is more likely to be causal if the effect always occurs after the cause.
  5. Gradient: A relationship is more likely to be causal if a greater exposure to the suspected cause leads to a greater effect.
  6. Plausibility: A relationship is more likely to be causal if there is a plausible mechanism between the cause and the effect.
  7. Coherence: A relationship is more likely to be causal if it is compatible with related facts and theories.
  8. Experiment: A relationship is more likely to be causal if it can be verified experimentally.
  9. Analogy: A relationship is more likely to be causal if there are proven relationships between similar causes and effects.

These criteria are sound principles for establishing whether some condition or event causes another condition or event. No individual criterion is foolproof, however. That’s why it’s important to meet as many of the criteria as is possible. Still, sometimes causality is unprovable.

Three Steps to Decide if Correlation Implies Causation

Hill’s criteria can be thought of as aspects of the process of critical thinking or considerations in the scientific method or a model for deciding if a relationship involves causation. The criteria don’t all have to be met to suggest causality and some may not even be possible to meet in every case. The important point is to consider the criteria in a careful and unbiased process.

Step 1 — Check the Metrics

The admonition that correlation does not imply causation is used to remind everyone that a correlation coefficient may actually be characterizing a non-causal influence or association rather than a causal relationship. A large correlation coefficient does not necessarily indicate that a relationship is causal. On the other hand, saying that correlation is a necessary but not sufficient condition for causality, or in other words, causation cannot occur without correlation, is also not necessarily true. There are quite a few reasons for a lack of correlation.

So, before you get too excited about some causal relationship, make sure the correlation is statistically legitimate. You can’t assess the relationship’s gradient (i.e., the sign of the correlation coefficient) and strength (i.e., the value of the correlation coefficient) if the correlation is erroneous. Make sure to:

  • Use metrics (variables) that are appropriate for quantifying the relationship. For example, don’t use an index that is a ratio of the other metric in the relationship.
  • Use an appropriate correlation coefficient based on the scales of the relationship metrics.
  • Confirm that the samples are representative of the population being analyzed and that the relationship is linear (or you are using non-linear methods for analysis).
  • Make sure that there are no outliers or excessive uncontrolled variance.

The gradient of most causal relationships is positive. Inverse relationships will have a negative gradient. The strength of causal relationships could be almost anything; it depends on what you expect. If you don’t know what to expect, look at the square of the correlation coefficient, called the coefficient of determination, R-square, or R2. R-square is an estimate of the proportion of variance shared by two variables. It is used commonly to interpret the strength of the relationship between variables. Be aware, though, that even causal relationships may show smaller than expected correlations.

Step 2 — Explain the Relationship

If you are comfortable with the gradient and strength of the correlation coefficient, the next step is to define the pattern of the relationship. The correlation may not be of any help in exploring the pattern of the relationship because data plots for different patterns can look similar. Nonetheless, there’s no sense expending more effort if the correlation is in any manner suspect.

http://i.stack.imgur.com/aZX4a.pngFirst, check for temporality in the data. If the cause doesn’t always precede the effect then either the relationship is a feedback relationship or is not causal. If cause and effect are not measured simultaneously, temporality may be obscured.

Next, try to determine what pattern of relationship is likely. This is not easy but it’s also not a permanent determination. If you are uncertain, start with either a direct or an inverse relationship, which can be determined from data plots. Then as you study the relationship further, you can assess whether the relationship may be based on feedback, common-source, mediation, stimulation, suppression, threshold, or multiple complexities.

Patterns of relationshipsConsider your relationship in terms of Hill’s criteria of Plausibility, Coherence, Analogy, and Specificity. Plausibility and Coherence are perhaps the easiest of the criteria to meet because it is all too easy to rationalize explanations for observed phenomenon. They may also rely on related facts and theories that can change over time. Analogy is a bit more difficult to meet but not impossible for a fertile mind. However, analogous relationships may appear to be similar but in fact be attributable to very different underlying mechanisms. Narrow minded people rely on Specificity in their arguments. Then again, relationships may have no other likely explanation because a phenomenon is not well understood.

Step 3 — Validate the Explanation

Perhaps the most important of Hill’s criteria are Experiment and Consistency. If you’re serious about proving there is a causal relationship between two conditions or events, you have to verify the relationship using an effective research design. Such an experiment usually requires a model of the relationship, a testable hypothesis based on the model, incorporation of variance control measures, collection of suitable metrics for the relationship, and an appropriate analysis. An appropriate analysis may be statistical (using multiple samples from a well-defined population and analyses like ANOVA to assess effects) or deterministic (using a representative example of a component of the relationship to demonstrate the effect). If the experiment verifies the relationship, especially if it can be consistently replicated by independent parties, there will be solid proof of causality and any spurious relationships will be disproved. The two problems are that this validation can involve considerable effort and that not every relationship can be verified experimentally.

There are two types of research studies — experimental and observational. In an experimental study, researchers decide what conditions the subjects (the entities being Lewontin quoteexperimented on) will be exposed to and then measure variables of interest. In an observational study, researchers observe subjects that possess the conditions being assessed and then measure variables of interest. Both types of experimental designs have their challenges. Researchers may not be able to manipulate the conditions under study in an experiment because of cost, logistical, or ethical issues. Observational studies may be subject to confounding, conditions that interfere with the interpretation of results. Consequently, verifying that a relationship is causal is often easier said than done.

 Implying Causality

Hills criteria were developed for medicine. Medical research may start with anecdotal observations and progress to statistical observations of occurrence. Add demographics and patterns of occurrence may become apparent. Then the patterns are assessed to look for coherent, plausible explanations and analogues. Some medical hypotheses can be tested and analyzed statistically. Pharmaceutical effectiveness is an example. Psychological and agricultural relationships can often be tested. Other relationships can’t be manipulated so must be analyzed based on observations. Epidemiological studies are examples. Without being able to rely on the Experiment and Consistency criteria, causality can only be argued using the weaker Plausibility, Coherence, Analogy, and Specificity criteria. This is also true with natural phenomena, like landslides and earthquakes. Some conditions are unique or the underlying knowledgebase is insufficient to explain the phenomenon convincingly, so even the Plausibility, Coherence, Analogy, and Specificity criteria aren’t useful. Economic and political relationships often fall into this category.

So, if you hear someone claim that a relationship is causal, consider how Hill’s criteria might apply before you believe the assertion.

http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/100000/40000/4000/200/144270/144270.strip.gif

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at amazon.com,  barnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , | 5 Comments

2014 in review

The WordPress.com stats helper monkeys prepared a 2014 annual report for this blog.

Here’s an excerpt:

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 83,000 times in 2014. If it were an exhibit at the Louvre Museum, it would take about 4 days for that many people to see it.

Click here to see the complete report.

Posted in Uncategorized | Leave a comment

Reading Stats with Cats

Reading Stats with  Cats

Todd P. Chang bonding with his cat while reading Stats with Cats together.

Posted in Uncategorized | Tagged | 1 Comment

Types and Patterns of Data Relationships

Types of Data Relationships

Posted in Uncategorized | Tagged , , , , , , , , , , , , | 1 Comment

Why You Don’t Always Get the Correlation You Expect

If you’ve ever taken a statistics class on correlation, you’ve probably come to expect that a large value for a correlation coefficient, either positive or negative, means that there is a noteworthy relationship between two phenomena. This is not always the case. Furthermore, a small correlation may not always mean there is no relationship between the phenomena.

Correlation Size Doesn’t Always Matter

That's not what I expected

That’s not what I expected.

A small correlation coefficient does not necessarily imply a lack of a relationship any more than a large correlation coefficient implies a strong relationship. It depends on the type of relationship and the data used to characterize it. This is important because analysts often devote all their time investigating large correlations while disregarding relationships that have small correlation coefficients, especially if they don’t have an expectation for what a good value might actually be.

Statistical Reasons for Small Correlations

There are several statistical reasons for unexpected correlations:

  • Non-linear relationships — Correlation coefficients assume that the relationship between two variables is linear. Nonlinear relationships result in smaller than expected correlation coefficients. A scatterplot of the variables can usually confirm this problem, which can often be corrected with a data transformation.
  • Outliers — The strength of a correlation coefficient can be deflated or inflated by outliers. A scatterplot can usually confirm the presence of outliers although deciding how to treat them may be more problematical.
  • Excessive uncontrolled variance — Sometimes, data points that appear to be outliers may just be instances of excess variance. Excess variance is probably the most common cause of smaller than expected correlations. Usually, excess variance is the result of a lack of adequate control in data generation.
  • Inappropriate sample — Data points that look like outliers or excess variance may be sham samples. Sham samples are not representative of the population being analyzed, and so, confound any calculated statistics. Samples may also represent trends hidden in subpopulations, perhaps even resulting in Simpson’s paradox.
  • Inefficient metricsVariables used in the analysis may not be appropriate for investigating the phenomenon in question. As a consequence, the strength of a relationship will be smaller than expected.

That’s why any evaluation of a correlation should include looking at the coefficient’s sign and size, a scatterplot of the relationship, and a statistical test of significance. There’s a lot more information in data relationships than can be expressed by a single statistic.

If there are no statistical issues with a dataset, it’s important to also consider what types of relationships between the variables are possible.

Relationship Reasons for Small Correlations

Types of Relationships

When analysts see a large correlation coefficient, they begin speculating about possible reasons. They’ll naturally gravitate toward their initial hypothesis (or preconceived notion) which set them to investigate the data relationship in the first place. Because hypotheses are commonly about causation, they often begin with this least likely type of relationship. Besides causation, relationships can also reflect influence or association:

  • Causes — A cause is a condition or event that directly triggers, initiates, makes happen, or brings into being another condition or event. A cause is a sine qua non; without a cause a consequent will not occur. Causes are directional. A cause must precede its consequent.
  • We're influences.

    We’re influences.

    Influences — An influence is a condition or event that changes the manifestation of an existing condition or event. Influences can be direct or mediated by a separate condition or event. Influences may exist at any time before or after the influenced condition or event. Influences may be unidirectional or bidirectional.

  • Associations — Associations are two conditions or events that appear to change in a related way. Any two variables that change in a similar way will appear to be associated. Thus, associations can be spurious or real. Associations may exist at any time before or after the associated condition or event. Unlike causes and influences, associated variables have no effect on each other and may not exist in different populations or in the same population at different times or places.

Associations are commonplace. Most observed correlations are probably just associations. Influences and causes are less common but, unlike associations, they can be supported by the science or other principles on which the data are based. The strength of a correlation coefficient is not related to the type of relationship. Causes, influences, and associations can all have strong as well as weak correlations depending on the efficiency of the variables being correlated and the pattern of the relationship.

Patterns of Relationships

Most discussions of correlation and causation focus on the simple direct relationship that one event or condition, designated as A, is related to a second event or condition, designated as B. For example, gravitational forces from the Moon and Sun cause ocean tides on the Earth. A causes B but B does not cause A. Another direct relationship is that age influences height and weight. Age doesn’t cause height and weight but we tend to grow larger as we age so A influences B.

Direct relationships are easy to understand and, if there are no statistical obfuscations, should exhibit a high degree of correlation. In practice, though, not every relationship is direct or simple. Here are eight:

  •     Phasers locked on target. Shields up!

    Shields up!  ………………  Phasers locked on target.

    Feedback Relationship: A and B are linked in a loop; A causes or influences B which then causes or influences A and so on. For example, poor performance in school or at work (A) creates stress (B) which degrades performance further (A) leading to more stress (B) and so on.

  • Common Relationship: A third event or condition, C, causes or influences both A and B. For example, hot weather (C) causes people to wear shorts (A) and drink cool beverages (B). Wearing shorts doesn’t cause or influence beverage consumption, although the two are associated by their common cause. Another example is the influence obesity has on susceptibility to a variety of health maladies.
  • Mediated Relationship: A causes or influences C and C causes or influences B so that it appears that A causes B. For example, rainy weather (A) often induces people to go to their local shopping mall for something to do (C). While there, they shop, eat lunch, and go to the movies and other entertainment venues thus providing the mall with increased revenues (B). In contrast, snowstorms (A) often induce people to stay at home (C) thus decreasing mall revenues (B). Bad weather doesn’t cause or influence mall revenues directly but does influence whether people visit the mall.
  • Stimulated Relationship: A causes or influences B but only in the presence of C. There are many examples of this pattern, such as metabolic and chemical reactions involving enzymes or catalysts.
  • Suppressed Relationship: A causes or influences B but not in the presence of C. For example, pathogens (A) cause infections (B) but not in the presence of antibiotics (C). Some drugs (A) cause side effects (B) only in certain at-risk populations (C).
  • We're an inverse relationship.

    We’re an inverse relationship.

    Inverse Relationship: The absence of A causes or influences B. For example, vitamin deficiencies (A) cause or influence a wide variety of symptoms (B).

  • Threshold Relationship: A causes or influences B only when A is above a certain level. For example, rain (A) causes flooding (B) only when the volume or intensity is very high.
  • Complex Relationship: Many A factors or events contribute to the cause or influence of B. Numerous environmental processes fit this pattern. For example, A variety of atmospheric and astronomical factors (A) contribute to influencing climate change (B).

Spurious Relationships

There are also a variety of spurious relationships in which A appears to cause or influence B but does not. Often the reason is that the relationship is based on anecdotal evidence that is not valid more generally. Sometimes spurious relationships may be some other kind of relationship that isn’t understood. Here are five other reasons why spurious relationships are so common.

  • Misunderstood relationships: The science behind a relationship may not be understood correctly. For example, doctors used to think that spicy foods and stress caused ulcers. Now, there is greater recognition of the role of bacterial infection. Likewise, hormones have been found to be the leading cause of acne rather than diet (i.e., consumption of chocolate and fried foods).
  • Misinterpreted statistics: There are many examples of statistical relationships being interpreted incorrectly. For example, the sizes of homeless populations appear to influence crime. Then again, so do the numbers of museums and the availability of public transportation. All of these factors are associated with urban areas.
  • We're a spurious relationship.

    We’re a spurious relationship.

    Misinterpreted observations: Incorrect reasons are attached to real observations. Many old wives tales are based on credible observations. For example, the notion that hair and nails continue to grow after death is an incorrect explanation for the legitimate observation.

  • Urban legends: Some urban legends have a basis in truth and some are pure fabrications, but they all involve spurious relationships. For example, In South Korea, it believed that sleeping with a fan in a closed room will result in death.
  • Biased Assertions: Some spurious relationships are not based on any evidence but instead are claimed in an attempt to persuade others of their validity. For example, the claim that masturbation makes you have hairy palms is not only ludicrous but also easily refutable. Likewise, almost any advertisement in support of a candidate in an election contains some sort of bias, such as cherry picking.

Correlation and Causation — Not Always What You Expect

Calculated correlation coefficients are innocent bystanders in debates over causation, influence, and association. With all the statistical and relational nuances that can affect their interpretation, it’s a wonder that they are so often used alone as determinants of causality and influence. As with all statistics, correlation coefficients need to be interpreted in the context provided by other types information. Certainty correlation does not imply causation, but according to Edward Tufte and others, sometimes it’s a good hint.

Not expected January 11 2014 B 042

Well, that’s not what I expected.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at amazon.com,  barnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , | 5 Comments

O.U..T…L….I……E……..R………………..S

Occasionally, a dataset may contain a value that is far greater (or less) than, or doesn’t display the same characteristics as the other values. This anomalous value is termed an influential observation. If the influential observation is not representative of the population being sampled, it is called an outlier.

Causes

Influential observations and outliers occur for a variety of reasons. Some are straightforward data generation or reporting errors. Lab results that are off by a factor of ten are often identified as outliers. Outliers occur in business data for a variety of reasons. Reporting deadlines may be missed, weather or construction may prevent customers from shopping, and there may be one-time corrections for past errors.

I'm being an outlier

I’m being an outlier.

Sometimes, there are deterministic influences that skew some measurements. For example, aberrant measurements may be caused by instrument error or miscalibration. Some outliers aren’t errors but instead are the result of inherent variability or a natural cause. So, if you run into outliers, try to figure out why they exist. They may mean nothing so that you can delete them from the analysis, or they may be critical to your interpretation of a dataset. You’ll probably find that, most of the time, the causes of outliers will be unknown.

Identification

Influential observations and outliers are generally not difficult to detect. Sorting and listing the data will often reveal questionable values, though the best way to identify a potential outlier is by graphing the data. Histograms, box plots, probability plots, time-series plots, or scatter plots of the data will usually reveal any aberrant values.

Graphs are particularly effective in identifying three patterns of outliers:

  • Cross-trend Outliers. Cross-trend outliers lie a substantial distance away from the rest of the data in positions that do not fall on the trend of the data. As a consequence, they can substantially reduce R2, inflate variances, and change regression model coefficients. They are usually easy to identify in graphs, however, their cause is usually difficult to ascertain.
  • In-Trend Outliers. In-trend outliers lie a substantial distance away from the rest of the data in positions that do fall on the trend of the data. Like cross-trend outliers, they are usually easy to identify in graphs. They can substantially inflate R2 but not change regression equations, which leads some analysts to include the outlier despite evidence of questionable validity. Their cause is often easy to ascertain because of the unique conditions the outlier represents.
  • Fringe Outliers. Fringe outliers lie a relatively small distance away from the rest of the data in positions that parallel the trend of the data. They are not always easy to identify in graphs. They can deflate R2 and change regression equations. Their cause is usually difficult to ascertain but may be the result of some bias in the data collection.
Three Patterns of Outliers.

Three Patterns of Outliers.

There are many statistical tests for identifying outliers. Outlier tests follow one of several strategies. Deviation/spread tests are like simple t-tests. They are calculated as the difference between the outlier value and the mean (or other measure of central tendency), divided by the standard deviation (or other measure of data dispersion). Excess/spread tests, also called Dixon-type tests, are calculated as the difference between the outlier and the next closest value (or other observation in the dataset), and the dataset range (or other dispersion statistic). Some statisticians prefer this type of approach because it is not necessary to have good estimates of the mean and variance. Other outlier tests examine sums-of-squares, skewness, and location relative to the center of the dataset.

The truth of the matter is that outlier tests are often superfluous. If you can see it in a graph, the test will usually confirm what you see. Tests are often convenient for convincing reviewers that what you think is an outlier, really is. If you can’t see it in a graph but an outlier test is significant, it may be an outlier … or not. The real issue, in most cases, is what you do if you find a value you think is an outlier.

Treatment

There are five options for treating outliers:

  • Inclusion — Inclusion involves keeping the outlier in the dataset. This approach would make sense to use if you’re looking to assess the effects of the anomalies. Sometimes you’re forced to take this approach because an unenlightened reviewer thinks you are trying to “pull something.” In cases like this, it might be beneficial to run your analyses both with and without the outlier so that everyone can understand its effect.
  • Correction — Correction involves changing the outlier to the correct value. This doesn’t happen often. You might find an outlier to be an error but you can’t correct it because you don’t know what the true value should be. In that case, deletion is probably a better option. If you’re lucky, though, you might find an outlier to be an error and be able to correct it.
  • Replacement — Replacement involves changing the outlier to a contingency value. This approach is like the replacement options for missing data. Using the mean or median in place of an outlier will bias the dataset, but not nearly as much as the outlier. This is often the best approach to use for complex statistical calculations.
  • Accommodation — Accommodation involves keeping the outlier in the dataset but using “robust” statistical procedures that are less sensitive to outliers. Nonparametric statistics are often used for this purpose.
  • Deletion — Deletion is simply removing the outlier from the dataset. This approach would make sense if you’re looking to assess general trends. Once again, it might be beneficial to run your analyses both with and without the outlier.

The option you select should depend on whether you believe the aberrant observation is representative of the population you are investigating. Your objective and the type of analysis you plan to do will also be considerations in this decision.

What Should You Do

Am I an Outlier?

Am I an Outlier?

If a statistical graphic or an outlier test suggests that a data value may be an influential observation or an outlier, follow these steps:

  1. Examine a variety of graphical depictions of all the data points including box plots, probability plots, bivariate plots, time-series plots, and contour maps to assess possible reasons for the aberrant observation.
  2. Review notes and metadata concerning the sample or measurement to determine if any irregularities in the sampling or data collection processes may be responsible for the discordant value.
  3. Review documentation related to data quality for the sample or measurement to determine if any irregularities in the collection, packaging, transport, and analysis or measurement and recording processes may be responsible for the discordant concentration.
  4. If any information indicates that the sample is probably not representative of the population being sampled, consider the sample or measurement to be an outlier and replace or delete it from further analysis. If possible, collect a new sample or measurement.
  5. If any information indicates that the sample should be representative of the population, review results for other measurements from the same source to determine if other results support the legitimacy of the suspected outlier. Also, review results for the same variable that may have been generated during previous sampling efforts.
  6. If prior results for the variable or results for other variables are consistent with results for the suspect sample or measurement, retain the value and evaluate it as an influential observation.
  7. If prior results for the parameter or results for other parameters are not consistent with results for the suspect sample or measurement, consider the value to be an outlier and replace or delete it from further analysis. If possible, collect a new sample or measurement.

This procedure works best if both data analysts and reviewers can somehow be involved in the examination process. Be sure to document all findings and decisions during this process.

If you decide to retain the outlier, consider using a nonparametric alternative to the procedure you planned to conduct. If for some reason this is not feasible, consider analyzing the dataset twice, once with the outlier and once without the outlier. Caveat your conclusions on the basis of the outlier and recommend collecting additional samples or measurements to assess its validity. Consultants always recommend additional work anyway, so this should come as no surprise to either clients or reviewers.

If you are assessing data trends, you will probably want to delete or replace any outliers. Even a single outlier can mask significant trends. Be aware however, that this action could bias predicted values and the prediction error if the cause of the outlier is natural.

Given the choice to replace or delete an outlier, consider the number of samples you have and the importance of the variable the outlier is a measure of. Remember, if you delete the outlier you will end up having to delete either the sample or the variable to conduct your statistical analysis. If the variable is important and you don’t have many samples, consider replacing the outlier.

There is also a psychological component to consider when replacing or deleting outliers. Scientists and engineers are taught that it is unethical to delete or change data that might not fit with their expectations. Outliers challenge that notion. Statisticians and reviewers become highly suspicious of each other when the need to judge an outlier arises. Consequently, it is sensible to have a procedure for evaluating outliers in place that everyone agrees to before the need arises. Even so, somebody will criticize you no matter what you do. It’s the way things work.

This idea for this title came from: Beckman, R.J. and Cook, R.D. (1983). Outlier. . ..s, Technometrics, 25, 119-163. If you don’t get the joke, don’t worry about it.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at amazon.com,  barnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , | 2 Comments