Occasionally, a dataset may contain a value that is far greater (or less) than, or doesn’t display the same characteristics as the other values. This anomalous value is termed an influential observation. If the influential observation is not representative of the population being sampled, it is called an outlier.
Influential observations and outliers occur for a variety of reasons. Some are straightforward data generation or reporting errors. Lab results that are off by a factor of ten are often identified as outliers. Outliers occur in business data for a variety of reasons. Reporting deadlines may be missed, weather or construction may prevent customers from shopping, and there may be one-time corrections for past errors.
Sometimes, there are deterministic influences that skew some measurements. For example, aberrant measurements may be caused by instrument error or miscalibration. Some outliers aren’t errors but instead are the result of inherent variability or a natural cause. So, if you run into outliers, try to figure out why they exist. They may mean nothing so that you can delete them from the analysis, or they may be critical to your interpretation of a dataset. You’ll probably find that, most of the time, the causes of outliers will be unknown.
Influential observations and outliers are generally not difficult to detect. Sorting and listing the data will often reveal questionable values, though the best way to identify a potential outlier is by graphing the data. Histograms, box plots, probability plots, time-series plots, or scatter plots of the data will usually reveal any aberrant values.
Graphs are particularly effective in identifying three patterns of outliers:
- Cross-trend Outliers. Cross-trend outliers lie a substantial distance away from the rest of the data in positions that do not fall on the trend of the data. As a consequence, they can substantially reduce R2, inflate variances, and change regression model coefficients. They are usually easy to identify in graphs, however, their cause is usually difficult to ascertain.
- In-Trend Outliers. In-trend outliers lie a substantial distance away from the rest of the data in positions that do fall on the trend of the data. Like cross-trend outliers, they are usually easy to identify in graphs. They can substantially inflate R2 but not change regression equations, which leads some analysts to include the outlier despite evidence of questionable validity. Their cause is often easy to ascertain because of the unique conditions the outlier represents.
- Fringe Outliers. Fringe outliers lie a relatively small distance away from the rest of the data in positions that parallel the trend of the data. They are not always easy to identify in graphs. They can deflate R2 and change regression equations. Their cause is usually difficult to ascertain but may be the result of some bias in the data collection.
There are many statistical tests for identifying outliers. Outlier tests follow one of several strategies. Deviation/spread tests are like simple t-tests. They are calculated as the difference between the outlier value and the mean (or other measure of central tendency), divided by the standard deviation (or other measure of data dispersion). Excess/spread tests, also called Dixon-type tests, are calculated as the difference between the outlier and the next closest value (or other observation in the dataset), and the dataset range (or other dispersion statistic). Some statisticians prefer this type of approach because it is not necessary to have good estimates of the mean and variance. Other outlier tests examine sums-of-squares, skewness, and location relative to the center of the dataset.
The truth of the matter is that outlier tests are often superfluous. If you can see it in a graph, the test will usually confirm what you see. Tests are often convenient for convincing reviewers that what you think is an outlier, really is. If you can’t see it in a graph but an outlier test is significant, it may be an outlier … or not. The real issue, in most cases, is what you do if you find a value you think is an outlier.
There are five options for treating outliers:
- Inclusion — Inclusion involves keeping the outlier in the dataset. This approach would make sense to use if you’re looking to assess the effects of the anomalies. Sometimes you’re forced to take this approach because an unenlightened reviewer thinks you are trying to “pull something.” In cases like this, it might be beneficial to run your analyses both with and without the outlier so that everyone can understand its effect.
- Correction — Correction involves changing the outlier to the correct value. This doesn’t happen often. You might find an outlier to be an error but you can’t correct it because you don’t know what the true value should be. In that case, deletion is probably a better option. If you’re lucky, though, you might find an outlier to be an error and be able to correct it.
- Replacement — Replacement involves changing the outlier to a contingency value. This approach is like the replacement options for missing data. Using the mean or median in place of an outlier will bias the dataset, but not nearly as much as the outlier. This is often the best approach to use for complex statistical calculations.
- Accommodation — Accommodation involves keeping the outlier in the dataset but using “robust” statistical procedures that are less sensitive to outliers. Nonparametric statistics are often used for this purpose.
- Deletion — Deletion is simply removing the outlier from the dataset. This approach would make sense if you’re looking to assess general trends. Once again, it might be beneficial to run your analyses both with and without the outlier.
The option you select should depend on whether you believe the aberrant observation is representative of the population you are investigating. Your objective and the type of analysis you plan to do will also be considerations in this decision.
What Should You Do
If a statistical graphic or an outlier test suggests that a data value may be an influential observation or an outlier, follow these steps:
- Examine a variety of graphical depictions of all the data points including box plots, probability plots, bivariate plots, time-series plots, and contour maps to assess possible reasons for the aberrant observation.
- Review notes and metadata concerning the sample or measurement to determine if any irregularities in the sampling or data collection processes may be responsible for the discordant value.
- Review documentation related to data quality for the sample or measurement to determine if any irregularities in the collection, packaging, transport, and analysis or measurement and recording processes may be responsible for the discordant concentration.
- If any information indicates that the sample is probably not representative of the population being sampled, consider the sample or measurement to be an outlier and replace or delete it from further analysis. If possible, collect a new sample or measurement.
- If any information indicates that the sample should be representative of the population, review results for other measurements from the same source to determine if other results support the legitimacy of the suspected outlier. Also, review results for the same variable that may have been generated during previous sampling efforts.
- If prior results for the variable or results for other variables are consistent with results for the suspect sample or measurement, retain the value and evaluate it as an influential observation.
- If prior results for the parameter or results for other parameters are not consistent with results for the suspect sample or measurement, consider the value to be an outlier and replace or delete it from further analysis. If possible, collect a new sample or measurement.
This procedure works best if both data analysts and reviewers can somehow be involved in the examination process. Be sure to document all findings and decisions during this process.
If you decide to retain the outlier, consider using a nonparametric alternative to the procedure you planned to conduct. If for some reason this is not feasible, consider analyzing the dataset twice, once with the outlier and once without the outlier. Caveat your conclusions on the basis of the outlier and recommend collecting additional samples or measurements to assess its validity. Consultants always recommend additional work anyway, so this should come as no surprise to either clients or reviewers.
If you are assessing data trends, you will probably want to delete or replace any outliers. Even a single outlier can mask significant trends. Be aware however, that this action could bias predicted values and the prediction error if the cause of the outlier is natural.
Given the choice to replace or delete an outlier, consider the number of samples you have and the importance of the variable the outlier is a measure of. Remember, if you delete the outlier you will end up having to delete either the sample or the variable to conduct your statistical analysis. If the variable is important and you don’t have many samples, consider replacing the outlier.
There is also a psychological component to consider when replacing or deleting outliers. Scientists and engineers are taught that it is unethical to delete or change data that might not fit with their expectations. Outliers challenge that notion. Statisticians and reviewers become highly suspicious of each other when the need to judge an outlier arises. Consequently, it is sensible to have a procedure for evaluating outliers in place that everyone agrees to before the need arises. Even so, somebody will criticize you no matter what you do. It’s the way things work.
This idea for this title came from: Beckman, R.J. and Cook, R.D. (1983). Outlier. . ..s, Technometrics, 25, 119-163. If you don’t get the joke, don’t worry about it.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at amazon.com, barnesandnoble.com, or other online booksellers.