Becoming Part of the Group

Imagine looking for patterns in a scatter plot of two variables. You see no linear trends, no curvilinear trends, and no cyclic or sinusoidal trends. Does that mean there are no associations between the variables? Maybe not.

No sooner than he had gotten out of bed, two clusters of black fur formed on the blanket.

Most people think of statistics as hypothesis tests and regression lines, but of course, there’s much more http://statswithcats.wordpress.com/2010/08/22/the-five-pursuits-you-meet-in-statistics/. Classification is often an important goal of data analysis. You can classify data visually by sorting or filtering metadata, and by plotting histograms and setting thresholds. But that approach is inefficient, especially compared to cluster analysis.

Cluster Luck

Cluster analysis refers to a number of procedures for arranging ungrouped items into statistically similar collections. Either samples or variables can be clustered. Sample clusters can be used to better describe the data using descriptive statistics or coded as grouping variables for other types of statistical analysis. Variable clusters can be used to help evaluate what a set of variables actually measures. Cluster analysis can also be used to identify atypical groups, even individual outliers.

There are several types of cluster analysis, each with many options for directing the clustering process. The most commonly used type of cluster analysis is hierarchical cluster analysis. Results from a hierarchical clustering are usually expressed as a tree diagram, which looks a bit like a company’s organization chart. The challenge in hierarchical cluster analysis is to interpret a tree diagram and select the appropriate clusters.

Cluster analysis has been used to classify animal and plant species, soil and rock types, astronomical bodies, and weather systems. It is used in education research to classify students, schools and districts. It is used to analyze customer preferences, market segments, target markets, and social networks. It is used to identify crime hot spots and anatomical features in forensic analysis.

Food for Thought

Consider this example. People follow special diets for a variety of reasons, such as controlling weight or blood glucose. But food is complex. Ignoring taste, food is characterized by the energy it provides (i.e., calories), the rate it is metabolized (i.e., Glycemic index for carbohydrates), its components (i.e., carbohydrates, proteins, and fats), and many other attributes. So, it is useful for nutritionists to classify foods to help consumers make healthy choices. Cluster analysis is one approach for such a characterization.

Data for this analysis consisted of values of five variables (i.e., calories, carbohydrates, proteins, fats, and Glycemic index) for 213 sample foods. The figure shows the tree diagram produced by the cluster analysis (although only 38 of the 213 foods are listed to aid readability). From the tree diagram, an appropriate number of clusters are selected. Cluster selection requires a combination of information on the statistical differences between potential clusters, an understanding of the data to interpret why each member might belong to a certain cluster, and a sense of how many clusters might be reasonable for characterizing the data. The letters in the tree diagram of the figure show one of the many possible sets of clusters.

Tree diagram for the Cluster Analysis of Food Types.

Once clusters ore chosen, they are characterized based on the characteristics of their members. The table summarizes how the six food clusters could be interpreted. These interpretations might have been different if the original variables or the number of clusters were different.

Characteristics of Six Food Categories Identified with Cluster Analysis.

Food Category and Description

Calories

Metabolism

Protein

Carbo-

hydrates

Fats

Foods

A

Muscle- maintenance foods

Low

Very Slow

High

Low

Moderate

Eggs, most fish, ham, salami, bacon, liverwurst, frankfurters

B

Quick-energy foods

Low

Fast

Low

Moderate to High

Low to Moderate

Milk, fruit juices, apples, bananas, cherries, grapes, pears, mangos, papayas, potatoes, crackers, pretzels

C

Low-calorie foods

Very Low

Moderate to Fast

Low

Moderate to High

Low to Moderate

Bread, peas, carrots, citrus fruits, peaches, plums, kiwis, watermelon, anchovies, caviar, gefiltefish, pepperoni

D

Sustained-energy foods

High

Fast to Very Fast

Moderate

High

Moderate

Yogurt, dates, prunes, pasta, rice, beans, French fries

E

Muscle-building foods

High

Very Slow

High

Low

Very High

Catfish, abalone, flounder, herring, mackerel, corned beef, liver, skinned chicken, turkey, venison, veal

F

Weight-gain foods

Very High

Slow to Moderate

High

Low

Very High

Raisins, soybeans, bass, kingfish, most beef, chicken and pork

One thing you should do after every analysis is to ask yourself if the results make sense. This isn’t the same as trying to bias the results, or at least it shouldn’t be. If you really understand your data, you should be able to tell if a result fits with the conventional wisdom. In the table, for example, does it make sense that raisins and soybeans are weight-gain foods and pepperoni is a low-calorie food? Could there be errors in the data? Might the serving sizes be non-representative of what might be eaten at one time? Perhaps. In other cases, it might also be possible that a different clustering algorithm, a different measure of data distance, or a different number of clusters would allow a better interpretation of the data.

Cluster analysis is a powerful technique for exploring patterns of similarity and difference in samples or variables. It is considered to be an exploratory statistical technique. It requires considerable knowledge of the phenomenon the data represent to interpret the results. For applied statisticians, though, this is where data analysis really gets fun.

[The data for this analysis came from http://www.ast-ss.com/research/food/food_listing_all.asp. Values for calories, carbohydrates, proteins, and fats are contingent on serving size. Not all of the foods were included in the analysis because of missing data.]

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

About these ads

About statswithcats

Charlie Kufs has been crunching numbers for over thirty years. He currently works as a statistician.
This entry was posted in Uncategorized and tagged , , , , , , . Bookmark the permalink.

6 Responses to Becoming Part of the Group

  1. I like clustering ;)
    Are there cluster techniques for nominal data ?

  2. Pingback: Ten Tactics used in the War on Error | Stats With Cats Blog

  3. doo says:

    Thanks for breaking it down!

    Please which software did you use in charting the diagram?

  4. Pingback: Becoming Part of the Group | Pets

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s