Catalog of Models

page-cat-models-catWhether you know it or not, you deal with models every day. Your weather forecast comes from a meteorological model, usually several. Mannequins are used to display how fashions may look on you. Blueprints are drawn models of objects or structures to be built. Maps are models of the earth’s terrain. Examples are everywhere.

Models are representations of things, usually an ideal, a standard, or something desired. They can be true representations, approximate (or at least as good as practicable), or simplified, even cartoonish compared to what they represent. They can be about the same size, bigger, or most typically, smaller, whatever makes them easiest to manipulate. They can represent:

  • Physical objects that can be seen and touched
  • Processes that can be watched
  • Behaviors that can be observed
  • Conditions that can be monitored
  • Opinions that can be surveyed.

The models themselves do not have to be physical objects. They can be written, drawn, or consist of mathematical equations or computer programming. In fact, using equations and computer code can be much more flexible and less expensive than building a physical model.

Stats with Cats Models 10-23-2017

Classification of Models

There are many ways that models are classified, so this catalog isn’t unique. The models may be described with different terms or broken out to greater levels of detail. Furthermore, you can also create hybrid models. Examples include mash-ups of analytical and stochastic components used to analyze phenomena such as climate change and subatomic particle physics. Nevertheless, the catalog should give you some ideas for where you might start to develop your own model.

Physical Models

Your first exposure to a model was probably a physical model like a baby pacifier or a plush animal, and later, a doll or a toy car. From then, you’ve seen many more – from ant farms to anatomical models in school. You probably even built your own models with Legos, plastic model kits, or even a Halloween costume. They are all representations of something else.


Physical models aren’t used often for advanced applications because they are difficult and expensive to build and calibrate to a realistic experience. Flight simulators, hydrographic models of river systems, and reef aquariums are well known examples.

Conceptual Models

Strat modelModels can also be expressed in words and pictures. These are used in virtually all fields to convey mental images of some mechanism, process, or other phenomenon that was or will be created. Blueprints, flow diagrams, geologic fence diagrams, anatomical diagrams are all conceptual models. So are the textual descriptions that go with them. In fact, you should always start with a simple text model before you embark on building a complex physical or mathematical model.

Mathematical and Computer Models

ROCKWARE strat_fence_cage_01Theoretical Models

Theoretical models are based on scientific laws and mathematical derivations. Both theoretical models and deterministic empirical models provide solutions that presume that there is no uncertainty. These solutions are termed exact (which does not necessarily imply correct). There is a single solution for given inputs.

Analytical Models

Analytical models are mathematical equations derived from scientific laws that produce exact solutions that apply everywhere. For example, F (force) = M (mass) times A (acceleration) and E(energy) = m (mass) times c2 (speed of light squared) are analytical models. Probably, most concepts in classical physics can be modeled analytically.

Numerical Models

Numerical models are mathematical equations that have a time parameter. Numerical models are solved repeatedly, usually on a grid, to obtain solutions over time. This is sometimes called a Dynamic Model (as opposed to a Static Model) because it describes time-varying relationships.

Empirical Models

Empirical models can be deterministic, probabilistic, stochastic, or sometimes, a hybrid of the three. They are developed for specific situations from measured data. Empirical models differ from theoretical models in that the model is not necessarily fixed for all instances of its use. There may be multiple reasonable empirical models that can apply to a given situation.

Deterministic Models

Deterministic empirical models presume that a mathematical relationship exists between two or more measurable phenomena (as do theoretical models) that will allow the phenomena to be modeled without uncertainty (or at least, not much uncertainty, so that it can be ignored) under a given set of conditions. The difference is that the relationship isn’t unique or proven. There are usually assumptions. Biological growth and groundwater flow models are examples of deterministic empirical models

12-sistwins-cats.w710.h473Probability Models

Probability models are based on a set of events or conditions all occurring at once. In probability, it is called an intersection of events. Probability models are multiplicative because that is how intersection probabilities are combined. The most famous example of a probability model is the Drake equation, a summary of the factors affecting the likelihood that we might detect radio-communication from intelligent extraterrestrial life

Stochastic Models

Stochastic empirical models presume that changes in a phenomenon have a random component. The random component allows stochastic empirical models to provide solutions that incorporate uncertainty into the analysis. Stochastic models include lottery picks, weather, and many problems in the behavioral, economic, and business disciplines that are analyzed with statistical models.

Comparison Models

Bombay-cat-3In statistical comparison models, the dependent variable is a grouping-scale variable (one measured on a nominal scale). The independent variable can be either grouping, continuous, or both. Simple hypothesis tests include:

  • c2 tests that analyze cell frequencies on one or more grouping variables, and
  • t-tests and z-tests that analyze independent variable means in two or fewer groups of a grouping variable.

Analysis of Variance (ANOVA) models compare independent variable means for two or more groups of a dependent grouping variable. Analysis of Covariance (ANCOVA) models compare independent variable means for two or more groups of a dependent grouping variable while controlling for one or more continuous variables. Multivariate ANOVA and ANCOVA compare two or more dependent variables using multiple independent variables. There are many more types of ANOVA model designs.

Classification Models

Classification and identification models also analyze groups.

Clustering models identify groups of similar cases based on continuous-scale variables. There need be no prior knowledge or expectation about the nature of the groups. There are several types of cluster analysis, including hierarchical clustering, K-Means clustering, two-step clustering, and block clustering. Often, the clusters or segments that are used as inputs to subsequent analyses. Clustering models are also known as segmentation models.

cute-dog-and-cat-hd-wallpaperClustering models do not have a nominal-scale dependent variable, but most classification models do. Discriminant analysis models have a nominal-scale dependent variable and one or more continuous-scale independent variables. They are usually used to explain why the groups are different, based on the independent variables, so they often follow a cluster analysis. Logistic regression is analogous to linear regression but is based on a non-linear model and a binary or ordinal dependent variable instead of a continuous-scale variable. Often, models for calculating probabilities use a binary (0 or 1) dependent variable with logistic regression.

There are many analyses that produce decision trees, which look a bit like organization charts. C&R (Classification and Regression Trees) split categorical dependent variables into its groups based in continuous or categorical-scale independent variables. All splits are binary. CHAID (Chi-square Automatic Interaction Detector) generates decision trees that can have more than two branches at a split. A Random Forest consists of a collection of simple tree predictors.

Explanation Models

Explanation models aim to explain associations within or between sets of variables. With explanation models, you select enough variables to address all the theoretical aspects of the phenomenon, even to the point of having some redundancy. As you build the model, you discover which variables are extraneous and can be eliminated.

page-cat-models-kittenFactor Analysis (FA) and Principal Components Analysis (PCA) are used to explore associations in a set of variables where there is no distinction between dependent and independent variables. The two types of statistical analysis:

  • Create new metrics, called factors or components, which explain almost the same amount of variation as the original variables.
  • Create fewer factors/components than the original variables so further analysis is simplified.
  • Require that the new factors/components be interpreted in terms of the original variables, but they often make more conceptual sense so subsequent analyses are more intuitive.
  • Produce factors/components that are statistically independent (uncorrelated) so they can be used in regression models to determine how important each is in explaining a dependent variable.

Canonical Correlation Analysis (CCA) is like PCA only there are two sets of variables. Pairs of components, one from each group, are created that explain independent aspects of the dataset.

Regression analysis is also used to build explanation models. In particular, regression using principle components as independent variables is popular because the components are uncorrelated and not subject to multicollinearity.

Prediction Models

catSome models are created to predict new values of a dependent variable or forecast future values of a time-dependent variable. To be useful, a prediction model must use prediction variables that cost less to generate than the prediction is worth. So the predictor variables and their scales must be relatively inexpensive and easy to create or obtain. In prediction models, accuracy tends to come easy while precision is elusive. Prediction models usually keep only the variables that work best in making a prediction, and they may not necessarily make a lot of conceptual sense.

Regression is the most commonly used technique for creating prediction models. Transformations are used frequently. If a model includes one or more lagged values of the dependent variable among its predictors, it is called an autoregressive model.

Neural Networks is a predictive modeling technique inspired by the way biological nervous systems process information. The technique involves interconnected nodes or layers that apply predictor variables in different ways, linear and nonlinear, to all or some of the dependent variable values. Unlike most modeling techniques, neural networks can’t be articulated so they are not useful for explanation purposes.

Picking the Right Model

There are many ways to model a phenomenon. Experience helps you to judge which model might be most appropriate for the situation. If you need some guidance, follow these steps.

  • maxresdefaultStep 1 – Start at top of the Catalog of Models figure. Decide whether you want to create a physical, mathematical, or conceptual model. Whichever you choose, start by creating a brief conceptual model so you have a mental picture of what your ultimate goal is and can plan for how to get there.

If your goal is a physical or full blown conceptual model, do the research you’ll need to identify appropriate materials and formats. But this blog is about mathematical models, so let’s start there

  • Step 2 – If you want to select a type of mathematical model, start on the second line of the Catalog of Models figure and decide whether your phenomenon fits best with a theoretical or an empirical approach.

If there are scientific or mathematical laws that apply to your phenomenon, you’ll probably want to start with some type of theoretical model. If there is a component of time, particularly changes over time periods, you’ll probably want to try developing a numerical model. Otherwise, if a single solution is appropriate, try an analytical model.

  • Step 3 – If your phenomenon is more likely to require data collection and analysis to model, you’ll need an empirical model. An empirical model can be probabilistic, deterministic, or stochastic. Probability models are great tools for thought experiments. There are no wrong answers, only incomplete ones. Deterministic models are more of a challenge. There needs to be some foundation of science (natural, physical, environmental, behavioral, or other discipline), engineering, business rules, or other guidelines for what should go into the model. More often than not, deterministic models are overly complicated because there is no way to distinguish between components that are major factors versus those that are relatively inconsequential to the overall results. Both Probability and Deterministic models are often developed through panels of experts using some form of Delphi process.
  • Step 4 – If you need to develop a stochastic (statistical) model, go here to pick the right tool for the job.
  • Step 5 – Consider adding hybrid elements. Don’t feel constrained to only one type of component in building your model. For instance, maybe your statistical model would benefit from having deterministic, probability, or other types of terms in it. Calibrate your deterministic model using regression or another statistical method. Be creative.



Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , | 2 Comments

How to Describe Numbers

Data catSay you wanted to describe someone you see on the street. You might characterize their sex, age, height, weight, build, complexion, face shape, hair, mouth and lips, eyes, nose, tattoos, scars, moles, and birthmarks. Then there’s clothing, behavior, and if you’re close enough, speech, odors, and personality. Your description might be different if you’re talking to a friend or a stranger, of the same or different sex and age. Those are a lot of characteristics and they’re sometimes hard to assess. Individual characteristics aren’t always relevant and can change over time. And yet, without even thinking about it, we describe people we see every day using these characteristics. We do it mentally to remember someone or overtly to describe a person to someone else. It becomes second nature because we do it all the time.

Most people don’t describe sets of numbers very often, though, so they don’t know how easy it actually is. You have to consider only a few characteristics, all of which are fairly easy to assess and will never change for the dataset. Once you learn how, it’s hardly a challenge to get it right, unlike describing the hot young guy who just robbed a bank wearing a clown costume.

What’s involved in describing a dataset? First, before considering any descriptive statistics, you have to assess two qualities.

  • Phenomenon and population or sample
  • Measurement scale

From this information, you’ll be able to determine what descriptive statistics to calculate.

Phenomenon and Population or Sample

This is a thinking exercise; there are no calculations.

First, determine what the numbers represent. What is the phenomenon they are related to? If there’s no context for the numbers, like it’s just a dataset for a homework problem, that’s fine too. But if you know something about the data, you might be able to judge whether your answer makes sense later when the calculations are done.


Next, think about the population from which the data were obtained. How is the population defined? Do you have all the possible measurements or entities? If not, you have a sample of the population, hopefully a sample that is a good representation of the population. This knowledge will help you judge whether your answer makes sense and will be consistent with other samples taken from the same population. Again, if there’s no context for the numbers, that’s fine. Now, all you have to decide is whether you want to describe the population or just the sample of the population for which you have measurements. If you’re not sure, assume you want to describe the population. All the fun stuff in statistics involves populations.

Measurement Scale


Scales of measurement express the phenomenon represented by the population. Simply put, scales are the ways that a set of numbers are related to each other. For example, the increments between scale values may all be identical, such as with heights and weights, or vary in size, such as with earthquake magnitudes and hurricane categories. The actual values of scales are called levels.

You have to understand the scale of measurement to describe data. There are a variety of types of measurement scales, but for describing a dataset you only need to pick from three categories:

  • Grouping Scales – Scales that define collections having no mathematical relationship to each other. The groups can represent categories, names, and other sets of associated attributes. These scales are also called nominal scales. They are described by counts and statistics based on counts, like percentages.
  • Ordered Scales – Scales that define measurement levels having some mathematical progression or order, commonly called ordinal scales. Data measured on an ordinal scale are represented by integers, usually positive. Counts and statistics based on medians and percentiles can be calculated for ordinal scales.
  • Continuous Scales – Scales that define a mathematical progression involving fractional levels, represented by numbers having decimal points after the integer. These scales may be called interval scales or ratio scales depending on their other properties. Any statistic can be calculated for data measured on continuous scales.

There are other scales of measurement but that’s all you’ll need at this point.

Descriptive Statistics

Now you can get on to describing a set of numbers. You’ll only need to consider four attributes – frequency, central tendency, dispersion, and shape.

 Frequency refers to the number of times the level of a scale appears in a set of numbers. It is used mostly for nominal (grouping) scales and sometimes with ordinal scales. The level with the highest frequency is called the mode. Frequency is used most effectively to show how scale levels compare to each other, such as with percentages or in a histogram.


Central Tendency refers to where the middle of a set of numbers is. It is used mostly for continuous (interval or ratio) scales and often with ordinal scales. There are many statistics that may be used to describe where the center of a dataset is, the most popular of which are the median and the mean. The median is the exact center of a progression-scale dataset. There are exactly the same number of data values less than and greater than the median. You determine the median by sorting the values in the dataset and counting the values from the extremes until you find the center. The mean, or average, is the center of a progression-scale dataset that is determined by a calculation. There may not be an equal number of data values less than and greater than the mean. You determine the mean by adding all the values in the dataset and dividing that sum by the number of values. The mean or the median is used in most statistical testing to find differences in data populations.

 Dispersion refers to how spread out the data values are. It is used for continuous (interval or ratio) scales but only rarely with ordinal scales. There are many ways to describe data dispersion but the most popular is the standard deviation. You calculate the standard deviation by:

  1. Subtracting the mean of a dataset from each value in the dataset
  2. Squaring each subtracted value
  3. Adding all the squared values
  4. Dividing the sum of the squared values by the number of values in the dataset (if you’re describing a sample) or by the number of values in the dataset minus 1 (if you’re describing a population).

The standard deviation is used in statistical testing to find differences in data populations.

5518606-pics-of-kittensShape refers to the frequency of the values in a dataset at selected levels of the scale, most often depicted as a graph. For ordinal scales, the graph is usually a histogram. For continuous scales, the graph is usually a probability plot, although sometimes histograms are used. Shapes of continuous scale data can be compared to mathematical models (equations) of frequency distributions. It’s like comparing a person to some well-known celebrity; they’re not identical but are similar enough to provide a good comparison. There are dozens of such distribution models, but the most commonly used is the normal distribution. The normal distribution model has two parameters – the mean and the standard deviation.

There are many other statistics that can be used to describe datasets, but most of the time, this is all you need:


For example, a nominal-scale dataset would be described by providing counts or percentages of observations in each group. An ordinal-scale dataset would be described by providing counts or percentages for each level, the median and percentiles, and ideally, a histogram. A continuous-scale dataset would be described by providing the closest distribution model and estimates of its parameters, such as “normally distributed with a mean of 10 and a standard deviation of 2.” Continuous-scale datasets can be described so succinctly because the distribution-shape specification contains so much of the telling information.

Now isn’t that a lot easier than describing that hot bank robber wearing a clown costume?


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at,, or other online booksellers.


Posted in Uncategorized | 4 Comments

Visualizations versus Infographics

Visualizations and infographics are both visual representations of data that are often confused. In fact, there is not a clear line of demarcation between the two. Both are informative. Both can be static or animated. Both require a knowledgeable person to create them.


Visualizations Explore

Data visualizations are created to make sense of data visually and to explore data interactively. Visualization is mostly automatic, generated through the use of data analysis software, to create graphs, plots, and charts. The visualizations can use the default settings of the software or involve Data Artistry and labeling (i.e., these Enhanced Visualizations fall in the intersection of the two circles in the figure). The processes used to create visualizations can be applied efficiently to almost any dataset. Visualizations tend to be more objective than infographics and better for allowing audiences to draw their own conclusions, although the audience needs to have some skills in data analysis. Data visualizations do not contain infographics.

Infographics Explain

Infographics are artistic displays intended to make a point using information. They are specific, elaborate, explanatory, and self-contained. Every infographic is unique and must be designed from scratch for visual appeal and overall reader comprehension. There is no software for automatically producing infographics the way there is for visualizations. Infographics are combinations of illustrations, images, text, and even visualizations designed for general audiences. Infographics are better than visualizations for guiding the conclusions of an audience but can be more subjective than visualizations.

Visualization Infographic
Objective Analyze Communicate
Audience Some data analysis skills General audience
Components Points, lines, bars, and other data representations Graphic design elements, text, visualizations
Source of Information Raw data Analyzed data and findings
Creation Tool Data analysis software Desktop publishing software
Replication Easily reproducible with new data Unique
Interactive or Static Either Static
Aesthetic Treatment Not necessary Essential
Interpretation Left to the audience Provided to the audience


img_8475c (1)

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at, or other online booksellers.


Posted in Uncategorized | Tagged , , , , , , , , | 1 Comment

How to Analyze Text

Statisticians love to analyze numbers, but what do they do when what they want to explore is unformatted text? It happens all the time. The text may come from opencat-diary-ended responses on surveys, social networking sites, email, online reviews, public comments, notations (e.g., medical, customer relations), documents and text files, or even recorded and transcribed interactions. But before anything can happen, you have to accomplish three tasks:

  • Get the text into a spreadsheet or other software that you can use to manipulate it.
  • Break the text into analyzable fragments – letters, words, phrases, sentences, paragraphs, or whatever.
  • Assign properties to the text fragments

How you might complete these tasks depends on what you want to do and the software you have. Nonetheless, you’ll be surprised by how much you can do with just a spreadsheet and an internet connection if you have the time and focus. This article will show you how.


Ther0402a6_fd87fbc829ec41faaf10aa7aa1cbed88-mv2_d_2000_1333_s_2e are several ways that you can analyze text. You can:

  • Count the occurrence of specific letters, words, or phrases, often summarized as Word Clouds. There are quite a few free web sites that will help you construct word clouds.
  • Categorize text by key themes, topics, or commonalities, called Text Mining.
  • Classify attitudes, emotions, and opinions of a source toward some topic, called Sentiment Analysis or opinion mining. There are many applications of sentiment analysis in business, marketing, customer management, political science, law, sociology, psychology, and communications.
  • Explore relationships between words using a Word Net. The relationships can reflect definitions or other commonalities.

Some of these analyses can be performed using free web apps, others, require special software.

Specialized Software

Some text analytics can be performed manually, but it is a time consuming process so having software can be crucial. Unfortunately, the biggest and best software is proprietary, like SAS and SPSS, and costs a lot. There are also free and low-cost alternatives, as well as free web sites that preform less sophisticated analyses. There are a lot of software options so there are probably a lot of people analyzing text. Let Google be your guide.

Manual Analyses

Even if you don’t have access to specialized software for text analyses, you can also still perform two types of analyses with nothing more than a spreadsheet program and an internet connection. You can count the number of times that a letter, word, or phrase appears in a text passage. Word frequency turns out to be relatively easy to produce but once you have the counts, the analysis and interpretation may be a bit more challenging. You can also do simple topic analyses or sentiment analyses. Parsing the sentences or sentence fragments and analyzing them is straightforward but time consuming, though the interpretation is usually easier.

Word Counts

If you are just looking for keywords or counting words for some diagnostic purpose, you’ll find that it’s not that difficult. Here’s how to do word counts.

Step 1 – Find the text you want to analyze.

This is usually easy except for there being so many choices. You have to start with an electronic file. If you have hard copy, you’ll have to sc
an it and correct the errors. If you have text from separate sources, you’1399360333213ll want to aggregate them to make things easier. If you have text on a website, you can usually highlight it and copy it using <ctrl-C>. If the passage is long, you can use <ctrl-A> to select everything before copying it, but you’ll have to edit out the extraneous material. You can do these operations in most word processors.

Step 2 – Scrub the data

You should scrub the text to be sure you’ll be counting the correct things. Take out entries that aren’t part of the flow of the text, like footnotes and section numbers. Correct misspellings. Take out punctuation that might become associated with words, like em dashes.

Step 3 – Count the words.

The quickest way to count words is to go to an Internet site for that purpose. Just copy your scrubbed text, paste it into the box on the site, and press submit. You’ll get a column of words and their frequencies. Parse the numbers from the text and you’re ready to analyze the data. It’s a good idea to review the results of the counting to be sure no errors have crept into the process.

Another way to do this solely in a spreadsheet is to replace all the punctuation with blanks and then replace the blanks with paragraph marks. This will give you a column of words. Copy it and remove the duplicates then you can use a formula to count each word.

Once you have the counts, the analysis is up to you. You can compare word statistics from different sources or analyze word frequencies within a single source. The possibilities are endless. Interpretation is another matter. Here are some examples.


One thing you can do with word counts is to produce a word cloud. There are many web sites that will generate these graphics. My favorite is Wordle, but be advised, you have to use Internet Explorer for it to work. Here’s an example of a word cloud produced with Wordle.


Text Mining

Topic or Sentiment Analyses are straightforward but more time consuming than word counts. Unless you are analyzing text for work or school, relax and turn on Netflix. This isn’t very sophisticated, but it’ll take a while and you’ll need frequent breaks to maintain your focus.

There are six steps.

Step 1 – Get the Data into a Spreadsheet

As with word counts, you have to get the text file into a text manager, preferably a spreadsheet. Highlight your text or use <ctrl A> and then <ctrl C> and <ctrl V>. You’ll need to parse any block text into sentences or whatever length fragment you want to analyze. You can usually do this by replacing periods with paragraph marks. Start with a small dataset, perhaps fewer than fifty fragments, until you get used to the process.

Step 2 – Scrub the Responses

Format the fragments into a single column with one fragment per row. Delete extraneous fragments. Don’t worry about misspellings and punctuation. If you make a mistake, <ctrl Z> will undo it.

Step 3 – Assign Descriptors

In a column next to the column with the fragments, enter your first descriptor. It can be a keyword, theme, sentiment, length, or whatever you want to analyze. Unless you have predetermined descriptors you are looking for, don’t worry too much about the descriptors you use. You’ll review and edit them in the next step.

cat-writingStep 4 – Count the Fragments Assigned to Each Descriptor

When you count the fragments assigned to each descriptor, you’ll probably find a few descriptors with only a few fragments. Consider combining them with other descriptors. When you’re satisfied with the assignments, you might want to subdivide the descriptor groups with another set of descriptors.

Step 5 – Repeat Steps 3 and 4

You can repeat the last two steps as many times as you feel is necessary. You can use these hierarchical descriptor groups to characterize subsets of the text so don’t have too many or too few fragments in each descriptor group. When you’re done, your data set would look something like this.


If you have a predetermined set of descriptors, you can assign each one to a column of the spreadsheet and code them as 0 or 1 for presence or absence.

Step 6 – Analyze

Once you have built your data set, you can analyze it statistically by counts and percentages, or graphically using word clouds. Consider this example. On December 29, 2016, Tanya Lynn Dee asked the question on her Facebook page, “Without revealing your actual age, what [is] something you remember that if you told a younger person they wouldn’t understand?” There were over 1,000 responses (at the time I saw the post), which I copied and classified into common themes. The results are here.

To learn more about analyzing text for its sentiment, read Sentiment Analysis
nearly everything you need to know by MonkeyLearn.

So, try analyzing some text (and other things) at home. You won’t need parental supervision.


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , | 5 Comments

Top 50 Statistics Blogs And Websites on the Web

Number 28

Reading Stats with  Cats

Posted in Uncategorized | Tagged , , , , | 1 Comment

Hellbent on Measurement

tape-measure-on-a-catAny variable that you record in a dataset will have some scale of measurement. Scales of measurement are, simply put, the ways that associated numbers relate to each other. Scales are properties of numbers, not the objects being measured. You could measure the same attribute of an object using more than one scale. For example, say you were doing a study involving cats and wanted to have a measure of each cat’s age. If you knew their actual birth dates, you could calculate their real ages in years, months, and days. If you didn’t know their birth dates, you could have a veterinarian or other knowledgeable individual estimate their ages in years. If you didn’t need even that level of precision, you could simply classify the cats as kittens, adult cats, or mature cats.

Understanding scales of measurement is important for a couple of reasons. Use a scale that has too many divisions and you’ll be fooled by the illusion of precision. Use a scale that has too few divisions and you’ll be dumbing down the data. Most importantly, though, scales of measurement determine, in part, what statistical methods might be applied to a set of measurements. If you want to do a certain type of statistical analysis on a variable, you have to use an appropriate scale for the variable. There are a few intricacies involved with measurement scales, so for now, just know that you have to understand a variable’s scale of measurement in order to analyze those data and interpret what it all means.

sound-27302bac00000578-3021300-image-a-35_1427885938930In Statistics 101, you’ll learn that there are four types of measurement scales – nominal, ordinal, interval, and ratio. This isn’t entirely true. The four-scale classification, described by Stevens (1946)[1], is just one way that scales are categorized, though it’s mentioned in almost every college-level introduction to statistics. There are actually a variety of other measurement scales, some differing in only obscure details.

The most basic classification of measurement scales involves whether or not the scale defines (1) groups having no mathematical relationship to each other, called grouping scales, or (2) a progression of measurement levels within a group, called progression scales.

Grouping (Nominal) Scales

Grouping scales define groups, which are finite, usually independent, and non-overlapping (discrete). Nominal scales are grouping scales. They represent categories, names, and other sets of associated attributes. None of the levels within a grouping scale have any sequential relationship to any of the other levels. One level isn’t greater than or less than another level.

Examples of properties that would be measured on a qualitative scale include:

  • Names—Kyle, Stan, Eric, Kennycup-beaker-vv9areh
  • Sex—female, male
  • Identification—PINs, product serial numbers
  • Locations—Wolf Creek, Area 51, undisclosed secure location
  • Car styles—sedan, pickup, SUV, limo, station wagon
  • Organization—company, office, department, team

Grouping scales are sometimes subdivided by the number of measurement levels. Discrete scales have a finite number of levels. For example, sex has two levels, male and female. Discrete scales with two levels are also called binary or dichotomous scales. Discrete scales with more than two levels are called categorical scales.

Variables measured on grouping scales can be used for counts and statistics based on counts, like percentages. They are also used to subdivide variables measured on progression scales.


Progression Scales

Progression or continuous scales define some mathematical progression. The number of possible levels may be finite or infinite. They can be limited to integers or use an integer and any number of decimal points after the integer. Ordinal, interval, and ratio scales are all progression scales.

Ordinal Scales

Ordinal scales have levels that are ordered. The levels denote a ranking or some sequence. One measurement may be greater than or less than another. However, the intervals between the measurements might not be constant.

Examples of properties that would be measured on an ordinal scale include:

  • Time—business quarter, geologic period, football quarters
  • Rankings—first place, second place, third place …
  • Thickness—geologic strata, atmospheric layers
  • Survey responses—very good, somewhat good, average, somewhat bad, very bad

Sometimes the intervals between levels of an ordinal scale are so different they can be treated as if they were grouping scales. Consider geologic time. It’s divided into eon, eras, periods, epochs, and ages, but the divisions aren’t the same lengths. Some periods are four times longer than others and the lengths can change as more is learned about the history of Earth. The units of the scale are also different in different parts of the world. Then there’s Moh’s scale of mineral hardness. It consists of ten levels. However, the interval between levels 1 and 8 is about the same as the interval between levels 8 and 9. The interval between levels 9 and 10 is four times greater than the interval between levels 8 and 9. Geologists must be a bunch of really creative people who aren’t bound by convention.

More frequently, the intervals between levels of an ordinal scale are the same, in theory or reality. Rankings, game segments like innings and periods, business quarters and fiscal years, are all examples.


Counts and statistics based on medians and percentiles can be calculated for ordinal scales. This includes most types of nonparametric statistics. However, there are situations in which averages and standard deviations are used. Surveys present one of those situations because the responses can be considered to be either grouping or progression scales depending on how the levels are defined. Say you have a survey question that has five possible responses:

  • Very good
  • Good
  • No opinion
  • Poor
  • Very poor

This is a grouping scale because the No Opinion response is not part of a progression. But, if the responses were:scale-twitch_scale

  • Very good
  • Good
  • Fair
  • Poor
  • Very poor

The scale could be recoded as Very Good=5, Good=4, Fair=3, Poor=2, and Very Poor=1 allowing statistical analyses to be conducted. If it were believed that the intervals between levels were not constant, analyses should be limited to counts and statistics based on medians and percentiles. If the intervals between levels were believed to be fairly constant, calculating averages and standard deviations might be legitimate. This is one of the points of contention with Stevens’s categories of scales. A given measurement’s scale might be perceived differently by different users.

Ratio Scales

Ratio scales are the top end of progression scales. Their levels consist of integers followed by any number of decimal points. Ratios and arithmetic operations are meaningful. Zero is a constant and a reference to an absence of the attribute the scale measures.

Measurements made by most kinds of meters or other types of measuring device are probably ratio scales. Examples of variables measured on ratio scales include:five

  • Concentrations, densities, masses, and weights
  • Durations in seconds, minutes, hours, or days
  • Lengths, areas, and volumes

Any type of statistic can be calculated for variables measured on a ratio scale.

Other Scales of Measurement

Understanding different types of measurement scales can help you select appropriate techniques for an analysis, especially if you’re a statistical novice. Stevens’s classification of scales works for many applications but it should be viewed as guidance rather than gospel. Interval scales in particular are an exception to the progression of scales form ordinal to ratio scales, and there are other exception scales as well. The following sections describe interval scales and a few scales that don’t quite fit into Stevens’s taxonomy.

Interval Scales

Interval measurements are ordered like ordinal measurements and the intervals between the measurements are equal. However, there is no natural zero point and ratios have no physical meaning. The classical example of an interval scale is temperature in degrees Fahrenheit or Centigrade. The intervals between each Fahrenheit degree are equal, but the zero point (-32 degrees) is arbitrary. Elevation is sometimes considered to be an interval scale temperature-should-hospital_e2d565717fa09970because the choice of sea level as the zero elevation is arbitrary. Time can also be thought of as an interval scale.

Some statisticians consider log-interval scales of measurement, in which the intervals between levels are constant in terms of logarithms, to be a subset of interval scales. Earthquake intensity (Richter and Mercali scales) and pH are examples of log-interval scales.

Statistics for ordinal scales and statistics based on means, variances, and correlations can be calculated for interval scales.



Counts are like ratio scales in that they have a zero point, constant intervals and ratios are meaningful, but there are no fractional units. Any statistic that produces a fractional count is meaningless. The classic example of a meaningless count statistic is that the average family includes 2.3 children. Counts are usually treated as ratio scales, but the result of any calculation is rounded off to the nearest whole unit.

Restricted-Range Scales

A constrained or restricted-range scale is a type of scale that is continuous only within a finite range. Probabilities are examples of constrained scales because any number is valid between the fixed endpoints of 0 and 1. Numbers outside this range are not possible. Percentages can be considered constrained or unconstrained depending on how the ratio is defined. For example, percentages for opinion polls are restricted to the range 0 to 100 percent. Percentages that describe corporate profits can be negative (i.e., losses) or virtually infinite (as in windfall profits). Restricted-range scales must be handled with special statistical techniques, such as logistical regression, that account for fixed scale

Cyclic Scales

Cyclic scales are scales in which sets of units repeat.

Repeating Units

Some cyclic scales consist of repeating levels for measuring open-ended quantities. Day of the week, month of the year, and season are examples. Time isn’t the only dimension with repeating scales, either. Musical scales, for instance, repeat yet have very different properties compared to time scales.

Repeating scales can be analyzed either by (1) treating them as an ordinal scale or (2) ignoring the repeating nature of the measure and transforming them into non-repeating linear units, such as day 1, day 2, and so on, or using a specialized statistical technique. The objective of the statistical analysis dictates which approach should be used. The first approach might be used to identify seasonality or determine if some measurement is different on one day or month rather than another. For example, this approach would be used to determine if work done on Fridays had higher numbers of defects than work done on other days. The second approach might be used to examine temcompass-20130531-182857poral trends. The third approach is used by statisticians who want to show off.

Orientation Scales

Orientation scales are a special type of cyclic scale. Degrees on a compass, for example, are a cyclic scale in which 0 degrees and 360 degrees are the same. Special formulas are required to calculate measures of central tendency and dispersion on circles and spheres.

Concatenated Numbers and Text

Concatenated numbers and text are not scales in the true sense of variable measurement, but they are part of every data analysis in one way or another. Concatenated numbers contain multiple pieces of information, which must be treated as a nominal scale unless the information can be extracted into separate variables. Examples of concatenated numbers include social security numbers, telephone numbers, sample IDs, date ranges, latitude/longitude, and depth or elevation intervals. Likewise, labels can sometimes be parsed into useful data elements. Names and addresses are good examples.

Time Scales

Time scales have some very quirky properties. You might think that time is measured on a ratio scale given its ever finer divisions (i.e., hours, minutes, seconds), yet it doesn’t make sense to refer to a ratio of two times any more than the ratio of two location coordinates. The starting point is also arbitrary. This sounds like an interval scale.


Time is like a one-dimensional location coordinate but it can also be linear or cyclic. Year is linear, so it’s at least an ordinal scale. For example, 1953 happened once and will never recur. Some time scales, though, repeat. Day 8 is the same as day 1. Month 13 is the same as month 1. So, time can also be treated as being measured on a nominal scale.

Time units are also used for durations, which are measured on a ratio scale. Durations can be used in ratios, they have a starting point of zero, and they don’t repeat (eight days aren’t the same as one day).

Time formats can be difficult to deal with. Most data analysis software offer a dozen or more different formats for what you see. Behind the spreadsheet format, though, the database has a number, which is the distance the time value is from an arbitrary starting point. Convert a date-time format to a number format, and you’ll see the number that corresponds with the date. The software formatting allows you to recognize values as times while the numbers allow the software to calculate statistics. This quirk of time formatting also presents a potential for disaster if you use more than one piece of software because different programs use different starting dates for their time calculations. Always check that the formatted dates are the same between applications.location-d71_2271

Location Scales

Just as there is time and duration, there is location and distance (or length), but there are a few twists. Time is one-dimensional; at least as we now know it. Distance can be one-, two-, or three-dimensional. Distance can be in a straight line (“as the crow flies”) or along a path (such as driving distance). Distances are usually measured in English inches, feet, yards, and miles or metric centimeters, meters, and kilometers. Locations, though, are another matter. Defining the location of a unique point on a two-dimensional surface (i.e., a plane) requires at least two variables. The variables can represent coordinates (northing/easting, latitude/longitude) or distance and direction from a fixed starting point. Of the coordinate systems, only the northing/easting scheme is a simple, non-concatenated scale that can be used for classical statistical analysis. However, this type of scale is usually not used for published maps, which can be a problem because virtually all environmental data are inherently location-dependent and multidimefly-c2bac65889b946dec4996a0a248e2ba0nsional. Thus, coordinate systems usually have to be converted for one to the other. Geostatistical applications, for example, are based on distance and direction measurements but these measurements are calculated from spatial coordinates.

At least three variables are needed to define a unique point location in a three-dimensional volume, so a variable for depth (or height) must be added to the location coordinates. Often, however, a property of an object occurs over a range of depths (or heights or elevations) rather than a finite point. Unfortunately, depth range is a concatenated number (e.g., 2-4 feet). It’s always better to use two variables to represent starting depth and ending depth. Thus, it may take four variables to define an environmental space, such as the sampled interval of a well or soil boring.

Selecting Scales

In the simplest taxonomy, almost all scales act either to group data othe_cat_stairsr represent the progression in a variable’s attribute, whether simple, ordinal-scale levels or more expansive ratio-scale levels. One way to view these differences is this: nominal (grouping) scales are like stone outcrops, randomly scattered around a garden area. Ordinal scales are like garden steps. You can only be on a step not between steps, and the steps lead progressively upward or downward. There may be many steps or just a few. Ratio scales are like a garden path or ramp. You can be anywhere along the path, at high levels or low. You can move forward or back, in small or large intervals.

Somewhere between those simple, discrete ordinal scales and the finely-divided ratio scales, however, are quite a few types of scales that don’t meet either definition. Just ask yourself these questions to understand the scale you will be dealing with:

  • Does the scale represent a progression of values? If not, the scale is a grouping scale.
  • Are the scale intervals approximately equal? If not, the scale is may be treated as a grouping scale.
  • Is there a constant zero (or other reference point) representing the absence of the attribute being measured? If not, the scale is may be treated like an interval scale.
  • Are the limits of the scale limited in any way? Is there a scale minimum or maximum? Are negative numbers prohibited? If so, you may have to use special statistical approaches to analyze data measured on the scale.
  • Are the scale values cyclic or repeating? If so, you may have to use special statistical approaches to analyze data measured on the scale.
  • Are ratios and other mathematical operations that produce fractional scale levels permissible? If so, you have a ratio scale.

Some people think that an attribute can be measured in only one way. This is untrue more often than it is not. Consider the example of color. To an auto manufacturer, color is measured ontape-itskhnz a nominal scale. You can buy one of their cars painted red or blue or silver or black. To a gemologist, the color of a diamond is graded on an ordinal scale from D (colorless) to Z (light yellow). To an artist, color is measured on an interval scale because their color wheel contains the sequence: red, red-orange, orange, orange-yellow, yellow, yellow-green, green, green-blue, blue, blue-violet, violet, and violet-red. To a physicist, colors are measured by a continuous spectrum of light frequencies, which employ a ratio scale.

Using a different scale than what might be the convention can provide advantages. Consider this example. Soil texture is usually measured on a nominal scale that defines groups such as loam, sandy loam, clay loam, and silty clay. The information can be made quantitative by recording the percentages of sand, silt, and clay (which define the texture) instead of just the classification. The nominal-scale measure is much easier to collect in the field and is one variable to manage rather than three. On the other hand, the progression-scale measures can be analyzed in more ways. Correlating the clay content of a soil to crop growth, soil moisture, or a pollutant concentration can be done only if soil texture is measured on a progression scale.


If a choice can be made on which type of scale to use, use a ratio scale. Ratio scales are usually best because they provide the most information and can be rescaled easily as ordinal scales. For example, many sports organize contestants using weight class, measured on an ordinal scale, instead of weight, which is measured on a ratio scale. Weight is still measured at weigh-in using a ratio scale but is converted to the ordinal-scale weight classes for simplicity. In contrast, it’s usually not possible to upgrade an ordinal scale to a ratio scale unless the ordinal scale has equal intervals and calculation of percentages or z-scores makes sense. You couldn’t just estimate a contestant’s weight of 178.2 pounds from a weight class of 170-185 pounds.

If you can’t measure an attribute on a ratio or interval scale, think about hobook-92e6f20565c2a9472dda6410939a44a6w an ordinal scale could be applied. You can almost always devise an ordinal scale to characterize an attribute; you just have to be creative. Think of opinion surveys. If you can measure opinions, you can measure anything.

[1] Stevens, S. S. 1946. On the theory of scales of measurement. Science v. 103, No. 2684, p. 677–680.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at, or other online booksellers.


Posted in Uncategorized | Tagged , , , , , , , , , , , , , , | 6 Comments

Searching for Answers

scaleNeed to find something out, just Google it. Now that Google is a verb as well as a noun, it’s easy. But …

It Hasn’t Always Been Easy

Adults under 30, Millennials, grew up with smartphones, laptop and tablet computers, and the Internet. As a group, they’ve never known a time when technology wasn’t integral to their existence. For those of us who finished school before the 1980s, personal computers were a rarity and the Internet was only then being developed for the military-industrial complex. Browsers didn’t appear until the early 1990s. You couldn’t buy a book from Amazon until 1995.

student-using-the-card-catalog-1971So, it hasn’t always been easy to find information. For most students, searching for information before 1980 usually involved a trip to the library. There, you would thumb through the 3×5” cards in the drawers of the card catalog looking for information by keywords. You would write down the title of the book referenced on a card along with its location classification (Dewey, Library of Congress). Then you would go to the location in the book stacks and retrieve the book, unless it was already in use, checked out, misplaced, or stolen. Finding enough information to fulfill a need might take hours or days or longer. Then you had to lug the books to a place where you could read them, extract the information you needed, and write it all down on paper. Needless to say, things have changed for the better. Now you can enter your keywords into an Internet search engine, and in a fraction of a second have references to hundreds, if not hundreds of thousands, websites, articles, blogs, books, images, and presentations. You can bookmark sites to read later or just save the relevant information to the cloud. That process might take minutes and will return more relevant information than you could ever access a generation earlier.


What People Looked For

Not only can people search more information sources faster than ever before but now Big Business and Big Government collects data on all those searches. For example, keeps track of the number of visitors to the Stats with Cats bcat-using-iphonelog site, what country they accessed the blog from, the search terms they used to find the site, and the blogs they visited. This is useful because it reveals what people are looking for, at least those people who ended up at the Stats with Cats blog.

Here are the frequencies for pertinent search terms from May 2010 through June 2016 and the associated word cloud (produced at; works best in IE).

keywords2Perhaps not surprisingly, the most common terms are associated with topics students would search if they were confronted with taking their first statistics class – statistics or stats, school or class, graph or chart, data, variable, and correlation. This may reflect the overpowering anticipation of learning about the some of the fascinating aspects of statistical thinking or, more likely, the fear of number crunching.

People searching for “report” are probably trying to figure out how to convert their statistical results into some meaningful story. How to Write Data Analysis Reports is probably much more than they might have expected.

People searching for the number 30 are looking for the reason they were told that their statistical analysis must have at least 30 samples. They might not like the answer at 30 Samples. Standard, Suggestion, or Superstition? but at least they’ll understand where it started, why they keep hearing it, and why the real answer is so unsatisfying.

What They Found

There were over 76,000 referrals from 255 sites, of which 97% came from Google. Bing and Facebook each contributed about 1%. Five Things You Should Know Before Taking Statistics 101 was viewed over 100,000 times in five and a half years. Secrets of Good Correlations had nearly 70,000 views in six years.


The following table summarizes the views and the views per year for 56 Stats with Cats blogs.



Total Views

Years Available

Views per Year

Five Things You Should Know Before Taking Statistics 101 109,329 5.5 19,878
Secrets of Good Correlations 69,212 6.1 11,377
How to Write Data Analysis Reports 32,253 3.5 9,774
How to Tell if Correlation Implies Causation 10,552 1.5 7,035
30 Samples. Standard, Suggestion, or Superstition? 18,151 6.1 2,984
Why Do I Have To Take Statistics? 13,645 6.1 2,243
Ten Fatal Flaws in Data Analysis 13,618 6.1 2,239
Fifty Ways to Fix your Data 11,067 6.1 1,819
Six Misconceptions about Statistics You May Get From Stats 101 8,011 5.5 1,457
Regression Fantasies 7,117 5.5 1,294
The Right Tool for the Job 5,586 6.1 918
The Best Super Power of All 3,511 4.5 780
Why You Don’t Always Get the Correlation You Expect 1,450 2.5 580
Looking for Insight through a Window 224 0.5 448
A Picture Worth 140,000 Words 2,292 5.5 417
The Heart and Soul of Variance Control 2,248 6.1 370
O.U..T…L….I……E……..R………………..S 907 2.5 363
The Five Pursuits You Meet in Statistics 2,005 6.1 330
Ten Ways Statistical Models Can Break Your Heart 144 0.5 288
The Zen of Modeling 1,731 6.1 285
The Foundation of Professional Graphs 1,226 4.5 272
Assuming the Worst 1,550 6.1 255
It’s All Relative 1,303 5.5 237
There’s Something About Variance 1,424 6.1 234
The Measure of a Measure 1,180 6.1 194
Purrfect Resolution 1,167 6.1 192
The Data Scrub 1,145 6.1 188
Limits of Confusion 1,030 5.5 187
Try This At Home 1,133 6.1 186
Grasping at Flaws 1,009 5.5 183
Consumer Guide to Statistics 101 984 5.5 179
It’s All Greek 1,058 6.1 174
It was Professor Plot in the Diagram with a Graph 1,028 6.1 169
Weapons of Math Production 934 6.1 154
Polls Apart 819 5.5 149
You’re Off to Be a Wizard 881 6.1 145
Samples and Potato Chips 866 6.1 142
Time Is On My Side 865 6.1 142
You Can Lead a Boss to Data but You Can’t Make Him Think 833 6.1 137
Types and Patterns of Data Relationships 323 2.5 129
The Santa Claus Strategy 741 6.1 122
It’s All in the Technique 693 6.1 114
The Data Dozen 603 5.5 110
Becoming Part of the Group 589 5.5 107
Reality Statistics 618 6.1 102
Aphorisms for Data Analysts 524 5.5 95
Ten Tactics used in the War on Error 520 5.5 95
The Seeds of a Model 478 6.1 79
Ockham’s Spatula 389 5.5 71
Statistics: a Remedy for Football Withdrawal 384 5.5 70
Many Paths Lead to Models 370 6.1 61
Dealing with Dilemmas 283 5.5 51
Perspectives on Objectives 251 6.1 41
Tales of the Unprojected 241 6.1 40
Getting the Right Answer 197 5.5 36
Resurrecting the Unplanned 202 6.1 33

The message these statistics are sending appears to be that the Stats with Cats blog attracts introductory students who don’t know what to expect from their statistics class or need help in understanding challenging statistical concepts. In contrast, experienced students are acquainted with more statistics professors and students. They own more statistics textbooks and have visited more educational web sites. And as a consequence, they search for more specific statistical terms, like tolerance limits and autocorrelation, that beginners wouldn’t know. It’s ironic, then, that Stats with Cats was written for students who had completed Statistics 101 and were looking for some help in applying what they had learned. Interesting … sometimes statistical analyses reveal things you don’t expect.


Read more about using statistics at the Stats with Cats blog. Read them to your cats. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , | 3 Comments