If you’ve never created a statistical model before, you might be surprised to find that the process involves a lot more than statistics. It’s like traveling. You don’t start by thinking about your transport, the plane, train, or bus you might take. You start by knowing where you are and where you want to go. Only then do you create your itinerary, select a carrier, buy a ticket, pack your belongings, and make the trip. Likewise, modeling starts with the phenomenon you’re trying to model and ends with the model. Between those two points, though, there are many possible routes.
For example, after studying a phenomenon, you might decide how you would use a model, and from that, decide what the model should focus on, what data you’ll need, what statistical method you’ll use, and how you’ll calibrate the model. Or, you may be given a dataset by a client, and from the samples and variables, you determine what models could be created, and what statistical methods would be required. Sometimes, you decide what you want the model to do, but the variables would be too difficult or costly to collect, so you have to revise the model specifications and reconsider the samples and variables. Similarly, you might find that the statistical method you want to use will require different data or model boundaries, so you have to reconsider your plans. It’s not uncommon to iterate through these considerations several times before you’re ready to advance to the actual modeling process.
As you model more and more phenomena, you’re likely to take these paths and many more. Each excursion through the maze of modeling elements will be a new and different adventure for you to learn from. If you get lost, find someone to give you directions. Here are a few to get you started.
The first thing you’ll have to do is to think about the phenomenon you want to model. This may sound trivial, but it’s not. Even if you were assigned the work by your boss or academic advisor, you’re going to have to make a lot of decisions on your own. If they were going to make all the decisions, they wouldn’t have given the project to you.
The nature of the phenomenon has to do first with how tangible the phenomenon is. Is the phenomenon an object that can be seen and touched? Is it a process that can be watched and interacted with or a behavior that can be observed but not necessarily manipulated? Is it a condition that can be monitored, or if not visible, at least measurable (like radioactivity)? Or, is it an opinion that can’t be seen or touched, and may not even be measurable directly?
The nature of the phenomenon also has to do with how changeable the phenomenon is. Is it something that is fixed and unchangeable? If it changes, what is the rate of change? Is it too slow or too fast to be observable? Does the phenomenon exist in states of equilibrium and disequilibrium? Can changes be manipulated by an experimenter? Thinking about the nature of the phenomenon will help you narrow your options for what form might be appropriate for the model. For example, would it be possible to build a physical model or will the model have to be a less tangible written model, blueprint, computer application or mathematical equation? It’s not uncommon for several types of models to be developed to display, manipulate, or substitute for the phenomena. Automakers, for example, make many types of models of the automobiles they sell, from the styling, to the performance, to the marketing.
Model Use and Specifications
After the phenomenon, you’ll need to think about what you want to do with the model and how it will be designed. You can use a model to:
- Display—use the model to describe or characterize the sample or the population.
- Substitute—use the model in place of the phenomenon, such as for prediction.
- Manipulate—use the model to explain aspects of the phenomenon.
As a point of reference, most models involve the simple display of descriptive information. If you plan to use them for substitution or manipulation, you’ll have to know more about the phenomenon, more about modeling, and more about statistics.
Whatever your planned use, you’ll have to think about how you want to approach the modeling. Three factors you ought to consider are the viewpoint you’ll take to develop the model, the level of detail of the model, and the boundaries of the model relative to the phenomenon.
Your viewpoint in modeling is how you plan to approach the effort, that is, either from the top down or the bottom up. A top-down viewpoint will require you to understand the big picture, things like what the phenomena is associated with. This viewpoint is more correlative and is commonly used in statistical models, especially predictive models. A bottom-up viewpoint will require you to understand the details, the conditions that cause or affect the phenomena. This viewpoint is more deterministic and is commonly used in theoretical models and in statistical models for explanation.
Top-down models usually don’t require as many variables as bottom-up models so long as they are the right variables. The problem with top-down models is that sometimes relationships appear to be oversimplified or obscure. Why should skirt length predict stock prices, for example? It makes no sense, but a high correlation has been found between the two measures.
Bottom-up models tend to require more variables to characterize all the facets of a phenomenon. Larger numbers of variables, in turn, require greater levels of effort than for top-down models. Furthermore, many of the details included in a bottom-up model are often found not to have a significant impact on the overall model. Hence, bottom-up modeling tends to be labor intensive and inefficient, but in the end, at least you know how everything fits together.
Some modelers take their viewpoint as an extension of their own personalities. Big picture people think of a phenomenon in terms of general concepts, mechanisms, trends, and patterns and tend to model from the top down. They don’t care if their favorite team has weaknesses as long as the team’s winning percentage is high. Details people think of discrete parts or elements that make up a phenomenon, and tend to model from the bottom up. They believe the whole is equal to the sum of the parts. Their team could be in first place, but they’re concerned about one player who is in a slump.
Often both viewpoints work equally well for modeling a phenomenon. Sometimes, though, one or the other viewpoint will work better, be easier, or even be the only feasible approach. For example, say you want to model the performance of an automobile. Using a top-down viewpoint, you might focus on acceleration, gas mileage, top speed, and so on. You might be able to model how the automobile will perform under certain driving conditions, but you won’t learn anything about how the components of the automobile work together. Using a bottom-up viewpoint, you might focus on number of cylinders, gear ratios, timing, and so on. You might be able to model how changing a component could boost or diminish its function, but you won’t know if the change would provide the same effect to the automobile’s overall performance. You have to be sure that your viewpoint is appropriate for how you plan to use the model or else the model won’t be useful. At every step in your modeling effort, ask yourself, “will I be able to do what I need to do with the results of the model?”
Every phenomenon complex enough to have to be modeled assuredly has many levels of detail. You have to decide how much detail to put in your model, especially if your viewpoint is bottom-up. Still, there are practical limits imposed by restrictive budgets and schedules or by what is known about the phenomenon. For example, if you want to model the performance of an automobile, do you concentrate on the engine or also consider aerodynamics, steering, braking, and other components? If you concentrate on the engine, do you focus on the internal combustion components or also consider the pollution control devices, the electrical system, and other components? If you concentrate on the internal combustion components, do you focus on the pistons or also consider the spark plugs and the fuel?
Where will your model end? This is easy to visualize with location and time; you can draw a line on a map or block out dates on a calendar. Many phenomena aren’t so easy to isolate, though. Processes, in particular, often use inputs from other processes or contain subprocesses that can’t be isolated. In modeling the performance of an automobile, for example, do you include different makes (e.g., Ford, Honda), different models (e.g., sedans, SUVs), different options (e.g., engines, transmissions), different drivers, different types of road conditions, and so on.
These determinations will affect everything else you do.
Other Model Specifications
There are many other things about your model that might have a bearing on the variables and samples you select, the statistical methods you use, and how you go about optimizing the model. Here are a few specifications that may be relevant to your model:
- Users—Who will be using the model? If it’s just you, the model may not need to have a polished appearance and extensive documentation. If others will be using the model, though, consider that audience. You may not have to build a comprehensive user interface, but you’ll at least need to try to make it understandable and sufficiently documented. Don’t try to make it idiot-proof; it’s not worth the effort. God is just too good at making idiots.
- Frequency of Use—If the model will be used on a recurring basis, make sure there will be some provision for you or some other qualified individual to review the model periodically to ensure it is being used correctly and is still appropriate for representing the phenomenon.
- Accuracy and Precision—As a general rule, statistical models tend to be fairly accurate but never as precise as you need them to be. Have some notion of the accuracy and precision you want. That way you’ll know when either you’re done or it’s time to quit. A good way to specify the precision you want is to start from a gut feeling and specify the precision as a percentage, for example, ±5 percent or ±10 percent. Then you’ll have to control variance and manipulate the number of samples and the confidence level so that a confidence interval is close to your target precision.
- Limit of Complexity—Some models were not meant to be. If you can’t fit the model to the data, you have to be prepared to call it quits. In a way, this is equivalent to a Do Not Resuscitate order in medicine, and likewise, it can be a sensitive subject. It’s usually easier to create new variables or try some other statistical manipulation than it is to give the bad news, and the bill, to the client.
So, those are some of the issues you’ll want to consider when you build a statistical model. There’s much more to think about, of course, especially when you start collecting the provisions for your model—the samples, variables, and data. Now to be candid, you’ll give some of these topics only a few nanoseconds of thought before you jump into the maelstrom of model building. Some you’ll think about constantly throughout your modeling effort. Some will just be what they turn out to be. Model building is an adventure. Every journey is unique so savor the experiences.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.