In theory, if you have the free time, you can calculate any statistic you might need using nothing more than a pencil and paper. After all, it’s just matrix mathematics. With a lot of data or a complicated procedure, though, you might need a lot of free time. A generation ago, that’s how most statistics were calculated. Most people didn’t have computers, or calculators for that matter. Slide rules … maybe. Now, there is an abundance of hardware and software to ease the tedium. Having a statistician’s version of Norm Abram’s workshop to use actually makes analyzing data a lot of fun.
Whether you’re planning a career in statistics or just looking to analyze your current dataset, you’re going to need software to do the calculations. Yes, there are some people who still calculate descriptive statistics manually, but this practice is so prone to errors that it’s only applied to very small datasets. And yes, there are some people who develop their own statistical routines, usually with R, a programming language for statistics available for free under a General Public License, or matrix manipulation software like matlab, maple and mathematica. Unless you’re a mathematical statistician developing a new statistical technique, though, you won’t need to take this approach if you don’t want to. There’s plenty of software available. All you need to know is the kind of statistical analyses you’re likely to use and your price range.
Software for General Statistics
With a few exceptions, almost all of the statistical software you’ll find is geared to the most common types of statistical analysis, including descriptive statistics, hypothesis testing, correlation and regression, and analysis of variance. Software used for statistical analysis can be grouped into five categories:
- Web-based Calculators—Web sites that perform simple statistical calculations can be found at statpages.org/. This is the low end of cost, but also usability. You usually have to enter your data and edit it manually, so it’s not really suitable for production work.
- Spreadsheets—You probably already have a copy of Microsoft Excel or some other spreadsheet software on your computer. If you are a beginner at data analysis, you’ll find that you can accomplish most of what you want to do using spreadsheet software. Advanced data analysis may be more of an issue, though. Some statisticians advise against using spreadsheet software, particularly Excel, citing three reasons. First, Excel doesn’t do some calculations and graphs that statistical packages do. Well, of course it doesn’t. It’s a spreadsheet program that sells for less than $200 (by itself, not part of Office) compared to statistical packages that cost ten times as much. Big deal. Second, Excel’s calculated probabilities are incorrect, reportedly in the third decimal place. OK, but if you would base a decision solely on whether a probability is 0.051 instead of 0.049, you really don’t understand the nature of statistical testing (more on this in another blog). And third, Excel’s random number generators are not of research quality. Yup, so if you’re planning to do Monte Carlo simulations with Excel … well, don’t (not necessarily because your answer will be wrong as much as because some people will think it is wrong).
- Basic Statistical Software—This category includes software that is used mainly for less sophisticated types of statistical analysis. Most can be purchased for less than about $500. Key examples include StatsDirect, In Stat, Analyze It, and Assistat.
- Intermediate Statistical Software—This category includes software that can be used for many types of statistical analysis except some of the more sophisticated techniques like multivariate analysis. Most but not all are a single module and cost less than about $1,000. Examples include NCSS, Statistix, Costat, Origin, Prostat, Soritec, MVSP, and Simstat.
- Major Statistical Packages—This category includes software that can be used for a variety of purposes. Most have a base module and a variety of optional add-on modules. They are usually purchased through annual licenses specifying a number of users, and cost more than about $1,000 (in some cases, way over). Some of the major packages like SAS and SPSS have been around since the mainframe days of the 1960s. Others like Statistica are products of the 1980s development of personal computers. Other examples include S-Plus, Stata, Systat, Minitab, and Statgraphics.
Data analysis programs typically have spreadsheet screens for data because statistical calculations use matrices, and after all, a spreadsheet is really just a matrix. They also have utilities for both data management and graphing, which are essential for any type of data analysis. Most all statistical software has graphical user interfaces (GUIs) and many also allow you to write your own code for specialized applications. Almost all have downloadable demos, usually fully functional (at least for basic statistics) for 30 days.
To conduct an analysis with statistical software, you enter or upload your data, scrub it (a whole other discussion), then pick from the program’s menus the graphing or analysis procedure you want to run. Submenus will pop up with all the specifications and options for the procedure. So, it’s quite easy to do a lot of statistical analyses with just a few mouse clicks but you really have to understand what all those specifications and options are about.
All of the software packages have their fans, especially the major packages. SPSS was created in the 1960s by graduates of Stanford who continued development at the University of Chicago. It used to be called Statistical Package for the Social Sciences, which is why it’s still very popular in the social sciences. SPSS was bought by IBM in 2009. SAS, formerly called the Statistical Analysis System, was developed in the early 1970s by professors at North Carolina State University. S-Plus started out as a programming language developed by Bell Laboratories in the 1980s. Minitab was created by professors at the Pennsylvania University in the 1970s from statistical spreadsheet software developed at the National Institute of Standards and Technology (NIST). It’s now focusing on Six Sigma statistics procedures for managing quality.
There is no real best statistical software. They’re all pretty good, dollar-for-dollar. A lot of what determines a user’s preference is what software is (was) available at their college or the place they work. For example, if you go (went) to Penn State, you probably think Minitab is the best. If you work at a pharmaceutical company, you probably use SAS because that’s what the entire pharmaceutical industry uses. Social scientists like to use SPSS. If you like programming your own procedures you’re probably a proponent of the R programming language for statistics.
Assuming you don’t have access to software through your school or work, you can evaluate your software needs by answering three questions:
- How sophisticated are the statistical techniques you need to use?
- How often would you likely need to use the software?
- How much do you have to spend for the software?
If you are planning on doing only one analysis, see if you can use what you have. You may be able to do all your calculations in a spreadsheet program or use free software or web-based software. If you are going to do full-time statistical consulting and you can’t afford a license for a major package, bite the bullet and learn R. Another option would be to buy a basic or an intermediate package and move up as you can afford to. If you’re only going to be an occasional user, any of the statistical packages will be better than using a spreadsheet (except perhaps for dataset scrubbing), so purchase whatever you can afford.
If you aren’t acquainted with statistical software, conduct a web search or start at en.wikipedia.org/wiki/List_of_statistical_packages. Explore the web sites you find to be sure that the software has the statistical procedures you think you will be using. Almost all of the sites have free downloads, such as brochures, white papers and demonstration software. Don’t download the demo software until you’re ready to make a decision. Most demos are good for only 30 days after which the software won’t work even if you download a new copy.
Software for Specialized Applications
There are a few kinds of analysis you might run into that will require specialized software. For example, have you ever seen an icon plot using sparklines or Chernoff faces? How about a ternary diagram or a piper plot? Some day you may have to produce one of these specialized graphics. Software you could look into would include: Sigmaplot, Origin, AquaChem, GraphPad, EasyPlot, Delta Graph, and Grapher.
If you ever have to do time-series analysis, you could start with some of the high-end statistical packages. Or, you could look into specialized software including Autobox, Eviews, ForecastX, and RATS. If you have to produce maps, find a GIS expert to help you. If you’re committed to doing it yourself, try Surfer. If you’re not into meteorology or geology, you probably don’t run into orientation data very often, but if you ever do, get Oriana. For critical-path scheduling, try Microsoft Project or P5, an update to Primavera Project Planner, now a product of Oracle. There’s also software for resampling statistics, control charts, ANOVA, neural networks, nonparametric statistics, power analysis, Bayesian statistics, data mining and many other specialties.
The software market changes rapidly. The big packages keep getting bigger, spawning optional modules from procedures that used to be part of the basic package. At the same time, new statistical software appears, usually for specialized application. Spreadsheet software is also becoming more sophisticated. Introductory statistics classes are now taught with spreadsheet software; even calculators are a thing of the past. So do some research and get the software that’s best for your situation.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.