DSUS5 has arrived!

The fifth edition of Discovering Statistics Using IBM SPSS Statistics has just landed (or so I am told). For those that use the book I thought it might be helpful to run through what’s changed.

General changes

It might sound odd if you’ve never done a new edition of a textbook, but it can be quite hard to quantify (or remember) what you have changed. I know I spent a ridiculous number of hours working on it, so I must have changed a lot, but when I list the tangibles it seems uninspiring. Here’s an exercise for you. Take something you wrote 5 years ago and re-write it. The chances are the content won’t change but you’ll express yourself better and it’ll take you a bit of time to do the re-writing. The piece will have improved (hopefully), but the content is probably quite similar. The improvement lies in some crack of intangibility. Anyway, assuming you did the exercise (which of course no-one in their right mind would), multiply that effort by 1000/(number of pages you just re-wrote) and that’s what I spent early 2017 doing.

So, the first major change is that I did a lot of re-structuring and re-writing that doesn’t change the content, as such, but I believe does improve the experience of reading my drivel. It’s a bit less drivel-y, you might say. With respect to the tangibles (I’ve plagiarised myself from the preface here …):

  • IBM SPSS compliance: This edition was written using version 25 of IBM SPSS Statistics. IBM releases new editions of SPSS Statistics more often than I bring out new editions of this book, so, depending on when you buy the book, it may not reflect the latest version. I
  • New! Chapter: In the past four years the open science movement has gained a lot of momentum. Chapter 3 is new and discusses issues relevant to this movement such as p-hacking, HARKing, researcher degrees of freedom, and pre-registration of research. It also has an introduction to Bayesian statistics.
  • New! Bayes: Statistical times are a-changing, and it’s more common than it was four years ago to encounter Bayesian methods in social science research. IBM SPSS Statistics doesn’t really do Bayesian estimation, but you can implement Bayes factors. Several chapters now include sections that show how to obtain and interpret Bayes factors. Chapter 3 also explains what a Bayes factor is.
  • New! Robust methods: Statistical times are a-changing … oh, hang on, I just said that. Although IBM SPSS Statistics does bootstrap (if you have the premium version), there are a bunch of statistics based on trimmed data that are available in R. I have included several sections on robust tests and syntax to do them (using the R plugin).
  • New! Pointless fiction: Having got quite into writing a statistics textbook in the form of a fictional narrative (An Adventure in Statistics) I staved off boredom by fleshing out Brian and Jane’s story (which goes with the diagrammatic summaries at the end of each chapter). Of course, it is utterly pointless, but maybe someone will enjoy the break from the stats.
  • New! Misconceptions: Since the last edition my cat of 20 years died, so I needed to give him a more spiritual role. He has become the Correcting Cat, and he needed a foil, so I created the Misconception Mutt, who has a lot of common misconceptions about statistics. So, the mutt (based on my cocker spaniel, who since I wrote the update has unexpectedly died leaving a gaping emotional vacuum in my life) gets stuff wrong and the cat appears from the spiritual ether to correct him. All of which is an overly elaborate way to point out some common misconceptions.
  • New-ish! The linear model theme: In the past couple of editions of this book I’ve been keen to scaffold the content on the linear model to focus on the commonalities between models traditionally labelled as regression, ANOVA, ANCOVA, t-tests, etc. I’ve always been mindful of trying not to alienate teachers who are used to the historical labels, but I have again cranked the general linear model theme up a level.
  • New-ish! Characters: I loved working with James Iles on An Adventure in Statistics so much that I worked with him to create new versions of the characters in the book (and other design features like their boxes). They look awesome. Given that I was overhauling the characters, I decided Smart Alex should be a woman this time around.
  • Obvious stuff: I’ve re-created all of the figures, and obviously updated the SPSS Statistics screenshots and output.
  • Feedback-related changes: I always collate feedback from readers and instructors and feed that into new editions. Lots of little things will have changed resulting from user-feedback. One obvious example, is with the examples in the book. I tweaked quite a few examples this time around (Smart Alex and within the main book). It’s hard to remember everything, but most tweaks were aimed at trying to avoid lazy stereotypes: for example, I changed a lot of examples based on sex differences, I changed a suicide example etc. The style of the book hasn’t changed (the people who like it will still like it, and the people who don’t still won’t) but sometimes an example that seemed like a good idea in 2005 doesn’t seem so great in 2017.

Chapter-by-chapter changes

Every chapter got a thorough re-write, but here are the tangible changes:

  • Chapter 1 (Doing research): I re-wrote and expanded the discussion of hypotheses. I changed my beachy head example to be about memes and how they follow normal distributions. I used some google analytics data to illustrate this.
  • Chapter 2 (Statistical theory): I restructured this chapter around the acronym of SPINE (thanks to a colleague, Jennifer Mankin, for distracting me from the acronym that more immediately sprang to my childish mind), so you’ll notice that subheadings/structure has changed and so on. The content is all there, just rewritten and reorganized into a better narrative. I expanded my description of null hypothesis significance testing (NHST).
  • Chapter 3 (Current thinking in statistics): This chapter is completely new. It co-opts some of the critique of NHST that used to be in Chapter 2 but moves this into a discussion of open science, p-hacking, HARKing, researcher degrees of freedom, pre-registration, and ultimately Bayesian statistics (primarily Bayes factors).
  • Chapter 4 (IBM SPSS Statistics): Obviously reflects changes to SPSS Statistics since the previous edition. There’s a new section on ‘extending’ SPSS Statistics that covers installing the PROCESS tool, the Essentials for R plugin and installing the WRS2 package (for robust tests).
  • Chapter 5 (Graphs): No substantial changes other than reflecting the new layout and output from the chart editor. I tweaked a few examples.
  • Chapter 6 (Assumptions): The content is more or less as it was. I have a much stronger steer away from tests of normality and homogeneity (I still cover them but mainly as a way of telling people not to use them) because I now offer some robust alternatives to common tests.
  • Chapter 7 (Nonparametric models): No substantial changes to content.
  • Chapter 8 (Correlation): I completely rewrote the section on partial correlations.
  • Chapter 9 (The linear model): I restructured this chapter a bit and wrote new sections on robust regression and Bayesian regression.
  • Chapter 10 (t-tests): I did an overhaul of the theory section to tie it in more with the linear model theme. I wrote new sections on robust and Bayesian tests of two means.
  • Chapter 11 (Mediation and moderation): No substantial changes to content, just better written.
  • Chapters 12–13 (GLM 1–2): I changed the main example to be about puppy therapy. I thought that the Viagra example was a bit dated, and I needed an excuse to get some photos of my spaniel into the book. (I might have avoided doing this had I know the crappy hand that fate would subsequently deal my beloved hound, but he’s in there just to make it super hard for me to look at those chapters and teach from them.). I wrote new sections on robust and Bayesian (Chapter 12 only) variants of these models.
  • Chapter 14 (GLM 3): I tweaked the example – it’s still about the beer-goggles effect, but I linked it to some real research so that the findings now reflect some actual science that’s been done (and it’s not about sex differences any more). I added sections on robust and Bayesian variants of models for factorial designs.
  • Chapters 15–16 (GLM 4–5): I added some theory to Chapter 14 to link it more closely to the linear model (and to the content of Chapter 21). I give a clearer steer now to ignoring Mauchly’s test and routinely applying a correction to F (although, if you happen to like Mauchly’s test, I doubt that the change is dramatic enough to upset you). I added sections on robust variants of models for repeated-measures designs. I added some stuff on pivoting trays in tables. I tweaked the example in Chapter 16 a bit so that it doesn’t compare males and females but instead links to some real research on dating strategies.
  • Chapter 17 (MANOVA), Chapter 18 (Factor analysis), Chapter 19 (Categorical data), Chapter 20 (Logistic regression), Chapter 21 (Multilevel models): Just rewritten, structural tweaks and so on but no major content changes.

International editions

Nothing to do with me, but this time around if you live in North America you’ll get a book like this:

In the rest of the world it’ll look like this:

The basic difference is in the page size and formatting. The North American edition has wider pages and a three column layout, the standard edition doesn’t. The content is exactly the same (I say this confidently despite the fact that I haven’t actually seen the proofs for the North American edition so I have no idea whether the publishers changed my UK spellings to US spellings or edited out anything they secretly wished I hadn’t put in the book.)

So there you have it. Needless to say I hope that those using the book think that things have got better …

FAQ #1: K-S Tests in SPSS

I decided to start a series of blogs on questions that I get asked a lot. When I say a series I’m probably raising expectation unfairly: anyone who follows this blog will realise that I’m completely crap at writing blogs. Life gets busy. Sometimes I need to sleep. But only sometimes.

Anyway, I do get asked a lot about why there are two ways to do the Kolmogorov-Smirnov (K-S) test in SPSS. In fact, I got an email only this morning. I knew I’d answered this question many times before, but I couldn’t remember where I might have saved a response. Anyway, I figured if I just blog about it then I’d have a better idea of where I’d written a response. So, here it is. Anyway, notwithstanding my reservations about using the K-S test (you’ll have to wait until edition 4 of the SPSS book), there are three ways to get one from SPSS:

  1. Analyze>explore>plots> normality plots with tests
  2. Nonparametric Tests>One Sample … (or legacy dialogues>one sample KS)
  3. Tickle SPSS under the chin and whisper sweet nothings into its ear
These methods give different results. Why is that? Essentially (I think) if you use method 1 then SPSS applies Lillifor’s correction, but if you use method 2 it doesn’t. If you use method 3 then you just look like a weirdo.
So, is it better to use Lillifor’s correction or not? In the additional website material for my SPSS book, which no-one ever reads (the web material, not the book …) I wrote (self-plaigerism alert):
“If you want to test whether a model is a good fit of your data you can use a goodness-of-fit test (you can read about these in the chapter on categorical data analysis in the book), which has a chi-square test statistic (with the associated distribution). One problem with this test is that it needs a certain sample size to be accurate. The K–S test was developed as a test of whether a distribution of scores matches a hypothesized distribution (Massey, 1951). One good thing about the test is that the distribution of the K–S test statistic does not depend on the hypothesized distribution (in other words, the hypothesized distribution doesn’t have to be a particular distribution). It is also what is known as an exact test, which means that it can be used on small samples. It also appears to have more power to detect deviations from the hypothesized distribution than the chi-square test (Lilliefors, 1967). However, one major limitation of the K–S test is that if location (i.e. the mean) and shape parameters (i.e. the standard deviation) are estimated from the data then the K–S test is very conservative, which means it fails to detect deviations from the distribution of interest (i.e. normal). What Lilliefors did was to adjust the critical values for significance for the K–S test to make it less conservative (Lilliefors, 1967) using Monte Carlo simulations (these new values were about two thirds the size of the standard values). He also reported that this test was more powerful than a standard chi-square test (and obviously the standard K–S test).
Another test you’ll use to test normality is the Shapiro-Wilk test (Shapiro & Wilk, 1965) which was developed specifically to test whether a distribution is normal (whereas the K–S test can be used to test against other distributions than normal). They concluded that their test was ‘comparatively quite sensitive to a wide range of non-normality, even with samples as small as n = 20. It seems to be especially sensitive to asymmetry, long-tailedness and to some degree to short-tailedness.’ (p. 608). To test the power of these tests they applied them to several samples (n = 20) from various non-normal distributions. In each case they took 500 samples which allowed them to see how many times (in 500) the test correctly identified a deviation from normality (this is the power of the test). They show in these simulations (see table 7 in their paper) that the S-W test is considerably more powerful to detect deviations from normality than the K–S test. They verified this general conclusion in a much more extensive set of simulations as well (Shapiro, Wilk, & Chen, 1968).” 
So there you go. More people have probably read that now than when it was on the additional materials for the book. It Looks like Lillifor’s correction is a good thing (power wise) but you probably don’t want to be using K-S tests anyway really, or if you do interpret them within the context of the size of your sample and look at graphical displays of your scores too.

SPSS is not dead

This blog was published recently showing that the use of R continues to grow in academia. One of the graphs (Figure 1) showed citations (using google scholar) of different statistical packages in academic papers (to which I have added annotations).
At face value, this graph implies a very rapid decline in SPSS use since 2005. I sent a tongue in cheek tweet about this graph, and this perhaps got interpreted that I thought SPSS use was on the decline. So, I thought I’d write this blog. The thing about this graph is it deals with citations in academic papers. The majority of people do not cite the package they use to analyse their data, so this might just reflect a decline in people stating that they used SPSS in papers. Also, it might be that users of software such as R are becomming more inclined to cite the package to encourage others to use it (stats package preference does for some people mimic the kind of religious fervor that causes untold war and misery. Most packages have their pros and cons and some people should get a grip). Also, looking at my annotations on Figure 1 you can see that the decline in SPSS is in no way matched by an upsurge in the use of R/Stata/Systat. This gap implies some mysterious ghost package that everyone is suddenly using but is not included on this graph. Or perhaps people are just ditching SPSS for qualitative analysis or doing it by handJ
If you really want to look at the decline/increase of package use then there are other metrics you could use. This article details lots of them. For example you could look at how much people talk about packages online (Figure 2).
Figure 2: online talk of stats packages (Image from http://r4stats.com/popularity)
Based on this R seems very popular and SPSS less so. However, the trend for SPSS is completely stable between 2005-2010 (the period of decline in the Figure 1). Discussion of R is on the increase though. Again though you can’t really compare R and SPSS here because R is more difficult to use than SPSS (I doubt that this is simply my opinion, I reckon you could demonstrate empirically that the average user prefers the SPSS GUI to R’s command interface if you could be bothered). People are, therefore, more likely to seek help on discussion groups for R than they are for SPSS. It’s perhaps not an index of popularity so much as usability. 
There are various other interesting metrics discussed in the aforementioned article. Perhaps the closest we can get to an answer to package popularity (but not decline in use) is survey data on what tools people use for data mining. Figure 3 shows that people most frequently report R, SPSS and SAS. Of course this is a snapshot and doesn’t tell us about usage change. However, it shows that SPSS is still up there. I’m not sure what types of people were surveyed for this figure, but I suspect it was professional statisticians/business analysts rather than academics (who would probably not describe their main purpose as data mining). This would also explain the popularity of R, which is very popular amongst people who crunch numbers for a living.
Figure 3: Data mining/analytic tools reported in use on Rexer Analytics survey during 2009 (from http://r4stats.com/popularity).
To look at the decline or not of SPSS in academia what we really need is data about campus licenses over the past few years. There were mumblings about Universities switching from SPSS after IBM took over and botched the campus agreement, but I’m not sure how real those rumours were. In any case, the teething problems from the IBM take over seem to be over (at least most people have stopped moaning about them). Of course, we can’t get data on campus licenses because it’s sensitive data that IBM would be silly to put in the public domain. I strongly suspect campus agreements have not declined though. If they have, IBM will be doing all that they can (and they are an enormously successful company) to restore them because campus agreements are a huge part of SPSS’s business.
Also, I doubt campus agreements have declined because they will stop for two main reasons (1) SPSS isn’t used by anyone anymore, (2) the cost become prohibitive. These two reasons are related obviously – the point at which they stop the agreement will be a function of cost and campus usage. In terms of campus usage, If you grew up using SPSS as an undergraduate or postgraduate, you’re unlikely to switch software later in your academic career (unless you’re a geek like me who ‘enjoys’ learning R). So, I suspect the demand is still there. In terms of cost, as I said, I doubt IBM are daft enough to price themselves out of the market.
So, despite my tongue in cheek tweet, I very much doubt that there is a mass exodus from SPSS. Why would there be? Although some people tend to be a bit snooty about SPSS, it’s a very good bit of software: A lot of what it does, it does very well. There are things I don’t like about it (graphs, lack of robust methods, their insistence on moving towards automated analysis), but there’s things I don’t like about R too. Nothing is perfect, but SPSS’s user-friendly interface allows thousands of people who are terrified of stats to get into it and analyse data and, in my book, that’s a very good thing.

Factor Analysis for Likert/Ordinal/Non-normal Data

My friend Jeremy Miles sent me this article by Basto and Periera (2012) this morning with the subject line ‘this is kind of cool’. Last time I saw Jeremy, my wife and I gatecrashed his house in LA for 10 days to discuss writing the R book that’s about to come out. During that stay we talked about lots of things, none of which had anything to do with statistics, or R for that matter. It’s strange then that with the comforting blanket of the Atlantic ocean between us, we only ever talk about statistics, or rant at each other about statistics, or R, or SPSS, or each other.
Nevertheless, I’m always excited to see a message from Jeremy because it’s usually interesting, frequently funny, and only occasionally  insulting about me. Anyway, J was right, this article was actually kind of cool (in a geeky stats sort of way). The reason that the article is kind of cool is because it describes an SPSS interface for doing various cool factor analysis (FA) or principal components analysis (PCA) things in SPSS such as analysis of correlation matrices other than those containing Pearson’s r and parallel analysis/MAP. It pretty much addresses two questions that I get asked a lot:
  1. My data are Likert/not normal, can I do a PCA/FA on them?
  2. I’ve heard about Velicer’s minimum average partial (MAP) criteria and Parallel analysis, can you do them in SPSS.
PCA/FA is not something I use, and the sum total of my knowledge is in my SPSS/SAS/R book. Some of that isn’t even my knowledge, it’s Jeremy’s, because he likes to read my PCA chapter and get annoyed about how I’ve written it. The two questions are briefly answered in the book, sort of.
The answer to question 1 is apply the PCA to the correlation matrix of polychoric correlations (for Likert/ordinal/skewed data) or tetrachoric correlations (for dichotomous data) rather than the matrix of Pearson’s r. This is mentioned so briefly that you might miss it on p. 650 of the SPSS book (3rd ed) and 772 (in the proofs at least) of the R book.
The answer to question 2 is in Jane Superbrain 17.2 in the books, in which I very briefly explain parallel analysis and point to some syntax to do it that someone else wrote, and I don’t talk about MAP at all.
I cleverly don’t elaborate on how you would compute polychoric correlations, or indeed tetrachoric ones, and certainly don’t show anyone anything about MAP. In part this is because the books are already very large, but in the case of the SPSS book it’s because SPSS won’t let you do PCA on any correlation matrix other than one containing Pearson’s r and MAP/parallel analysis let’s just say have been overlooked in the software. Until now that is.
Basto and Periera (2012) have written an interface for doing PCA on correlation matrices containing things other than Pearson’sr, and you can do MAP parallel analysis and a host of other things. I recommend the article highly if PCA is your kind of thing.
However, the interesting thing is that underneath Basto and Periera’s interface SPSS isn’t doing anything – all of the stats are computed using R. In the third edition of the SPSS book I excitedly mentioned the R plugin for SPSS a few times. I was mainly excited because at the time I’d never used it, and I stupidly thought it was some highly intuitive interface that enabled you to access R from within SPSS without knowing anything about R. My excitement dwindled when I actually used it. It basically involves installing the plugin which may or may not work. Even if you get it working you simply type:
Input Program R
End Program
and stick a bunch of R code in between. It seemed to me that I might as well just use R and save myself the pain of trying to locate the plugin and actually get it working (it may be better now – I haven’t tried it recently). Basto and Periera’s interface puts a user-friendly dialog box around a bunch of R commands.
I’m absolutely not knocking Basto and Periera’s interface – I think it will be incredibly useful to a huge number of people who don’t want to use R commands, and very neatly provides a considerable amount of extra functionality to SPSS that would otherwise be unavailable. I’m merely making the point that it’s shame that having installed the interface, SPSS will get the credit for R’s work.
Admittedly it will be handy to have your data in SPSS, but you can do this in R with a line of code:
data<-read.spss(file.choose(), to.data.frame = TRUE)
Which opens a dialog box for you to select an SPSS file, and then puts it in a data frame that I have unimaginatively called ‘data’. Let’s imagine we opened Example1.sav from Basto and Periera’s paper. These data are now installed in an object calleddata.
Installing the SPSS interface involves installing R, installing the R plugin for SPSS, then installing the widget itself through the utilities menu. Not in itself hard, but by the time you have done all of this this I reckon you could type and execute this command in R:
This creates a matrix (called rMatrix) containing polychoric correlations of the variables in the data frame (which, remember, was called data).
You probably also have time to run these command:
That’s your parallel analysis done on the polychoric correlation matrix (first command) and displayed on the screen (second command). The results will mimic the values in Figure 4 in Basto and Periera. If you want to generate Figure 3 from their paper as well then execute:
It’s not entirely implausible that the R plugin for SPSS will still be downloading or installing at this point, so to relieve the tedium you could execute this command:
mapResults<-VSS(rMatrix, n.obs = 590, fm = "pc" )
That’s the MAP analysis done on the polychoric correlation matrix using the VSS() function in R. n.obs is just the number of observations in the dataframe, and fm = “pc” tells it to do PCA rather than FA. The results will mimic the values in Figures 5 and 6 of  Basto and Periera.
The R plugin isn’t working so you’re frantically googling for a solution. While you do that, a small gerbil called Horace marches through the cat flap that you didn’t realise you had, jumps on the table and types this into the R console:
PCA<-principal(rMatrix, nfactors = 4, rotate = "varimax")
print.psych(PCA, cut = 0.3, sort = TRUE)
Which will create an object called OPCA which contains the results of a PCA on your polychoric correlation matrix, extracting 4 factors and rotating using varimax (as Basto and Periera do for Example 1). You’ll get basically the same results as Figure 13.
You probably also have time to make some tea.
Like I said, my intention isn’t to diss Basto and Periera’s interface. I think it’s great, incredibly useful and opens up the doors to lots of useful things that I haven’t mentioned. My intention is instead to show you that you can do some really complex things in R with little effort (apart from the 6 month learning curve of how to use it obviously). Even a gerbil called Horace can do it. Parallel analysis was done using 33 characters: a line of code. PCA on polychoric correlations sounds like the access code to a secure unit in a mental institution, but it’s three short lines of code. Now that is pretty cool.


PS The R-code here needs the following packages installed: psychnFactors and foreign.