I was reading a paper this morning. It included a perversion of a common statistical analysis that is fundamentally wrong, utterly unneccesary, and has an easy solution. This perversion, unfortunately, is also distressingly common. Inspired by O’Hara and Kotze’s 2010 paper Do Not Log-Transform Count Data, I now offer you this blog post/rant, entitled “Do not flip-flop variables to make them work in your #@%*^& ANOVA.”
What set me off was a statement about the presence or absence of a particular fish in alpine lakes (details have been blurred to protect the guilty):
Lakes containing [fish] were lower in elevation…than lakes without [fish].
This statement was followed by the results of a non-parametric ANOVA confirming that lakes with fish were at significantly lower elevations than lakes without them. Can you spot the problem here? This model implies–wait for it–that you can flatten mountain ranges by adding fish to their lakes. Who knew?
This question—“If we know a lake contains fish, what is it’s elevation?”—was not what the authors were actually interested in. They wanted to know, given a lake’s elevation, how likely it was to contain fish. The latter question, unlike the former, makes an iota of biological sense. There are several good reasons why lakes at different elevations might be more or less likely to contain fish. There are zero good reasons why the presence of fish would change the elevation of a lake.
This mistake—which, again, answers a backwards question—arises from blind over-reliance on the statistical analysis of variance, or ANOVA. Technically, an analysis of variance is something you can do with many different models, but in common use, “ANOVA” refers to a particular one: a model with a categorical predictor variable and a continuous response.
This kind of ANOVA is taught in every introductory statistics class, and for many people is viewed as a machine for the production of p-values (which as we all know, are the sole arbiter of Good and Correct science). But not all data come with a categorical cause and a continuous effect. So what do you do? Use a more appropriate, if slightly more complicated model? Why do that when you can simply reverse cause and effect? Voila! Categorical predictor and continuous response! ANOVA on!
When ANOVA is the only model you have, this is the shit you are forced to do. But the categorical ANOVA is not the only model we have. In this case, we have binomial GLMs, aka logistic regression. They’ve been around since the 1940s! They predict whether or not something occurs (say, fish present a lake), based on a continuous predictor (say, elevation)! In R, they require just 18 more keystrokes!
> right.model <- glm(fish.present ~ elevation, family=binomial) > wrong.model <- aov(elevation ~ fish.present)
Why are you answering an idiotic question when you could be answering the one you are actually interested in???
To make matters worse, to use the traditional ANOVA, we have to assume Elevation is normally distributed, which it probably isn't. So we fall back to a nonparametric ANOVA, losing statistical power and interpretability. (Though in the larger picture, this may be a net positive, since the model was scientific nonsense to begin with--the fewer people who interpret it, the better!)
A possibly objection to the flawlessly argued, legally-binding internet rant above is that this test was not intended to predict the presence of fish, just to show that there was "a difference" due to elevation. To which I say...okay, in some sense, yes, you are making a not-incorrect model of one margin of a joint distribution. But knowing P(Elevation | Fish) is not useful, and, taken literally, leads to a nonsensical interpretation.
Think about how much more useful the other approach is: From the GLM, you will get a model that not only gives you your precious p-value, but an actual predictive model. You can say that lower-elevation lakes are more likely to have fish, but you can also report where the cutoff elevation lies, and you can say how sharp or gradual the cutoff is. Other researchers can use your model, and compare their results quantitatively with it.
From talking to people in areas where GLMs are not common, they are sometimes seen as untrustworthy, if not outright statistical machismo. They are not. Did I mention that this technique was invented in the 1940s? There is no reason not to use it if you have a binary response and a continuous covariate. So pretty please. With sugar on top. Do not flip-flop variables to make them work in your #@%*^& ANOVA.