On reliability
In statistics, reliability is one way to talk about measurement error. In the real world, measurement is rarely perfect, and imperfect measurement reduces the power of inferential statistics. Measurement error also has more complex implications for traditional approaches to correlation. Reliability is one way of quantifying measurement error that lets us compare error across measurement devices and even properties.
Say you’re interested in measuring your weight. Scale A has a known average measurement error of plus-or-minus 100 grams. Scale B has a known average measurement error of plus-or-minus 500 grams. It’s pretty intuitive to conclude that Scale A is more reliable than Scale B because they both measure the same property (weight) and Scale A’s measurement error is lower. However, let’s say you’re also interested in measuring your percent-body-fat and you use a tool like a DXA machine. You look at the manufacturer’s handbook and find that they’ve tested the machine and found it has a known average measurement error of plus-or-minus 1%. How would you go about comparing the measurement error of Scale A and the DXA machine? If you try to compare 100 grams to 1%, you’re in an apples-and-oranges situation.
The solution provided by traditional calculations of reliability is to transform each observed measurement error to a common scale. This is achieved by changing the question from “How variable should we expect measurement of an individual to be?” to “How variable should we expect measurement of an individual to be relative to the variability we observe in the measurement of individuals across the entire population”. Thus, we become interested not in measurement error, but the ratio of measurement error to the variability of the property of interest in the population generally.
Traditionally, the reliability of a measurement tool is computed by obtaining multiple measurements of multiple items; if we’re talking scales, then we’d weigh many people and each person would get weighed multiple times. We’d then split each person’s measurements into two halves, obtain the mean per half per person, then correlate half 1 () with half 2 (
) across people. Check out the formula for correlation:
In the context of reliability, and
are really two attempts at measuring the same thing, the variability of scores in the population, so their product is really just an estimate of the population variance.
is thus expressed as a ratio of the covariance of
and
over the population variance; sound familiar? Hopefully you can see that the
must be an index of measurement error, and indeed it can be shown that when
and
are identical (i.e. when there’s no measurement error),
is equal to
, thus yielding an estimate of
of 1 (perfect correlation/reliability). As you add measurement error
and
will start to differ, yielding values for
that will be slightly less than
, in turn leading to estimates of
that are less than 1.
After obtaining an estimate of reliability by splitting the data and computing the correlation of halves, it is often necessary to apply the Spearman-Brown prediction formula to extrapolate the reliability estimate to the un-split data.
Unfortunately, the correlation approach to computing reliability has troubles. I noted that to compute the correlation you have to split the data in half; how do you decide what observations to put in what half? It turns out that the computed reliability can vary dramatically depending on how you split the data. It’s also the case that the correlation approach can sometimes yield estimates of reliability that are negative, a certainly non-sensical result! One solution is to compute many many reliability estimates using a large subset of all possible ways to split the data (I just submitted a paper that employs this approach). This is what Cronbach’s alpha attempts to do, though that approach actually obtains an analytic solution that represents the average of all possible splits, but it applies only to the case of measures that take a sum across measurements, a truly silly practice. I’ve attempted to derive the analytic solution for the more reasonable case of obtaining a mean across measurements, and had some success that was subsequently validated by simulation, but I’m still working out some kinks to this solution (like how to account for N). By comparing the reliability estimates obtained by applying this approach and a more standard correlation approach to a lot of real data, I find that my analytic solution already appears to address the reliability of the full data set (the estimates are roughly equivalent to the correlation estimates after the Spearman-Brown prediction formula has been applied to the latter). I must admit I do have some suspicions that I might have simply re-invented intraclass correlation, but I’m still working out the math on whether this is the case.
But ignoring nuances of computation, I wonder about the true usefulness of reliability as a statistical concept. Sure, it permits comparing apples to oranges, but to what end? Williams, Zimmerman and Zumbo (1995, citation, PDF) demonstrate that the relationship between reliability and statistical power is not a simple one. It turns out that increased measurement error within-Ss causes decreases in both reliability and power, but increased between-Ss variability also decreased power (for between-Ss designs) but actually increases reliability. So if you can’t use reliability to compare the power of two tools, what other use does it have?