On reliability

In sta­tis­tics, relia­bi­lity is one way to talk about mea­su­re­ment error. In the real world, mea­su­re­ment is rarely per­fect, and imper­fect mea­su­re­ment redu­ces the power of infe­ren­tial sta­tis­tics. Mea­su­re­ment error also has more com­plex impli­ca­tions for tra­di­tio­nal approaches to corre­la­tion. Relia­bi­lity is one way of quan­tif­ying mea­su­re­ment error that lets us com­pare error across mea­su­re­ment devi­ces and even properties.

Say you’re inte­res­ted in mea­su­ring your weight. Scale A has a known ave­rage mea­su­re­ment error of plus-or-minus 100 grams. Scale B has a known ave­rage mea­su­re­ment error of plus-or-minus 500 grams. It’s pretty intui­tive to conc­lude that Scale A is more relia­ble than Scale B because they both mea­sure the same pro­perty (weight) and Scale A’s mea­su­re­ment error is lower. Howe­ver, let’s say you’re also inte­res­ted in mea­su­ring your percent-body-fat and you use a tool like a DXA machine. You look at the manufacturer’s hand­book and find that they’ve tes­ted the machine and found it has a known ave­rage mea­su­re­ment error of plus-or-minus 1%. How would you go about com­pa­ring the mea­su­re­ment error of Scale A and the DXA machine? If you try to com­pare 100 grams to 1%, you’re in an apples-and-oranges situation.

The solu­tion pro­vi­ded by tra­di­tio­nal cal­cu­la­tions of relia­bi­lity is to trans­form each obser­ved mea­su­re­ment error to a com­mon scale. This is achie­ved by chan­ging the ques­tion from “How varia­ble should we expect mea­su­re­ment of an indi­vi­dual to be?” to “How varia­ble should we expect mea­su­re­ment of an indi­vi­dual to be rela­tive to the varia­bi­lity we observe in the mea­su­re­ment of indi­vi­duals across the entire popu­la­tion”. Thus, we become inte­res­ted not in mea­su­re­ment error, but the ratio of mea­su­re­ment error to the varia­bi­lity of the pro­perty of inte­rest in the popu­la­tion generally.

Tra­di­tio­nally, the relia­bi­lity of a mea­su­re­ment tool is com­pu­ted by obtai­ning mul­ti­ple mea­su­re­ments of mul­ti­ple items; if we’re tal­king sca­les, then we’d weigh many peo­ple and each per­son would get weighed mul­ti­ple times. We’d then split each person’s mea­su­re­ments into two hal­ves, obtain the mean per half per per­son, then corre­late half 1 (x) with half 2 (y) across peo­ple. Check out the for­mula for correlation:

r=\frac{cov(x,y)}{sd(x) \times sd(y)}

In the con­text of relia­bi­lity, sd(x) and sd(y) are really two attempts at mea­su­ring the same thing, the varia­bi­lity of sco­res in the popu­la­tion, so their pro­duct is really just an esti­mate of the popu­la­tion variance. r is thus expres­sed as a ratio of the cova­riance of x and y over the popu­la­tion variance; sound fami­liar? Hope­fully you can see that the cov(x,y) must be an index of mea­su­re­ment error, and indeed it can be shown that when x and y are iden­ti­cal (i.e. when there’s no mea­su­re­ment error), cov(x,y) is equal to sd(x) \times sd(y), thus yiel­ding an esti­mate of r of 1 (per­fect correlation/reliability). As you add mea­su­re­ment error x and y will start to dif­fer, yiel­ding values for cov(x,y) that will be slightly less than sd(x) \times sd(y), in turn lea­ding to esti­ma­tes of r that are less than 1.

After obtai­ning an esti­mate of relia­bi­lity by split­ting the data and com­pu­ting the corre­la­tion of hal­ves, it is often neces­sary to apply the Spearman-Brown pre­dic­tion for­mula to extra­po­late the relia­bi­lity esti­mate to the un-split data.

Unfor­tu­na­tely, the corre­la­tion approach to com­pu­ting relia­bi­lity has trou­bles. I noted that to com­pute the corre­la­tion you have to split the data in half; how do you decide what obser­va­tions to put in what half? It turns out that the com­pu­ted relia­bi­lity can vary dra­ma­ti­cally depen­ding on how you split the data. It’s also the case that the corre­la­tion approach can some­ti­mes yield esti­ma­tes of relia­bi­lity that are nega­tive, a cer­tainly non-sensical result! One solu­tion is to com­pute many many relia­bi­lity esti­ma­tes using a large sub­set of all pos­si­ble ways to split the data (I just sub­mit­ted a paper that employs this approach). This is what Cronbach’s alpha attempts to do, though that approach actually obtains an analy­tic solu­tion that repre­sents the ave­rage of all pos­si­ble splits, but it applies only to the case of mea­su­res that take a sum across mea­su­re­ments, a truly silly prac­tice. I’ve attemp­ted to derive the analy­tic solu­tion for the more rea­so­na­ble case of obtai­ning a mean across mea­su­re­ments, and had some suc­cess that was sub­se­quently vali­da­ted by simu­la­tion, but I’m still wor­king out some kinks to this solu­tion (like how to account for N). By com­pa­ring the relia­bi­lity esti­ma­tes obtai­ned by appl­ying this approach and a more stan­dard corre­la­tion approach to a lot of real data, I find that my analy­tic solu­tion already appears to address the relia­bi­lity of the full data set (the esti­ma­tes are roughly equi­va­lent to the corre­la­tion esti­ma­tes after the Spearman-Brown pre­dic­tion for­mula has been applied to the lat­ter). I must admit I do have some sus­pi­cions that I might have simply re-invented intrac­lass corre­la­tion, but I’m still wor­king out the math on whether this is the case.

But igno­ring nuan­ces of com­pu­ta­tion, I won­der about the true use­ful­ness of relia­bi­lity as a sta­tis­ti­cal con­cept. Sure, it per­mits com­pa­ring apples to oran­ges, but to what end? Williams, Zim­mer­man and Zumbo (1995, cita­tion, PDF) demons­trate that the rela­tionship bet­ween relia­bi­lity and sta­tis­ti­cal power is not a sim­ple one. It turns out that inc­rea­sed mea­su­re­ment error within-Ss cau­ses dec­rea­ses in both relia­bi­lity and power, but inc­rea­sed between-Ss varia­bi­lity also dec­rea­sed power (for between-Ss designs) but actually inc­rea­ses relia­bi­lity. So if you can’t use relia­bi­lity to com­pare the power of two tools, what other use does it have?

Leave a Reply