The impact of judges' consensus on the accuracy of anchor-based judgmental estimates of multiple-choice test item difficulty: The case of the NATABOC Examination
Multiple factors have influenced testing agencies to more carefully consider the manner and frequency in which pretest item data are collected and analyzed. One potentially promising development is judges’ estimates of item difficulty. Accurate estimates of item difficulty may be used to reduce pretest samples sizes, supplement insufficient pretest sample sizes, aid in test form construction, assist in test form equating, calibrate test item writers who may be asked to produce items to meet statistical specifications, inform the process of standard setting, aid in preparing randomly equivalent blocks of pretest items, and/or aid in helping to set item response theory prior distributions.
Two groups of 11 and eight judges, respectively, provided estimates of difficulty for the same set of 33 multiple-choice items from the National Athletic Trainers’ Association Board of Certification (NATABOC) Examination. Judges were faculty in Commission on Accreditation of Athletic Training Education-approved athletic training education programs and were NATABOC-approved examiners of the former hands-on practical portion of the Examination.
For each item, judges provided two rounds of independent estimates of item difficulty and a third round group-level consensus estimate. Prior to providing estimates of item difficulty in rounds two and three, group discussion of the estimates provided in the preceding round was conducted.
In general, the judges’ estimates of test item difficulty did not improve across rounds as predicted. Two-way repeated measures analyses of variance comparing item set mean difficulty estimates by round and the item set mean empirical item difficulty revealed no statistically significant differences across rounds, groups, or the interaction of these two factors. Moreover, item set mean difficulty estimates by round gradually drifted away from the item set mean empirical item difficulty and, therefore, mean estimation bias and effect size analyses gradually increased in correspondence with the item set mean item difficulty estimates provided across rounds. Therefore, the results revealed that no item difficulty estimation round yielded statistically significantly better recovery of the empirical item difficulty values compared to the other rounds.