A comparison of item response theory true score equating and item response theory-based local equating
The need to compare students across different test administrations, or perhaps across different test forms within the same administration, plays a key role in most large-scale testing programs. In order to do this, these tests must be placed on the same scale. Placing test forms onto the same scale not only allows results from different test forms to be compared to each other, but also facilitates placing the results from different test scores onto a common reporting scale. The statistical method used to place these test scores onto a common metric is called equating.
Estimated true equating, one of the conditional equating methods described by van der Linden (2000), has been shown to be a dramatic improvement over classical based equipercentile equating under some conditions (van der Linden, 2006).
The purpose of the study is to investigate the relative performance of estimated true equating with IRT true score equating under a variety of conditions that are known to impact equating accuracy, namely: anchor test length, data misfit, scaling method, and examinee ability distribution, through simulation study. The results are evaluated based on root mean squared error (RMSE) and bias of the equating functions, as well as decision accuracy when placing examinees in to performance categories. A secondary research question of relative performance of the scaling methods is also investigated.
The results indicate that estimated true equating shows tremendous promise with the dramatically lower bias and RMSE values when compared to IRT true score equating. However, this promise does not bear out when looking at examinee classification. Despite the lack of significant gains in the area of decision accuracy, this new equating method shows promise in its reduction of error attributable to the equating functions themselves, and therefore deserves further scrutiny.
The results fail to indicate a clear choice for a scaling method for use with either equating method. Practitioners still must do their best to rely on the growing body of evidence, and consider the nature of their own testing programs, and the abilities of their examinee population when choosing a scaling method.