Evaluating the consistency and accuracy of proficiency classifications using item response theory
As demanded by the No Child Left Behind (NCLB) legislation, state-mandated testing has increased dramatically, and almost all of these tests report examinee's performance in terms of several ordered proficiency categories. Like licensure exams, these assessments often have high-stakes consequences, such as graduation requirements and school accountability. It goes without saying that we want these tests to be of high quality, and the quality of these test instruments can be assessed, in part, through the decision accuracy (DA) and decision consistency (DC) indices.
With the popularization of IRT, an increasing number of tests are adopting IRT for test development, test score equating and all other data analyses, which naturally calls for approaches to evaluating DA and DC in the framework of IRT. However, it is still common to observe the practice of carrying out all data analyses in IRT while reporting DA and DC indices derived in the framework of CTT. This situation testifies to the necessity to the exploration of possibilities to quantify DA and DC under IRT.
The current project addressed several possible methods for estimating DA and DC in the framework of IRT with the specific focus on tests involving both dichotomous and polytomous items. It consisted of several simulation studies in which the all IRT methods introduced were valuated with simulated data, and all methods introduced were also be applied in a real data context to demonstrate their application in practice. Overall, the results from this study provided evidence that would support the use of the 3 IRT methods introduced in this project in estimating DA and DC indices in most of the simulated situations, and in most of the cases the 3 IRT methods produced results that were close to the "true" DA and DC values, and consistent results to (sometimes even better results than) those from the commonly used L&L method. It seems the IRT methods showed more robustness on the distribution shapes than on the test length. Their implications to educational measurement and some directions for future studies in this area were also discussed.