Approaches for addressing the fit of item response theory models to educational test data
The study was carried out to accomplish three goals : (1) Propose graphical displays of IRT model fit at the item level and suggest fit procedures at the test level that are not impacted by large sample size, (2) examine the impact of IRT model misfit on proficiency classifications, and (3) investigate consequences of model misfit in assessing academic growth.
The main focus of the first goal was on the use of more and better graphical procedures for investigating model fit and misfit through the use of residuals and standardized residuals at the item level. In addition, some new graphical procedures and a non-parametric test statistic for investigating fit at the test score level were introduced, and some examples were provided. Based on a realistic dataset from a high school assessment, statistical and graphical methods were applied and results were reported. More important than the results about the actual fit, were the procedures that were developed and evaluated.
In addressing the second goal, practical consequences of IRT model misfit on performance classifications and test score precision were examined. It was found that with several of the data sets under investigation, test scores were noticeably less well recovered with the misfitting model; and there were practically significant differences in the accuracy of classifications with the model that fit the data less well.
In addressing the third goal, the consequences of model misfit in assessing academic growth in terms of test score precision, decision accuracy and passing rate were examined. The three-parameter logistic/graded response (3PL/GR) models produced more accurate estimates than the one-parameter logistic/partial credit (1PL/PC) models, and the fixed common item parameter method produced closer results to “truth” than linear equating using the mean and sigma transformation.
IRT model fit studies have not received the attention they deserve among testing agencies and practitioners. On the other hand, IRT models can almost never provide a perfect fit to the test data, but the evidence is substantial that these models can provide an excellent framework for solving practical measurement problems. The importance of this study is that it provides ideas and methods for addressing model fit, and most importantly, highlights studies for addressing the consequences of model misfit for use in making determinations about the suitability of particular IRT models.