Assessing fit of item response theory models
Item response theory (IRT) modeling is a statistical technique that is being widely applied in the field of educational and psychological testing. The usefulness of IRT models, however, is dependent on the extent to which they effectively reflect the data, and it is necessary that model data fit be evaluated before model application by accumulating a wide variety of evidence that supports the proposed uses of the model with a particular set of data.
This thesis addressed issues in the collection of two major sources of fit evidence to support IRT model application: evidence based on model data congruence, and evidence based on intended uses of the model and practical consequences. Specifically, the study (a) proposed a new goodness-of-fit procedure, examined its performance using fitting and misfitting data, and compared its behavior with that of the commonly used goodness-of-fit procedures, and (b) investigated through simulations the consequences of model misfit on two of the major IRT applications: equating and computer adaptive testing.
In all simulation studies, 3PLM was assumed to be the true IRT model, while 1PLM and 2PLM were treated as misfitting models. The study found that the new proposed goodness-of-fit statistic correlated consistently higher than the commonly used fit statistics with the true size of misfit, making it a useful index to estimate the degree of misfit, which is often of interest but unknown in practice. A major issue with the new statistic is its inappropriately defined null distribution and critical values, and as a result the new statistical test appeared to be less powerful, but less susceptible to type I error rate either.
In examining the consequences of model data misfit, the study showed that although theoretically 2PLM could not provide a perfect fit to 3PLM data, there was minimum consequence if 2PLM was used to equate 3PLM data and if number correct scores were to be reported. This, however, was not true in CAT given the significant bias 2PLM produced. The study further emphasized the importance of fit evaluation through both goodness-of-fit statistical tests and examining practical consequences of misfit.