Psychometric properties of several computer -based test designs with ideal and constrained item pools
The purpose of this study was to compare linear fixed length test (LFT), multi stage test (MST), and computer adaptive test (CAT) designs under three levels of item pool quality, two levels of match between test and item pool content specifications, two levels of test length, and several levels of exposure control expected to be practical for a number of testing programs. This design resulted in 132 conditions that were evaluated using a simulation study with 9000 examinees on several measures of overall measurement precision including reliability, the mean error and root mean squared error between true and estimated ability levels, classification precision including decision accuracy, false positive and false negative rates, and Kappa for cut scores corresponding to 30%, 50%, and 85% failure rates, and conditional measurement precision with the conditional root mean squared error between true and estimated ability levels conditioned on 25 true ability levels.
Test reliability, overall and conditional measurement precision, and classification precision increased with item pool quality and test length, and decreased with less adequate match between item pool and test specification match. In addition, as the maximum exposure rate decreased and the type of exposure control implemented became more restrictive, test reliability, overall and conditional measurement precision, and classification precision decreased. Within item pool quality, match between test and item pool content specifications, test length, and exposure control, CAT designs showed superior psychometric properties as compared to MST designs which in turn were superior to LFT designs. However, some caution is warranted in interpreting these results since the ability of the automated test assembly software to construct test that met specifications was limited in conditions where pool usage was high. The practical importance of the differences between test designs on the evaluation criteria studied is discussed with respect to the inferences test users seek to make from test scores and nonpsychometric factors that may be important in some testing programs.