Detecting exposed items in computer -based testing
More and more testing programs are transferring from traditional paper and pencil to computer-based administrations. Common practice in computer-based testing is that test items are utilized repeatedly in a short time period to support large volumes of examinees, which makes disclosed items a concern to the validity and fairness of test scores. Most current research is focused on controlling item exposure rates, which minimizes the probability that some items are over used, but there is no common understanding about issues such as how long an item pool should be used, what the pool size should be, and what exposure rates are acceptable.
A different approach to addressing overexposure of test items is to focus on generation and investigation of item statistics that reveal whether test items are known to examinees prior to their seeing the tests. A method was proposed in this study to detect disclosed items by monitoring the moving averages of some common item statistics.
Three simulation studies were conducted to investigate and evaluate the usefulness of the method. The statistics investigated included classical item difficulty, IRT-based item raw residuals, and three kinds of IRT-based standardized item residuals. The detection statistic used in study 1 was the classical item difficulty statistic. Study 2 investigated classical item difficulty, IRT-based item residuals and the best known of the IRT-based standardized residuals. Study 3 investigated three different types of standardizations of residuals. Other variables in the simulations included window sizes, item characteristics, ability distributions, and the extent of item disclosure. Empirical type I error and power of the method were computed for different situations. The results showed that, with reasonable window sizes (about 200 examinees), the IRT-based statistics under a wide variety of conditions produced the most promising results and seem ready for immediate implementation. Difficult and discriminating items were the easiest to spot when they had been exposed and it is the most discriminating items that contribute most to proficiency estimation with multi-parameter IRT models. Therefore, early detection of these items is especially important. The applicability of the approach to large scale testing programs was also addressed.