When to worry more: An empirical investigation of the effects of non-randomly missing data on regression analysis
Statistical remedies exist for most configurations of missing data, but these remedies require specific models and/or measures of nonresponse that are usually unavailable to the researcher. Consequently, the question of the conditions under which the threat to regression analysis posed by non-randomly missing data increases becomes relevant. This simulation study addresses that question empirically by assessing the effect of various configurations of non-randomly missing data on OLS regression analysis completed with different techniques for coping with the missing observations on samples drawn from varying populations. The configurations of missing data vary by which variable has missing observations, which variable drives the response mechanism, and the strength of the response mechanism. Five different techniques--listwise deletion, pairwise deletion, regression estimation without the addition of a residual, regression estimation with the addition of a residual, and EM estimation--for coping with the missing observations are compared. The regression analysis is completed on samples of different sizes drawn from populations that vary on the degree of correlation between the independent variables and the effect sizes. The effects of the non-randomly missing observations are assessed in terms of the deviation of the estimated coefficient from its true value and the increase or decrease in the associated standard error relative to its value based on known population parameters. The effect of the missing observations on inference is examined as well. In general, when the strength of the missing data mechanism is low, all techniques except pairwise deletion perform quite well. When the strength of the missing data mechanism is high, regression with the addition of a residual and EM estimation perform better than the other techniques. Pairwise deletion and regression without the addition of a residual consistently produce the worst results. Finally, the most troublesome situation occurs when the chances of observing a variable depend upon the value of the variable itself.
0344: Social research