Content area
Full Text
Objectives. To develop an imputation method to produce estimates for suppressed values within a shared government administrative data set to facilitate accurate data sharing and statistical and spatial analyses.
Methods. We developed an imputation approach that incorporated known features of suppressed Massachusetts surveillance data from 2011 to 2017 to predict missing values more precisely. Our methods for 35 de-identified opioid prescription data sets combined modified previous or next substitution followed by mean imputation and a count adjustment to estimate suppressed values before sharing. We modeled 4 methods and compared the results to baseline mean imputation.
Results. We assessed performance by comparing root mean squared error (RMSE), mean absolute error (MAE), and proportional variance between imputed and suppressed values. Our method outperformed mean imputation; we retained 46% of the suppressed value's proportional variance with better precision (22% lower RMSE and 26% lower MAE) than simple mean imputation.
Conclusions. Our easy-to-implement imputation technique largely overcomes the adverse effects of low count value suppression with superior results to simple mean imputation. This novel method is generalizable to researchers sharing protected public health surveillance data. (AmJ Public Health. 2021; 1 1 1(10):1830-1838. https://doi.org/10.2105/AJPH.2021.306432)
(ProQuest: ... denotes formulae omitted.)
In this information age, increasing availability of public health surveillance data is catalyzing groundbreaking research while presenting new challenges related to data privacy and completeness. For example, protected government surveillance data cannot be shared without suppressing small values to protect the confidentiality of individuals,1 which may adversely affect the subsequent analyses. Inference from the analytical results using suppressed data may be subject to bias because of the removal of small count values, yielding potential loss of statistical power because of the reduced sample size. Analyses using data with suppressed values may not produce reliable results for areas with low population counts, for minority population groups, or for rare outcomes.2 Suppression is particularly troublesome for geomapping and spatial analytic methods that rely upon joined data across multiple data sets. Suppressed small cell data disproportionately affect rural and small population areas, may discourage research comparing smaller subsets of the population, and leave large spatial areas with unknown or unreportable risk.2 We describe a novel and practical method that can provide imputed values for protected government data that would otherwise have limited analytic...