Addressing missing data in public health: An empirical comparison of strategies
Experience the future of the Tulane University Digital Library.
We invite you to visit the new Digital Collections website.
Description
Missing data occur frequently in public health databases. In the public health setting, it is common practice to remove from analysis any cases with missing values when summarizing the data. If cases with missing values differ in any systematic way from complete cases, however, then analyses that ignore the fact that some data are missing can produce biased results. Although different strategies have been recommended to address the occurrence of missing data, these strategies have been evaluated with a research, rather than public health, setting in mind. Specifically, the performance of different strategies have been compared in the context of hypothesis testing, which is important in the research setting but which rarely occurs when public health data are used for planning purposes on an ongoing basis. Furthermore, comparisons have predominantly involved continuous data, while analysts in a public health setting almost exclusively summarize categorical data The current study was an empirical comparison of several imputation methods used to impute a binary outcome from categorical predictors. Monte Carlo simulations were used to compare the accuracy of these methods under varying sample sizes, amounts of missing data, and departures from ignorable missingness. This study also compared three different strategies for building an imputation model in order to determine whether the strategies differed in their ability to impute less-biased results or in their robustness to violations of the assumption that that data were ignorably missing. Finally, this study applied imputation to the problem of missing HIV transmission risk data in two local public health databases The 'score step down' hot deck and logistic regression imputation methods performed similarly across all conditions investigated, and produced estimates less-biased and generally more precise than complete case estimates. There were no meaningful differences in terms of accuracy or lack of bias between imputation estimates derived from mean or stochastic imputations produced by these methods. Logistic regression multiple imputation [25] did not perform as well as either the hot deck or logistic regression imputation methods and was found to be sensitive to the complexity of the prediction model used for imputation and to the frequency of the outcome being predicted The three different strategies for building prediction models showed no differences in either their abilities to produce unbiased estimates or in their robustness to the assumption that the data were ignorably missing. When applied to the imputation of missing HIV transmission risk data, the three strategies did yield models that consistently differed in their predictive power; however, these differences were not associated with accuracy of the imputations, as indicated by the proxy measure of agreement between separate imputations for the same individual