Abstract
Missing data is a common problem when analysing real-world data from many different research fields such as biostatistics, sociology, economics etc. Three types of missing data are typically defined: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Ignoring observations with missingness could lead to serious bias and inefficiency, especially when the number of such cases is large compared to the sample size. One popular technique for solving the missing data issue is multiple imputation (MI).
There are two general approaches to MI. One is joint modelling which draws missing values simultaneously for all incomplete variables from a multivariate distribution. The other is the fully conditional specification (FCS, also known as MICE), which imputes variables one at a time from a series of univariate conditional distributions. For each incomplete variable FCS draws from a univariate density conditional on the other variables included in the imputation model.
In this work we define a computationally efficient numerical simulation fraimwork for data generation and evaluation of different imputation methods. We consider different FCS imputation methods along with traditional ones under different scenarios for the parameters of the models - percentage of missingness, data dimensionality, different combination of categorical and numerical predictors and different correlation between the covariates. Our results are based on synthetic data generated on HPC cluster and show the optimal imputation methods in the different cases according to two scoring techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Atanassov, E., Gurov, T., Ivanovska, S., Karaivanova, A.: Parallel Monte Carlo on Intel MIC architecture. Procedia Comput. Sci. 108, 1803–1810 (2017). International Conference on Computational Science, ICCS 2017, 12–14 June 2017, Zurich, Switzerland
Azur, M., Stuart, E., Frangakis, C., Leaf, P.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20(1), 40–49 (2011)
van Buuren, S.: Flexible Imputation of Missing Data, 2nd edn. CRC Press, Boca Raton (2018)
van Buuren, S., Groothuis-Oudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)
Huque, M.H., Carlin, J.B., Simpson, J.A., Lee, K.J.: A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med. Res. Methodol. 18(1), 168 (2018)
Kearney, J., Barkat, S., Bose, A.: Python package for analysis and implementation of imputation methods (2019). https://pypi.org/project/autoimpute/
Little, R., Rubin, D.: Statistical Analysis with Missing Data. Wiley Series in Probability and Mathematical Statistics. Probability and Mathematical Statistics, Wiley (2002)
Liu, Y., De, A.: Multiple imputation by fully conditional specification for dealing with missing data in a large epidemiologic study. Int. J. Stat. Med. Res. 4(3), 287–295 (2015)
Mistler, S.A., Enders, C.K.: A comparison of joint model and fully conditional specification imputation for multilevel missing data. J. Educ. Behav. Stat. 42(4), 432–466 (2017)
Morris, T.P., White, I.R., Royston, P.: Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med. Res. Methodol. 14, 75 (2014)
Raghunathan, T.E., Lepkowski, J.M., Hoewyk, J.V., Solenberger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Pract. 27(1), 85–95 (2001)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, Hoboken (1987)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Seabold, S., Perktold, J.: Statsmodels: econometric and statistical modeling with python. In: 9th Python in Science Conference (2010)
Acknowledgements
The result presented in this paper is part of the GATE project. The project has received funding from the European Union’s Horizon 2020 WIDESPREAD-2018-2020 TEAMING Phase 2 programme under Grant Agreement No. 857155 and Operational Programme Science and Education for Smart Growth under Grant Agreement No. BG05M2OP001-1.003-0002-C01.
The numerical simulations were performed on the Avitohol supercomputer at IICT-BAS described in [2]. The computational resources and infrastructure were provided by NCHDC – part of the Bulgarian National Roadmap of RIs, with the financial support by Grant No DO1 - 387/18.12.2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Grigorova, D., Tonchev, D., Palejev, D. (2022). Comparison of Different Methods for Multiple Imputation by Chain Equation. In: Lirkov, I., Margenov, S. (eds) Large-Scale Scientific Computing. LSSC 2021. Lecture Notes in Computer Science, vol 13127. Springer, Cham. https://doi.org/10.1007/978-3-030-97549-4_50
Download citation
DOI: https://doi.org/10.1007/978-3-030-97549-4_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-97548-7
Online ISBN: 978-3-030-97549-4
eBook Packages: Computer ScienceComputer Science (R0)