Escolar Documentos
Profissional Documentos
Cultura Documentos
When we have missing values in a dataset it is important to think about why they are missing and their impact
on analysis. Sometimes ignoring missing data reduces power, but more importantly, sometimes it biases
answers and potentially misleads to incorrect conclusions. So it is important to think about what the missing
data mechanism is in order to see what to do about it. Rubin (1976) dierentiated between three types of
missigness mechanisms:
1. Missing completely at random (MCAR): when cases with missing values can be thought of as a random
sample of all the cases; MCAR occurs rarely in practice.
2. Missing at random (MAR): when conditioned on all the data we have, any remaining missingness is
completely random; that is, it does not depend on some missing variables. So missingness can be
modelled using the observed data. Then, we can use specialised missing data analysis methods on the
available data to correct for the eects of missingness.
3. Missing not at random (MNAR): when data is neither MCAR nor MAR. This is dicult to handle because
it will require strong assumptions about the patterns of missingness.
One common way people try to deal with missing data is to delete all cases for which a value is missing. This
method is called complete case analysis (CC). However, CC is valid only if data is MCAR. Another method is
multiple imputation (MI), which is a monte carlo method that simulates multiple values to impute (ll-in) each
missing value, then analyses each imputed dataset separately and nally pools the results together. We
impute missing data multiple times to account for uncertainty we have about the true (& unknown) values of
the missing data. We will get more comfortable with MI as we work with the example dataset. In theory, MI
can handle all the three types of missingness. However, packages that do MI are usually not designed for
MNAR case. Missing data analysis of MNAR data is more complicated and is not the focus of this lab. Here
we work with a dataset where we are assuming the data are MAR.
# required libraries
library(mice)
library(VIM)
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 1/13
2017-5-7 Missing Data Analysis with mice
library(lattice)
Dataset: Example data is simulated based on the Phase 1 of the Third National Health and Nutrition
Examination Survey (NHANES) by the US National Center for Health Statistics. The study was designed to
assess the health and nutritional status of the population by subgroups such as of age. Age was obtained
through a household screening interview from the individuals themselves or neighbours if they were not
present. Thus, age is fully observed. Health stata were collected through actual physical examinations of the
sampled persons in mobile examination centers. Some of the individuals did not show up and their absence is
assumed to be mostly due to inability to participate in certain examination procedures, and thus missingness
is assumed to only depend on fully observed variable age (a MAR case). Missing proportions are between
30% to 40%.
# load data
data(nhanes)
str(nhanes)
# contains 25 obs & four variables: age (age groups: 20-39, 40-59, 60+), bmi (body ma
ss index),
# hyp (hypertension status) and chl (cholesterol level).
nhanes$age = factor(nhanes$age)
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 2/13
2017-5-7 Missing Data Analysis with mice
# 5 patterns observed from 2^3 possible patterns; we see for example that there are 3
cases where chl is missing whereas all the other variables are observed.
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 3/13
2017-5-7 Missing Data Analysis with mice
##
## Variables sorted by number of missings:
## Variable Count
## chl 0.40
## bmi 0.36
## hyp 0.32
## age 0.00
# This plot gives the frequencies for different combination of variables missing. For
example,
# that all the three variables chl, bmi & hyp are missing is the most frequent with a
bout 28% frequency (7 observations). Note that, blue refers to observed data and red
to the missing data.
# The margin plot of the pairs can be plotted using VIM package as
marginplot(nhanes[, c("chl", "bmi")], col = mdc(1:2), cex.numbers = 1.2, pch = 19)
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 4/13
2017-5-7 Missing Data Analysis with mice
# blue box plots summarise the distribution of observed data given the other variable
is observed,
# and red box plots summarise the distribution of observed data given the other varia
ble is missing.
# For example, the red box plot in the bottom margin shows that bmi is rather missing
for lower cholesterol levels.
# Note that, if data are MCAR, we expect the blue and red box plots to be identical.
# We are interested in the simple linear regression of chl on age and bmi
##
## Call:
## lm(formula = chl ~ age + bmi, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.504 -19.967 0.755 7.619 58.871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.948 56.221 -0.515 0.61903
## age2 55.810 18.418 3.030 0.01424 *
## age3 104.724 24.843 4.215 0.00225 **
## bmi 6.921 1.951 3.548 0.00624 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.1 on 9 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.7329, Adjusted R-squared: 0.6439
## F-statistic: 8.232 on 3 and 9 DF, p-value: 0.006013
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 5/13
2017-5-7 Missing Data Analysis with mice
# bmi and age both significant, however, based only on 13 observations (12 observatio
ns are deleted).
mice() imputes each missing value with a plausible value (simulates a value to ll-in the missing one)
until all missing values are imputed and dataset is completed. Repeats the process for multiple times,
say m times and stores all the m complete(d)/imputed datasets.
with() analyses each of the m completed datasets separately based on the analysis model you want.
pool() combines (pools) all the results together based on Rubins rules (Rubin, 1987).
Run MI by mice:
# Function mice() in mice package is a Markov Chain Monte Carlo (MCMC) method that us
es
# correlation structure of the data and imputes missing values for each incomplete
# variable m times by regression of incomplete variables on the other variables itera
tively.
imp = mice(nhanes, m=5, printFlag=FALSE, maxit = 40, seed=2525)
# The output imp contains m=5 completed datasets. Each dataset can be analysed
# using function with(), and including an expression for the statistical analysis app
roach
# we want to apply for each imputed dataset as follows
fit.mi = with(data=imp, exp = lm(chl ~ age + bmi))
# Next, we combine all the results of the 5 imputed datasets using the pool() functio
n
combFit = pool(fit.mi)
# Note that the function pool() works for any object having BOTH coef() and vcov() me
thods, such as lm, glm and Arima, also for lme in nlme package.
round(summary(combFit),2)
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 6/13
2017-5-7 Missing Data Analysis with mice
# This result shows that bmi and age are significant. But is m=5 imputed datasets suf
ficient? Maybe, but if possible, it would be better to increase it to check, e.g. m=2
0:
# The results are not much changed. MI works for as low as m=5 for this example.
## est lo 95 hi 95 fmi
## R^2 0.4803841 0.1091925 0.7698717 0.3639035
# Check for implausible imputations (values that are clearly impossible, e.g. negativ
e values for bmi)
# The imputations, for example for bmi, are stored as
imp$imp$bmi
## 1 2 3 4 5
## 1 29.6 28.7 20.4 27.2 28.7
## 3 29.6 28.7 30.1 30.1 28.7
## 4 27.5 22.7 25.5 27.5 27.4
## 6 27.4 22.7 22.7 27.4 22.7
## 10 22.7 27.5 22.0 22.5 27.5
## 11 25.5 29.6 27.2 28.7 28.7
## 12 25.5 28.7 22.5 27.4 22.0
## 16 27.5 28.7 28.7 35.3 30.1
## 21 20.4 22.7 22.5 26.3 30.1
# Completed datasets (observed and imputed), for example the second one, can be extra
cted by
imp_2 = complete(imp, 2)
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 7/13
2017-5-7 Missing Data Analysis with mice
# We can inspect the distributions of the original and the imputed data:
# Blue represents the observed data and red shows the imputed data. These colours are
consistent with what they represent from now on.
# Here, we expect the red points (imputed data) have almost the same shape as blue po
ints
# (observed data). Blue points are constant across imputed datasets, but red points d
iffer
# from each other, which represents our uncertainty about the true values of missing
data.
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 8/13
2017-5-7 Missing Data Analysis with mice
# This plot compares the density of observed data with the ones of imputed data. We e
xpect them
# to be similar (though not identical) under MAR assumption.
Convergence monitoring:
# MICE runs m parallel chains, each with a certain number of iterations, and imputes
# values from the final iteration. How many iterations does mice() use and how can we
make sure that
# this number is enough?
# To monitor convergence we can use trace plots, which plot estimates against the num
ber of iteration.
plot(imp20)
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 9/13
2017-5-7 Missing Data Analysis with mice
# shows mean and standard deviation of the variables through the iterations for the m
# See the univariate imputation model for each incomplete variable that mice() used
# for your data as default
imp$meth
pmm stands for predictive mean matching, default method of mice() for imputation of continous incomplete
variables; for each missing value, pmm nds a set of observed values with the closest predicted mean as the
missing one and imputes the missing values by a random draw from that set. Therefore, pmm is restricted to
the observed values, and might do ne even for categorical data (though not recommended).
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 10/13
2017-5-7 Missing Data Analysis with mice
# For example, logreg stands for Bayesian logistic regression for binary incomplete v
ariable.
# You can run ?mice.impute.logreg to get more information whether the imputation mode
l is
# suitable for the incomplete variable you want to impute.
# Since hyp is binary we want to change the default; also we might want to change the
imputation
# model for bmi to (Bayesian) normal linear model:
meth=imp$meth;
meth = c("", "norm", "logreg", "pmm")
# you might get an error when running mice if the method you specify is not consisten
t with the
# type of the variable specified in the dataset. So, for this specified method, you a
lso need to
# change the type of hyp to a 2-level factor or a yes/no binary variable.
nhanes$hyp = factor(nhanes$hyp)
Predictor matrix: Which variables does mice() use as predictors for imputation of each incomplete variable?
imp$pred
# by default mice() includes all the variables in the data; it is suggested that we u
se as many
# variables as possible. Do not exclude a variable as predictor from the imputation m
odel if
# that variable is going be included in your final analysis model (either as response
or predictor) in function with().
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 11/13
2017-5-7 Missing Data Analysis with mice
vis=imp$vis; vis
# mice() imputes values according to the column sequence of variables in the dataset,
from left
# to right. Here, for a specifed iteration of the mice sampler, first bmi is imputed,
then
# hyp is imputed using the currently imputed bmi and previously imputed chl from the
previous
# iteration, and so on.
# You can choose visiting sequence by increasing order of missing data manually as
vis = c(3,2,4)
# , or alternatively
vis="monotone"
# However, if there is no specific pattern in the missingness (we say the missingness
pattern
# is general of arbitrary), we do not expect the visiting sequence to affect the resu
lts much.
Model testing:
# mice provides Wald test for testing models. It is also possible to do likelihood ra
tio test if
# the final analysis model is logistic regression. You can choose the test in pool.co
mpare() function by "method" option.
## [,1]
## [1,] 0.002859227
Bibliography:
Rubin, D.B., Inference and missing data. Biometrika, 1976. 63(3): p.581-592.
Rubin, D.B., Multiple Imputation for Nonresponse in Surveys. 1987, New York: John Wiley.
Templ, M., et al., VIM: visualization and imputation of missing values. R package version 2.3, 2011.
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 12/13
2017-5-7 Missing Data Analysis with mice
Van Buuren, S., Flexible Imputation of Missing Data. 2012, Chapman & Hall/CRC.
http://web.maths.unsw.edu.au/~dwarton/missingDataLab.html 13/13