Escolar Documentos
Profissional Documentos
Cultura Documentos
A Thesis
Presented to
The Faculty of the Mathematics Department
College of Science
De La Salle University - Manila
In Partial Fulfillment
of the Requirements for the Degree
Bachelor of Science in Statistics Major in Actuarial Science
by
Diana Camille B. Cortes
James Edison T. Pangan
August 2007
Approval Sheet
PANEL OF EXAMINERS
The researchers would like to extend their warmest gratitude to the following
people, who have undoubtedly contributed to the success of this study:
• To Dr. Jun Pacificador Jr., for his supervision, suggestions and guidance
during the duration of this thesis.
• To our panelists, Dr. Rechel Arcilla, Prof. Imelda de Mesa and Ms. Michele
Tan for helping us improve our thesis.
• To Dr. Ederlina Nocon, for providing us the software LaTeX during THSMTH1
• To our parents especially Jed’s mother, Prof. Erlinda Pangan, for constantly
reminding the researchers (i.e. ”Tapos na ba ang thesis nyo?”) about the
thesis.
• To Mark Nanquil and Norman Rodrigo, for helping us in using LaTeX and
for their unwavering support to our thesis
• To our classmates, friends from COSCA, La Salle Debate Society and Math
Circle for their continuous encouragement and support.
• Lastly, to The Lord Almighty, for providing us the strength, patience, wis-
dom and determination to finish this thesis.
Table of Contents
Title Page i
Approval Sheet ii
Acknowledgments iii
Table of Contents iv
Abstract ix
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3 Conceptual Framework 17
Deterministic Regression . . . . . . . . . . . . . . . . . . . 31
Stochastic Regression . . . . . . . . . . . . . . . . . . . . . 32
4 Methodology 34
6 Conclusion 83
List of Tables
• Table 1: Imputed Values of GPA Using HDI (p.29)
(p.57)
• Table 15: Ranking of the Different Imputation Methods: 10% NRR (p.79)
• Table 16: Ranking of the Different Imputation Methods: 20% NRR (p.80)
• Table 17: Ranking of the Different Imputation Methods: 30% NRR (p.81)
viii
List of Figures
Several imputation methods have been developed for imputing missing responses.
It is often not clear which imputation method is best for a particular assumption.
In choosing an imputation method, several factors should be considered such as
the types of estimates that will be generated, the type and pattern of nonresponse,
and the availability of the auxiliary data that are highly correlated with charac-
teristic of interest or with the response propensity.
This study compared the effectiveness of four imputation methods namely the
Overall Mean, Hot Deck, Deterministic and Stochastic Regression Imputation us-
ing the first visit variable to be its auxiliary variable. Values for variables Total
Income and Expenditures (TOTIN2 and TOTEX2)of the second visit were set
to nonresponse to satisfy the assumption of partial nonresponse. The results of
the study provide some support for the following conclusions: (a) for the 1997
FIES data, the Hot Deck Imputation and Overall Mean Imputation methods are
not appropriate for handling partial nonresponse data; (b) stochastic regression
imputation was selected as the best imputation method; and (c) the imputation
classes must be homogeneous to produce less biased estimates.
Chapter 1
1.1 Introduction
Missing data in sample surveys is inevitable. The problem of missing data occurs
for various reasons such as when the respondent moved to another location, refused
to participate in the survey or is unable to answer specific items in the survey. This
failure to obtain responses from the units selected in the sample is called nonre-
sponse. There are several types of nonresponse; (a) Unit nonresponse refers to
the failure to collect any data from a sample unit; (b) while item nonresponse
refers to the failure to collect valid responses to one or more items from a respond-
ing sample unit (i.e. in cases of surveys with only one phase or considers a single
phase ignoring other phases); and (c) partial nonresponse occurs when there
is a failure to collect responses for large sets or a block of items (i.e. in cases of
surveys with two phases, the same respondent cannot answer in the second phase
of the survey hence the items for the second phase of the survey are missing) for
a responding unit.
The effect of nonresponse must not be ignored since it leads to biased estimates
2
In practice, there are three ways of handling missing data. These are discarding
the missing values, applying weighting adjustments or using imputation meth-
ods. Discarding the missing values or otherwise known as the Available Case
Method is based on excluding the nonresponse records when analyzing the vari-
able of interest. The problem with this method is that it does not account for
the difference in characteristic between the responding and nonresponding units.
Hence, methods for compensating missing data are applied. Another method is
called weighting adjustments. Weighting adjustments is based on matching
nonrespondents to respondents in terms of data available on nonrespondents and
increasing the weights of matched respondents to account for the missing values.
Hence, a weight proportionate to the amount of nonresponse is often multiplied
to the inverse of the response rate. This is often applied for unit nonresponse.
On the other hand, imputation is also used by statisticians to account for non-
response, usually in the case of item and partial nonresponse. In imputation, a
missing value is replaced by a reasonable substitute for the missing information.
Once nonresponse has been dealt with, whether by weighting adjustments or im-
putation, then researchers can proceed with their data analysis.
3
With the 1997 FIES as the data set for this study, this paper will focus on dealing
with partial nonresponse through the use of imputation methods. It aims to exam-
ine the effects of imputed values in coming up with estimates for the missing data
at various nonresponse rates. Furthermore, the study aims to determine which
imputation method is appropriate for the FIES data through applying some of the
methods mentioned in the study about the 1978 Research Panel Survey for the
Income Survey Development Program (ISDP) entitled Compensating for Missing
Data by Kalton (1983).
4
1. Which imputation method is the most appropriate for the FIES data?
2. How do varying nonresponse rates affect the results for each imputation
method?
Since most statistical packages require the use of complete data before conducting
any procedure for data analysis, the use of imputation methods can ensure con-
sistency of results across analyses, something that an incomplete data set cannot
fully provide.
More importantly, given the great impact of this survey to the country, employing
imputation methods help statisticians to provide a method in handling nonre-
sponse, which could lead to a more meaningful generalization about our country’s
income distribution, spending patterns and poverty incidence. Hence, having es-
timates with less bias and more consistent results, this can contribute in making
our policymakers and economists provide better solutions in improving the lives
of the Filipinos.
Throughout this paper, only the 1997 Family Income and Expenditure Survey
(FIES), will be used to tackle the problem of nonresponse and to examine the
impact of the different imputation methods applied in the dataset. With regards
to the extent of how these imputation methods will be applied and evaluated, this
paper will only cover the partial nonresponse occurring in the National Capital
Region (NCR) since NCR is noted as the region with highest nonresponse rate.
Also, the variables that will be imputed for this study would be the Total Income
(TOTIN2) and Total Expenditures (TOTEX2) in the second visit of the FIES
data.
The researchers will only focus on using the 1997 FIES data on the first visit
7
to impute the partial nonresponse that is present on the second visit. This paper
also assumes that the first visit data is complete and the pattern of nonresponse
follows Missing Completely at Random (MCAR) case. The MCAR case happens
if the probability of response to Y is unrelated to the value of Y or to any other
variables; making the missing data randomly distributed across all cases (Musil et.
al, 2002). If the pattern on nonresponse does not satisfy the MCAR assumption,
imputation methods may not achieve its purpose.
As for the imputation methods, only four will be applied for this paper namely:
Overall Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Re-
gression Imputation (DRI) and Stochastic Regression Imputation (SRI).
On the aspect of evaluating the efficacy and appropriateness of the four impu-
tation methods, this will only be limited to the following: (a) Bias of the Mean
of the Imputed Data, (b) Assessment of the Distributions of the Imputed vs. the
Actual Data and (c) the criteria mentioned in the report entitled Compensating
for Missing Data(Kalton, 1983) namely the Mean Deviation, Mean Absolute De-
viation and the Root Mean Square Deviation.
Chapter 2
Much research effort has been devoted in the efficacy of various imputation meth-
ods. In the report entitled Compensating for Missing Survey Data, two simultion
studies were carried out using the data in the 1978 Income Survey Development
Program (ISDP) Research Panel to compare some imputation methods. The first
study compared imputation methods for the variable Hourly Rate of Pay while
the second dealt with the imputation of the variable Quarterly Earnings. For both
studies, the author stratified the data into its imputation classes, constructed data
sets with missing values by randomly deleting some of the recorded values in the
original dataset and then applied the various imputation methods to fill in the
missing values. This process was replicated ten times to ensure consistency of the
results. Once the imputation methods have been applied, the three measures for
evaluating the effectiveness of imputation methods namely the Mean Deviation,
Mean Absolute Deviation and the Root Mean Square Deviation were obtained
and averaged across the ten trials (Kalton, 1983).
For the first study of imputing the variable Hourly Rate of Pay, eight methods
were used namely the Grand Mean Imputation (GM), the Class Mean Imputa-
9
tion using eight imputation classes (CM8), the Class Mean Imputation using ten
imputation classes (CM10), Random Imputation with eight imputation classes
(RM8), Random Imputation with ten imputation classes (RM10), Multiple Re-
gression Imputation (MI), Multiple Regression Imputation plus a random residual
chosen from a normal distribution (MN) and Multiple Regression Imputation plus
a randomly chosen respondent residual (MR). Using the Mean Deviation criteria,
the results showed that all mean deviations were negative, indicating that the im-
puted values underestimated the actual values. Moreover, the results show that
the Grand Mean Imputation (GM) has the greatest underestimation among the
eight procedures. Meanwhile for the Mean Absolute Deviation and Root Mean
Square Deviation, which measures the ability to reconstruct the deleted value, the
results showed that the Grand Mean Imputation fared the worst for both criteria.
In addition, it also showed that the Multiple Regression Imputation (MI) ob-
tained the best measures for the two criteria and that the procedures with greater
number of imputation classes (i.e.CM8 VS. CM10, RC8 VS. RC10) yield slightly
better results for the two criteria (Kalton, 1983).
For the second study, which is the imputation of Quarterly Earnings, ten impu-
tation procedures were used. These are the Grand Mean Imputation (GM), the
Class Mean Imputation using eight imputation classes (CM8), the Class Mean Im-
putation using twelve imputation classes (CM12), Random Imputation with eight
imputation classes (RM8), Random Imputation with twelve imputation classes
(RM12), Multiple Regression Imputation (MI), Multiple Regression Imputation
plus a random residual chosen from a normal distribution (MN), Multiple Regres-
10
sion Imputation plus a randomly chosen respondent residual (MR), Mixed De-
ductive and Random Imputation using eight imputation classes (DI8) and Mixed
Deductive and Random Imputation using twelve imputation classes (DI12). Using
the first criteria, the Mean Deviation, the results showed that the Grand Mean
(GM) obtained a positive bias. This implied that the grand mean imputation is
not an effective imputation method for the this study. The results also showed
that the regression imputation procedures have almost similar results producing
almost unbiased estimates. In addition, the Class Mean Imputation methods
(CM8 and CM12) have similar measures with those of the Random Imputation
Methods. Nevertheless, all methods have produced relatively small mean devia-
tions except for the last two methods. Comparing the Mean Absolute Deviations
and the Root Mean Square Deviations, the results show that the Grand Mean
Imputation obtained values similar to the regression procedures with residuals
(i.e. Multiple Regression Imputation plus a random residual chosen from a nor-
mal distribution or MN, Multiple Regression Imputation plus a randomly chosen
respondent residual or MR). The results also show that the RC8. RC12, MN and
MR procedures are over one third larger compared to deterministic procedures
such as the CM8, CM12 and MI procedures (Kalton, 1983).
To further investigate the relatively larger biases of DI8 and DI12 procedures,
the author further divided the date into the deductive and non deductive cases.
This shed further light on the Mean Deviations and Mean Absolute Deviations
of the various imputation methods. It was found that the mean deviations are
positive on the deductive case and negative on the non deductive case for all of
11
the procedures. These then explains why there are relatively small deviations in
the previous results since the measures between the cases tend to cancel out. It
also showed that the DI8 and DI12 results are similar to those of the RC8, RC12,
CM8 and CM12 in the non deductive cases but are largely different in the deduc-
tive cases. This explains the larger values of DI8 and DI12 in the previous results
(Kalton, 1983).
At the end of the two studies, it showed that the imputation procedures tend
to overestimate the Hourly Rate of Pay and underestimate the Quarterly Earn-
ings. Moreover, it showed how the mean imputation appears to be the weakest
imputation method among the studies since it has distorted the distribution of
the original data. Lastly, Kalton’s study shows the impact of increasing the im-
putation classes with respect to the criteria used such that it gives a better yield
of values for the three criteria.
Using the Center for Epidemiological Studies data on stress and health ratings of
older adults, the authors imputed a single variable namely the functional health
rating. Of the 492 cases, 20% cases were deleted in an effort to maximize the
effects of each imputation method. Except for the Listwise Deletion and Mean
Imputation, the researchers used the SPSS Missing Value Analysis function for
the Deterministic Regression, Stochastic Regression and EM Method. For the
correlations, the researchers obtained the correlation values of the original data
and the five methods of the imputed variable with the variables, age, gender and
self assed health rating. (Musil et. al, 2002) The results show that comparing
the mean of the original data with the five methods, all imputed values underesti-
mated the mean. The closest to the original data was the Stochastic Regression,
followed very closely by EM Method, Deterministic Regression, Listwise Deletion
and Mean Imputation. The same results also hold for the standard deviations.
For the correlations, however, the EM Method produced the closest correlation
values to the original data followed closely by the Stochastic Regression, Deter-
ministic Regression, Listwise Deletion and Mean Imputation. Hence, the Finding
suggests that the Stochastic Regression and EM Method performed better while
the Mean Imputation is the least effective (Musil et. al, 2002).
Experiments and Practical Examples, the authors described two simulation exper-
iments of the Hot Deck Method. The first study focused on comparing whether
the Hot Deck Method performs better than leaving the records with nonresponse
out of the data set when analyzing the variable, which is known as the Available
Case Method. This was done by constructing a fictitious data set of four values;
two of these variables were used for the imputation. Then nonresponse rates were
identified namely 5%, 10% and 20% and the simulation process was replicated 50
times. The data set containing the missing values was first analyzed using the
Available Case Method then followed by the Hot Deck Imputation. Same with
the methodology of Musil et.al.(2002), descriptive statistics such as the mean,
variance and correlation were computed. Moreover, the absolute differences be-
tween the original and the available case method also with the original and hot
deck method were computed. Based on his criteria, the results show that Hot
Deck performs better than the Available Case Method. Also, it showed that the
Hot Deck, while had closer results with the original data, has the tendency to
underestimate the values. In terms of the absolute differences, it was observed
that these values increase when the percentage of missing values also increases.
egory 22 (value worth at 300,000) are changed into missing values. The rationale
for this choice was to ensure that the original value from these categories will note
be used as the replacements for the variable to be imputed since it is no longer
in the file. Then imputation classes were created once the missing values were
already identified. A table showing the number of respondents before and after
imputation showed that in every category except for 13 and 22, which was set as
missing values, the number of respondents increased after the imputation. This
showed that the remaining records have equal probability of becoming a donor
record for an imputation and that not all imputations give values that are near
category 13 or 22. Nordholt also explored on the Available Case Method and Hot
Deck Method for this real life data. Same with the first study, the Hot Deck fared
better than the Available Case Method (Nordholt, 1998).
There were two undergraduate theses that conducted a similar study on imputa-
tion. The first undergraduate thesis was by Salvino and Yu (1996). They assessed
the efficiency of the Mean Imputation versus Hot Deck Imputation Technique by
applying these techniques on the 1991 Census on Agriculture and Fisheries (CAF)
data. In their research, they generated an incomplete data using the Gauss Soft-
ware for the imputed variables which were the count for cattle, hogs and chicken.
In order to determine which is better between the two, the variances were com-
pared. Looking at the variances, it was determined that the Hot Deck Imputation
Technique was better. Also, the design effect was considered by dividing the vari-
ance of the Hot Deck Imputation versus the Mean Imputation, since the ratio
produced was less than one, they concluded that again, the Hot Deck Imputation
Technique is a better option.
create a tool for implementing their imputations. When the results were obtained,
they assessed the efficacy of the imputation techniques by looking at the accuracy
and precision of the estimates. Accuracy was measured by the percentage error
and the variance of these percentage errors were the basis for the precision of the
estimates. The results show that the Linear Regression was the best method,
followed closely by Multiple Regression, then Hot Deck and finally the Mean Im-
putation. While this is the case, it can be noted that in this study, the criteria for
determining the best imputation model was not extensive as it only used the per-
centage error and variance of the percentage errors. Also, the study wasn’t able
to explore on the use of imputation classes to improve the accuracy and precision
of the imputation methods.
Chapter 3
Conceptual Framework
Bias is defined as the difference between the expected value of an estimator and
the true value of the parameter being estimated. The bias is expressed by:
Bias(θ̂) = E[θ̂] − θ
where θ̂ is the estimator of the true value of the parameter θ and θ is the true
value of the parameter.The bias of an estimator can be positive, negative or even
zero. An estimator having nonzero bias is said to be an unbiased estimator.
Accuracy is the extent to which estimates are close to the value of the parameter.
Precision is is the extent to which estimates are close to one another.
Efficiency is defined to be the measurement on how a method is accomplished
through a set of criteria.
Nonresponse is the failure to collect valid response for a particular unit.
18
The types of nonresponse focus on the method in which the observations are
nonresponse values. Kalton (1983) stressed the importance to differentiate the
types of nonresponse: Unit (or total) nonresponse, item nonresponse, and partial
nonresponse. Unit (or Total) nonresponse takes place when no was informa-
tion collected from a sampling unit. There are many causes of this nonresponse,
namely, the failure to contact the respondent (not at home, moved or unit not
being found), refusal to give information, inability of the unit to cooperate (might
be due to an illness or a language barrier) or lost questionnaires.
Item nonresponse, on the other hand, happens when the information collected
from a unit is incomplete due to refusal to answering some of the questions. There
are many causes of item nonresponse, namely, refusal to answer the question due
to the lack of information necessarily needed by the informant, failure to make
the effort required to establish the information by retrieving it from his memory
or by consulting his records, refusal to give answers because the questions might
be sensitive, the interviewer fails to record an answer or the response is subse-
quently rejected at an edit check on the grounds that it is inconsistent with other
responses (may include an inconsistency arising from a coding or punching error
occurring in the transfer of the response to the computer data file).
Lastly, Partial Nonresponse is the failure to collect large sets of items for a
responding unit. A sampled unit fails to provide responses for the following,
namely, in one or more waves of a panel survey, later phases of a multi-phase
19
data collection procedure (e.g. second visit of the FIES), and later items in the
questionnaire after breaking off a telephone interview. Other reasons namely in-
clude, data are unavailable after all possible checking and follow-up, inconsistency
of the responses that do not satisfy natural or reasonable constraints known as
edits, which one or more items are designated as unacceptable and therefore are
artificially missing, and similar causes given in Unit (Total) Nonresponse. In this
study, the researchers dealt with Partial Nonresponse occurring in the second visit
of the 1997 FIES.
the MCAR pattern is when a sample unit in the survey fails to provide an answer
to the total monthly expenditure because the unit cannot be reached.
controlled. Using the example in the MAR pattern, however, the sampling unit
did not also provide answer because he was a high income earner. This is consid-
ered NIN since the sampling unit also depends on the income group even if the
gender of the unit was controlled (Musil, et al., 2002).
Suppose the population is divided in two groups or strata. The first group consist-
ing of all units in the population for which units will be obtained (Respondents)
and the second group are those units for which no measurement will be obtained
(Nonrespondents).
22
To arrive at the proper estimation of the nonresponse bias, the following quanti-
ties are defined:
Let R be the number of respondents and M (M stands for missing) be the number
of nonrespondents in the population, with R + M = N. Assume that a Simple
Random Sample (SRS) with replacement is drawn from each group. The cor-
R
responding sample quantities are r and m, with r + m = n. Let R̄ = N
and
M
M̄ = N
be the proportions of respondents and nonrespondents in the population
r m
and let r̄ = n
and m̄ = n
be the response and nonresponse rates in the sample.
The population total and mean are given by Y = Yr + Ym = RȲr + M Ȳm and
Ȳ = R̄Ȳr + M̄ Ȳm , where Yr and Ȳr are the total and mean for respondents and
Ym and Ȳm are the same quantities for the nonrespondents. The corresponding
sample quantities are y = yr + ym = rȳr and ȳ = r̄ȳr + m̄ȳm (Kalton, 1983).
The equation above shows that ȳr is approximately unbiased for Ȳ if either the
proportion of nonrespondents M̄ is small or the mean for nonrespondents, Ȳm ,
is close to the respondents, Ȳr . Since the survey analyst usually has no direct
empirical evidence on the magnitude of (Ȳr − Ȳm ), the only situtation in which
he can have confidence that the bias is small is when the nonresponse rate is low.
However, in practice, even with moderate M̄ many survey results escape sizable
biases because (Ȳr − Ȳm ) is fortunately often not large (Kalton, 1983).
In reducing nonresponse bias caused by missing data, there are many procedures
that can be applied and one of these procedures is imputation. In this study,
imputation methods are applied to eliminate nonresponse and reduce bias to the
estimates. Imputation is briefly defined as the substitution of values for the non-
response observations.
24
Imputation is listed as one of the many procedures that can be used to deal with
nonresponse in order to generate more unbiased results. Imputation is the process
of replacing a missing value through available statistical and mathematical tech-
niques, with a value that is considered to be a reasonable substitute for the missing
information (Kalton,1983).
Imputation has certain advantages. First, utilizing imputation methods help re-
duce biases in survey estimates. Second, imputation makes analysis easier and
the results are simpler to present. Imputation does not make use of complex al-
gorithms to estimate the population parameters in the presence of missing data;
hence, much processing time is saved. Lastly, using imputation methods can en-
sure consistency of results across analyses, a feature that an incomplete data set
cannot fully provide.
On the other hand, imputation has also several disadvantages. There is no guar-
antee that the results obtained after applying imputation methods will be less
biased than those based on the incomplete data set. Hence, the use of imputation
methods depends on the suitability of the assumptions built into the imputation
procedures used. Even if the biases of univariate statistics are reduced, there
is no assurance that the distribution of the data and the relationships between
variables will remain. More importantly, imputation is just a fabrication of data.
Many naive researchers falsely treat the imputed data as a complete data set for
n respondents as if it were a straightforward sample of size n.
25
There are four Imputation Methods (IMs) applied in this study, namely, the
Overall (Grand) Mean Imputation (OMI), Hot Deck Imputation (HDI),
Deterministic Regression Imputation (DRI) and Stochastic Regression
Imputation (SRI). For most imputation methods, imputation classes are needed
to be defined before performing the imputation methods.
Imputation classes are stratification classes that divide the data into groups
before imputation takes place. The formation of imputation classes is very useful
if the classes are divided into homogeneous groups. That is, similar characteris-
tics that has some propensity to provide the same response. The variables used to
define imputation classes are called matching variables. In getting the values
to be substituted to the nonresponse observations, a group of observations coming
from a variable with a response are used. These records are called donors. The
records with missing observations to be substituted are called recipients.
Problems might arise if imputation classes are not formed with caution. One
of them is the number of imputation classes. The imputation class must have a
definite number of classes applied to each method. The larger the number of im-
putation classes, the possibility of having fewer observations in one class increases.
This can cause the variance of the estimates under that class to increase. On the
other hand, the smaller the number of imputation classes, the possibility of hav-
ing more observations in that class increases thus making the estimates burdened
with aggregation bias.
26
The mean imputation method is the process by which missing data is imputed by
the mean of the available units of the same imputation class to which it belongs
(Cheng and Sy, 1999). One types of this method is the Overall Mean Imputa-
tion (OMI) method. The OMI is one of the widely used method in imputing ofr
missing data. The OMI method simply replaces each missing data by the over-
all mean of the available (responding) units in the same population. The overall
mean is given by
r
X
yri
i=1
ȳomi = r
= ȳr
where yomi is the mean of the entire sample of the responding units of the yth
variable and yri is the observation under y which are responding units.
There are many advantages and disadvantages of this method. The advantage
of using this method is its universality. This means that it can be applied to any
data set. Moreover, this method does not require the use of imputation classes.
Without imputation classes, the method becomes easier to use and results are
generated faster.
27
However, there are serious disadvantages of this method. Since missing values are
imputed by a single value, the distribution of the data becomes distorted (Figure
1). The distribution of the data becomes too peaked making it unsuitable in many
post-analysis. Second, it produces large biases and variances because it does not
allow variability in the imputation of missing values. Many related literatures
stated that this is the least effective. Thus, this method is never recommended to
used.
28
One of the most popular and widely known methods used is the Hot Deck
Imputation (HDI) method. The HDI method is the process by which the
missing observations are imputed by choosing a value from the set of available
units. This value is either selected at random (traditional hot deck), or in
some deterministic way with or without replacement (deterministic hot deck),
or based on a measure of distance (nearest-neighbor hot deck). To perform
this method, let Y be the variable that contains missing data and X that has no
missing data. In imputing for the missing data:
1. Find a set of categorical X variables that are highly associated with Y . The
X variables to be selected will be the matching variables in this imputation.
3. If there are cases that are missing within a particular cell in the table, select a
case from the set of available units from Y variable and impute the chosen Y
value to the missing value. In choosing for the imputation to be substituted
to the missing value, both of them must have similar or exactly the same
characteristics.
Cheng and Sy (1999) stated that HDI method give estimates that reflect more
accurately to the actual data by making imputation classes. If the matching vari-
ables are closely associated with the variable being imputed, the nonresponse bias
should be reduced.
29
Like OMI, there are certain advantages in using this method. One major attrac-
tion of this method cited by Kazemi (2005) is that imputed values are all actual
30
values. More importantly, the shape of the distribution is preserved. Since im-
putation classes are introduced, the chance in distorting the distribution decreases.
On the other hand, it also has a set of disadvantages. First, in order to form
imputation classes, all X variables must be categorical. Second, the possibility
of generating a distorted data set increases if the method used in imputing val-
ues to the missing observations is without replacement as the nonresponse rate
increases. Observations from the donor record might be used repeatedly by the
missing values causing the shape of the distribution to get distorted. Third, the
number of imputation classes must be limited to ensure that all missing values
will have a donor for each class.
As in MI and HDI methods, this procedure is one of the widely known used im-
putation methods. The method of imputing missing values via the least-squares
regression is known to be the Regression Imputation (RI) Method. There
are many ways of creating a regression to be used in imputing for the missing
observations. The y-variable for which imputations are needed is regressed on
the auxiliary variable (x1 , x2 , x3 , ..., xp ) for the units providing a response on y.
These auxiliary variables may be quantitative or qualitative, the latter being in-
corporated into the regression model by means of dummy variables. There are
two basic types of the RI method: (a) Deterministic Regression Imputation and
(b) Stochastic Regression Imputation.
31
In comparing for the accuracy and efficiency of the RI method, it will be helpful
if the RI methods to be compared have the same imputation class.
Deterministic Regression
The use of the predicted value from the model given the values of the auxiliary
values that contains no missing data for the record with a missing response in the
variable y is called the Deterministic Regression Imputation (DRI). This method
is seen as the generalization of the mean imputation method. The model for DRI
is given by:
P
ŷk = β̂0 + β̂1 Xik
where ŷk the predicted value under the kth nonresponding unit to be imputed, β̂0
and β̂i are the parameter estimates, Xik is the auxiliary variable that can either
be a quantitative variable or a dummy variable under the k-th nonresponding unit.
There are advantages and disadvantages of using DRI. DRI has the potential
to produce closer imputed value for the nonresponse observation. In order to
make the method effective by imputing a predicted value, which is near the actual
value, a high R2 is needed. Though this method has the potential to make closer
imputed values, this method is a time-consuming operation and often times un-
realistic to consider its application for all the items with missing values in a survey.
Using the DRI can also underestimate the variance of the estimates. It can also
distort the distribution of the data. One major disadvantage of this method is
32
Stochastic Regression
The use of the predicted value from the deterministic regression model has similar
undesirable distributional properties in the mean imputation method. To com-
pensate for it, an estimated residual is added to the predicted value. The use
of this predicted value plus some type of randomly chosen estimated residual is
called the Stochastic Regression Imputation (SRI) method The model for
SRI is given by:
P
ŷk = β̂0 + β̂1 Xik + eˆk
where ŷk the predicted value under the kth nonresponding unit to be imputed,
β̂0 and β̂i are the parameter estimates, Xik is the auxiliary that can either be a
quantitative variable or a dummy variable under the k-th nonresponding unit and
eˆk is the randomly chosen residual for the k-th nonresponding unit.
There are various ways in which this could be done depending on the assump-
tions made about the residuals. The following are some possibilities:
1. Assume that the errors are homoscedastic and normally distributed,N (0, σe2 ).
Then σe2 could be estimated by the residual variance from the regression,s2e ,
and the residual for a recipient could be chosen at random from N (0, s2e )
2. Assume that the errors are heteroscedastic and normally distributed, with
2 2
σej being the residual variance in some group j. Estimate the σej by s2ej ,
33
3. Assume that the residuals all come from the same, unspecified, distribution.
Then estimate yk by ŷk + êk , where êi is the estimated residual for a random-
chosen donor.
4. The assumption in (3) accepts the linearity and additivity of the model.
If there are doubts about these assumptions, it may be better to take not
a random-chosen donor but instead one close to the recipient in terms of
his x-values (Kalton, 1983). In the limit, if a donor with the same set of
x-values is found, this procedure reduces to assigning that donor’s y-value
to the recipient.
There are advantages and disadvantages in using SRI. Similar to DRI, this method
can produce imputed values that are near to the nonresponse observation if the
model has a high R2 . This method is also a time-consuming operation and often
times unrealistic to consider its application for all the items with missing values
in a survey. This method can also produce out-of-range values other than the
predicted value without the added residual. It is possible under SRI that after
adding the residual to the deterministic imputation, which is feasible, an unfeasi-
ble value could result.
Chapter 4
Methodology
The purpose of this section is to give an overview about the data that will be used
for this study which is the 1997 Family Income and Expenditures Survey (FIES).
The 1997 FIES is a nationwide survey with two visits per survey period on the
same households conducted by the National Statistics Office (NSO) every three
years. The objectives of the survey are as follows:
1. to gather data on family income and family living expenditures and related
information affecting income and expenditure levels and patterns in the
Philippines;
The sampling design method for the 1997 FIES is a stratified multi - stage sam-
pling design consisting of 3,416 Primary Sampling Units (PSU’s) for the provincial
estimate, the PSU’s referred by the 1997 FIES are the barangays. Then, a subsam-
ple of 2,247 PSU’s comprises as the master sample for the regional level estimates
(NSO, 1997-2005).
This multi stage sampling design involved three stages. First is the selection
of sample barangays. Second is the selection of sample enumeration areas. Enu-
meration areas pertain to the subdivision of barangays. This was followed by
a selection of sample households. The sampling frame and stratification of the
three stages were based on the 1995 Census of Population (POPCEN) and 1990
Census of Population and Housing (CPH). From this method, a sample of 41,000
households participated in this survey (NSO, 1997-2005).
The 1997 FIES questionnaire contains about 800 data items, where questions are
asked by the interviewer to the respondent of the selected sample household. A re-
36
spondent is defined as the household head or the person who manages the finances
of the family or any member of the family who can give reliable information to
the questionnaire (NSO, 1997-2005).
The items or variables gathered in the 1997 FIES are listed in Appendix A.
Two types of nonresponse occurred in the 1997 FIES. The first type of nonresponse
which resulted from factors such as being unaware of the question, unwilling to
provide the answer or omission of the question during the interview is called the
item nonresponse.This type of nonresponse totaled to only 2.1% of the total num-
ber of respondents (NSO, 1997-2005).
The NSO has only devised the deductive imputation for solving the problem of
item nonresponse while no specific method was mentioned to compensate for the
partial nonresponse (NSO, 1997-2005).
used to apply the imputation methods. In this case, the National Capital Region
(NCR) was chosen because it was noted as the region with highest nonresponse
rate. The data consist of 4,130 households, 39 categorical variables and the rest are
continuous variables pertaining to income and expenditures of the respondents.
As to which variables will be imputed, the researchers chose two variables namely
the second visit Total Income (TOTIN2) and Total Expenditure (TOTEX2). The
selections for these variables were chosen due to its importance to the FIES and
the frequency of missing values for these observations.
2. Each observation from the matrix of random numbers was assigned to both
observations of the 1997 FIES second visit variables TOTIN2 and TOTEX2.
This was done in order to satisfy the assumptions that the data has partial
nonresponse and that the missing observations follow the Missing Com-
pletely At Random (MCAR) nonresponse pattern.
3. The second visit observations for both variables were sorted in ascending
order through their corresponding random number.
4. The first 10% of the sorted second visit data for both variables were selected
and set to as missing observations. The same procedure goes for the data
set which will contain 20% and 30% nonresponse rates respectively.
5. The missing observations were flagged. This was done to distinguish the
imputed from the actual values during the data analysis.
This simulation method was implemented with the use of the Decimal Basic pro-
gram, SIMULATION.BAS (Appendix B) where the files Simulated Values for
Income (SIMI) and Simulated Values for Expenditure (SIME), a matrix contain-
ing missing observations for the income and expenditure, were stored in order to
use it in the application of the imputation methods.
39
Imputation classes are stratification classes that divide the data in order to pro-
duce groups that have similar characteristics. Assuming that the units that have
the same characteristics have the propensity to give the same response, the for-
mation of imputation classes would help reduce the bias of the estimates.
The steps undertaken in the formation of the imputation classes are as follows:
1. The researchers identified the potential matching variables, which are the
candidate variables that could have an association with the variables of
interest (i.e. TOTEX2 and TOTIN2).
2. The categorical variables from the first visit data must fit into the criteria
in order to be selected as a candidate variable. Three criteria were used as
a basis for selecting the candidate variables. The first criterion is that the
variable must be known. Second, the candidate variable must be easy to
measure. Lastly, the probability of missing observations for the candidate
variable is small. If the variable from the first visit data would fit in the
three criteria, then it can be used as a candidate variable.
3. For the variables that have many categories, the researchers reduced the
number of categories for these variables. The rationale for this procedure
is because having too many categories can increase heterogeneity and the
bias of the estimates. This was done with the use of the software Statistica,
particularly, the Recode function.
40
All these tests were performed using the statistical packages Statistica and SPSS.
The results of these tests were presented in the next chapter.
41
The Overall Mean Imputation (OMI) is an imputation procedure where the miss-
ing observations are replaced with the mean of the variable which contains avail-
able units. As said in the Conceptual Framework, this imputation method does
not require the formation of imputation classes, which makes this method as the
simplest procedure among the four methods in this study.
The procedures in applying the Overall Mean Imputation (OMI) are as follows:
1. The overall mean for the variables of interest, which is the first visit variables
of interest, TOTIN1 and TOTEX1 was computed. The formula that was
used for the computation of the overall mean is:
r
X
yri
i=1
ȳomi = r
where ȳomi is the overall mean for the first visit variables of interest,TOTEX1
or TOTIN1 while yri is the observation for the first visit variables of interest,
TOTEX1 or TOTIN1 and r is the total number of responding units for the
first visit variable TOTEX1 or TOTIN1.
2. Using the nonresponse data sets generated, the missing observations for the
second visit variables TOTEX2 and TOTIN2 were replaced with the overall
means of the first visit TOTEX1 and TOTIN1.
The implementation of the Overall Mean Imputation (OMI) was made through
the Decimal Basic program OMI.BAS (Appendix B).
42
The Hot Deck (HDI) Imputation is an imputation procedure where the missing
observations are replaced by choosing a value from the set of available units.
The steps undertaken in applying the Hot Deck Imputation (HDI) are as follows:
1. The donor and recipient record for each imputation class and variable were
first identified.
2. The missing observations of the second visit TOTIN2 and TOTEX2 were
assigned to their respective recipient records for each imputation class while
the first visit TOTIN2 and TOTEX2 observations were placed to their re-
spective donor records for each imputation class.
3. The values that were substituted for the missing observations were randomly
chosen from the donor record for each imputation class.
The implementation of the Hot Deck Imputation (HDI)was made through the
Decimal Basic program HOT DECK.BAS (Appendix B).
43
1. A logarithmic transformation was applied for the first visit variables of in-
terest,TOTEX1 and TOTIN1 as well as for the second visit variables of
interest, TOTEX2 and TOTIN2. The rationale for this transformation is
that the income and expenditure variables are not normally distributed.
Moreover, logarithmic transformations help correct the non-linearity of the
regression equation.
2. The formation of regression equation was done after the transformation. For
this study, only one predictor variable was used and the general formula for
the regression equation is:
where ŷ is the predicted observation for the second visit variable TOTIN2 or
TOTEX2, β̂0 and β̂1 are the parameter estimates, x is the first visit variable,
44
and êi is the random residual term. Note that for DRI, êi = 0.
3. For the stochastic regression which involves the computation of the error
term, the following steps were made:
i. The residuals were grouped into class intervals and in each inter-
val,the frequencies for each was obtained.
(b) The class means of the frequency distributions were used to obtain the
error terms for the regression equation.
4. The diagnostic checking requires the fitted model to satisfy the following
assumptions:
(a) Linearity
The results for the diagnostic checking of each regression equation used for
this study were presented in Appendix C.
5. The missing observations were replaced by the predicted value using the
corresponding regression equation.
45
To compute for the bias of the mean of the imputed data, the following pro-
cedures were implemented:
1. The mean of the imputed data, ȳ 0 was computed. For Hot Deck and Stochas-
tic Regression Imputation, the average of all the mean of the 1,000 simulated
data sets was computed.
3. The resulting bias of the mean of the imputed data was computed by getting
the difference between (1) and (2).
Actual Data
In order to determine which imputation method was able to maintain the same
distribution of the actual data, a goodnesss - of - fit test was utilized. For this
study, the researchers chose the Kolmogorov - Smirnov (K-S) test. The Kol-
mogorov - Smirnov is a goodness of fit test concerned with the degree of agree-
ment between the distribution of a set of sampled (observed) values and some
specified theoretical distribution (Siegel, 1988). In this study, the researchers
were concerned with how the imputation methods affected the distribution of the
1997 FIES data.
The following steps are made for the Kolmogorov - Smirnov Test:
1. Income and Expenditure deciles were created. The creation of these deciles
was based on the second visit actual 1997 FIES data.
2. The obtained deciles were used as upper bounds of the frequency classes.
3. A Frequency Distribution Table (FDT) for each trial was created. For this
part, the researchers used the SPSS aggregate function to generate the FDT.
4. The FDT includes the Relative Cumulative Frequency (RCF) for both the
imputed and actual distribution. RCFs are computed by dividing the cu-
mulative frequency by the total number of observations.
5. The absolute value of the difference of the actual data RCF and the imputed
RCF was computed. This was computed using Microsoft Excel
47
6. The test statistic for the Kolmogrov - Smirnov Test, which is the maximum
deviation, D, was determined by using this formula:
D = max|RCFimputed − RCFactual |
7. Since this is a large sample case and assuming a 0.05 level of significance,
1.36
the critical value for this is computed using the formula: √ ,
N
N = 4, 130
8. If D is less than the critical value, then the conclusion that the imputed
data maintains the same distribution of the actual data follows.
1. Income and Expenditure deciles were created. The deciles that were used
in the previous test were the same deciles used here.
2. The obtained deciles were used as upper bounds of the frequency classes.
3. A Frequency Distribution Table (FDT) for both the imputed and actual
values was generated.
4. For Hot Deck and Stochastic Regression which had 1,000 sets the Relative
Frequencies (RF) for each frequency class were averaged over 1,000 RFs.
48
Imputation Methods
Lastly, the researchers adopted measures used by Kalton (1983) in his report enti-
tled Compensating for Missing Data for evaluating the effectiveness of imputation
methods. These measures are: (a) Mean Deviation (MD), (b) Mean Ab-
solute Deviation (MAD) and (c) Root Mean Square Deviation (RMSD).
The Mean Deviation (MD) measures the bias of the imputed values. This is
represented by the formula:
X
(ŷmi − ymi )
MD = m
, i = 1, 2..., m
where ŷmi is the imputed value for the variables TOTEX2 or TOTIN2 and ymi is
the actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.
where ŷmi is the imputed value for the variables TOTEX2 or TOTIN2, and ymi is
the actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.
The Root Mean Square Deviation (RMSD) is the square root of the sum of the
square deviations of the imputed and actual observation. Same as the MAD, it
measures the closeness with which the deleted values are reconstructed. This is
expressed as:
rX
(ŷmi − ymi )2
RM SD = m
where ŷmi is the imputed value for the variables TOTEX2 or TOTIN2, and ymi is
the actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.
These three criteria for measuring the performance of the imputation methods
were implemented using the Decimal Basic program. After each imputation
method is performed, the program proceeds in finding the Mean Deviation, Mean
Absolute Deviation and Root Mean Square Deviation and were saved in their cor-
responding Criteria for Expenditure (CRITEX) and Criteria for Income (CRITIN)
files.
50
To answer the primary objective of this study which is determining the best or the
most appropriate imputation technique for FIES 1997, the researchers ranked the
four imputation methods based on the criteria discussed in the previous sections.
The selection of the best method will be independent for all the variables of inter-
est and nonresponse rates. The ranking of the imputation methods covered the
following: Bias of the Mean of the Imputed Data, Estimated Percentage
of Correct Distribution of the Imputed Data (PCD) which refers to the
proportion, out of the total number of simulated data sets, that the imputed data
set was able to reconstruct the actual data set, Mean Deviation (MD), Mean
Absolute Deviation (MAD) and Root Mean Square Deviation (RMSD)
1. In each criteria mentioned above, the imputation methods were ranked using
the scale of 1 to 4,with 1 indicating the best imputation method and 4 being
the worst.
2. For each variable of interest (i.e. TOTEX2, TOTIN2), the obtained rankings
of a particular imputation method for each criteria is added.
3. The imputation method with the lowest total will be considered as the best
imputation method for the respective variable of interest and nonresponse
rate.
The results of the ranking procedure were presented in the next chapter.
Chapter 5
Variables
Table 2 shows the descriptive statistics of the second visit variables of interests
(VI), TOTEX2 and TOTIN2. This was computed to provide a brief idea on how
much a household spends and earns in a period of time, measure the differences
of the statistics between the two variables and to compare the results with other
tests later on.
The average total spending of a household in the National Capital Region (NCR) is
about Php 102,389.80 while the average total earnings amounted to Php 134,119.40,
52
a difference of more than thirty thousand pesos. it can be noted that the observa-
tions from the TOTIN2 have a larger mean and standard deviation as compared
to TOTEX2. The dispersion can be also seen by just looking at the minimum at
maximum of the two variables.
Tables 3, 4, 5 shows the candidate matching variables along their respective cat-
egories and scope. The candidate MVs that were tested are the provincial area
codes (PROV), recoded education status (CODES1) and recoded total employed
household members (CODEP1). The candidate PROV has four categories and it is
the only matching variable that was not recoded. The other candidates, namely,
CODEP1, which is recoded total employed household members and CODES1,
which is the recoded education status are the matching variables that were re-
duced to smaller number o groups since the original number of categories for
these two candidate MVs were 7 and 99 respectively. As mentioned in the previ-
ous chapters, the number of categories are further reduced into smaller groups to
minimize the heterogeneity and the bias of the estimates.
53
The Chi-Squared test of association for the candidates and the variables of inter-
est showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and
CODEX1. The p-values for all the candidates were less than 0.0001 indicating
that the association is very significant. The results of succeeding measures of as-
sociation will determine which of the three candidates will be chosen as the MV
of the study.
56
The measures of association showed small degrees of association with the vari-
ables CODIN1 and CODEX1. This kind of result is expected in real complex
data, given larger variability among the observations. From Table 7, it is clearly
shown that the CODES1 is the MV which exhibit the largest association among
the variables and therefore, the MV that can ensure that the ICs are homoge-
neous. Thus, CODES1 is the chosen MV for this data.
57
The table shown above indicates that IC1 is the imputation class with the least
standard deviation compared to the two ICs, IC2 and IC3. IC2 and IC3 produced
large standard deviations however it is being neutralized by a low value from IC1
which has the largest proportion of the data. A possible reason why the standard
deviation and the mean of IC3 are large is because majority of the extreme values
were contained on that class.
58
Results in Table 9 show the means for both second visit VIs, TOTEX2 and
TOTIN2, under all NRR. This was generated to be used an input in the eval-
uation of the mean from the imputed data for each IM.
The mean of the observations set to nonresponse and observations retained showed
contrasting results. For both variables, TOTEX2 and TOTIN2, when the nonre-
sponse rate increases, the mean of the observations set to missing (deleted) also
increases. Conversely, the mean of observations retained decreases when non-
response rate increases. Perhaps the large values that were set to nonresponse
increased the means of the data sets containing nonresponse for the varying rates
of nonresponse. Hence, as the number of missing values increases, the deviation
between the means of the actual and retained data slowly increases.
59
Table 10 shows the different regression models for all VIs and nonresponse rates
(NRRs) that were checked for adequacy. The columns are represented as follows:
(a) VI, (b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e)
the coefficient of determination (R2 ) and (f) the F-statistic and its corresponding
p-value in parenthesis.
For the notations used in Table 10, the codes IC1, IC2, IC3 represents the first,
second and third imputation class respectively. Meanwhile, for the regression
equations used for the regression imputation, ŷi represents the dependent variable,
which is the predicted value for the second visit variable TOTIN2 or TOTEX2.
Logarithmic transformations were utilized in order to correct the non-linearity for
the regression equations. The code (LN F V E1i ) is the logarithmic transforma-
tion of the observation from the first visit variable Total Expenditure (TOTEX1)
under IC1. Similarly, (LN F V I1i ) is the logarithmic transformation of the first
visit observation for the variable Total Income (TOTIN1) under IC1. The same
notation also applies for (LN F V E2i ) and (LN F V E3i ) under IC2 and IC3 for
the variable TOTEX1, respectively and (LN F V I2i ) and (LN F V I3i ) under IC2
and IC3 for the variable TOTIN1, respectively.
60
Table 10 showed the regression models used for the regression imputations
under their respective VIs and ICs. Before using these equations for imputating
missing values, diagnostic checking of the models, which include Linearity, Nor-
mality of Error Terms, Independence of Error Terms and Constancy of Variance,
were performed.
Second, the models were checked if they satisfy the assumption of linearity. This
was performed using the ANOVA tables presented in Appendix C. The results
of the diagnostic checking showed that all models exhibited the assumption of
linearity. The p-values for all the models were less than 0.0001, an indication that
the linearity of the models is very significant.
Third, the next phase for diagnostic checking is to check if the regression model
satisfy the assumption of normality. For this study, the researchers examined the
Normal Probability Plot(NPP) of the regression models which can be found in
62
Appendix C. The normal probability plot in all models moderately follows the
S-shaped pattern which indicates that the residuals are not normal but rather
lognormal. However, the shape of the NPP improved after ln transformation was
applied even though the model was not linear previously. Since the data used is
a complex data, the models were used even if assumption of the residuals to be
normal is not perfectly achieved.
Fourth, in testing for the assumption of independence of error terms, the Durbin -
Watson test was implemented. Results in Appendix C show that all of the models
satisfy the assumption of independence.
Hence, given this discussion, the results show that the assumptions for the di-
agnostic checking of the regression equations used for the regression imputations
were satisfied.
63
To determine the effect of nonresponse rates in the results for each imputation
method (IM), evaluation of different IMs was performed. In the evaluation of the
different IMs, the results of each IM will be discussed independently. For each IM,
the discussion of results will be as follows: (1) bias of the mean of the imputed
data, (2) distribution of the imputed data using the Kolmogorov-Smirnov Good-
ness of Fit Test, and (3) other measures of variability using the mean deviation
(MD), mean absolute deviation (MAD), and root mean square deviation (RMSD).
The table of results will contain the following columns: (a) variable of interest
(VI), (b) nonresponse rate (NRR), (c) the bias of the mean of the imputed data,
Bias (ŷ 0 ),(d) percentage of correct distribution of the imputed data to the actual
data set out of 1000 trials (PCD), (e) MD, (f) MAD, and (g) RMSD.
64
Table 11 shows the results of the different criteria in evaluating the imputed data
using the overall mean imputation (OMI) method.
On the other hand, the results shown for TOTIN2 were the opposite of
TOTEX2 as NRR increases. The bias of the mean of the imputed data for
65
For TOTIN2, the data which have twenty percent imputed observations
have the highest values in all the three measures of variability. Unlike for
TOTEX2, surprisingly, values from the three measures of variability under
the highest NRR have the lowest results.
Table 12 shows the results of the different criteria in evaluating imputed data with
imputations using the Hot Deck Imputation (HDI) method with three imputation
classes.
For the TOTEX2 variable, the data with twenty percent NRR provided the
least bias. On the other hand, the data with the lowest NRR yielded the
smallest bias for TOTIN2.
For TOTEX2 and TOTIN2, the data with the highest number of imputed
observations failed to maintain the distribution of the actual data. Much
worse, none of the simulated data set for TOTEX2 registered the same dis-
tribution as the actual. On the other hand, only a lone data set maintained
the same distribution as the actual. The researchers look into the possibility
that more than one recipient are having the same donor.
68
For the variable TOTIN2, the following results were obtained: (i) all the
three criteria increases as NRR increases, (ii) results for the three criteria
were larger than for TOTEX2, and (iii) the data with the largest number of
imputations generated the highest value in the three criteria.
69
Table 13 shows the results of the different criteria in evaluating the imputed data
using the Deterministic Regression Imputation method with three imputation
classes (DRI).
Table 14 shows the results of the different criteria in evaluating the imputed data
using the Stochastic Regression Imputation method with three imputation classes
(SRI).
For the OMI method, the figures clearly illustrate the distortion of the distrib-
ution. Since the OMI method assigns the mean of the first visit VI to all the
missing cases, all the data sets concentrated in one particular frequency class.
The three other methods which implemented imputation classes, gave a better
outcome than OMI by spreading the distribution of the imputed data.
For the HDI method, all the figures clearly illustrate the over representation in
the first frequency class, that is less than 37,859.5 for TOTEX2 and less than
40,570.0 for TOTIN2. Over representation can also be seen in Figure 5 under the
second frequency class, in Figures 6 and 7 under the last frequency class.
While there is an over representation of the data for HDI3, an under represen-
tation was observed. Looking at Figures 2, 3, and 4, under representation were
observed in the seventh frequency class (128,000.0 - 161,669.0) for TOTIN2 .
77
For the two regression imputation methods, unlike HDI and OMI which had major
clusters, produced more spread distribution although there are some areas that are
under represented. The failure to consider a random residual term in deterministic
regression resulted into a severe under representation of the data in particularly
in the first frequency class in all the figures. Looking at Figures 2, 3, and 4, under
representation can be seen in the last frequency class in TOTIN2. For SRI which
added a random residual provided better results than DRI. However, there are
some areas that the added random produced significant excess mostly from the
last frequency class. These can be seen in Figures 5,6 and 7 for TOTEX2 and
Figures 2 and 4 for TOTIN2.
78
For this section, the rankings of all the tests are the basis to determine which of
the following IMs will be chosen as the best IMs for this particular study and data.
The selection of the best method will be independent for all VIs and NRRs. The
ranking are based on a four-point system wherein the rank value of 4 denotes the
worst IM for that specific criterion and 1 denotes the best IM for that criterion. In
case of ties, the average ranks will be substituted. The IM with the smallest rank
total will be declared the best IM for the particular VI and NRR. The ranking
of IM will cover the following criteria: (a) Bias of the mean of the imputed data
(N.B.), (b) percentage of correct distributions (PCD), and (c) Other measures of
variability, namely, MD, MAD and RMSD. All in all, there are five criteria that
each IM will be rank in.
Tables 12, 13 and 14 show the ranking of the different imputation methods for the
10%, 20% and 30% NRR respectively. For each NRR, the table containing the
rankings of the IMs will go as follows: (a) VIs, (b) Criteria, (c) OMI, (d) HDI,
(e) DRI, and (f) SRI.
79
Rankings show that the two regression IMs provided better results than their
model-free counterparts. For all the nonresponse rates under the TOTIN2 vari-
able, the two regression imputation methods tied as the best IM, and surprisingly
the HDI finished the worst IM behind OMI. Under the TOTEX2 variable, mixed
rankings were seen for all nonresponse rates. The regression methods still pro-
vided good results. The SRI method finished first in the 10% and 30% NRR
and ranked third in the 20% NRR while the DRI method finished third, first and
second in the 10%, 20% and 30% NRR respectively. While the HDI was seen as
the worst IM for TOTIN2, the OMI was concluded the worst IM for TOTEX2 by
ranking last for both 10% and 20% NRR and third for the 30% NRR.
In conclusion, the best imputation method for this study is the Stochastic Re-
gression Imputation using the 1997 FIES data. It is very closely followed by
the Deterministic Regression Imputation. No records in the results show
that SRI method ranked last in all the criteria, NRRs and VIs, unlike for DRI
which provided the worst IM in the bias of the mean of the imputed data and MD
criteria. The researchers selected the HDI as the worst IM in this study. The HDI
method fared the worst such that majority of the results in the different criteria
under each NRR and VI in particular the said method rated poorly.
Chapter 6
Conclusion
Anyone faced with having to make decisions about imputation procedures will
usually have to choose some compromise between what is technically effective and
what is operationally expedient. If resources are limited, this is a hard choice.
This study aims to help future researchers in choosing the most appropriate im-
84
For our particular implementation, all of the methods were run to a programming
language due to the unavailability of software that can generate imputations for
all the methods needed ofr this study. In all of the methods, the overall mean
imputation was the easiest to use and create a computer program. The other
three methods required the formation of imputation classes. Both regression im-
putations were the hardest to program and the most time consuming imputation
methods.
The results show that the choice of imputation method significantly affected the es-
timates of the actual data. The similarities among the two best methods, namely,
the Deterministic and Stochastic Regression imputation methods were due in part
to the adequacy and prediction power of the models.
85
The bias and variance estimates of the imputed data obtained appeared to vary
much across imputation methods and it was unexpected that the Hot Deck Im-
putation method rendered the highest estimates in majority of the nonresponse
rates as well as its variables. Stochastic Regression, on the other hand, was the
best method in that particular criterion since in majority of the results in the
tests produced relatively small biases and variances.
The distributions of the imputed data of each method were checked for the preser-
vation of the distribution using the Kolmogorov-Smirnov Goodness of Fit test. In
the methods used in this study, both regression imputation methods retained the
distribution of the data especially the Deterministic Regression Imputation that
generated exactly the same distribution as the actual data.
In the other tests of accuracy and precision, namely, the mean deviation, mean
absolute deviation and root mean square deviation, the different methods pro-
vided mixed results in all nonresponse rates. The results for some methods did
not consistently and clearly yielded good results. Only half of the methods used
provided great results in one particular criterion which is the preservation of the
distribution of the data. In the other results, inconsistency was obviously seen
due to the alternating rankings from each method.
Given the criteria and procedures in judging the best imputation procedure among
the four methods, the selection of the best method was difficult. Consequently,
in order to determine the best method of imputing nonresponse observation for
86
each variable in the study, the methods were ranked according to several criteria.
Methods that were ranked 1 indicate as the best imputation method while meth-
ods ranked 4 shows that it is the worst in that particular criterion.
After comparing the methods, the two regression method namely the Determinis-
tic and Stochastic Regression Imputation gave the outstanding results. Therefore,
it can be concluded that that the Stochastic Regression Imputation procedure is
considered the best imputation method for this study since the it did not rank
poorly in any criteria under all NRRs and VIs.
The efficiency of the imputation method was supported by the R2 of the model
and the added random residual in the deterministic imputed value. The random
residuals added to the deterministic imputation provided a change in making the
estimates less biased than its deterministic counterpart.
In this study, we have compared four imputation methods commonly used in deal-
ing with partial nonresponse data and with the assumption of MCAR. However,
there are other methods that are currently being developed and improved. For
example, the multiple imputation method involves independently imputing more
than one value for each nonresponse value. Multiple imputation is an important
and powerful form of imputation and has the advantage that variance estimation
under imputation can be carried out comparatively easily (Kalton, 1983).
Regarding the variance estimation, further studies should implement the use of
proper variance estimates like the Jackknife variance estimator. This variance
estimator is more often used in comparing the variance estimates of most impu-
tation methods. The study of Rao and Shao (1992) has proposed an adjusted
Jackknife variance estimator to use with the imputation methods related to the
Hot Deck imputation procedure. This variance estimator is said to be asymptot-
ically unbiased.
Future researchers may test other methods on the same data set and compare
88
the results with those presented in this paper. They could also compare the re-
sults of this study with those of multiple imputation and the Rao-Shao jackknife
variance estimator. There is a need, however, for a higher knowledge in statistics
and Bayesian statistics in using the above procedures. The complexity of the
methods especially both regression imputations could hinder future researchers in
the use of modern variance estimator.
It is also suggested that the use of a method to select a matching variable through
the use of advanced modern statistical methods like the CHAID analysis. The
acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one
of the oldest tree classification methods originally proposed by Kass (1980; accord-
ing to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by
Morgan and Messenger, 1973). CHAID will ”build” non-binary trees (i.e., trees
where more than two branches can attach to a single root or node), based on a
relatively simple algorithm that is particularly well suited for the analysis of larger
datasets. Also, because the CHAID algorithm will often effectively yield many
multi-way frequency tables (e.g., when classifying a categorical response variable
with many categories, based on categorical predictors with many classes), it has
been particularly popular in marketing research, in the context of market segmen-
tation studies. (Statsoft, 2003)
the model. These dummy variables are the categories of the matching variables.
It would definitely save time and money since only one model is created and tested.
These researchers strongly recommend using a statistical package that can gen-
erate faster and a lot easier imputations but generate less biased estimates than
programming. It would definitely save time than creating a computer program
that eats up a majority of the research time in debugging and prevent computer
crashes due to computer memory overload.
Bibliography
[4] Musil, C., Warner, C., Yobas, P. K. and Jones. S. A Comparison of Impu-
tation Techniques for Handling Missing Data, Western Journal of Nursing
Research. Vol.24,No.7, 815-829 (2002)
[5] National Statistics Office (NSO)(1997 - 2005). Technical Notes on the 1997
Family Income and Expenditure Survey (FIES). Retrieved 18 June 2007,
from http://www.census.gov.ph/data/technotes/notefies.html
[6] Netter, J., Wasserman, W. and Kutner, M.H.. Applied Linear Statistical Mod-
els 2nd ed. Homewood, Illinois: Richard D. Irwin, Inc.
91
[8] Obanil, R. (2006, October 3). Topmost floor of NSO Building Gutted by
Fire. The Manila Bulletin Online. Retrieved 28 August 2007, from
http://www.mb.com.ph/issues/2006/10/03/MTN2061037203.html
[12] StatSoft, Inc. STATISTICA (data analysis software system), version 7.1.
www.statsoft.com.(2005)
Appendix
Appendix A
Items and Information Gathered in the FIES
1997
Part I – Identification A. Identification of the Household
B. Other Information:
and Other Information 1. Particulars about the Head of the Family
a) Sex
b) Age as of Last Birthday
c) Martial Status
d) Highest Grade Completed
e) Employment Status
f) Occupation
g) Kind of Industry / Business
h) Class of Worker
2. Other information about the Household
a) Type of Household
b) Number of Family Members Enumerated
c) Number of boarders, helpers and other non-
relatives
d) Number of Family Members who are Employed
for Pay or Profit
Part II - Expenditures A. Food, Alcoholic Beverages and Tobacco
1. Particulars about the Head of the Family
and Other a) Cereals and Cereal Preparations
b) Roots and Tubers
Disbursements c) Fruits and Vegetables
d) Meat and Meat Preparations
e) Dairy Products and Eggs
f) Fish and Marine Products
g) Coffee, Cocoa and Tea
h) Non-Alcoholic Beverages
i) Food Not Elsewhere Classified
2. Food Regularly Consumed Outside the Home
3. Alcoholic Beverages
4. Tobacco
5. Food Items, Alcoholic Beverages and
Tobacco Received as Gifts
B. Fuel, Light and Water, Transportation and
Communication and Household Operation
C. Personal Care and Effects, Clothing, Footwear and Other
Wear
D. Education, Recreation and Medical Care
E. Furnishings and Equipment
F. Taxes
G. Housing, House Maintenance and Minor Repairs
H. Miscellaneous Expenditures
I. Other Disbursements
Part III – Income and A. Salaries and Wages from Employment
B. Net Share of Crops, Fruits and Vegetables Producedor
Other Receipts
Livestock and Poultry Raised by Other Households
C. Other Sources of Income
1. Cash Receipts, Gifts, Support, Relief and Other Forms
of Assistance From Abroad
2. Cash Receipts, Support, Assistance and Relief from
Domestic Source
3. Rentals Received From Non-Agricultural Lands,
Buildings, Spaces and Other Properties
4. Interest
5. Pension and Retirement, Workmen's Compensation and
Social Security Benefits
6. Net Winnings from Gambling, Sweepstakes and Raffle
7. Dividends From Investment
8. Profits from Sale of Stocks, Bonds and Real and
Personal Property
9. Back pay and Proceeds from Insurance
10. Inheritance
D. Other Receipts
93
Appendix B
Source Codes of the Imputation Programs
SIMULATION.BAS
RANDOMIZE
LET TOT = 4130 ! TOT = TOTAL NUMBER OF OBSERVATIONS !
! Creating a new file to save the original data with an additional column with
nonresponse observations !
FOR I = 1 TO TOT
LET RN(I,1) = RND
FOR COL = 1 TO 3
LET IMISS(I,COL) = IFIES(I,COL)
LET EMISS(I,COL) = EFIES(I,COL)
NEXT COL
LET IMISS(I,4) = IFIES(I,3)
LET EMISS(I,4) = EFIES(I,3)
NEXT I
FOR B = 1 TO NON
LET IMISS(B,4) = -1 ! Setting the observation to nonresponse !
LET EMISS(B,4) = -1
NEXT B
CLOSE #3
CLOSE #4
END
OMI.BAS
! Opening and creation of files to be used to upload the data and write the results in the
program !
OPEN #1: NAME "E:\MISSI30%.CSV"
MAT READ #1: SIMI
CLOSE #1
OPEN #2: NAME "E:\MISSE30%.CSV"
MAT READ #2: SIME
CLOSE #2
OPEN #3: NAME "E:\IDATA30%.TXT"
ERASE #3
OPEN #4: NAME "E:\EDATA30%.TXT"
ERASE #4
REM Computation of the overall mean of the first visit nonresponse variables to be
imputed later on the program
FOR I = 1 TO TOT
LET OMEX = OMEX + SIME(I,2)
LET OMIN = OMIN + SIMI(I,2)
NEXT I
LET OMEX = OMEX/TOT
LET OMIN = OMIN/TOT
FOR J = 1 TO NON
LET SIMI(j,4) = OMIN
LET SIME(J,4) = OMEX
NEXT J
OMI.BAS
REM Computation of the mean deviation, mean absolute deviation, root mean square
deviation
! INCDOMI = deviation of the imputed and the actual observation for the income
variable under OMI !
! EXPDOMI = deviation of the imputed and the actual observation for the expenditure
variable under OMI !
LET INCDOMI = 0
LET EXPDOMI = 0
! INCMDOMI = mean deviation of the imputed and the actual observation for the income
variable under OMI !
! EXPMDOMI = mean deviation of the imputed and the actual observation for the
expenditure variable under OMI !
LET INCMDOMI = 0
LET EXPMDOMI = 0
! INCMADOMI = mean absolute deviation of the imputed and the actual observation for
the income variable under OMI !
! EXPMADOMI = mean absolute deviation of the imputed and the actual observation for
the expenditure variable under OMI !
LET INCMADOMI = 0
LET EXPMADOMI = 0
! INCRMSDOMI = root mean square deviation of the imputed and the actual observation
for the income variable under OMI !
! EXPRMSDOMI = root mean square deviation of the imputed and the actual observation
for the expenditure variable under OMI !
LET INCRMSDOMI = 0
LET EXPRMSDOMI = 0
FOR I = 1 TO NON
NEXT I
LET INCMDOMI = INCMDOMI/NON
LET EXPMDOMI = EXPMDOMI/NON
LET INCMADOMI = INCMADOMI/NON
LET EXPMADOMI = EXPMADOMI/NON
LET INCRMSDOMI = SQR(INCRMSDOMI/NON)
LET EXPRMSDOMI = SQR(EXPRMSDOMI/NON)
CLOSE #3
CLOSE #4
CLOSE #5
CLOSE #6
END
HOT DECK.BAS
RANDOMIZE
LET TOT = 4130 ! TOT = Total number of observations !
LET TRIALS = 100 ! Number of trials !
LET N1 = 2635 ! N1 = Number of total observations under employment status 1 !
LET N2 = 1434 ! N2 = Number of total observations under employment status 2 !
LET N3 = 61 ! N3 = Number of total observations under employment status 3 !
LET NRR = 0.3 ! NRR = Nonresponse rate !
LET NON = NRR * TOT ! NON = Number of total nonresponse observations !
DIM SIMI1(N1,5), SIMI2(N2,5), SIMI3(N3,5) ! Matrix that will contain the income
observations for each imputation class!
! Matrices that contains the criteria that will be computed later in the program!
DIM CRITIN(TRIALS,3), CRITEX(TRIALS,3)
! Opening and creation of files to be used to upload the data and write the results in the
program !
DO
LET TRIAL = TRIAL + 1 ! Trial count !
! PECK# = Observation number that was chosen randomly for the expenditure
variable under education status # !
! PICK# = Observation number that was chosen randomly for the income variable
under education status # !
FOR I= 1 TO N1
IF SIME1(I,5) = 0 THEN
LET PECK1 = INT(RND*N1) + 1
LET SIME1(I,4) = SIME1(PECK1,2)
END IF
IF SIMI1(I,5) = 0 THEN
LET PICK1 = INT(RND*N1) + 1
LET SIMI1(I,4) = SIMI1(PICK1,2)
END IF
NEXT I
FOR J = 1 TO N2
HOT DECK.BAS
IF SIME2(J,5) = 0 THEN
LET PECK2 = INT(RND*N2) + 1
LET SIME2(J,4) = SIME2(PECK2,2)
END IF
IF SIMI2(J,5) = 0 THEN
LET PICK2 = INT(RND*N2) + 1
LET SIMI2(J,4) = SIMI2(PICK2,2)
END IF
NEXT J
FOR K = 1 TO N3
IF SIME3(K,5) = 0 THEN
LET PECK3 = INT(RND*N3) + 1
LET SIME3(K,4) = SIME3(PECK3,2)
END IF
IF SIMI3(K,5) = 0 THEN
LET PICK3 = INT(RND*N3) + 1
LET SIMI3(K,4) = SIMI3(PICK3,2)
END IF
NEXT K
LET MDSIMI1 = 0
LET MDSIMI2 = 0
LET MDSIMI3 = 0
LET MADSIMI1 = 0
LET MADSIMI2 = 0
LET MADSIMI3 = 0
LET RMSDSIMI1 = 0
LET RMSDSIMI2 = 0
LET RMSDSIMI3 = 0
LET MDSIME1 = 0
LET MDSIME2 = 0
LET MDSIME3 = 0
LET MADSIME1 = 0
LET MADSIME2 = 0
LET MADSIME3 = 0
LET RMSDSIME1 = 0
LET RMSDSIME2 = 0
LET RMSDSIME3 = 0
HOT DECK.BAS
FOR A = 1 TO N1
LET NUM = NUM + 1
IF SIMI1(A,5) = 0 THEN
LET DIFFI1 = SIMI1(A,4) - SIMI1(A,3)
LET MDSIMI1 = MDSIMI1 + DIFFI1
LET MADSIMI1 = MADSIMI1 + (ABS(DIFFI1))
LET RMSDSIMI1 = RMSDSIMI1 + (DIFFI1)^2
END IF
IF SIME1(A,5) = 0 THEN
LET DIFFE1 = SIME1(A,4) - SIME1(A,3)
LET MDSIME1 = MDSIME1 + DIFFE1
LET MADSIME1 = MADSIME1 + (ABS(DIFFE1))
LET RMSDSIME1 = RMSDSIME1 + (DIFFE1)^2
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI1(A,COL)
LET SIME(NUM,COL) = SIME1(A,COL)
NEXT COL
CLOSE #7
CLOSE #8
CLOSE #9
CLOSE #10
END
DRI.BAS
! Opening and creation of files to be used to upload the data and write the results in the
program !
FOR J = 1 TO N2
LET LNFVE2(J,1) = LOG(SIME2(J,2))
IF SIME2(J,5) = 1 THEN
LET NOR2 = NOR2 + 1
DRI.BAS
FOR K = 1 TO N3
LET LNFVE3(K,1) = LOG(SIME3(K,2))
IF SIME3(K,5) = 1 THEN
LET NOR3 = NOR3 + 1
LET LNSVE3(K,1) = LOG(SIME3(K,3))
LET EYBAR3 = EYBAR3 + LNSVE3(K,1)
LET EXBAR3 = EXBAR3 + LNFVE3(K,1)
END IF
FOR A = 1 TO N1
IF SIME1(A,5) = 1 THEN
LET EDXY1 = EDXY1 + ((LNFVE1(A,1) - EXBAR1)*(LNSVE1(A,1)-
EYBAR1))
LET EDXSQR1 = EDXSQR1 + (LNFVE1(A,1) - EXBAR1)^2
END IF
IF SIMI1(A,5) = 1 THEN
LET IDXY1 = IDXY1 + ((LNFVI1(A,1) - IXBAR1)*(LNSVI1(A,1)-
IYBAR1))
LET IDXSQR1 = IDXSQR1 + (LNFVI1(A,1) - IXBAR1)^2
END IF
NEXT A
FOR B = 1 TO N2
IF SIME2(B,5) = 1 THEN
LET EDXY2 = EDXY2 + ((LNFVE2(B,1) - EXBAR2)*(LNSVE2(B,1)-
EYBAR2))
LET EDXSQR2 = EDXSQR2 + (LNFVE2(B,1) - EXBAR2)^2
END IF
IF SIMI2(B,5) = 1 THEN
LET IDXY2 = IDXY2 + ((LNFVI2(B,1) - IXBAR2)*(LNSVI2(B,1)-
IYBAR2))
LET IDXSQR2 = IDXSQR2 + (LNFVI2(B,1) - IXBAR2)^2
END IF
NEXT B
FOR C = 1 TO N3
IF SIME3(C,5) = 1 THEN
LET EDXY3 = EDXY3 + ((LNFVE3(C,1) - EXBAR3)*(LNSVE3(C,1)-
EYBAR3))
LET EDXSQR3 = EDXSQR3 + (LNFVE3(C,1) - EXBAR3)^2
END IF
IF SIMI3(C,5) = 1 THEN
LET IDXY3 = IDXY3 + ((LNFVI3(C,1) - IXBAR3)*(LNSVI3(C,1)-
IYBAR3))
LET IDXSQR3 = IDXSQR3 + (LNFVI3(C,1) - IXBAR3)^2
END IF
NEXT C
FOR Y1 = 1 TO N1
LET DREGE1(Y1,1) = EBZERO1 + EBONE1*LNFVE1(Y1,1)
LET DREGI1(Y1,1) = IBZERO1 + IBONE1*LNFVI1(Y1,1)
NEXT Y1
FOR Y2 = 1 TO N2
LET DREGE2(Y2,1) = EBZERO2 + EBONE2*LNFVE2(Y2,1)
LET DREGI2(Y2,1) = IBZERO2 + IBONE2*LNFVI2(Y2,1)
NEXT Y2
FOR Y3 = 1 TO N3
LET DREGE3(Y3,1) = EBZERO3 + EBONE3*LNFVE3(Y3,1)
LET DREGI3(Y3,1) = IBZERO3 + IBONE3*LNFVI3(Y3,1)
NEXT Y3
! Computation of residuals !
FOR A = 1 TO N1
IF SIME1(A,5) = 1 THEN
LET EH1 = EH1 + 1
LET ECRES1(EH1,1) = DREGE1(A,1) - LNSVE1(A,1)
END IF
IF SIMI1(A,5) = 1 THEN
LET IH1 = IH1 + 1
LET ICRES1(IH1,1) = DREGI1(A,1) - LNSVI1(A,1)
END IF
NEXT A
FOR B = 1 TO N2
IF SIME2(B,5) = 1 THEN
LET EH2 = EH2 + 1
DRI.BAS
FOR C = 1 TO N3
IF SIME3(C,5) = 1 THEN
LET EH3 = EH3 + 1
LET ECRES3(EH3,1) = DREGE3(C,1) - LNSVE3(C,1)
END IF
IF SIMI3(C,5) = 1 THEN
LET IH3 = IH3 + 1
LET ICRES3(IH3,1) = DREGI3(C,1) - LNSVI3(C,1)
END IF
NEXT C
FOR Y1 = 1 TO N1
LET DREGE1(Y1,1) = EXP(DREGE1(Y1,1))
LET DREGI1(Y1,1) = EXP(DREGI1(Y1,1))
NEXT Y1
FOR Y2 = 1 TO N2
LET DREGE2(Y2,1) = EXP(DREGE2(Y2,1))
LET DREGI2(Y2,1) = EXP(DREGI2(Y2,1))
NEXT Y2
FOR Y3 = 1 TO N3
LET DREGE3(Y3,1) = EXP(DREGE3(Y3,1))
LET DREGI3(Y3,1) = EXP(DREGI3(Y3,1))
NEXT Y3
FOR A = 1 TO N1
LET NUM = NUM + 1
IF SIME1(A,5) = 0 THEN
LET MDDREGE1 = MDDREGE1 + (DREGE1(A,1) - SIME1(A,3))
LET MADDREGE1 = MADDREGE1 + (ABS(DREGE1(A,1) -
SIME1(A,3)))
LET RMSDDREGE1 = RMSDDREGE1 + ((DREGE1(A,1) -
SIME1(A,3))^2)
LET SIME1(A,4) = DREGE1(A,1)
END IF
DRI.BAS
IF SIMI1(A,5) = 0 THEN
LET MDDREGI1 = MDDREGI1 + (DREGI1(A,1) - SIMI1(A,3))
LET MADDREGI1 = MADDREGI1 + (ABS(DREGI1(A,1) -
SIMI1(A,3)))
LET RMSDDREGI1 = RMSDDREGI1 + ((DREGI1(A,1) -
SIMI1(A,3))^2)
LET SIMI1(A,4) = DREGI1(A,1)
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI1(A,COL)
LET SIME(NUM,COL) = SIME1(A,COL)
NEXT COL
NEXT A
FOR B = 1 TO N2
LET NUM = NUM + 1
IF SIME2(B,5) = 0 THEN
LET MDDREGE2 = MDDREGE2 + (DREGE2(B,1) - SIME2(B,3))
LET MADDREGE2 = MADDREGE2 + (ABS(DREGE2(B,1) -
SIME2(B,3)))
LET RMSDDREGE2 = RMSDDREGE2 + ((DREGE2(B,1) -
SIME2(B,3))^2)
LET SIME2(B,4) = DREGE2(B,1)
END IF
IF SIMI2(B,5) = 0 THEN
LET MDDREGI2 = MDDREGI2 + (DREGI2(B,1) - SIMI2(B,3))
LET MADDREGI2 = MADDREGI2 + (ABS(DREGI2(B,1) -
SIMI2(B,3)))
LET RMSDDREGI2 = RMSDDREGI2 + ((DREGI2(B,1) -
SIMI2(B,3))^2)
LET SIMI2(B,4) = DREGI2(B,1)
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI2(B,COL)
LET SIME(NUM,COL) = SIME2(B,COL)
NEXT COL
NEXT B
FOR C = 1 TO N3
LET NUM = NUM + 1
IF SIME3(C,5) = 0 THEN
LET MDDREGE3 = MDDREGE3 + (DREGE3(C,1) - SIME3(C,3))
DRI.BAS
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI3(C,COL)
LET SIME(NUM,COL) = SIME3(C,COL)
NEXT COL
NEXT C
PRINT " MDDREGE ", " MADDREGE ", " RMSDDREGE "
PRINT MDDREGE, MADDREGE, RMSDDREGE
PRINT " MDDREGI ", " MADDREGI ", " RMSDDREGI "
PRINT MDDREGI, MADDREGI, RMSDDREGI
CLOSE #7
CLOSE #8
END
SRI.BAS
RANDOMIZE
LET TOT = 4130 ! TOT = Total number of observations !
LET TRIALS = 100 ! Number of trials !
LET N1 = 2635 ! N1 = Number of total observations under employment status 1 !
LET N2 = 1434 ! N2 = Number of total observations under employment status 2 !
LET N3 = 61 ! N3 = Number of total observations under employment status 3 !
LET NRR = 0.3 ! NRR = Nonresponse rate !
LET NON = NRR * TOT ! NON = Number of total nonresponse observations !
DIM SIMI1(N1,5), SIMI2(N2,5), SIMI3(N3,5) ! Matrix that will contain the income
observations for each imputation class!
! Matrices that contains the criteria that will be computed later in the program!
DIM CRITIN(TRIALS,3), CRITEX(TRIALS,3)
! Opening and creation of files to be used to upload the data and write the results in the
program !
FOR J = 1 TO N2
LET LNFVE2(J,1) = LOG(SIME2(J,2))
IF SIME2(J,5) = 1 THEN
SRI.BAS
FOR K = 1 TO N3
LET LNFVE3(K,1) = LOG(SIME3(K,2))
IF SIME3(K,5) = 1 THEN
LET NOR3 = NOR3 + 1
LET LNSVE3(K,1) = LOG(SIME3(K,3))
LET EYBAR3 = EYBAR3 + LNSVE3(K,1)
LET EXBAR3 = EXBAR3 + LNFVE3(K,1)
END IF
FOR A = 1 TO N1
IF SIME1(A,5) = 1 THEN
LET EDXY1 = EDXY1 + ((LNFVE1(A,1) - EXBAR1)*(LNSVE1(A,1)-
EYBAR1))
LET EDXSQR1 = EDXSQR1 + (LNFVE1(A,1) - EXBAR1)^2
END IF
IF SIMI1(A,5) = 1 THEN
LET IDXY1 = IDXY1 + ((LNFVI1(A,1) - IXBAR1)*(LNSVI1(A,1)-
IYBAR1))
LET IDXSQR1 = IDXSQR1 + (LNFVI1(A,1) - IXBAR1)^2
END IF
NEXT A
FOR B = 1 TO N2
IF SIME2(B,5) = 1 THEN
LET EDXY2 = EDXY2 + ((LNFVE2(B,1) - EXBAR2)*(LNSVE2(B,1)-
EYBAR2))
LET EDXSQR2 = EDXSQR2 + (LNFVE2(B,1) - EXBAR2)^2
END IF
IF SIMI2(B,5) = 1 THEN
LET IDXY2 = IDXY2 + ((LNFVI2(B,1) - IXBAR2)*(LNSVI2(B,1)-
IYBAR2))
LET IDXSQR2 = IDXSQR2 + (LNFVI2(B,1) - IXBAR2)^2
END IF
NEXT B
FOR C = 1 TO N3
IF SIME3(C,5) = 1 THEN
LET EDXY3 = EDXY3 + ((LNFVE3(C,1) - EXBAR3)*(LNSVE3(C,1)-
EYBAR3))
LET EDXSQR3 = EDXSQR3 + (LNFVE3(C,1) - EXBAR3)^2
END IF
IF SIMI3(C,5) = 1 THEN
LET IDXY3 = IDXY3 + ((LNFVI3(C,1) - IXBAR3)*(LNSVI3(C,1)-
IYBAR3))
LET IDXSQR3 = IDXSQR3 + (LNFVI3(C,1) - IXBAR3)^2
END IF
NEXT C
FOR Y1 = 1 TO N1
LET DREGE1(Y1,1) = EBZERO1 + EBONE1*LNFVE1(Y1,1)
LET DREGI1(Y1,1) = IBZERO1 + IBONE1*LNFVI1(Y1,1)
NEXT Y1
FOR Y2 = 1 TO N2
LET DREGE2(Y2,1) = EBZERO2 + EBONE2*LNFVE2(Y2,1)
LET DREGI2(Y2,1) = IBZERO2 + IBONE2*LNFVI2(Y2,1)
NEXT Y2
FOR Y3 = 1 TO N3
LET DREGE3(Y3,1) = EBZERO3 + EBONE3*LNFVE3(Y3,1)
LET DREGI3(Y3,1) = IBZERO3 + IBONE3*LNFVI3(Y3,1)
NEXT Y3
! Computation of residuals !
FOR A = 1 TO N1
IF SIME1(A,5) = 1 THEN
LET EH1 = EH1 + 1
LET ECRES1(EH1,1) = DREGE1(A,1) - LNSVE1(A,1)
END IF
IF SIMI1(A,5) = 1 THEN
LET IH1 = IH1 + 1
LET ICRES1(IH1,1) = DREGI1(A,1) - LNSVI1(A,1)
END IF
NEXT A
FOR B = 1 TO N2
IF SIME2(B,5) = 1 THEN
SRI.BAS
FOR C = 1 TO N3
IF SIME3(C,5) = 1 THEN
LET EH3 = EH3 + 1
LET ECRES3(EH3,1) = DREGE3(C,1) - LNSVE3(C,1)
END IF
IF SIMI3(C,5) = 1 THEN
LET IH3 = IH3 + 1
LET ICRES3(IH3,1) = DREGI3(C,1) - LNSVI3(C,1)
END IF
NEXT C
FOR Y1 = 1 TO N1
LET DREGE1(Y1,1) = EXP(DREGE1(Y1,1))
LET DREGI1(Y1,1) = EXP(DREGI1(Y1,1))
NEXT Y1
FOR Y2 = 1 TO N2
LET DREGE2(Y2,1) = EXP(DREGE2(Y2,1))
LET DREGI2(Y2,1) = EXP(DREGI2(Y2,1))
NEXT Y2
FOR Y3 = 1 TO N3
LET DREGE3(Y3,1) = EXP(DREGE3(Y3,1))
LET DREGI3(Y3,1) = EXP(DREGI3(Y3,1))
NEXT Y3
NEXT ROW1
END IF
NEXT ROW2
NEXT ROW3
LET CLRESID = 4
! The use of the class means of the frequency classes of the frequency distribution of the
residuals !
FOR I1 = 1 TO NOR1
FOR CL = 1 TO 4
SRI.BAS
NEXT I1
FOR I2 = 1 TO NOR2
FOR CL = 1 TO 4
LET LBI2 = MINIRES2 + (CWI2*(CL-1))
LET UBI2 = MINIRES2 + (CWI2*(CL))
LET LBE2 = MINERES2 + (CWE2*(CL-1))
LET UBE2 = MINERES2 + (CWE2*(CL))
NEXT I2
FOR I3 = 1 TO NOR3
FOR CL = 1 TO 4
LET LBI3 = MINIRES3 + (CWI3*(CL-1))
LET UBI3 = MINIRES3 + (CWI3*(CL))
LET LBE3 = MINERES3 + (CWE3*(CL-1))
LET UBE3 = MINERES3 + (CWE3*(CL))
NEXT CL
DO
LET TRIAL = TRIAL + 1
FOR B1 = 1 TO N1
IF SIMI1(B1,5) = 0 THEN
LET PICK1 = INT(RND*NOR1) + 1
LET SREGI1(B1,1) = DREGI1(B1,1) + STORIS1(PICK1,1)
END IF
IF SIME1(B1,5) = 0 THEN
LET PECK1 = INT(RND*NOR1) + 1
LET SREGE1(B1,1) = DREGE1(B1,1) + STORES1(PECK1,1)
END IF
NEXT B1
SRI.BAS
FOR B2 = 1 TO N2
IF SIMI2(B2,5) = 0 THEN
LET PICK2 = INT(RND*NOR2) + 1
LET SREGI2(B2,1) = DREGI2(B2,1) + STORIS2(PICK2,1)
END IF
IF SIME2(B2,5) = 0 THEN
LET PECK2 = INT(RND*NOR2) + 1
LET SREGE2(B2,1) = DREGE2(B2,1) + STORES2(PECK2,1)
END IF
NEXT B2
FOR B3 = 1 TO N3
IF SIMI3(B3,5) = 0 THEN
LET PICK3 = INT(RND*NOR3) + 1
LET SREGI3(B3,1) = DREGI3(B3,1) + STORIS3(PICK3,1)
END IF
IF SIME3(B3,5) = 0 THEN
LET PECK3 = INT(RND*NOR3) + 1
LET SREGE3(B3,1) = DREGE3(B3,1) + STORES3(PECK3,1)
END IF
NEXT B3
FOR Y1 = 1 TO N1
LET SREGE1(Y1,1) = EXP(SREGE1(Y1,1))
LET SREGI1(Y1,1) = EXP(SREGI1(Y1,1))
NEXT Y1
FOR Y2 = 1 TO N2
LET SREGE2(Y2,1) = EXP(SREGE2(Y2,1))
LET SREGI2(Y2,1) = EXP(SREGI2(Y2,1))
NEXT Y2
FOR Y3 = 1 TO N3
LET SREGE3(Y3,1) = EXP(SREGE3(Y3,1))
LET SREGI3(Y3,1) = EXP(SREGI3(Y3,1))
NEXT Y3
LET MDSREGI1 = 0
LET MDSREGI2 = 0
LET MDSREGI3 = 0
LET MADSREGI1 = 0
LET MADSREGI2 = 0
LET MADSREGI3 = 0
LET RMSDSREGI1 = 0
SRI.BAS
LET RMSDSREGI2 = 0
LET RMSDSREGI3 = 0
LET MDSREGE1 = 0
LET MDSREGE2 = 0
LET MDSREGE3 = 0
LET MADSREGE1 = 0
LET MADSREGE2 = 0
LET MADSREGE3 = 0
LET RMSDSREGE1 = 0
LET RMSDSREGE2 = 0
LET RMSDSREGE3 = 0
FOR A = 1 TO N1
LET NUM = NUM + 1
IF SIME1(A,5) = 0 THEN
LET MDSREGE1 = MDSREGE1 + (SREGE1(A,1) -
SIME1(A,3))
LET MADSREGE1 = MADSREGE1 + (ABS(SREGE1(A,1) -
SIME1(A,3)))
LET RMSDSREGE1 = RMSDSREGE1 + ((SREGE1(A,1) -
SIME1(A,3))^2)
LET SIME1(A,4) = SREGE1(A,1)
END IF
IF SIMI1(A,5) = 0 THEN
LET MDSREGI1 = MDSREGI1 + (SREGI1(A,1) - SIMI1(A,3))
LET MADSREGI1 = MADSREGI1 + (ABS(SREGI1(A,1) -
SIMI1(A,3)))
LET RMSDSREGI1 = RMSDSREGI1 + ((SREGI1(A,1) -
SIMI1(A,3))^2)
LET SIMI1(A,4) = SREGI1(A,1)
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI1(A,COL)
LET SIME(NUM,COL) = SIME1(A,COL)
NEXT COL
NEXT A
SRI.BAS
FOR B = 1 TO N2
LET NUM = NUM + 1
IF SIME2(B,5) = 0 THEN
LET MDSREGE2 = MDSREGE2 + (SREGE2(B,1) - SIME2(B,3))
LET MADSREGE2 = MADSREGE2 + (ABS(SREGE2(B,1) -
SIME2(B,3)))
LET RMSDSREGE2 = RMSDSREGE2 + ((SREGE2(B,1) -
SIME2(B,3))^2)
LET SIME2(B,4) = SREGE2(B,1)
END IF
IF SIMI2(B,5) = 0 THEN
LET MDSREGI2 = MDSREGI2 + (SREGI2(B,1) - SIMI2(B,3))
LET MADSREGI2 = MADSREGI2 + (ABS(SREGI2(B,1) - SIMI2(B,3)))
LET RMSDSREGI2 = RMSDSREGI2 + ((SREGI2(B,1) - SIMI2(B,3))^2)
LET SIMI2(B,4) = SREGI2(B,1)
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI2(B,COL)
LET SIME(NUM,COL) = SIME2(B,COL)
NEXT COL
NEXT B
FOR C = 1 TO N3
LET NUM = NUM + 1
IF SIME3(C,5) = 0 THEN
LET MDSREGE3 = MDSREGE3 + (SREGE3(C,1) - SIME3(C,3))
LET MADSREGE3 = MADSREGE3 + (ABS(SREGE3(C,1) -
SIME3(C,3)))
LET RMSDSREGE3 = RMSDSREGE3 + ((SREGE3(C,1) -
SIME3(C,3))^2)
LET SIME3(C,4) = SREGE3(C,1)
END IF
IF SIMI3(C,5) = 0 THEN
LET MDSREGI3 = MDSREGI3 + (SREGI3(C,1) - SIMI3(C,3))
LET MADSREGI3 = MADSREGI3 + (ABS(SREGI3(C,1) -
SIMI3(C,3)))
LET RMSDSREGI3 = RMSDSREGI3 + ((SREGI3(C,1) -
SIMI3(C,3))^2)
LET SIMI3(C,4) = SREGI3(C,1)
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI3(C,COL)
SRI.BAS
NEXT C
CLOSE #7
CLOSE #8
CLOSE #9
CLOSE #10
END
94
Appendix C
Model Validation of the Regression Equations
used in the Regression Imputation Procedures
TOTEX, 10% Nonresponse Rate, First Imputation Class
MULTIPLE R 0.853
MULTIPLE R2 0.728
ADJUSTED R 0.728
F-STAT 6363.590
P-VALUE 0.000
STD. ERR. OF
0.277
ESTIMATE
Analysis of Variance
SV SS Df MS F p-level
Model 486.5398 1 486.5398 6363.590 0.00
Residual 181.7378 2377 0.0765
Total 668.2776 2374
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5
Predicted Values 95% confidence
Serial
DW STAT
Correlation
Estimate 2.075187 -0.038196
Normal Probability Plot
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Residuals
MULTIPLE R 0.884
MULTIPLE R2 0.782
ADJUSTED R 0.781
F-STAT 4574.954
P-VALUE 0.000
STD. ERR. OF
0.309
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 436.3063 1 436.3063 4574.954 0.00
Residual 121.8809 1278 0.0954
Total 558.1871 1279
DW Serial
STAT Correlation
Estimate 2.043782 -0.022334
Normal Probability Plot
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
Predicted Values 95% confidence
MULTIPLE R 0.9499
MULTIPLE R2 0.9023
ADJUSTED R 0.9005
F-STAT 516.8993
P-VALUE 0.0000
STD. ERR. OF
0.3314
ESTIMATE
Analysis of variance
SV SS Df MS F p-level
Model 56.78424 1 56.78424 516.8993 0.000000
Residual 6.15191 56 0.10986
Total 62.93615
DW Serial
STAT Correlation
Estimate 2.167374 -0.094747
-1
-2
-3
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Residuals
Predicted vs. Residual Scatter Plot
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
Predicted Values 95% confidence
MULTIPLE R 0.857
MULTIPLE R2 0.734
ADJUSTED R 0.734
F-STAT 5786.271
P-VALUE 0.000
STD. ERR. OF ESTIMATE 0.273
Analysis of Variance
SV SS df MS F p-level
Model 429.7392 1 429.7392 5786.271 0.00
Residual 155.8159 2098 0.0743
Total 585.5551 2099
DW Serial
Stat Correlation
Estimate 2.013534 -0.006859
Normality Probability Plot
2
1
Value
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Residuals
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5
Predicted Values 95% confidence
MULTIPLE R 0.887
MULTIPLE R2 0.787
ADJUSTED R 0.787
F-STAT 4268.380
P-VALUE 0.000
STD. ERR. OF
0.308
ESTIMATE
Analysis Of Variance
SV SS df MS F p-level
Model 405.0530 1 405.0530 4268.380 0.00
Residual 109.3204 1152 0.0949
Total 514.3734
DW Serial
STAT Correlation
Estimate 2.014977 -0.007528
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
Predicted Values 95% confidence
Expenditure, 20% Nonresponse Rate, Third Imputation Class
MULTIPLE R 0.9490
MULTIPLE R2 0.9006
ADJUSTED R 0.8985
F-STAT 434.6591
P-VALUE 0.0000
STD. ERR. OF
0.3336
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 48.36771 1 48.36771 434.6591 0.000000
Residual 5.34131 48 0.11128
Total 53.70902
DW Serial
Stat Correlation
Estimate 2.268400 -0.167040
2
1
0
-1
-2
-3
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8
Residuals
Predicted vs. Residuals Scatterplot
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
Predicted Values 95% confidence
MULTIPLE R 0.840
MULTIPLE R2 0.705
ADJUSTED R 0.705
F-STAT 4382.102
P-VALUE 0.000
STD. ERR. OF
0.290
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 367.4335 1 367.4335 4382.102 0.00
Residual 153.5270 1831 0.0838
Total 520.9605
DW Serial
Stat Correlation
Estimate 2.072061 -0.036173
Normal Probability Plot
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5
Predicted Values 95% confidence
MULTIPLE R 0.890
MULTIPLE R2 0.791
ADJUSTED R 0.791
F-STAT 3841.345
P-VALUE 0.000
STD. ERR. OF
0.300
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 346.1192 1 346.1192 3841.345 0.00
Residual 91.1849 1012 0.0901
Total 437.3041
DW Serial
Stat Correlation
Estimate 2.023625 -0.012021
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals
Predicted vs. Residuals Scatter Plot
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0
Predicted Values 95% confidence
MULTIPLE R 0.9425
MULTIPLE R2 0.8882
ADJUSTED R 0.8856
F-STAT 333.7148
P-VALUE 0.0000
STD. ERR. OF
0.3237
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 34.97366 1 34.97366 333.7148 0.000000
Residual 4.40164 42 0.10480
Total 39.37531
DW Serial
Stat Correlation
Estimate 2.589756 -0.326722
Normal Probability Plot
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Residuals
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
Predicted Values 95% confidence
MULTIPLE R 0.840
MULTIPLE R2 0.706
ADJUSTED R 0.706
F-STAT 5703.605
P-VALUE 0.000
STD. ERR. OF
0.331
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 625.0487 1 625.0487 5703.605 0.00
Residual 260.4915 2377 0.1096
Total 885.5402
DW Serial
STAT Correlation
Estimate 2.047121 -0.023913
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
7 8 9 10 11 12 13 14
Predicted Values 95% confidence
Income Variable, 10% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.897
MULTIPLE R2 0.805
ADJUSTED R 0.804
F-STAT 5261.480
P-VALUE 0.000
STD. ERR. OF
0.331
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 576.7623 1 576.7623 5261.480 0.00
Residual 140.0941 1278 0.1096
Total 716.8564
Durbin-Watson Test
DW
Serial
STAT
Estimate 1.934528 0.031428
3
2
1
0
-1
-2
-3
-4
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Residuals
Predicted vs. Residuals
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
8 9 10 11 12 13 14 15
Predicted Values 95% confidence
MULTIPLE R 0.9591
MULTIPLE R2 0.9199
ADJUSTED R 0.9185
F-STAT 642.9754
P-VALUE 0.0000
STD. ERR. OF
0.3171
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 64.67059 1 64.67059 642.9753 0.000000
Residual 5.63249 56 0.10058
Total 70.30308
Durbin-Watson Test for Independence of error terms
DW Serial
Stat Correlation
Estimate 2.157647 -0.079104
Normal Probability Plot
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
10 11 12 13 14 15 16
Predicted Values 95% confidence
MULTIPLE R 0.838
MULTIPLE R2 0.703
ADJUSTED R 0.702
F-STAT 4954.234
P-VALUE 0.000
STD. ERR. OF
0.331
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 542.9226 1 542.9226 4954.233 0.00
Residual 229.9148 2098 0.1096
Total 772.8375
Durbin-Watson Test
DW Serial
Stat Correlation
Estimate 1.985916 0.007035
3
2
1
0
-1
-2
-3
-4
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
7 8 9 10 11 12 13 14
Predicted Values 95% confidence
Income, 20% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.906
MULTIPLE R2 0.821
ADJUSTED R 0.821
F-STAT 5275.064
P-VALUE 0.000
STD. ERR. OF
0.318
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 532.7126 1 532.7126 5275.063 0.00
Residual 116.3370 1152 0.1010
Total 649.0496
DW Serial
Stat Correlation
Estimate 1.980753 0.008036
2
1
0
-1
-2
-3
-4
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Residuals
Predicted vs. Residuals
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
9 10 11 12 13 14 15
Predicted Values 95% confidence
MULTIPLE R 0.9566
MULTIPLE R2 0.9151
ADJUSTED R 0.9133
F-STAT 517.0385
P-VALUE 0.0000
STD. ERR. OF
0.3277
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 55.51737 1 55.51737 517.0385 0.000000
Residual 5.15403 48 0.10738
Total 60.67140
DW Serial
Stat Correlation
Estimate 2.280743 -0.142227
Normal Probability Plot
-1
-2
-3
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Residuals
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
10 11 12 13 14 15 16
Predicted Values 95% confidence
MULTIPLE R 0.845
MULTIPLE R2 0.713
ADJUSTED R 0.713
F-STAT 4557.328
P-VALUE 0.000
STD. ERR. OF 0.330
ESTIMATE
Analysis of Variance
SV SS df MS F p-level
Model 496.0775 1 496.0775 4557.328 0.00
Residual 199.3093 1831 0.1089
Total 695.3868
DW Serial
Stat Correlation
Estimate 2.094357 -0.047223
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3. 0
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9. 0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0
Predicted Values 95% conf idence
Income, 30% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.909
MULTIPLE R2 0.826
ADJUSTED R 0.826
F-STAT 4793.392
P-VALUE 0.000
STD. ERR. OF
0.310
ESTIMATE
Analysis of Variance
SS df MS F p-level
Model 460.5995 1 460.5995 4793.392 0.00
Residual 97.2436 1012 0.0961
Total 557.8431
DW Serial
Stat Correlation
Estimate 2.072614 -0.038549
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Residuals
Predicted vs. Residuals Scatter plot
Pre di cte d v s. Re si dual S co res
Dependent v ariable: LN SV
2.0
1.5
1.0
Residuals
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9 10 11 12 13 14 15
Predicted Values 95% conf idence
MULTIPLE R 0.9654
MULTIPLE R2 0.9319
ADJUSTED R 0.9303
F-STAT 574.8240
P-VALUE 0.0000
STD. ERR. OF
0.2753
ESTIMATE
Analysis of Variance
SS df MS F p-level
Model 43.55594 1 43.55594 574.8240 0.000000
Residual 3.18245 42 0.07577
Total 46.73840
DW Serial
Stat Correlation
Estimate 2.249448 -0.190671
Normal Probability Plot
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0. 8
Residuals
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
Predicted Values 95% conf idence