Você está na página 1de 28

SCHOOL OF STATISTICS

UNIVERSITY OF THE PHILIPPINES DILIMAN

WORKING PAPER SERIES

Improving the Predictive Ability of 1SEC 2012:


The New Philippine Socioeconomic Classification

By:

Michael Daniel C. Lucagbo

UPSS Working Paper No. 2014-06


December 2014

School of Statistics
Ramon Magsaysay Avenue
U.P. Diliman, Quezon City
Telefax: 928-08-81
Email: updstat@yahoo.com

1
Abstract

Socioeconomic classification (SEC) is an important construct to capture and understand changes


in the structure of a society. The new scheme for classifying Philippine households according to
their SEC was introduced in a paper entitled 1SEC 2012: The New Philippine Socioeconomic
classification. The classification scheme consists of nine clusters (or segments). To predict the
SEC of a household, certain household characteristics are used as determinants. The 1SEC
Instrument, whose scoring system is based on the ordinal logistic regression model, is then used
to predict the households SEC. This study aims to improve the predictive ability of the 1SEC
Methodology using state-of-the-art statistical learning techniques, and thereby suggest a new
scheme for predicting SEC. In particular, the study compares the predictive ability of the MORES
1SEC Instrument with the predictive abilities of the following statistical techniques: discriminant
analysis, bootstrap aggregation (or bagging), random forests, boosting and support vector
machines (SVMs). The same set of determinants will be used in each of these methodologies.
Results suggest some potential for the statistical learning methods to predict a households SEC.
Alternative methods for predicting SEC may thus be considered for use.

Keywords: socioeconomic classification, ordinal logistic regression, discriminant analysis,


bagging, random forests, boosting, support vector machines

1. Introduction

A new socioeconomic classification (SEC) scheme for Filipino households was introduced by
Bersales et al. (2013) at the 12th National Convention on Statistics (NCS) in a paper entitled 1SEC
2012: The New Philippine Socioeconomic Classification. The classification consists of nine
segments (or clusters), non-mnemonically labeled as clusters one to nine (instead of the usual A,
B, C, D, E system). These nine clusters were formed based on Total Family Expenditure obtained
from the Family Income and Expenditure Survey (FIES) 2009.

To predict the SEC of a household, significant determinants were identified. These determinants
can be broadly categorized into nine groups: (1) quality of consumers in the household, (2) number
of selected energy-using facilities owned, (3) urban and regional membership, (4) transport type
ownership, (5) water source type, (6) connectivity, (7) living space assets, (8) living shell, (9)
tenure of home. Based on these household characteristics, an instrument (referred to as the
MORES 1SEC Instrument) is used to classify a household. The basis of the MORES 1SEC scoring
system is the sizes of the coefficients of the ordinal logistic regression where the true SEC is the
dependent variable, and the determinants are the independent variables.

2
The present study aims to introduce the new classification scheme and evaluate its predictive
ability. Moreover, the study aims to introduce new methods of predicting SEC using more recent,
state-of-the-art techniques which have been known to have superior predictive abilities. The
prediction performances of these methods will be compared with one another and with the MORES
1SEC methodology, in the hope of suggesting a method of predicting the SEC of a household
using the same set of determinants. The nine socioeconomic classes were formed with total family
expenditure as the clustering variable. Table 1.1 below shows the median total expenditure for
each of the nine clusters.
Table 1.1. Median Total Annual Family
Annual Family Expenditure by Cluster
Median Total Family
Cluster Expenditure (in Php)
1 34,744.50
2 59,653.00
3 89,469.00
4 123,997.50
5 161,068.00
6 208,665.00
7 274,245.00
8 400,852.00
9 738,592.00

There are 36 variables used in predicting the SEC of the households. All of these variables are
measured in the FIES 2009. The variables are listed in Table 1.2 below:

Table 1.2. List of Variables Needed in Predicting SEC of Households


Based on MORES 1SEC Methodology
1. Type of Place of Residence (Urban or Rural)
2. Region where household is located
3. Does the household spend on laundry services?
4. Does the household pay tuition fees in cash?
5. Does the household spend for maid/boy services?
6. Does the household on LPG?
7. Does the household spend on firewood?
8. Does the household spend on charcoal?
9. Does the household spend on school service (land and water)?
10. Does the household spend on air fare transport?
11. Does the household receive cash receipts, assistance from abroad?
12. Number of TVs owned by household
13. Number of airconditioners owned by household
14. Number of refrigerators owned by household

3
15. Number of microcomputers owned by household
16. Number of washing machines owned by household
17. Number of stereos owned by household
18. Number of VTRs owned by household
19. Number of cars owned by household
20. Number of motorcycles owned by household
21. Number of phones owned by household
22. Number of sala sets owned by household
23. Highest grade completed by the household head
24. Household house type of roofing
25. Household house type of wall
26. Household house building type
27. Household house toilet facility
28. Household head occupation
29. Employment of spouse of household head
30. Household tenure status
31. Household head kind of business
32. Household main source of water
33. Marital status of household head
34. Household type
35. Number of household members 60 years old and over
36. Number of employed members in the household

Using information from all of these variables, the SEC of a household is predicted using the scoring
system adopted by 1SEC, a scoring system based on ordinal logistic regression.

In the present study, the same set of determinants are used, but other procedures in coming up with
a predicted cluster are explored. These other procedures are mostly computer-intensive but can be
performed using a variety of recent software packages. These procedures have potential for making
improved predictions.

2. Review of Related Literature

Oakes (2014) defines Socioeconomic Status (SES) as being a construct that reflects ones access
to collectively desired resources, be they material goods, money, power, friendship networks,
healthcare, leisure time, or educational opportunities. Moreover, he mentions that SES is a latent
variable, that is, it cannot be measured directly. Instead, Oakes argues, SES is a complicated
construct that summarizes a persons or groups access to culturally relevant resources useful for
succeeding in if not moving up the social hierarchy. The importance of knowing what SECs really

4
measure is stressed by Rose and Pevalin (2001). They note that researchers ought to know what
the classifications are supposed to be measuring so that they can (a) use them correctly; (b) improve
their explanation of results; and (c) investigate whether the classifications are valid. Although
Princeton University defines socioeconomic class as people having the same social, economic,
or educational status, the terms socioeconomic class and socioeconomic status are treated
synonymously for this papers purposes.

Oakes (2014) suggests five reasons why SES matters:


Measures of SES, and statistics based on them such as variances, are necessary to quantify
if not understand the level of stratification or inequality in or between societies.
Without sound measures of SES, it is impossible to capture and understand changes to the
structure of a society.
Without sound measurement of SES, it is impossible to understand the intergenerational
change of social status over tie.
Without an understanding and sound measurement of SES, the relationship between other
important social variables, such as race or sex, can be masked by the evident and often
dominant relationship between outcomes and SES.
SES matters because it has been related to health and outcomes for as long as social groups
existed.

It is important for any socioeconomic classification to adapt to the changing times. Oakes (2014)
stresses the importance for SES measures to be aligned to cultures, eras and geographic places,
stating that it is difficult to have a universal measure of SES that would be useful in all research.
Thus, it is important to measure the SES of persons and households, and track the changes over
time.

The importance of collecting as much socioeconomic data as possible is also stressed by Oakes
(2014). The socioeconomic data collected will depend on the research question under
consideration. Moreover, researchers should use a measure of SES that is best suited to answer the
research question at hand. There is no agreed-upon measure of SES, and the best one depends on
the research question.

5
The measures of SES can either be composite measures or proxy measures. Oakes (2014) explains
the distinction between the two. Composite measures incorporate several domains of information
into a singular (i.e. scalar) quantity, thus treating the SES as a multidimensional concept. Common
examples include efforts to integrate information about educational attainment, annual earned
income, and occupational prestige into a single number for each person or group. In computing for
these composite measures, the weights attached to each of the components of the measure matter
a great deal and are the most controversial part of any measure. On the other hand, proxy measures
make use of a single variable, such as annual income or annual expenditure. Oakes (2014) also
remarks that as far as policy interventions go, composite measures are less helpful than simple,
clear and potentially actionable manifest measures such as educational attainment or annual
income.

In the next few paragraphs, some of the known measures of Socioeconomic classifications are
reviewed. Tabunda and de Jesus (1995) uses the sequential socioeconomic classification rule
which classifies a household into four categories: AB, C, D and E. This classification rule for
Metro Manila is summarized below:

1. A household is automatically classified as E if it satisfies any of the following conditions:


The walls of its house are makeshift or made from salvaged or improvised materials
or bamboo, sawali, cogon or nipa.
The roof of its house is makeshift or made from salvaged or improvised materials
or cogon, nipa or anahaw.
It has no appliances.
It has only a radio/radio cassette for appliance.
It has only a television for appliance.

2. A household is automatically classified as D it it has only a radio or television for


appliances.

3. A household is automatically classified as AB if it satisfies all of the following conditions:


The walls of its house are made of brick, glass, stone or asbestos.

6
The roof of its house is made of tile, concrete, galvanized iron, aluminum, asbestos
or half-galvanized iron and half-concrete.
The household head has some college units.
It owns a radio, television, refrigerator and a vehicle.
The floor area of its house is at least 120 square meters.

4. Households not assigned on the basis of the above procedure are then assigned points that
depend on the educational attainment of the household head, construction material for
walls, construction material for roof, appliances, and floor area per household member.

The same sequential classification rule is modified by Tabunda and de Jesus (1996) to yield a fifth
category: the F class, in an attempt to identify the poorest in Metro Manila. The following
municipal census data variables are used to investigate differences between E and F household and
household heads:

Age of HH head
Sex of HH head
Marital status of HH head
Mother tongue of HH head
Educational attainment of HH head
Employment status of HH head
Size of the HH
Fuel for lighting
Fuel for cooking
Main source of drinking water
Toilet facility
Manner of garbage disposal.

In the United Kingdom, two social classification systems used in the past are the Registrar
Generals Social Class (RGSC) and Socio-economic Groups (SEG). Both of these systems were
based on occupation, indicating that occupational status is an important variable for socio-

7
economic classification. In particular, for RGSC, each occupation group is assigned as a whole to
one social class and no account is taken of differences between individuals in the same social
group. On the other hand, classification using SEG takes into account the employment status, size
of employing organization, occupation.

Rose and Pevalin (2001) mention that the best known and most widely-used sociological class
schema is the Goldthorpe Class schema, on which the National Statistics Socioeconomic
Classification (NS-SEC) is based. The Goldthorpe class schema makes distinctions between (1)
employers who buy the labor of others and assume some degree of authority and control over them;
(2) self-employed workers who neither buy labor nor sell their own to an employer; and (3)
employees who sell their labor to employers and thus place themselves under the authority of their
employer. The Goldthorpe schema has been shown to be a good predictor of health and education
outcomes (Rose and Pevalin, 2001).

The National Statistics Socioeconomic Classification (NS-SEC) of the UK brings together


combinations of occupational groups and employment statuses that share similar employment
relations, but are different in these terms from occupational group/employment status
combinations in each of the other classes (Rose and Pevalin, 2001). An employment relations
approach does not assume a fixed number of classes. Rather, the number of classes to be
recognized empirically depends upon the analytic purposes at hand.

In the United States, a dominant measure of SES that is associated with census data is the Duncan
Socioeconomic Index (SEI). The measure is a scalar quantity on the continuous scale and
ultimately based on data from subjective assessments of occupational prestige (Oakes, 2014).
Another SES measure is the Nam-Powers Occupational Status Score (OSS) which differs from the
SEI in that subjective ratings are not used. Instead, it measures income and educational status to
create a single composite quantity.

Oakes (2014) opines that occupational prestige (on which research on SES was focused for a long
time) may no longer represent a sound measure of SES for the following reasons: (1) although
survey respondents readily provide their occupations, occupational prestige information is difficult

8
to come by, (2) there is no consensus on how ranks should be determined, especially when there
are hundreds of occupations in even the simplest of taxonomies, (3) occupations are increasingly
changing over time.

The Household Prestige (HHP) Score was introduced by Rossi (1974) and is described in Oakes
(2014). Rossi (1974) asked a sample of 146 white adults to rate the social standing of households
described in terms of spouses occupations, incomes, and ethnicities. The resulting ratings were
then regressed on characteristics of vignette examples. The regression coefficients then represent
the relative influence of the social characteristics of families. The resultant equation permits
investigators to assign unbiased status scores to households based on the occupations, educational
levels and ethnicities of spouses. Nock and Rossi (1979, 1978) later applied this method to national
samples and calculated weights that apply more generally.

Oakes and Rossi (2003) attempted to develop the CAPSES, a measure of SES which defines it as
ones access to resources. Resources were defined as (1) material capital (income, wealth, trust
funds, etc.), (2) human capital (skills, abilities, credentials, etc.) and (3) social capital (instrumental
relationships such as being friends with lawyers and doctors). The construct is named CAPSES
since all domains tapped some form of capital.

There are several well-known univariate or proxy measures of SES. Common proxy measures of
SES are income, expenditure, and educational attainment.

In the field of market research, predictive analytics has proved a major tool for overcoming many
of the pitfalls of traditional customer segmentation efforts. Watchman (2012) explains that By
defining the characteristics of segments or groups, and then predicting the value of those
customers, marketers have a tremendous opportunity to focus limited marketing resources to the
customers with the largest strategic and return on investment benefits.

Customer segmentation is the process of dividing a customer base into groups of individuals who
are similar in specific ways relevant to marketing. SAS (2012) mentions that it enables companies
to target groups effectively and allocate marketing resources appropriately. Analytics enables one

9
to go beyond foundation segmentation to targeting segmentation, allowing one to execute more
effective, sophisticated campaigns with messages and offers that are highly relevant to recipients.

Watchman (2012) suggests five fundamental steps to building a forward-looking, value-based


segmentation model:
1. Identify the key drivers of customer behavior
2. Extract Past Data
3. Build the Model
4. Reassess, Rescore and Track
5. Create overlays for each segment.

The traditional clustering techniques are distance-based. On the other hand, Zheng (2008)
introduces prediction-based segmentation, a clustering technique which, Zheng explains, is
different from the usual distance-based segmentation since it iteratively computes different
partitionings of the customer base to achieve better collective predictive performance across the
segments. It becomes a combinatorial optimization problem of finding the best partitioning of the
customers. Zheng (2008) gives five reasons for prediction-based segmentation:

1. It is optimized to generate the best performance of predictive models on customer


segments.
2. It can be customized towards performing various target marketing tasks on-the-fly.
3. This type of segmentation is flexible in terms of predictive model selection since any
predictive modeling techniques can be integrated with this segmentation approach.
4. The innate limitations of the distance-based segmentation, such as sensitivity of similarity
functions to the choices of a particular attribute are circumvented.

In addition to customer segmentation, predictive modeling is also another important marketing


tool. Verhoef et al. (2002) cites that predictive modeling has been used to a lesser extent, although
the majority of companies perform segmentation. Moreover, few managers use the models and
insights generated by scientific research. The statistical techniques employed by most are still
relatively simple. Many companies report that they base their target selections on intuition and

10
gut feeling, thereby using simple heuristics, such as mail all customers who recently purchased
product X or have an income above Y.

Predictive modeling is essential to the success of marketing strategies and plans in todays
environment (SAS, 2012). Predictive modeling techniques are useful in identifying the subset of
all customers that is likely to respond positively to a specific campaign or other marketing activity,
as well as to understand the behavior of targeted groups. Once the company becomes proficient
with a basic predictive modeling strategy, it can use more advanced models to realize even more
performance improvements. Advanced models answer the same questions, but with more precision
and sophistication.

This study tries to predict the SEC of a household. A variety of classification techniques are used
and tested for their predictive abilities. One of the techniques to be used in this study is ordinal
logistic regression, which is the basis for the MORES 1SEC Scoring System. Linear discriminant
analysis, a widely-used classification tool is compared with ordinal logistic regression. Moreover,
the performance of more recent and computer-intensive methods known to have good predictive
abilities will also be compared with the 1SEC methodology. These methods include trees, random
forests, and boosting.

In order to further disseminate the developed models, the scientific community may wish to take
specific steps, such as presenting research at practitioner conferences. The study of Verhoef et al.
(2002) find that companies applying more sophisticated techniques appear to have a better
performance than companies not applying these techniques. Hence, the utilization of sophisticated
models indeed seems to pay off from a practitioners perspective. However, despite the improved
performance, these companies still seek techniques that will further improve results.

3. Methodology

3.1 Cumulative Logits Model

The response variable, socioeconomic class, has nine levels (clusters one to nine). Since the
response variable is measured in the ordinal level of measurement, the model to be used should
take the ordering of the response categories into account. The study uses the cumulative logit

11
model for ordinal response. Following the discussion of Agresti (2007), which we outline, Y
represents the response variable, j 1,2,...,9 represents the nine SECs, and j the probability that
a household belongs to cluster j. For cluster j, the cumulative probability for Y is:

PY j 1 j , j 1,2,3
(1)

The cumulative logits are defined as

PY j 1 j (2)
log it PY j log log , j 1,..., J 1
1 PY j j 1 J

As Agresti (2007) describes it, a model for cumulative logit j is similar to a binary logistic
regression model in which the categories from 1 to j combine to form a single category, while
categories (j + 1) to J form a second category. For an explanatory variable, the model

log it PY j j x, j 1,..., J 1 (3)

has parameter describing the effect of x on the logarithm of the odds of response in category j or
below. Since the parameter does not have a j subscript, the model is identical for all the (J
1) cumulative logits, enabling us to have a single parameter, instead of (J 1), that describes the
effect of x.

Consider two values a and b of x, an odds ratio comparing the cumulative probabilities is

PY j | X b / PY j | X b (4)
PY j | X a / PY j | X a

Taking the logarithm of the expression in (4) gives the difference between the cumulative logits at
the two values a and b of x, and, using equation (3), can be shown to be equal to ( ), which
is proportional to the distance between a and b. Since this study will be dealing with dummy
variables as predictors, the difference (b a) = 1 so that the ratio in (4) is equal to . Thus, for
every unit increase in x, the odds of response below any given category multiplies by .

12
3.2 Linear Discriminant Analysis

We assume that X X 1 , X 2 ,, X p is drawn from a multivariate normal distribution with a

class-specific mean vector and a common covariance matrix. James et al. (2013) explains that the
linear discriminant analysis (LDA) classifier assumes that the observations in the kth class are
drawn from a multivariate normal distribution N k , , where k is a class-specific mean vector

and is a covariance matrix that is common to all K classes. The Bayes classifier assigns an
observation X x to the class for which

k x x T 1 k kT 1 k log k
1
2
is largest. In the above expression k represents the overall or prior probability that a randomly

chosen observation comes from the kth class. The unknown parameters 1 ,, K , 1 ,, K , and
are then estimated.

3.3 Bagging

James et al. (2013) explains that bootstrap aggregation, or bagging, is a general-purpose procedure
where the bootstrap is used in a new context: to reduce the variance of a statistical learning method.
The idea is to take many training sets from the population, build a separate prediction model using
each training set, and average the resulting predictions since taking many training data sets,
perform predictions for each, and then averaging the predictions, is a natural way of reducing the
variance.

Since we generally do not have multiple training sets, one option is to bootstrap by taking repeated
samples from the same training data set. Thus, we generate B different bootstrapped training data
sets. In the bth bootstrapped training set, the method is trained to arrive at the predicted response
f *b x , and finally average the predictions across the bootstrapped training sets to obtained the
prediction based on bagging:
1 B
fbag x f *b x
B b1

13
Bagging can be applied to predict a qualitative outcome variable Y (as is needed in this study).
The simplest approach is by taking the majority vote: the overall prediction that occurs most
commonly among the B predictions.

3.4 Random Forests

Random forests can provide improvements over bagging by decorrelating the trees. James et al.
(2013) explains that when building the decision trees, each time a split in a tree is considered, a
random sample of m predictors is chosen as split candidates from the full set of p predictors. The
split is allowed to use only one of those m predictors. A different sample of m predictors is taken
at each split, where, typically, m p.

The rationale behind this is that if there are strong predictors, most of the trees will use the strong
predictor in the top split, thus making the bagged trees correlated with each other. Averaging
highly correlated quantities does not lead to as large of a reduction in variance as averaging
uncorrelated quantities. By considering only a subset of predictors at each split, (p m)/p of the
splits will not consider the strong predictors. This decorrelates the trees, making the average
prediction have smaller variance.

3.5 Boosting

James et al. (2013) describes that boosting works in a way similar to bagging, except that the trees
are grown sequentially: each tree is grown using information from previously grown trees.
Boosting does not involve bootstrap sampling; instead, each tree is fit on a modified version of the
original dataset. The boosting approach is said to learn slowly, instead of fitting the data hard by
fitting a single large decision tree.

3.6 Support Vector Machines

Suppose the classification problem involved only two classes. Suppose further that there are n
observations x1 ,, xn p where each xi xi1 ,, xip T is a vector with p features. If each xi

has associated class labels y1 , yn {1, 1} where yi 1 represents one class and yi 1

14
represents the other class, James et al. (2013) explains that the maximal margin classifier solves
the following optimization problem:
1. maximize M across all 0 , 1 , k
p
2. subject to
j 1
2
j 1

3. yi 0 1 xi1 p xip M i 1,, n .

The quantity M is called the margin, the minimal distance from each observation to the separating
hyperplane 0 1 xi1 p xip 0 . A test observation is classified depending on which side of

the separating hyperplane it lies.

It is possible, however, that no separating hyperplane exists. James et al. (2013) explains the
support vector classifier as being an extension of the maximal margin classifier, in that it may
misclassify some of the observations. It is the solution to the optimization problem:
1. maximize M across all 0 , 1 , k , 1 ,, n
p
2. subject to
j 1
2
j 1

3. yi 0 1 xi1 p xip M 1 i i 1,, n .


n
4. i 0, i C ,
i 1

where C is a nonnegative tuning parameter, which we can think of as a budget for the amount of
margin that can be violated.

James et al. (2013) further explains that the solution to the support vector classifier involves only
the inner products of the observations. It can be shown that the linear support vector classifier can
be represented as
n
f x 0 i x, xi
i 1
p
where xi , xi ' xij xi ' j . A test observation x * is then classified based on the sign of
j 1


f x * 0 1 x1* p x *p . A positive value of f indicates that the test observation should be

15
assigned to class 1, while a negative value indicates assignment to class 2. We could replace
xi , xi ' with a generalization of the inner product of the form K xi , xi ' where K is some function

referred to as a kernel. The quantity xi , xi ' is known as a linear kernel. When a support vector

classifier is accompanied by a possibly nonlinear kernel, the result is known as a support vector
machine (SVM). Other popular choices of kernels are the polynomial kernel given by


.
d

K xi , xi ' 1 xij xi ' j and the radial kernel given by
p p

xij xi ' j
K x i , x i'
exp 2

j 1 j 1

To extend SVMs to the case where there are more than two classes, James et al. (2013) introduces
K
two approaches that may be considered. A one-versus-one approach constructs SVMs, each
2
comparing a pair of classes. The final classification assigns an observation to the class where it is
K
most frequently assigned in the pairwise classifications. Another approach is the one-versus-
2
all approach, which fits K SVMs, each time comparing one of the K classes to the remaining
K 1 classes taken as one group. The approach then assigns an observation to the class for which
0 k 1k x1* pk x*p is largest.

4. Results

The entire FIES 2009 data set was divided into a training data set and a test data set. The training
dataset was obtained by taking 90% of the households. There are 36,812 households in the entire
FIES 2009 dataset (excluding households from the Autonomous Region of Muslim Mindanao),
with 33,138 households in the training sample and 3,674 households in the test sample.

The results of the analysis will first be shown in terms of the confusion matrices for each of the
different classification methods. The confusion matrix is a cross classification of a households
actual cluster and its predicted cluster. The confusion matrix for both the training and test samples
will be shown, and a comparison of the predictive abilities of the classifiers will be made
afterwards.

16
Table 4.1.1. Cross Classification based on Ordinal Logistic Regression of
Actual and Predicted 1SEC Cluster for the Training Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 518 1359 169 0 1 0 0 0 0 2047 25.3%
2 165 3070 2087 110 35 0 1 0 0 5468 56.1%
3 26 1604 4533 768 508 46 10 2 0 7497 60.5%
4 3 252 2023 819 1087 225 62 8 0 4479 18.3%
Actual
5 0 50 947 697 1531 581 317 44 0 4167 36.7%
Cluster
6 0 11 246 259 1011 684 695 193 4 3103 22.0%
7 0 2 68 109 511 550 1060 588 13 2901 36.5%
8 0 0 10 19 122 206 661 1341 202 2561 52.4%
9 0 0 2 2 4 13 51 347 496 915 54.2%
Total 712 6348 10085 2783 4810 2305 2857 2523 715 33,138 42.4%

Table 4.1.2. Cross Classification based on Ordinal Logistic Regression of


Actual and Predicted 1SEC Cluster for the Test Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 56 146 25 0 0 0 0 0 0 227 24.7%
2 20 342 235 9 1 0 0 0 0 607 56.3%
3 4 186 496 91 49 5 1 0 0 832 59.6%
4 0 38 204 81 134 29 11 0 0 497 16.3%
Actual
5 0 8 96 92 170 48 45 3 0 462 36.8%
Cluster
6 0 1 34 24 116 81 74 14 0 344 23.5%
7 0 1 6 12 70 52 112 68 0 321 34.9%
8 0 0 5 2 16 26 66 145 24 284 51.1%
9 0 0 0 0 0 3 6 37 54 100 54.0%
Total 80 722 1101 311 556 244 315 267 78 3674 41.8%

17
Table 4.2.1. Cross Classification based on Linear Discriminant Analysis of
Actual and Predicted 1SEC Cluster for the Training Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 855 981 209 0 2 0 0 0 0 2047 41.8%
2 477 2740 2066 102 79 1 3 0 0 5468 50.1%
3 188 1560 4355 627 652 80 26 9 0 7497 58.1%
4 32 268 2012 715 994 316 122 18 2 4479 16.0%
Actual
5 13 66 986 551 1429 657 382 79 4 4167 34.3%
Cluster
6 2 17 268 257 844 795 676 229 15 3103 25.6%
7 0 6 69 99 452 565 1034 608 68 2901 35.6%
8 0 1 14 14 94 204 652 1232 350 2561 48.1%
9 1 0 1 1 4 10 48 274 576 915 63.0%
Total 1568 5639 9980 2366 4550 2628 2943 2449 1015 33138 41.4%

Table 4.2.2. Cross Classification based on Linear Discriminant Analysis of


Actual and Predicted 1SEC Cluster for the Test Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 96 104 26 1 0 0 0 0 0 227 42.3%
2 55 299 237 8 8 0 0 0 0 607 49.3%
3 16 176 491 69 72 7 1 0 0 832 59.0%
4 5 45 198 72 118 38 19 2 0 497 14.5%
Actual
5 1 7 108 84 149 58 42 13 0 462 32.3%
Cluster
6 0 1 29 26 87 87 90 23 1 344 25.3%
7 1 0 7 10 59 65 105 70 4 321 32.7%
8 0 0 3 6 12 26 77 127 33 284 44.7%
9 0 0 0 0 0 1 5 27 67 100 67.0%
Total 174 632 1099 276 505 282 339 262 105 3674 40.6%

18
Table 4.3.1. Cross Classification based on Bagging of
Actual and Predicted 1SEC Cluster for the Training Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 914 933 192 6 2 0 0 0 0 2047 44.7%
2 376 2818 2126 88 56 2 1 1 0 5468 51.5%
3 95 1627 4584 518 542 93 32 6 0 7497 61.1%
4 9 300 2243 579 937 275 115 21 0 4479 12.9%
Actual
5 2 86 1209 548 1315 538 381 87 1 4167 31.6%
Cluster
6 1 29 355 256 942 596 668 252 4 3103 19.2%
7 0 5 130 110 546 506 910 679 15 2901 31.4%
8 0 2 30 18 159 198 596 1413 145 2561 55.2%
9 0 0 2 3 9 9 38 425 429 915 46.9%
Total 1397 5800 10871 2126 4508 2217 2741 2884 594 33138 40.9%

Table 4.3.2. Cross Classification based on Bagging of


Actual and Predicted 1SEC Cluster for the Test Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 89 101 37 0 0 0 0 0 0 227 39.2%
2 46 302 249 6 4 0 0 0 0 607 49.8%
3 10 193 513 57 50 5 4 0 0 832 61.7%
4 2 36 240 65 96 37 20 1 0 497 13.1%
Actual
5 0 11 145 59 129 62 46 10 0 462 27.9%
Cluster
6 0 6 40 30 100 68 74 26 0 344 19.8%
7 0 0 12 9 62 70 92 75 1 321 28.7%
8 0 0 7 4 17 30 72 136 18 284 47.9%
9 0 0 0 0 3 1 2 40 54 100 54.0%
Total 147 649 1243 230 461 273 310 288 73 3674 39.4%

19
Table 4.4.1. Cross Classification based on Random Forests of
Actual and Predicted 1SEC Cluster for the Training Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 809 1039 193 1 5 0 0 0 0 2047 39.5%
2 274 2900 2195 31 64 1 2 1 0 5468 53.0%
3 50 1626 4951 212 571 43 35 9 0 7497 66.0%
4 6 275 2566 275 1077 137 121 22 0 4479 6.1%
Actual
5 0 71 1453 283 1558 311 394 97 0 4167 37.4%
Cluster
6 1 15 482 114 1147 365 688 290 1 3103 11.8%
7 0 4 165 53 654 306 917 791 11 2901 31.6%
8 0 0 30 13 185 117 573 1545 98 2561 60.3%
9 0 0 2 0 10 7 36 472 388 915 42.4%
Total 1140 5930 12037 982 5271 1287 2766 3227 498 33138 41.4%

Table 4.4.2. Cross Classification based on Random Forests of


Actual and Predicted 1SEC Cluster for the Test Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 77 122 28 0 0 0 0 0 0 227 33.9%
2 29 320 252 3 3 0 0 0 0 607 52.7%
3 4 185 548 22 68 4 1 0 0 832 65.9%
4 1 37 272 25 127 13 20 2 0 497 5.0%
Actual
5 0 8 174 28 161 32 46 13 0 462 34.8%
Cluster
6 0 4 55 11 123 43 79 29 0 344 12.5%
7 0 0 18 6 77 46 92 81 1 321 28.7%
8 0 0 7 3 21 19 68 157 9 284 55.3%
9 0 0 0 0 0 1 5 48 46 100 46.0%
Total 111 676 1354 98 580 158 311 330 56 3674 40.0%

20
Table 4.5.1. Cross Classification based on Boosting of
Actual and Predicted 1SEC Cluster for the Training Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 759 1009 275 0 3 1 0 0 0 2047 37.1%
2 331 2868 2185 0 41 34 4 5 0 5468 52.5%
3 107 2106 4603 10 292 288 35 56 0 7497 61.4%
4 24 675 2559 11 435 528 134 108 5 4479 0.2%
Actual
5 11 362 1775 11 650 705 321 328 4 4167 15.6%
Cluster
6 7 160 724 5 502 649 464 582 10 3103 20.9%
7 3 77 357 1 397 442 552 1045 27 2901 19.0%
8 0 15 102 0 151 166 345 1667 115 2561 65.1%
9 0 4 6 0 12 9 34 504 346 915 37.8%
Total 1242 7276 12586 38 2483 2822 1889 4295 507 33138 36.5%

Table 4.5.2. Cross Classification based on Boosting of


Actual and Predicted 1SEC Cluster for the Test Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 77 115 35 0 0 0 0 0 0 227 33.9%
2 37 316 247 0 3 3 0 1 0 607 52.1%
3 14 241 518 1 24 29 3 2 0 832 62.3%
4 5 83 263 3 52 59 15 17 0 497 0.6%
Actual
5 2 49 197 0 64 85 34 31 0 462 13.9%
Cluster
6 1 18 84 1 61 57 61 61 0 344 16.6%
7 0 9 33 0 48 59 56 115 1 321 17.4%
8 0 1 15 0 25 14 38 177 14 284 62.3%
9 0 1 2 0 1 2 4 50 40 100 40.0%
Total 136 833 1394 5 278 308 211 454 55 3674 35.6%

21
Table 4.6.1. Cross Classification based on Support Vector Machines of
Actual and Predicted 1SEC Cluster for the Training Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 949 977 120 0 1 0 0 0 0 2047 46.4%
2 369 3081 1925 45 46 1 1 0 0 5468 56.3%
3 88 1701 4728 368 553 39 17 3 0 7497 63.1%
4 9 302 2300 468 1136 153 101 10 0 4479 10.4%
Actual
5 2 68 1184 357 1772 371 359 54 0 4167 42.5%
Cluster
6 1 15 331 171 1101 583 693 207 1 3103 18.8%
7 0 3 95 64 620 405 1103 601 10 2901 38.0%
8 0 0 20 7 131 169 628 1485 121 2561 58.0%
9 0 0 2 1 6 8 45 400 453 915 49.5%
Total 1418 6147 10705 1481 5366 1729 2947 2760 585 33138 44.1%

Table 4.6.2. Cross Classification based on Support Vector Machines of


Actual and Predicted 1SEC Cluster for the Test Dataset
Predicted Cluster Percent
Correctly
1 2 3 4 5 6 7 8 9 Total Classified
1 88 119 20 0 0 0 0 0 0 227 38.8%
2 43 331 222 5 6 0 0 0 0 607 54.5%
3 10 187 530 47 52 4 2 0 0 832 63.7%
4 1 45 236 44 140 19 10 2 0 497 8.9%
Actual
5 1 7 132 57 174 41 47 3 0 462 37.7%
Cluster
6 0 3 40 15 129 55 87 15 0 344 16.0%
7 0 1 8 11 79 42 110 68 2 321 34.3%
8 0 0 5 3 19 19 69 160 9 284 56.3%
9 0 0 0 0 0 0 7 44 49 100 49.0%
Total 143 693 1193 182 599 180 332 292 60 3674 41.9%

The bulls eye hitrates give the percentage of households correctly classified, while the
neighboring cluster hitrates allow for misclassification to an adjacent cluster (e.g. a
misclassification to cluster 3 or 5 of a household belonging to cluster 4 is tolerated). Table 4.7
below shows the bulls eye and neighboring cluster hitrates for the training and test datasets for all
of the five classifiers. As expected, the hitrates based on the test datasets are lower than the hitrates
of the training datasets. The prediction based on Support Vector Machines gives the highest bulls
eye hitrates for the training and test datasets, while the predictions based on Logistic regression

22
gives the highest neighboring cluster hitrates for both the training and test datasets. The
performance of logistic regression comes second to SVM at bulls eye hitrates. The two methods
switch places at neighboring cluster hitrates. Moreover, the performances of bagging and random
forests are only slightly inferior to the performances of SVM, logistic regression, and discriminant
analysis. Lastly, boosting appears to give the inferior prediction rates.

Table 4.7. Training and Test Dataset Hitrates for Bulls Eye and
Neighboring Cluster Predictions by Classifier
Bull's Eye Neighboring
Training Test Training Test
Logistic Regression 42.4% 41.8% 85.9% 85.2%
Discriminant Analysis 41.4% 40.6% 83.4% 82.8%
Bagging 40.9% 39.4% 82.6% 81.7%
Random Forests 41.4% 40.0% 82.6% 81.4%
Boosting 36.5% 35.6% 75.1% 74.8%
SVM 44.1% 41.9% 84.7% 83.9%

Table 4.8 below shows the training and test dataset bulls eye hitrates by cluster for the different
classifiers. Clearly, there are clusters for which some classifiers are superior over others at
identifying. For example, the tree-based and SVM methods do a relatively poor job at predicting
households which belong to cluster 4. These methods, however, do a better job at predicting
households belonging to cluster 8.

Table 4.8. Training and Test Dataset Hitrates for Bulls Eye Predictions by Classifier and by Cluster
Logistic Discriminant
Cluster Bagging Random Forests Boosting SVM
Regression Analysis
1 25.3% 41.8% 44.7% 39.5% 37.1% 46.4%
2 56.1% 50.1% 51.5% 53.0% 52.5% 56.3%
3 60.5% 58.1% 61.1% 66.0% 61.4% 63.1%
4 18.3% 16.0% 12.9% 6.1% 0.2% 10.4%
Training
5 36.7% 34.3% 31.6% 37.4% 15.6% 42.5%
Dataset
6 22.0% 25.6% 19.2% 11.8% 20.9% 18.8%
7 36.5% 35.6% 31.4% 31.6% 19.0% 38.0%
8 52.4% 48.1% 55.2% 60.3% 65.1% 58.0%
9 54.2% 63.0% 46.9% 42.4% 37.8% 49.5%
1 24.7% 42.3% 39.2% 33.9% 33.9% 38.8%
2 56.3% 49.3% 49.8% 52.7% 52.1% 54.5%
Test
3 59.6% 59.0% 61.7% 65.9% 62.3% 63.7%
Dataset
4 16.3% 14.5% 13.1% 5.0% 0.6% 8.9%
5 36.8% 32.3% 27.9% 34.8% 13.9% 37.7%

23
6 23.5% 25.3% 19.8% 12.5% 16.6% 16.0%
7 34.9% 32.7% 28.7% 28.7% 17.4% 34.3%
8 51.1% 44.7% 47.9% 55.3% 62.3% 56.3%
9 54.0% 67.0% 54.0% 46.0% 40.0% 49.0%

Table 4.9 below shows the training and test data neighboring cluster hitrates for the different
classifiers by cluster. All of the six classifiers have difficulty in predicting the middle class
(clusters 5, 6, and 7), but have better hitrates for the other clusters.

Table 4.9. Training and Test Dataset Hitrates for Neighboring Clusters by Classifier and by Cluster
Logistic Discriminant Random
Cluster Bagging Boosting SVM
Regression Analysis Forests
1 91.7% 89.7% 90.2% 90.3% 86.4% 94.1%
2 97.3% 96.6% 97.3% 98.2% 98.5% 98.3%
3 92.1% 87.3% 89.8% 90.6% 89.6% 90.7%
4 87.7% 83.1% 83.9% 87.5% 67.1% 87.2%
Training
5 67.4% 63.3% 57.6% 51.6% 32.8% 60.0%
Dataset
6 77.0% 74.6% 71.1% 70.9% 52.0% 76.6%
7 75.8% 76.1% 72.2% 69.4% 70.3% 72.7%
8 86.1% 87.2% 84.1% 86.5% 83.1% 87.2%
9 92.1% 92.9% 93.3% 94.0% 92.9% 93.2%
1 89.0% 88.1% 83.7% 87.7% 84.6% 91.2%
2 98.4% 97.4% 98.4% 99.0% 98.8% 98.2%
3 92.9% 88.5% 91.7% 90.7% 91.3% 91.8%
4 84.3% 78.1% 80.7% 85.3% 64.0% 84.5%
Test
5 67.1% 63.0% 54.1% 47.8% 32.3% 58.9%
Dataset
6 78.8% 76.7% 70.3% 71.2% 52.0% 78.8%
7 72.3% 74.8% 73.8% 68.2% 71.7% 68.5%
8 82.7% 83.5% 79.6% 82.4% 80.6% 83.8%
9 91.0% 94.0% 94.0% 94.0% 90.0% 93.0%

Listed below, in Table 4.10, are the top 10 most important variables based on Mean Decrease in
Gini (a measure of variable importance) for the bagging method. Spending on LPG and number
of phones in a household stand out as the variables which give the greatest decrease in the Gini
index.

24
Table 4.10. Top 10 Most Important Variables based on Bagging
Mean Decrease
Description
in Gini
Spending on LPG 1509.09
Household Number of Phones 1282.44
Household Number of Microcomputers 525.84
Number of employed household members 514.96
Household Number of TVs 355.33
Spending on Tuition Fees in Cash 337.88
Household Number of Refrigerators 213.61
Employment of Spouse of Household Head 205.29
Number of household members 60 years old and over 205.14
Household Number of Airconditioners 196.84

The result for the top 10 most important variables, also based on mean decrease in Gini index, for
the random forest method is shown in Table 4.11. The top 7 variables for the bagging method all
appear in the top 10 for random forests. Once again, household number of phones and spending
on LPG are the variables which cause the greatest decrease in the Gini index.

Table 4.11. Top 10 Most Important Variables based on Random Forests


Mean Decrease
Description
in Gini
Household Number of Phones 894.47
Spending on LPG 491.76
Household Number of TVs 432.35
Household Number of Refrigerators 411.40
Number of employed household members 372.95
Household Number of Washing Machines 328.36
Household Number of Microcomputers 290.91
Household Number of VCRs 281.20
Spending on Tuition Fees in Cash 233.62
Spending on Firewood 230.82

Lastly, Table 4.12 gives the most important variables based on boosting. The measure of variable
performance that is used is relative influence. Nine out of the ten important variables for bagging
also appear as important variables for boosting (the only variable not included is number of
household heads older than 60). Eight out of the ten important variables for random forests also
appear as important variables for boosting (only number of VCRs and spending on firewood are
the variables not included.

25
These results indicate that the three tree-based methods agree on the same variables which are
most important. These variables are: number of phones in household, spending on LPG, number
of microcomputers, number of employed household members, number of TVs, spending on tuition
fees in cash, number of refrigerators, employment status of spouse of household head, and number
of airconditioners.

Table 4.12. Top 10 Most Important Variables based on Boosting


Relative
Description
Influence
Household Number of Phones 32.38
Spending on LPG 20.75
Household Number of Microcomputers 14.80
Household Number of TVs 13.89
Household Number of Airconditioners 4.60
Spending on Tuition Fees in Cash 3.61
Number of employed household members 2.34
Household Number of Washing Machines 1.92
Household Number of Refrigerators 1.68
Employment of Spouse of Household Head 1.36

5. Conclusions

The study has shown the potential of support vector machines in predicting the true cluster of a
household. The performance of ordinal logistic regression, however, is excellent, even superior to
the tree-based methods, and slightly superior to discriminant analysis. Thus, the study justifies the
predictive ability of the MORES 1SEC questionnaire. The best bulls eye hitrates were achieved
by SVM, while the best neighboring cluster hitrates were achieved by logistic regression. SVM
may thus be one method which can be considered in predicting cluster membership, especially if
high bulls eye hitrates are needed. Furthermore, the performances of bagging and random forests
are not much worse than the performance of SVM and logistic regression. However, boosting
seems to show inferior predictive ability among all the classifiers considered.

The study has also shown that some cluster memberships can be identified better by some
classifiers. For example, households belonging to cluster 4 can be predicted better by logistic
regression and discriminant analysis than by the other methods. One thing that all classifiers have

26
in common is the lack of power in predicting households which belong to the middle class
(clusters 5, 6 and 7).

The different tree-based methods also show roughly the same set of most important predictors.
These predictors are number of phones in household, spending on LPG, number of
microcomputers, number of employed household members, number of TVs, spending on tuition
fees in cash, number of refrigerators, employment status of spouse of household head, and number
of airconditioners.

REFERENCES

AGRESTI, A., 2007, An Introduction to Categorical Data Analysis, 2nd Ed., Hoboken, New Jersey:
John Wiley & Sons, Inc.

BERSALES, L.G.S, DE JESUS, N., BARRA, L., MERCADO, J.R., GOBENCION, M.B., and
LUCAGBO, M.D., 2013, 1SEC 2012: The New Philippine Socioeconomic Classification, 13th
National Convention on Statistics, Mandaluyong City, Philippines.

DICKSON, P.R., and GINTER, J.L., 1987, Market Segmentation, Product Differentiation, and
Marketing Strategy, Journal of Marketing, 51(2) : 1-10.

JAMES, G., WITTEN, D., HASTIE, T., and TIBSHIRANI, R., 2013, Introduction to Statistical
Learning with Applications in R, New York: Springer.

NOCK, S.L. and ROSSI, P.H., 1979, Household Types and Social Standing, Social Forces
57(4):1325-45.

NOCK, S.L. and ROSSI, P.H., 1978, Ascription versus Achievement in the Attribution of Family
Social Status, American Journal of Sociology 84:565-90.

OAKES, M., 2014, Measuring Socioeconomic Status, e-Source, Office of Behavioral & Social
Science Research, National Institutes of Health. Available at:
http://www.esourceresearch.org/Portals/0/Uploads/Documents/Public/Oakes_FullChapter.pdf.

OAKES, J. M. and ROSSI, P.H., 2003, The measurement of SES in health research: current
practice and steps toward a new approach, Soc Sci Med 56(4):769-84.

PRINCETON UNIVERSITY, 2012, Social Class, WordNet Search 3.1. Available at:
http://wordnetweb.princeton.edu/perl/webwn?s=social+class&sub=Search+WordNet&o2=&o0=
1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=0.

27
ROSE, D. and PEVALIN, D.J., 2001, The National Statistics Socio-economic Classification:
Unifying Official and Sociological Approaches to the Conceptualisation and Measurement of
Social Class, ISER Working Paper Number 2001-4, Institute of Social and Economic Research,
University of Essex.

ROSSI, P. H., SAMPSON, W. A., BOSE, C. E., JASSO, G., and PASSELL, J., 1974, Measuring
household standing, Social Science Research, 3:169-190.

STATISTICAL ANALYSIS SYSTEM (SAS), 2012, A Marketers Guide to Analytics: Using


Analytics to Make Smarter Marketing Decisions and Maximize Results, SAS White Paper, SAS.
Available at: http://www.sas.com/resources/whitepaper/wp_21118.pdf.

TABUNDA, A.L. and DE JESUS, G.V., 1995, A Sequential Socioeconomic Classification Rule
for Metro Manila Barangays, ASSIST Monograph Series 16, Quezon City: The Philippine
Statistical Association.

TABUNDA, A.M.L. and DE JESUS, G.V., 1996, Identifying the Poorest in Metro Manila, Manila:
National Statistics Office.

VERHOEF, P.C., SPRING, P.N., HOEKSTRA, J.C., and LEEFLANG , 2002, The commercial
use of segmentation and predictive modeling techniques for database marketing in the
Netherlands, Elsevier Science, 34(2002): 471-481.

WATCHMAN, R., 2012, Building Best Practice Customer Segmentation Using Predictive
Analytics, Article for February 20, 2012, Catalysis. Available at:
http://www.catalysis.com/Article.aspx?id=68.

ZHENG, R. and TUZHILIN, A., 2008, Partitioning Customers Using Overlapping Segmentation
Methods, WITS 2008 18th annual workshop on information technologies & systems, Paris,
France.

28

Você também pode gostar