Você está na página 1de 3

Enhancing Analytical Modeling for

Large Data Sets


with

Variable reduction

Doesn’t it become tricky resolving accurate relationships between


variables when there are thousands of them? More so, when each
can be used to create segmentation models…
More often than not, some variables are decidedly correlated with
one another. Including these highly correlated variables in the
modeling process definitely increases the amount of time spent by
the statistician finding a segmentation model that meets business
needs. In order to speed up the modeling process, the predictor
variables should be grouped into similar clusters. A few variables
can then be selected from each cluster - this way the analyst can
quickly reduce the number of variables and speed up the
modeling process.
T oday, technology helps store huge data at no or token
additional cost, as compared to earlier days. In today’s
business we keep information in different tables in suitable
structure. For instance, it can have account data, transaction  Factor analysis technique
data, customer demographic data, payment data, inbound-
outbound call data, campaign data, account history data etc. in a
financial business. For our analytical purpose we collate all the
 Cluster of variable technique
information to create customers’ single view that may contain
huge number of variables. The challenge is to identify which few  Method of Correlation*
of them we will use for our modeling purpose. In high
dimensional data sets, identifying irrelevant variables is more
(*for Predictive Analysis)
difficult than identifying redundant variables. It is suggested that
first the redundant variables be removed and then the irrelevant
ones looked for. There are several ways of identifying irrelevant
variables. The technique of identifying less important variables
can differ on the basis of what specific analysis need to be done Factor Analysis Technique
using the final data. We will discuss what different steps should
be followed to reduce the variables.
Let us consider we have N number of predictors. Do a factor
analysis for M factors where M is significantly less than N. For
Step One. Reduce variables on the basis of a specific factor we will get load value for each and every
missing percentage variable. The load factor will be high corresponding to those
variables which have a high influence on the specific factor.
The variable with high missing information has very less The set of variables which are highly correlated with each
contribution / predictive power in statistical model. Sometimes other will get high absolute load value. Select one variable for
the missing value needs to be imputed on the judgmental basis. the model with high load value out of those which have high
E.g. in case of amount purchase in last month missing can load value corresponding to this factor. Select the 2nd
indicate no purchase happened by the customer for last one variable for the model looking at the load value corresponding
month. So missing field need to be replaced by zero. But there to 2nd factor in the similar way. Continue this till the kth factor
are many cases where missing is actual missing. In this situation to identify the variable corresponding to that factor. Number
it is not suggested to impute the value in case the % missing is of k (<M) should be selected on the basis of cumulative
very high. We should remove the variables for which a high percentage of variation explained till k factors out of the total
proportion of missing observations are there. variation explained by M factors. You can take a cut-off within
90% to 95%.

Step Two. Variable reduction on the basis of


percentage of equal value

There might be fields with equal value for all the observation.
For this specific situation variable standard deviation will be
zero. We can remove these set of variables as it cannot have
any contribution on the model. There may be variables also for
which almost all (say > 98%) the records are with equal value.
We should not use these variables as they cannot contribute
much in the model. Calculating percentile with minimum and
maximum value of the variable will help identify such variables.

Step Three. Variable reduction among the


correlated variables

It is not desirable to use set of correlated predictor variables


either in cluster analysis or any types of regression analysis,
and forecasting technique. When we do subjective
segmentation using cluster technique we can identify
correlated variables by:
Cluster of Variable Technique

In this technique it will split the variables in two groups on the


basis of correlation to each other in each step of variable
splitting. The variables within the same group have higher
correlation with each other as compared to ‘between-group’
correlation. We can impose a condition, ‘whether a specific
subgroup needs to split farther or not’ on basis of cut-off point
of 2nd highest Eigen value of a subgroup. The cut-off point for
Eigen value is typically chosen as 0.8 to 0.5. Once the final
convergence happens you can select one or two variables
from each child group for the purpose of modeling.

Method of Correlation
● ● ●
In the method of prediction where we have one response Data mining methods simplify the extraction of
variable and other set of predictors this technique is very
much useful. Though we can first use any one of the above key insights from a huge database. They offer the
two methods to reduce the number of predictors in stage one possibility of starting the analysis from any given
and then use this method for farther reduction. Let use point in it. However, without proper methods and
consider we have response variable as Y and predictors are
X1, X2, … , Xn. Calculate the correlation matrix for all techniques we may never be able to do so. Variable
predictors including Y. Here we can impose a condition on reduction technique greatly helps both in handling
correlation value when we will take any one of two predictors
huge data and reducing the model development
if it is higher than some specific value, say r. Now if r ij,
correlation between Xi and Xj is greater than r we will keep time. And in the bargain, all of this is accomplished
Xi if r yi > r yj, where ryi is correlation between Y and Xi. In without sacrificing the quality of the model.
practice we generally use r ranges within (0.75 to 0.9).
Identifying the right technique becomes all the
more easier with a better understanding of the
data.
Note: If you feel that still you have many variables for model
and need to reduce prior to actual model you can do this on
With techniques like these we, at Cequity, are able
the basis of VIF value of each predictors performing
regression of Y on predictors. Remove the variable which to combine data & technology, and build
has VIF higher than 2.5 and remove variable one by one. actionable analytical marketing services to
accelerate ROI-driven, real-time customer-
engaged marketing. Touch base with us to learn
more…

Cequity Solutions Pvt. Ltd.

Reach us at 105-106, 1st Floor, Anand Estate, 189-A, Sane Guruji Marg, Mahalaxmi, Mumbai-400 011, India
Phone: +91 22-43453800 Fax: +91 22-43453840

For more case studies, white papers and presentations log on to www.cequitysolutions.com
Or Write to info@cequitysolutions.com
For the latest thinking in Analyical Marketing, check out our blog at blog.cequitysolutions.com

Você também pode gostar