Você está na página 1de 18

Wisdom and Statistical Techniques

by Prof. Amitava Banerjee

on 3rd. Aug’07 at QDI( Seepz- SDF-V)

12th. July’07
Statistical Methods for Data
Analysis
Types of techniques
• Descriptive analysis
• Exploratory techniques
• Visualization tools
• Basic inferential analysis
• Dependency techniques / Relationship analyses
• Analyses of association / contingency tables
• Time series / trend analysis
• Survival analysis
• Exploring structure and reduction of dimension
Descriptive Techniques
• Used to understand the level of variation of the variables
involved
• Typical tools are
– Frequency distribution
– Histogram
– Dot plots (for small data set – usually less than 50)
– Computation of average, percentiles, standard deviation, range
• On the basis of the descriptive analysis you should be
able to talk confidently about the pattern of variation of
the variables involved
• Always carry out descriptive analysis in the beginning.
Do not eliminate outliers at this stage. Try to understand
the pattern of variation. Always try to link the learning
with the stated problem
Exploratory Techniques
• The aim of these techniques are to discover ‘interesting
patterns’. Think about ‘interesting questions’ and try to
check whether the data supports the same or not
• Tools of exploratory techniques are
– Stratification
– Two and three way tables
– Usage of concepts of experimental design
– Scatter diagrams
– Stratified scatter diagrams
– Matrix of scatter diagrams
– Stratified dot plots
• The exploratory techniques should give strong
indications regarding the important variables
Data Visualization Tools
• Part of exploratory analyses. Primarily used to validate assumptions
/ develop new theories
• Typical tools are
– Run charts (plots in chronological order)
– Plot of mean with or without limits (essentially control charts)
– Plot of mean function with or without limits
– Plot of proportions for binary data (e.g. percentage of defective items)
– Plot of different proportions for multinomial data
• Data visualization tools should help you to detect
– Time based dependencies
– Whether a linear model to explain some variable in terms of other
variables is acceptable or not
– Whether it will be possible to develop models to understand behaviour
of binary data or not
Basic Inferential Analysis
• Should be looked at from three perspectives
– Whether the differences observed during the
exploratory analyses are likely to be important or not
(to be looked at primarily from elementary probability)
– Carrying out appropriate statistical hypothesis testing
(parametric as well as non-parametric) to judge the
significance of the differences observed
– Carrying out hypothesis testing from simulation
perspective
• The basic inferential analyses are
complimentary to the exploratory analyses and
increases the confidence level on the findings.
However, this requires a lot of maturity
Distribution Fitting
• These analyses aim at fitting probability models
to data so that risks (probabilities of specific
events) may be estimated objectively
• Typical distributions are as follows:
– Effort and turn around time data often fit lognormal
distribution
– Inter-arrival time data often fit exponential, gamma or
lognormal
– Effort and schedule variance data often fit normal
distribution
Dependency / Relationship
Analyses
• Theses analyses aims at finding out the
relationship between a response (dependent)
variable and a set of independent (predictor)
variables – the so called y = f(x) kind of analyses
• Typical tools are
– Regression analysis
– Discriminant analysis
– Analysis of variance / analysis of covariance
– Multivariate Analysis of Variance (MANOVA)
– Path analysis
Regression Analysis
• Use regression analysis in the following situations
• The response is a continuous variable
• The predictor variables are continuous
• The plot of mean functions show reasonable patterns
• From the underlying technology / experience you know
that it is possible to have a ‘good’ estimate of the value
of the response variable
• Always carry out the descriptive and exploratory
analyses before fitting a regression model. Try to use the
minimum number of predictor variables. Remember that
in order to be useful a model must be parsimonious
Discriminant Analysis
• Often there are cases when accurate estimation of the response
variable is either not necessary or not possible. This is usually the
case when the response variable has a large variation and hence no
model can explain the variability adequately
• In such cases the response variable can be grouped into classes
and while it may not be possible to estimate the value of the
response accurately, it may be possible to identify whether it will
belong to a particular class or not with a higher accuracy
• In order to apply discriminant analysis, the predictor variables should
be continuous and should follow normal distribution
• The effectiveness of the discriminant model is judged from the
perspective of probability of misclassification
Analysis of Variance / Covariance
• When the predictor variables are discrete taking
only a few values we are actually looking at a set
of segments defined by these predictor variables
• In this case the correct model to be used to
explain the variation in the response is analysis
of variance (ANOVA)
• An interesting application of ANOVA is the
comparison of value of sales across various
business / customer segments
Analysis of Association
• There are cases when the response variable is ordinal
• In these cases logistic regression is to be used to
explore the relationship between the response and the
predictor variables
• Before fitting the logistic regression model two way
contingency tables are to be constructed to measure the
association between the predictor and the response
variable one at a time
• In logistic regression the response is looked at from the
perspective of odds ratio rather than the individual value
• Logistic regression can be applied to assess the odds of
winning contracts or growth of an account beyond
certain points on the basis of available data
Time series / Trend Analyses
• Aims at discovering pattern of arrivals / forecasting
• Typical methods are
– Moving average
– Exponential smoothing
– Double exponential smoothing
– ARIMA models
• These models are used for forecasting call arrival,
volume of business, economic growths that may have a
bearing on business
• A very interesting application is the ‘Sales Reserve
Analysis’ where the value of billed revenue and the value
of opportunity won are compared. Another interesting
application is the study of the pattern of invoicing of
opportunities won
Survival Analysis
• Used to figure out the chance of survival / death in a
given interval of time
• In business context this is used for the following
– Retention of customers
– Retention of employees
• When a customer carries out transactions, the customer
is said to be surviving and when the level of transactions
go below certain level it is assumed that the customer is
no longer surviving
• Survival analysis looks at the distribution of the length of
the survival time and how is it impacted by different
parameters
Exploring Structure / Dimension
Reduction
• The aim of these analyses are to explore
structure within the data and use the same to
reduce dimension.
• Sometimes these techniques are used to identify
groups within the data
• Tools are
– Factor Analysis
– Cluster Analysis
• Factor analysis is used to explore structure and
cluster analysis is used to discover groups
Factor Analysis - Examples
• The complexity of software depends on large number of parameters.
It is difficult to define limits on many of them together. It is even more
difficult to control in case many parameters are being looked at.
Factor analysis can combine several parameters into one and hence
the control scheme can be simplified to a very large extent
• The number of defects / changes of a software depends on many
technical parameters. However, these parameters are correlated
and hence development of regression / discriminant models to
understand the impact of individual parameters on occurrence of
defects / change-proneness is not feasible. Factor analysis can
combine variables and reduce dimension and at the same time
ensure that the combined variables are orthogonal. Thus a
regression or discriminant model can be applied after carrying out
factor analysis
Cluster Analysis - Examples
• The customers and projects need to be grouped so that
different actions may be taken for different groups
(These actions may be regarding sales, attrition, control
from the perspective of quality / productivity etc.)
• The grouping may be carried out empirically using
cluster analysis. An interesting example is the
identification of say ‘Anchor’ or ‘Harvest’ accounts
requiring different focus depending on their average
revenue, growth, margin and possibly share of wallet.
• Individual customers may be grouped on the basis of
their loyalty etc. and one may aim at examining the
constitution of these groups. This can throw light on
accounts where the probability of defection is rather high

Você também pode gostar