Você está na página 1de 3

CHAID 

is a type of decision tree technique, which can be used for prediction , as well as classification,
and for detection of interaction between variables. CHAID stands for 

CHi-squared Automatic Interaction Detector.

Like other decision trees, CHAID's advantages are that its output is highly visual and easy to interpret.
Because it uses multiway splits by default, it needs rather large sample sizes to work effectively, since
with small sample sizes the respondent groups can quickly become too small for reliable analysis.

CHAID detects interaction between variables in the data set. Using this technique it is possible to
establish relationships between a ‘dependent variable’ and other explanatory variables. CHAID does this
by identifying discrete groups of respondents and, by taking their responses to explanatory variables,
seeks to predict what the impact will be on the dependent variable.

CHAID is often used as an exploratory technique and is an alternative to multiple linear regression and
logistic regression, especially when the data set is not well-suited to regression analysis.

Logistic regression
In statistics, logistic regression (sometimes called the logistic model or logit model) is used for prediction of
the probability of occurrence of an event by fitting data to a logit function logistic curve. Like many forms of
regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For
example, the probability that a person has a heart attack within a specified time period might be predicted from
knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the
medical and social sciences fields, as well as marketing applications such as prediction of a customer's
propensity to purchase a product or cease a subscription.

Figure 1. The logistic function, with z on the horizontal axis and ƒ(z) on the vertical axis

An explanation of logistic regression begins with an explanation of the logistic function:


A graph of the function is shown in figure 1. The input is z and the output is ƒ(z). The logistic function is
useful because it can take as an input any value from negative infinity to positive infinity, whereas the
output is confined to values between 0 and 1. The variable z represents the exposure to some set of
independent variables, while ƒ(z) represents the probability of a particular outcome, given that set of
explanatory variables. The variable z is a measure of the total contribution of all the independent
variables used in the model and is known as the logit.

The variable z is usually defined as

where β0 is called the "intercept" and β1, β2, β3, and so on, are called the "regression coefficients"
of x1, x2, x3 respectively. The intercept is the value of z when the value of all independent variables is
zero (e.g. the value of z in someone with no risk factors). Each of the regression coefficients
describes the size of the contribution of that risk factor. A positive regression coefficient means that
the explanatory variable increases the probability of the outcome, while a negative regression
coefficient means that the variable decreases the probability of that outcome; a large regression
coefficient means that the risk factor strongly influences the probability of that outcome; while a
near-zero regression coefficient means that that risk factor has little influence on the probability of
that outcome.

Logistic regression is a useful way of describing the relationship between one or more independent
variables (e.g., age, sex, etc.) and a binary response variable, expressed as a probability, that has
only two possible values, such as death ("dead" or "not dead").

The logits, natural logs of the odds, of the unknown binomial probabilities are modeled as a linear function
of the Xi.

Note that a particular element of Xi can be set to 1 for all i to yield an intercept in the model. The
unknown parameters βj are usually estimated by maximum likelihood using a method common to
allgeneralized linear models. The maximum likelihood estimates can be computed numerically by
using iteratively reweighted least squares.
The interpretation of the βj parameter estimates is as the additive effect on the log of the odds for a
unit change in the jth explanatory variable. In the case of a dichotomous explanatory variable, for
instance gender, eβ is the estimate of the odds of having the outcome for, say, males compared with
females.

The model has an equivalent formulation

This functional form is commonly called a single-layer perceptron or single-layer artificial


neural network. A single-layer neural network computes a continuous output instead of a step
function. The derivative of pi with respect to X = x1...xk is computed from the general form:

where f(X) is an analytic function in X. With this choice, the single-layer neural network is
identical to the logistic regression model. This function has a continuous derivative, which
allows it to be used in backpropagation. This function is also preferred because its
derivative is easily calculated:

Você também pode gostar