Você está na página 1de 10

Introduction

The ability to predict an event based upon historical data is an important task of machine
learning and data mining algorithms. Such predictive models are often referred to as forecasts
and must be recognized as one of the most valuable analytics tools used in organizations.
After all we forecast sales, demand, and weather. We attempt predict outcomes of events, games,
races, and of course the price of pretty much everything from commodities to stocks, currency,
real estate, and pretty much anything else that has value.
Fortune telling the future is big business. Just as there are many areas where are attempt to
attempt to beat the odds and guess at what the future holds, there are many ways to predict the
future.
One could engage the services of a fortune teller, bet on hunches or intuition or rely upon
statistical methods such as regression.
Regression is a statistical technique that builds a predictive model. A predictive model is one
that can predict an output value based upon an input value. Simply put, a predictive model
attempts to predict the future based upon what is known now and what has been observed in the
past. There are many examples of such predictive models.

For example, a researcher might use factors such diet, exercise, or sugar intake to predict the risk
an individual has of developing certain diseases.

A retailer may use factors such as increases or decreases in income levels to predict demand and
establish stocking levels of certain products within its stores.
A commodities broker may use a wide variety of input data such as the weather forecast,
precipitation, reports of disease, historical data of all these factors, and of course prices to
determine where the price of orange juice will be in the future.
Essentially such predictive models are used where a relationship exists between two or more
variables and where historical data can be used to establish trends which may be predictive of
future events.
Regression
Regression is a statistical method used to build such predictive models based upon historical
data. The input data for regression must include both the predictor variables (the attributes that
are used as input from which to generate a predicted outcome) and a predicted variable (the
output that is expected to occur based upon the input predictor variables).
As an example of this process, imagine that you wanted to predict the water level of lakes in
your area. To do so you collect metrics on rainfall and lake water levels for all of the lakes in
your area. Imagine further that you were able to get this same data collected over the course of
several years. Using this data we could develop a model that would allow us to predict what the
water level in the lake will be based upon the amount of rainfall that is received.
Regression is a process used to develop a model that rationalizes the data and identify predictive
trends within the data. There are a number of different types of these regression models that can
be used to predict different types of relationships. Some of these relationships when plotted are
linear others have a curved shape. We will walk through an example of the simplest form of
regression which is linear regression which is used when a scatterplot of the data has roughly a
linear shape. Going back to our lake levels example. Let us assume that we have all of the data
as follows.

If we take the two key variables, the rainfall in inches and the change in lake levels and plot
these values on a graph they would look like the following:

If we look at this graph one of the things that we realize is that the plotted values begin to form a
pattern. In fact we could draw a straight line that follows the shape of the data points that have
been plotted. In figure (x), we had added a line to the graph to indicate the general shape of the
points. This line is called a regression line. Essentially this line provides us with a way to use
to the data to create a prediction model. For example if we look at the graph we can see that, in
general, when we have 15 inches of rainfall, the lake levels will tend to be between 2 inches
lower. We can also see that when we have 20-25 inches of rainfall the levels of water in the lakes
will tend to be 3 inches higher.
In the figure, we just kind of drew a line on the data, but the regression algorithm employs a
process to fit the line to the data. The first thing that we need to understand about fitting a
regression line is the concept of error. When we look at Figure (x) we can see that many of the
data points do not touch the line. The line represents the predicted value of Y for any input value
of X. The dots that do not touch the line indicate that the actual data doesnt match the
prediction.

We refer to the difference between the predicated value and the actual value as error. The
objective of regression is to minimize the amount of error. The most popular regression
technique is called least squares because what it does is measure the distance of each actual point
from the regression line and then squares this distance.

The reason that we square the value is because the values that are above the red line will have a
positive value and the points below the line will have a negative value. When we square

them, all of the distances become positive. The algorithm must then minimize the value when all
of these errors have been added together.
Supervised learning for prediction
The lakes water level is an example of a linear regression. We call it linear regression because it
fits a straight regression line to the data and this line is expressed in the form:
y = mx + b
If you recall your basic algebra, this forms essentially a function where for any input value of x
there is an output value of y. The m in the equation is called the slope. The slope defines the rise
over run or the change in the value of y for any value of x. The b in the equation specifies the y
intercept or the point where the line intercepts the y axis. The y intercept is typically the point
where the value of x is zero.
Curvilinear regression
There are other forms of regression that are often important and needed. For example consider
the following data set (Figure x). The data in this graph are clearly not linear. We see the value
in Y rise with the values in X up to a point around 4 on the X axis and then they begin to decline.

Linear regression is clearly not going to be an accurate solution to predict these values. What is
needed is the ability to have a curved line that models the behavior of the data to increase up to a
point and then decrease. Fortunately, we have such a solution in curvilinear regression, which
uses splines to model data that is not linear and yet has a strong relationship between the data
attributes. In curvilinear regression we no longer have an algorithm to fit a line it rather
becomes an algorithm to fit a spline and is accordingly called spline fitting.

Logistic Regression
In both linear regression and in curvilinear (or spline) regression the objective is to predict and
output value based upon a set of input values. In these forms of regression analysis the variables,
typically both input and output variables, are continuous meaning that they can take on a whole
range of values dependent upon their type.
Logistic regression is a bit different. First of all the objective of logistic regression incorporates
not only the idea of prediction but also categorization. Logistic regression will predict by
placing an instance into a binary category (something that has two options). This technique is
often desired when the desired outcome is to make a decision between two choices. If you
recall our breast cancer example, the objective was to determine if the particular cell instance as
identified by the 9 attributes was cancerous or benign. This particular problem might be suited
to the logistic regression technique (based upon the type of attributes available).
In both linear and curvilinear regression, the output is a continuous variable. For any input
variable the regression model is designed to produce an output and the output is plotted on a
continuous line. In logistic regression, the output based upon any input is output as part of a
sigsmoid function which creates a threshold value where each point is divided into one of the
two categories which essentially are 0 and 1 and to which we affix a meaning such as Yes extend
credit or No do not. In the following figure we can see the difference between a continuous
linear outcome variable and the sigsmoid function. You will notice that in the sigsmoid function
there is a short boundary or distance between off where the line is near the bottom of the axis
and on. It is this demarcation that makes logistic regression valuable as a classification
technique.

Fitting the Regression Line


The process of fitting a line to determine the linear regression of the line involves determining
the unknown coefficients of the equation
y = mx +b
An equation of this form defines a line that can be plotted on a two dimensional axis. This
equation represents a single variable regression. In single variable regression there is one
variable which is expressed as some input value of x and an output which is represented in the
equation as y.
A similar process can be followed when there are more than two variables in the regression.
When the regression model has more than two variables there are multiple input variables and
one output variable. A multiple regression model is expressed by an equation in the form:
y = b + m1x1 + m2x2 . . . + mkxk
The unknown values in each of these cases is the value of b which represents the y intercept or
the point where the regression line crosses the y axis on the graph and the value of m which is

known as the slope of the line. In multiple regression we see how there is a slope associated
with each dimension (or each variable) in the model that is represented with the letter m with a
subscript for each variable.
To determine the value of the y intercept we use the following formula:

We use the following formula to determine the slope of the regression line:

In each of these equations we see that we need to have the sum of x and y values or the sum of
the product or square of x and y values. We determine these values by taking all of the x, y
points that we see plotted on the graph (figure xx).

In figure xx, we can see a number of points drawn on the x and y axis. Each of these points can
be represented as an x,y pair as shown below.
(13,-4)

(17.2,-1)

(13,-2)

(18,.5)

(15,-3)

(22,2)

(15,-2.5)

(22,.2)

(16,-3.2)

(22,1.5)

(17,-.75)

(22,2.1)

(17.1,0)

(22,3)

To calculate the y intercept we would first need to determine the sum of the y values which we
determine by adding all of the y values from each of the points. In this case we would add -4, -2,
-3, -.5, -3.2, -.75, 0, -1, .5, 2, 1.5, 2.1, and 3 together for a total of -7.5.

After calculating each of the values you can plug them into the respective formulas to find the
slope of the line and the y intercept. The following table details all of the values calculated for
this simple example.

It should be easy to see why it is so important to have a computer algorithm to determine


regression because of number of calculations that are required.

Você também pode gostar