Você está na página 1de 59

A statistical data may consists of a list of numbers related to a research.

Among
those numbers, few may be repeated twice and even more than twice. The
repetition of number is a data set is termed as frequency of that particular
number. The frequencies of variables in a data are to be listed in a table. This
table is known as frequency distribution table and the list is referred
as frequency distribution.

There are many types of frequency distributions

1. Grouped frequency distribution


2. Ungrouped frequency distribution
3. Cumulative frequency distribution
4. Relative frequency distribution
5. Relative cumulative frequency distribution

When the data are grouped into classes of appropriate size indicating the number
of observations in each class we get a frequency distribution. By forming
frequency distribution, we can summarize the data effectively. It is a method of
presenting the data in a summarized form. Frequency distribution is also known
as Frequency table.
Classes: A large number of observations varying in a wide range are usually
classified in several groups according to the size of their values. Each of these
groups is defined by an interval called class interval. The class interval between
10 and 20 is defined as 10-20.

Class limits: The smallest and largest possible values in each class of a
frequency distribution table are known as class limits. For the class 10-20, the
class limits are 10 and 20. 10 is called the lower class limit and 20 is called the
upper class limit.
Magnitude of a class interval: The difference between the upper and lower
limit of a class is called the magnitude of a class interval.

Class frequency: The number of observation falling within a class interval is


called class frequency of that class interval.
Construct a Frequency Distribution

A frequency distribution table is one way to organize data so that it makes more
sense. The data so distributed is called frequency distribution and the tabular
form is called frequency distribution table. Let us see with the help of example
how to construct distribution table.
The frequency distribution table lists all the marks and also show how many
times (frequency) they occurred.
The number which tells us how many times a particular data appears is called the
frequency. For example, 2 marks have been scored by five students which means
marks 2 occurs five times. Therefore, the frequency of score 2 is five. Similarly,
the frequency of marks 5 is three because three students scored five marks.

Grouped Data

Now consider the situation where we want to collect data on the test scores of five
such classes i.e. of 100 students. It becomes difficult to tally for each and every
score of all 100 students. Besides, the table we will obtain will be very large in
length and not understandable at once. In this case, we use what is called
a grouped frequency distribution table.

Such tables take into consideration groups of data in the form of class
intervals to tally the frequency for the data that belongs to that particular class
interval.

Take a look at the table below to understand the concept better:

Marks obtained in the test


No. of students (Frequency)
(Class Interval)

0-5 3

5-10 11

10-15 38
15-20 34

20-25 9

25-30 5

Total 100

The first column here represents the marks obtained in class interval form. The
lowest number in a class interval is called the lower limit and the highest number
is called the upper limit. This example is a case of continuous class intervals as
the upper limit of one class is the lower limit of the following class.

Note that in continuous cases, any observation corresponding to the extreme values
of a class is always included in that class where it is the lower limit. For example,
if we had a student who has scored 5 marks in the test, his marks would
be included in the class interval 5-10 and not 0-5.

Analogous to continuous class intervals are disjoint class intervals. An example


of such as case would be 0-4, 5-9, 10-14, and so on. The frequency distribution can
be done for disjoint data as well, similar to how it is done above.

Solved Example for You

Question: The following is the distribution for the age of the students in a school:

Age 0-5 5-10 10-15 15-20


No. of
35 45 50 30
Students

Calculate:

 The lower limit of the first class interval.


 The class limits of the third class.
 The classmark for the interval 5-10.
 The class size.
Answer:

 The lower limit of the first class interval i.e. 0-5 is ‘0’.
 The class limits of the third class, i.e. 10-15 are 10 (lower limit) and 15 (upper
limit).
 The classmark is defined as the average of the upper and lower limits of a
class. For 5-10, the classmark is (5+10)/2 = 7.5
 The class size is the difference between the lower and upper class-limits. Here,
we have a uniform class size, which is equal to 5 (5 – 0, 10 – 5, 15 – 10, 20 – 15
are all equal to 5).
Relative Frequency Distribution
Back to Top

A relative frequency distribution is a distribution in which relative frequencies


are recorded against each class interval. Relative frequency of a class is the
frequency obtained by dividing frequency by the total frequency. Relative
frequency is the proportion of the total frequency that is in any given class
interval in the frequency distribution.

Relative Frequency Distribution Table

If the frequency of the frequency distribution table is changed into relative


frequency then frequency distribution table is called as relative frequency
distribution table. For a data set consisting of n values. If f is the frequency of a
particular value then the ratio 'fnfn' is called its relative frequency.
Solved Example
Question: Find the relative frequency from the data given below:

Class interval Frequency


20-25 10
25-30 12
30-35 8
35-40 20
40-45 11
45-50 4
50-55 5

Solution:

Relative frequency distribution table for the given data.

Here n = 70

Class interval Frequency (f) Relative Cumulative Frequency (fnfn)


20-25 10 10 / 70 = 0.143
25-30 12 12 / 70 = 0.171
30-35 8 8 / 70 = 0.114
35-40 20 20 / 70 = 0.286
40-45 11 11 / 70 = 0.157
45-50 4 4 / 7 0 = 0.057
50-55 5 5 / 70 = 0.071
Total n = 70
Cumulative Frequency Distribution
Back to Top

One of the important type of frequency distribution is Cumulative frequency


distribution. In cumulative frequency distribution, the frequencies are shown in
the cumulative manner. The cumulative frequency for each class interval is the
frequency for that class interval added to the preceding cumulative total.
Cumulative frequency can also defined as the sum of all previous frequencies up
to the current point.
Cumulative Relative Frequency Distribution

Cumulative relative frequency distribution is one type of frequency distribution.


The relative cumulative frequency is the cumulative frequency divided by the
total frequency.

Grouped Frequency Distribution


Back to Top

A grouped frequency distribution is an ordered listed of a variable XX, into


groups in one column with a listing in a second column, the frequency column. A
grouped frequency distribution is an arrangement class intervals and
corresponding frequencies in a table.

There are certain rules to be remembered while constructing a grouped frequency


distribution

1. The number of classes should be between 55 and 2020.


2. If possible, the magnitude of the classes must be 55 or multiple of 55.
3. Lower limit of first class must be multiple of 55
4. Classes are shown in the first column and frequencies in the second column.
Grouped Frequency Distribution Table

Inclusive type of frequency distribution can be converted into exclusive type as in


Table (b)

cumulative frequency distributions

The cumulative frequency for each class interval is the frequency of that class
interval added to the preceding cumulative total. Cumulative frequency can also
be defined as the sum of all previous frequencies up to the current point.
Cumulative frequency is obtained by adding the frequency of a class interval and
the frequencies of the preceding intervals up to that class interval.
Less than cumulative frequency distribution :
It is found by adding sequentially the frequencies of all the earlier classes
including the class adjacent to which it is written. The cumulative is on the track
from the lowest to the highest size. It is obtained by adding successively the
frequencies of all the previous classes including the class against which it is
written. The cumulate is started from the lowest to the highest size.
2. More than cumulative frequency distribution :
It is found by the cumulative total of frequencies starting from the highest to the
lowest class. It is obtained by finding the cumulate total of frequencies starting
from the highest to the lowest class.
Now let’s see how to calculate the less than cumulative frequency distribution and
more than cumulative frequency distribution by solving an example problem:
Example:
The following frequency distribution table gives the marks obtained by 40
students:
Table (a)
Note: The frequencies can be added, as indicated by the arrows, to obtain the
cumulative frequency.
In the table(a), it is observed that 4 students got marks ‘less than 10’, 9 students
got marks ‘less than 20’ and so on.
Therefore, the above distribution is called ‘less than’ cumulative frequency
distribution.
Table (a) can be re-written as table (b).

Class Cumulative Frequency


Less than 10 4
Less than 20 9
Less than 30 21
Less than 40 32
Less than 50 40
Table (b)
In the same way ‘more than’ cumulative frequency distribution can be obtained
by adding to the other frequencies in the reverse order.

Table (c)
Note: The frequencies can be added, as indicated by the arrows, to obtain the
cumulative frequency.
Table (c) can be re-written as table (d)
Class Cumulative Frequency (c.f.)
More than 0 40
More than 10 36
More than 20 31
More than 30 19
More than 40 8

frequency polygon
To draw frequency polygons, we begin with, drawing histograms and follow the
following steps:

 Step 1- Choose the class interval and mark the values on the horizontal
axes
 Step 2- Mark the mid value of each interval on the horizontal axes.
 Step 3- Mark the frequency of the class on the vertical axes.
 Step 4- Corresponding to the frequency of each class interval, mark a point
at the height in the middle of the class interval
 Step 5- Connect these points using the line segment.
 Step 6- The obtained representation is a frequency polygon.
Let us consider an example to understand this in a better way.
Example 1: In a batch of 400 students, the height of students is given
in the following table. Represent it through a frequency polygon.

Solution: Following steps are to be followed to construct a histogram from the


given data:
 The heights are represented on the horizontal axes on a suitable scale as
shown.
 The number of students is represented on the vertical axes on a suitable
scale as shown.
 Now rectangular bars of widths equal to the class- size and the length of the
bars corresponding to a frequency of the class interval is drawn.
ABCDEF represents the given data graphically in form of frequency polygon as:

Frequency polygons can also be drawn independently without drawing


histograms. For this, the midpoints of the class intervals known as class marks
are used to plot the points.

Weighted Arithmetic Mean


When calculating the arithmetic mean, the importance of all the items are
considered to be equal. However, there may be situations in which all the items
under considerations are not of equal importance. For example, when we want to
find the average number of marks per students in different subjects like
mathematics, statistics, physics and biology. These subjects do not have equal
importance. Thus, the arithmetic mean computed by considering the
relative importance of each item is called the weighted arithmetic
mean. To give due importance to each item under consideration, we assign a
number called a weight to each item in proportion to its relative importance.
The weighted arithmetic mean is computed by using the following formula:

X¯¯¯¯w=∑wx∑wX¯w=∑wx∑w
Here:
X¯¯¯¯wX¯w stands for weighted arithmetic mean
xx stands for values of the items and
ww stands for the weight of the item
Example:
A student obtained the marks 40, 50, 60, 80, and 45 in math, statistics, physics,
chemistry and biology respectively. Assuming weights 5, 2, 4, 3, and 1
respectively for the above mentioned subjects, find the weighted arithmetic mean
per subject.

Solution:

Mark
Weight
Subject Obtained wxwx
ww
xx

Math 4040 55 200200

Statistics 5050 22 100100

Physics 6060 44 240240

Chemistry 8080 33 240240

Biology 4545 11 4545

Total ∑w=15∑w=15 ∑wx=825∑wx=825

Now we will find the weighted arithmetic mean as:


X¯¯¯¯w=∑wx∑w=82515=55X¯w=∑wx∑w=82515=55 marks/subject.
As an example, suppose that a marketing firm conducts a survey of 1,000
households to determine the average number of TVs each household owns. The
data show a large number of households with two or three TVs and a smaller
number with one or four. Every household in the sample has at least one TV and
no household has more than four. Here’s the sample data for the survey:

Number of TVs per Household Number of Households

1 73

2 378

3 459

4 90

Because many of the values in this data set are repeated multiple times, you can
easily compute the sample mean as a weighted mean. Doing so is quicker than
summing each value in the data set and dividing by the sample size.

Follow these steps to calculate the weighted arithmetic mean:


1. Assign a weight to each value in the data set:
X1 = 1, w1 = 73
X2 = 2, w2 = 378
X3 = 3, w3 = 459
X4 = 4, w4 = 90
2. Compute the numerator of the weighted mean formula.
Multiply each sample by its weight and then add the products together:
3. Compute the denominator of the weighted mean formula by adding the weights
together:

4. Divide the numerator by the denominator:

The mean number of TVs per household in this sample is 2.566.

MEAN OF GROUPED DATA


MEDIAN OF GROUPED DATA
For ungrouped data:
Ans = 3.4
RANGE :
MAXIMUM VALUE – LOWEST VALUE

The Range is the difference between the lowest and highest values.
Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9.

So the range is 9 − 3 = 6.

The Range Can Be Misleading

The range can sometimes be misleading when there are extremely high or low
values.

Example: In {8, 11, 5, 9, 7, 6, 3616}:

 the lowest value is 5,


 and the highest is 3616,

So the range is 3616 − 5 = 3611.

The single value of 3616 makes the range large, but most values are around 10.

So we may be better off using Interquartile Range or Standard Deviation .

Problem: Cheryl took 7 math tests in one marking period. What is the range of
her test scores?

89, 73, 84, 91, 87, 77, 94

Solution: Ordering the test scores from least to greatest, we get:

73, 77, 84, 87, 89, 91, 94

highest - lowest = 94 - 73 = 21

Answer: The range of these test scores is 21 points.

Definition: The range of a set of data is the difference between the highest and
lowest values in the set.
Mean Deviation

How far, on average, all values are from the middle.

Calculating It

Find the mean of all values ... use it to work out distances ... then find the mean of
those distances!

In three steps:

 1. Find the mean of all values


 2. Find the distance of each value from that mean (subtract the mean from each
value, ignore minus signs)
 3. Then find the mean of those distances
It tells us how far, on average, all values are from the middle.

In that example the values are, on average, 3.75 away from the middle.

For deviation just think distance


Deviation just means how far from the normal

Standard Deviation

The Standard Deviation is a measure of how spread out numbers are.

Its symbol is σ (the greek letter sigma)

The formula is easy: it is the square root of the Variance. So now you ask,
"What is the Variance?"
Example

You and your friends have just measured the heights of your dogs (in
millimetres):

The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and
300mm.

Find out the Mean, the Variance, and the Standard Deviation.

Your first step is to find the Mean:


Properties of the Standard Deviation
In terms of measuring the variability of spread of data, we've seen that
the standard deviation is the preferred and most used measure.

Some additional things to think about the standard deviation:

1. The standard deviation is the typical or average distance a value is to the


mean
2. If all values are the same, then the standard deviation is 0

Coefficient of Variation (CV)


If you know nothing about the data other than the mean, one way to
interpret the relative magnitude of the standard deviation is to divide it by
the mean. This is called the coefficient of variation. For example, if the mean
is 80 and standard deviation is 12, the cv = 12/80 = .15 or 15%.

Measure of relative importance is CV

SKEWNESS

If dispersion measures amount of variation, then the direction of variation is


measured by skewness. The most commonly used measure of skewness is Karl
Pearson's measure given by the symbol Skp. It is a relative measure of skewness.
When the distribution is symmetrical then the value of coefficient of skewness is
zero because the mean, median and mode coincide. If the co-efficient of skewness
is a positive value then the distribution is positively skewed and when it is a
negative value, then the distribution is negatively skewed. In terms of moments
skewness is represented as follows:
Scatter Plots

A Scatter (XY) Plot has points that show the relationship between two sets of
data.

In this example, each dot shows one person's weight versus their height.
(The data is plotted on the graph as " Cartesian (x,y) Coordinates ")

Example:

The local ice cream shop keeps track of how much ice cream they sell versus the
noon temperature on that day. Here are their figures for the last 12 days:

Ice Cream Sales vs Temperature

Temperature °C Ice Cream Sales

14.2° $215

16.4° $325

11.9° $185

15.2° $332

18.5° $406

22.1° $522

19.4° $412

25.1° $614

23.4° $544

18.1° $421

22.6° $445

17.2° $408
And here is the same data as a Scatter Plot:

It is now easy to see that warmer weather leads to more sales, but the
relationship is not perfect.

Line of Best Fit

We can also draw a "Line of Best Fit" (also called a "Trend Line") on our scatter
plot:
Try to have the line as close as possible to all points, and as many points
above the line as below.

CORRELATION :
Correlation Coefficients
While examining scatterplots gives us some idea about the relationship between
two variables, we use a statistic called the correlation coefficient to give us a
more precise measurement of the relationship between the two variables. The
correlation coefficient is an index that describes the relationship and can take on
values between −1.0 and +1.0, with a positive correlation coefficient indicating a
positive correlation and a negative correlation coefficient indicating a negative
correlation.
The absolute value of the coefficient indicates the magnitude, or the strength, of
the relationship. The closer the absolute value of the coefficient is to 1, the
stronger the relationship. For example, a correlation coefficient of 0.20 indicates
that there is a weak linear relationship between the variables, while a
coefficient of −0.90 indicates that there is a strong linear relationship.
The value of a perfect positive correlation is 1.0, while the value of a perfect
negative correlation is −1.0.

When there is no linear relationship between two variables, the correlation


coefficient is 0. It is important to remember that a correlation coefficient of
0 indicates that there is no linearrelationship, but there may still be a strong
relationship between the two variables. For example, there could be a
quadratic relationship between them.

Karl Pearson’s Coefficient of Correlation


Definition: Karl Pearson’s Coefficient of Correlation is widely used
mathematical method wherein the numerical expression is used to calculate the
degree and direction of the relationship between linear related variables.

Pearson’s method, popularly known as a Pearsonian Coefficient of


Correlation, is the most extensively used quantitative methods in practice. The
coefficient of correlation is denoted by “r”.

If the relationship between two variables X and Y is to be ascertained, then the


following formula is used:
Properties of Coefficient of Correlation

 The value of the coefficient of correlation (r) always lies between±1. Such as:
r=+1, perfect positive correlation
r=-1, perfect negative correlation
r=0, no correlation
 The coefficient of correlation is “ zero” when the variables X and Y are
independent. But, however, the converse is not true.

Assumptions of Karl Pearson’s Coefficient of Correlation

1. The relationship between the variables is “Linear”, which means when the two
variables are plotted, a straight line is formed by the points plotted.
2. There are a large number of independent causes that affect the variables under
study so as to form a Normal Distribution. Such as, variables like price,
demand, supply, etc. are affected by such factors that the normal distribution is
formed.
3. The variables are independent of each other.
4. The cause and effect relationship exists between two variables.

Note: The coefficient of correlation measures not only the magnitude of


correlation but also tells the direction. Such as, r = -0.67, which shows correlation
is negative because the sign is “-“ and the magnitude is 0.67.
Merits:
1. This method indicates the presence or absence of correlation between two
variables and gives the exact degree of their correlation.

2. In this method, we can also ascertain the direction of the correlation; positive,
or negative.
Demerits:
1. It is more difficult to calculate than other methods of calculations.

2. It is much affected by the values of the extreme items.

3. It is based on a many assumptions, such as: linear relationship, cause and


effect relationship etc. which may not always hold good.

4. It is very much likely to be misinterpreted in case of homogeneous data.

Spearman Rank Correlation


The Spearman rank correlation coefficient, rs, is the nonparametric version of
the Pearson correlation coefficient.

+1 = a perfect positive correlation between ranks


-1 = a perfect negative correlation between ranks
0 = no correlation between ranks.
Method - calculating the coefficient
 Create a table from your data.
 Rank the two data sets. Ranking is achieved by giving the ranking '1' to the
biggest number in a column, '2' to the second biggest value and so on. The
smallest value in the column will get the lowest ranking. This should be done
for both sets of measurements.
 Tied scores are given the mean (average) rank. For example, the three tied
scores of 1 euro in the example below are ranked fifth in order of price, but
occupy three positions (fifth, sixth and seventh) in a ranking hierarchy of
ten. The mean rank in this case is calculated as (5+6+7) ÷ 3 = 6.
 Find the difference in the ranks (d): This is the difference between the
ranks of the two values on each row of the table. The rank of the second
value (price) is subtracted from the rank of the first (distance from the
museum).
 Square the differences (d²) To remove negative values and then sum them (
d²).

 Find the value of all the d² values by adding up all the values in the Difference²
column. In our example this is 285.5. Multiplying this by 6 gives 1713.
 Now for the bottom line of the equation. The value n is the number of sites at
which you took measurements. This, in our example is 10. Substituting these
values into n³ - n we get 1000 - 10
 We now have the formula: Rs = 1 - (1713/990) which gives a value for Rs:
1 - 1.73 = -0.73

What does this Rs value of -0.73 mean?


The closer Rs is to +1 or -1, the stronger the likely correlation. A perfect positive
correlation is +1 and a perfect negative correlation is -1. The Rs value of -0.73
suggests a fairly strong negative relationship.

Coefficient of Determination (R Squared)


The coefficient of determination, R2, is similar to the correlation
coefficient, R. The correlation coefficient formula will tell you how strong
of a linear relationship there is between two variables. R Squared is the
square of the correlation coefficient, r (hence the term r squared).

WHAT DOES CORRELATION OF DETERMINATION MEAN :

Correlation measures linear relationship between two variables, while coefficient


of determination (R-squared) measures explained variation.
For example; height and weight of individuals are correlated. If the correlation
coefficient is r = 0.8 means there is high positive correlation. What does that
mean?
It means that both height and weight of individuals increase/decrease together
(positive) and their relationship (linear) is strong. The scatter plot is something
like:

But height of individuals may also be affected by other factors like age, genetics,
food intake, amount and type of exercise, location etc.
So, we if try to predict height by using weight as a single predictor, coefficient of
determination is 0.64 (equals to square of correlation coefficient here). How to
interpret it?
It shows that 0.64 (or 64%) of variation in height can be explained by weight. and
remaining 36% of variation in height may be due to other factors which affect
height of individuals like age, genetics, food intake, amount and type of exercise,
location etc.

Coefficient of Determination

The coefficient of determination (denoted by R2) is a key output


of regression analysis. It is interpreted as the proportion of the variance in the
dependent variable that is predictable from the independent variable.

 The coefficient of determination is the square of the correlation (r) between


predicted y scores and actual y scores; thus, it ranges from 0 to 1.
 With linear regression, the coefficient of determination is also equal to the
square of the correlation between x and y scores.
 An R2 of 0 means that the dependent variable cannot be predicted from the
independent variable.
 An R2 of 1 means the dependent variable can be predicted without error
from the independent variable.
 An R2 between 0 and 1 indicates the extent to which the dependent variable
is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is
predictable from X; an R2 of 0.20 means that 20 percent is predictable; and
so on.

REGRESSION ANALYSIS

Regression analysis is used in stats to find trends in data. For example, you might
guess that there’s a connection between how much you eat and how much you
weigh; regression analysis can help you quantify that. Regression analysis will
provide you with an equation for a graph so that you can make predictions about
your data. For example, if you’ve been putting on weight over the last few years, it
can predict how much you’ll weigh in ten years time if you continue to put on
weight at the same rate.

In regression analysis, those factors are called variables. You have your
dependent variable — the main factor that you’re trying to understand or
predict.

And then you have your independent variables — the factors you suspect have
an impact on your dependent variable.

What is regression analysis?

Redman offers this example scenario: Suppose you’re a sales manager trying to
predict next month’s numbers. You know that dozens, perhaps even hundreds of
factors from the weather to a competitor’s promotion to the rumor of a new and
improved model can impact the number.

Regression analysis is a way of mathematically sorting out which of those


variables does indeed have an impact. It answers the questions: Which factors
matter most? Which can we ignore? How do those factors interact with each
other? And, perhaps most importantly, how certain are we about all of these
factors?
In simple linear regression, we predict scores on one variable from the
scores on a second variable. The variable we are predicting is called
the criterion variableand is referred to as Y. The variable we are basing our
predictions on is called the predictor variable and is referred to as X. When
there is only one predictor variable, the prediction method is called simple
regression. In simple linear regression, the topic of this section, the
predictions of Y when plotted as a function of X form a straight line.

The example data in Table 1 are plotted in Figure 1. You can see that
there is a positive relationship between X and Y. If you were going to
predict Y from X, the higher the value of X, the higher your prediction of
Y.
Table 1. Example data.

X Y
1.00 1.00

2.00 2.00

3.00 1.30

4.00 3.75

5.00 2.25

Linear regression consists of finding the best-fitting straight line


through the points. The best-fitting line is called a regression line. The
black diagonal line in Figure 2 is the regression line and consists of the
predicted score on Y for each possible value of X. The vertical lines from
the points to the regression line represent the errors of prediction. As you
can see, the red point is very near the regression line; its error of
prediction is small. By contrast, the yellow point is much higher than the
regression line and therefore its error of prediction is large.

Figure 2. A scatter plot of the example data. The black line consists of the
predictions, the points are the actual data, and the vertical lines between
the points and the black line represent errors of prediction.
The error of prediction for a point is the value of the point minus the
predicted value (the value on the line). Table 2 shows the predicted
values (Y') and the errors of prediction (Y-Y'). For example, the first point
has a Y of 1.00 and a predicted Y (called Y') of 1.21. Therefore, its error
of prediction is -0.21.
Table 2. Example data.

X Y Y' Y-Y' (Y-Y')2


1.00 1.00 1.210 -0.210 0.044

2.00 2.00 1.635 0.365 0.133


3.00 1.30 2.060 -0.760 0.578

4.00 3.75 2.485 1.265 1.600

5.00 2.25 2.910 -0.660 0.436

You may have noticed that we did not specify what is meant by "best-
fitting line." By far, the most commonly-used criterion for the best-fitting
line is the line that minimizes the sum of the squared errors of prediction.
That is the criterion that was used to find the line in Figure 2. The last
column in Table 2 shows the squared errors of prediction. The sum of the
squared errors of prediction shown in Table 2 is lower than it would be
for any other regression line.
The formula for a regression line is

Y' = bX + A

where Y' is the predicted score, b is the slope of the line, and A is the Y
intercept.
COMPUTING THE REGRESSION LINE
In the age of computers, the regression line is typically computed with
statistical software. However, the calculations are relatively easy, and are
given here for anyone who is interested. The calculations are based on
the statistics shown in Table 3. MX is the mean of X, MY is the mean of Y,
sX is the standard deviation of X, sY is the standard deviation of Y, and r
is the correlation between X and Y.

Formula for standard deviation


Formula for correlation

Table 3. Statistics for computing the regression line.


MX MY sX sY r

3 2.06 1.581 1.072 0.627

The slope (b) can be calculated as follows:

b = r sY/sX
and the intercept (A) can be calculated as

A = MY - bMX.

For these data,

b = (0.627)(1.072)/1.581 = 0.425

A = 2.06 - (0.425)(3) = 0.785

Você também pode gostar