Escolar Documentos
Profissional Documentos
Cultura Documentos
Among
those numbers, few may be repeated twice and even more than twice. The
repetition of number is a data set is termed as frequency of that particular
number. The frequencies of variables in a data are to be listed in a table. This
table is known as frequency distribution table and the list is referred
as frequency distribution.
When the data are grouped into classes of appropriate size indicating the number
of observations in each class we get a frequency distribution. By forming
frequency distribution, we can summarize the data effectively. It is a method of
presenting the data in a summarized form. Frequency distribution is also known
as Frequency table.
Classes: A large number of observations varying in a wide range are usually
classified in several groups according to the size of their values. Each of these
groups is defined by an interval called class interval. The class interval between
10 and 20 is defined as 10-20.
Class limits: The smallest and largest possible values in each class of a
frequency distribution table are known as class limits. For the class 10-20, the
class limits are 10 and 20. 10 is called the lower class limit and 20 is called the
upper class limit.
Magnitude of a class interval: The difference between the upper and lower
limit of a class is called the magnitude of a class interval.
A frequency distribution table is one way to organize data so that it makes more
sense. The data so distributed is called frequency distribution and the tabular
form is called frequency distribution table. Let us see with the help of example
how to construct distribution table.
The frequency distribution table lists all the marks and also show how many
times (frequency) they occurred.
The number which tells us how many times a particular data appears is called the
frequency. For example, 2 marks have been scored by five students which means
marks 2 occurs five times. Therefore, the frequency of score 2 is five. Similarly,
the frequency of marks 5 is three because three students scored five marks.
Grouped Data
Now consider the situation where we want to collect data on the test scores of five
such classes i.e. of 100 students. It becomes difficult to tally for each and every
score of all 100 students. Besides, the table we will obtain will be very large in
length and not understandable at once. In this case, we use what is called
a grouped frequency distribution table.
Such tables take into consideration groups of data in the form of class
intervals to tally the frequency for the data that belongs to that particular class
interval.
0-5 3
5-10 11
10-15 38
15-20 34
20-25 9
25-30 5
Total 100
The first column here represents the marks obtained in class interval form. The
lowest number in a class interval is called the lower limit and the highest number
is called the upper limit. This example is a case of continuous class intervals as
the upper limit of one class is the lower limit of the following class.
Note that in continuous cases, any observation corresponding to the extreme values
of a class is always included in that class where it is the lower limit. For example,
if we had a student who has scored 5 marks in the test, his marks would
be included in the class interval 5-10 and not 0-5.
Question: The following is the distribution for the age of the students in a school:
Calculate:
The lower limit of the first class interval i.e. 0-5 is ‘0’.
The class limits of the third class, i.e. 10-15 are 10 (lower limit) and 15 (upper
limit).
The classmark is defined as the average of the upper and lower limits of a
class. For 5-10, the classmark is (5+10)/2 = 7.5
The class size is the difference between the lower and upper class-limits. Here,
we have a uniform class size, which is equal to 5 (5 – 0, 10 – 5, 15 – 10, 20 – 15
are all equal to 5).
Relative Frequency Distribution
Back to Top
Solution:
Here n = 70
The cumulative frequency for each class interval is the frequency of that class
interval added to the preceding cumulative total. Cumulative frequency can also
be defined as the sum of all previous frequencies up to the current point.
Cumulative frequency is obtained by adding the frequency of a class interval and
the frequencies of the preceding intervals up to that class interval.
Less than cumulative frequency distribution :
It is found by adding sequentially the frequencies of all the earlier classes
including the class adjacent to which it is written. The cumulative is on the track
from the lowest to the highest size. It is obtained by adding successively the
frequencies of all the previous classes including the class against which it is
written. The cumulate is started from the lowest to the highest size.
2. More than cumulative frequency distribution :
It is found by the cumulative total of frequencies starting from the highest to the
lowest class. It is obtained by finding the cumulate total of frequencies starting
from the highest to the lowest class.
Now let’s see how to calculate the less than cumulative frequency distribution and
more than cumulative frequency distribution by solving an example problem:
Example:
The following frequency distribution table gives the marks obtained by 40
students:
Table (a)
Note: The frequencies can be added, as indicated by the arrows, to obtain the
cumulative frequency.
In the table(a), it is observed that 4 students got marks ‘less than 10’, 9 students
got marks ‘less than 20’ and so on.
Therefore, the above distribution is called ‘less than’ cumulative frequency
distribution.
Table (a) can be re-written as table (b).
Table (c)
Note: The frequencies can be added, as indicated by the arrows, to obtain the
cumulative frequency.
Table (c) can be re-written as table (d)
Class Cumulative Frequency (c.f.)
More than 0 40
More than 10 36
More than 20 31
More than 30 19
More than 40 8
frequency polygon
To draw frequency polygons, we begin with, drawing histograms and follow the
following steps:
Step 1- Choose the class interval and mark the values on the horizontal
axes
Step 2- Mark the mid value of each interval on the horizontal axes.
Step 3- Mark the frequency of the class on the vertical axes.
Step 4- Corresponding to the frequency of each class interval, mark a point
at the height in the middle of the class interval
Step 5- Connect these points using the line segment.
Step 6- The obtained representation is a frequency polygon.
Let us consider an example to understand this in a better way.
Example 1: In a batch of 400 students, the height of students is given
in the following table. Represent it through a frequency polygon.
X¯¯¯¯w=∑wx∑wX¯w=∑wx∑w
Here:
X¯¯¯¯wX¯w stands for weighted arithmetic mean
xx stands for values of the items and
ww stands for the weight of the item
Example:
A student obtained the marks 40, 50, 60, 80, and 45 in math, statistics, physics,
chemistry and biology respectively. Assuming weights 5, 2, 4, 3, and 1
respectively for the above mentioned subjects, find the weighted arithmetic mean
per subject.
Solution:
Mark
Weight
Subject Obtained wxwx
ww
xx
1 73
2 378
3 459
4 90
Because many of the values in this data set are repeated multiple times, you can
easily compute the sample mean as a weighted mean. Doing so is quicker than
summing each value in the data set and dividing by the sample size.
The Range is the difference between the lowest and highest values.
Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9.
So the range is 9 − 3 = 6.
The range can sometimes be misleading when there are extremely high or low
values.
The single value of 3616 makes the range large, but most values are around 10.
Problem: Cheryl took 7 math tests in one marking period. What is the range of
her test scores?
highest - lowest = 94 - 73 = 21
Definition: The range of a set of data is the difference between the highest and
lowest values in the set.
Mean Deviation
Calculating It
Find the mean of all values ... use it to work out distances ... then find the mean of
those distances!
In three steps:
In that example the values are, on average, 3.75 away from the middle.
Standard Deviation
The formula is easy: it is the square root of the Variance. So now you ask,
"What is the Variance?"
Example
You and your friends have just measured the heights of your dogs (in
millimetres):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and
300mm.
Find out the Mean, the Variance, and the Standard Deviation.
SKEWNESS
A Scatter (XY) Plot has points that show the relationship between two sets of
data.
In this example, each dot shows one person's weight versus their height.
(The data is plotted on the graph as " Cartesian (x,y) Coordinates ")
Example:
The local ice cream shop keeps track of how much ice cream they sell versus the
noon temperature on that day. Here are their figures for the last 12 days:
14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
And here is the same data as a Scatter Plot:
It is now easy to see that warmer weather leads to more sales, but the
relationship is not perfect.
We can also draw a "Line of Best Fit" (also called a "Trend Line") on our scatter
plot:
Try to have the line as close as possible to all points, and as many points
above the line as below.
CORRELATION :
Correlation Coefficients
While examining scatterplots gives us some idea about the relationship between
two variables, we use a statistic called the correlation coefficient to give us a
more precise measurement of the relationship between the two variables. The
correlation coefficient is an index that describes the relationship and can take on
values between −1.0 and +1.0, with a positive correlation coefficient indicating a
positive correlation and a negative correlation coefficient indicating a negative
correlation.
The absolute value of the coefficient indicates the magnitude, or the strength, of
the relationship. The closer the absolute value of the coefficient is to 1, the
stronger the relationship. For example, a correlation coefficient of 0.20 indicates
that there is a weak linear relationship between the variables, while a
coefficient of −0.90 indicates that there is a strong linear relationship.
The value of a perfect positive correlation is 1.0, while the value of a perfect
negative correlation is −1.0.
The value of the coefficient of correlation (r) always lies between±1. Such as:
r=+1, perfect positive correlation
r=-1, perfect negative correlation
r=0, no correlation
The coefficient of correlation is “ zero” when the variables X and Y are
independent. But, however, the converse is not true.
1. The relationship between the variables is “Linear”, which means when the two
variables are plotted, a straight line is formed by the points plotted.
2. There are a large number of independent causes that affect the variables under
study so as to form a Normal Distribution. Such as, variables like price,
demand, supply, etc. are affected by such factors that the normal distribution is
formed.
3. The variables are independent of each other.
4. The cause and effect relationship exists between two variables.
2. In this method, we can also ascertain the direction of the correlation; positive,
or negative.
Demerits:
1. It is more difficult to calculate than other methods of calculations.
But height of individuals may also be affected by other factors like age, genetics,
food intake, amount and type of exercise, location etc.
So, we if try to predict height by using weight as a single predictor, coefficient of
determination is 0.64 (equals to square of correlation coefficient here). How to
interpret it?
It shows that 0.64 (or 64%) of variation in height can be explained by weight. and
remaining 36% of variation in height may be due to other factors which affect
height of individuals like age, genetics, food intake, amount and type of exercise,
location etc.
Coefficient of Determination
REGRESSION ANALYSIS
Regression analysis is used in stats to find trends in data. For example, you might
guess that there’s a connection between how much you eat and how much you
weigh; regression analysis can help you quantify that. Regression analysis will
provide you with an equation for a graph so that you can make predictions about
your data. For example, if you’ve been putting on weight over the last few years, it
can predict how much you’ll weigh in ten years time if you continue to put on
weight at the same rate.
In regression analysis, those factors are called variables. You have your
dependent variable — the main factor that you’re trying to understand or
predict.
And then you have your independent variables — the factors you suspect have
an impact on your dependent variable.
Redman offers this example scenario: Suppose you’re a sales manager trying to
predict next month’s numbers. You know that dozens, perhaps even hundreds of
factors from the weather to a competitor’s promotion to the rumor of a new and
improved model can impact the number.
The example data in Table 1 are plotted in Figure 1. You can see that
there is a positive relationship between X and Y. If you were going to
predict Y from X, the higher the value of X, the higher your prediction of
Y.
Table 1. Example data.
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
Figure 2. A scatter plot of the example data. The black line consists of the
predictions, the points are the actual data, and the vertical lines between
the points and the black line represent errors of prediction.
The error of prediction for a point is the value of the point minus the
predicted value (the value on the line). Table 2 shows the predicted
values (Y') and the errors of prediction (Y-Y'). For example, the first point
has a Y of 1.00 and a predicted Y (called Y') of 1.21. Therefore, its error
of prediction is -0.21.
Table 2. Example data.
You may have noticed that we did not specify what is meant by "best-
fitting line." By far, the most commonly-used criterion for the best-fitting
line is the line that minimizes the sum of the squared errors of prediction.
That is the criterion that was used to find the line in Figure 2. The last
column in Table 2 shows the squared errors of prediction. The sum of the
squared errors of prediction shown in Table 2 is lower than it would be
for any other regression line.
The formula for a regression line is
Y' = bX + A
where Y' is the predicted score, b is the slope of the line, and A is the Y
intercept.
COMPUTING THE REGRESSION LINE
In the age of computers, the regression line is typically computed with
statistical software. However, the calculations are relatively easy, and are
given here for anyone who is interested. The calculations are based on
the statistics shown in Table 3. MX is the mean of X, MY is the mean of Y,
sX is the standard deviation of X, sY is the standard deviation of Y, and r
is the correlation between X and Y.
b = r sY/sX
and the intercept (A) can be calculated as
A = MY - bMX.
b = (0.627)(1.072)/1.581 = 0.425