Você está na página 1de 50

DATA ANALYSIS

By G. Raja Sekhar

INTRODUCTION
The data is collected, then the sampling is done,
simultaneously the interviewees are carried out. The
collected data then goes for processing. After the data
have been processed, it is necessary that these data are
analysed.
Analysis refers to computation of certain indices or
measures along with searching for patterns of relationship
that exist among the data groups.

Statistical methods can be used to summarize or


describe a collection of data this is called descriptive
statistics. This is useful in research, when communicating
the results of experiments.
In addition, patterns in the data may be modeled in a
way that accounts for randomness and uncertainty in the
observation and are then used to draw inferences about
the process or population being studied this is called
inferential statistics.
If the inference hold true, then the descriptive statistics of
the new data increases the soundness of that hypothesis.

PROCESSING STAGE
The processing stage includes the editing, coding, classification and
tabulation of collected data that are ready to analyse.
The collected data must be arranged. In other words it means that out of all
received data some of them are useful and others not and therefore in this
step, these received data must be

Edited,
Coded,
Classified and
Tabulated.

EDITING
The purpose of editing is that careful scrutiny
of
all
collected
data
to
produce
completeness, error-free and readability.

CODING
The purpose of coding is the assigning
codes (numbers) for each category of
answers.

CLASSIFICATION

The purpose of classification is to


divide the received data on the basis
of their groups.

TABULATION
The purpose of tabulation is the process of
summarizing data and displaying them in the
appropriate tables that further analysis are to
be facilitated.

ANALYSIS OF DATA
Whenever the mass of data is collected the statistics
comes into account and it creates the procedures
to support processing of data and also analysis of
data.

STATISTICS IN RESEARCH
The role of statistics in research is to function as a tool in designing research,
analysing its data and drawing conclusions therefrom.
To achieve the objective of the research, we have to go a step further and
develop certain indices or measures to summarise the collected/classified
data.
Only after this we can adopt the process of generalisation from small groups
(i.e., samples) to population. If fact, there are two major areas of statistics
viz., descriptive statistics and inferential statistics.

DESCRIPTIVE STATISTICS AND


INFERENTIAL STATISTICS
Descriptive statistics concern the development of certain indices from the
raw data, whereas inferential statistics concern with the process of
generalisation.
Inferential statistics are also known as sampling statistics and are mainly
concerned with two major type of problems:
The estimation of population parameters, and
The testing of statistical hypotheses.

DESCRIPTIVE STATISTICS
Descriptive statistics is the term given to the analysis of data that helps
describe, show or summarize data in a meaningful way
Descriptive statistics, allow us to make conclusions beyond the data we
have analysed or reach conclusions regarding any hypotheses we might
have made.
Descriptive statistics are very important because if we simply presented
our raw data it would be hard to visualize what the data was showing,
especially if there was a lot of it.

Descriptive statistics allow us to properly describe data through statistics and


graphs is an important topic and discussed.
Typically, there are two general types of statistic that are used to describe
data
Measures of Central Tendency
Measures of spread

MEASURES OF CENTRAL TENDENCY


Measures of central tendency (or statistical averages) tell us the point about
which items have a tendency to cluster. Such a measure is considered as
the most representative figure for the entire mass of data. Measure of central
tendency is also known as statistical average.
The mean, median and mode are all valid measures of central tendency,
but under different conditions, some measures of central tendency become
more appropriate to use than others.

ARITHMETIC MEAN
Arithmetic mean is defined as the sum of the items divided by the number of
items in a series.
Arithmetic mean is the most is the widely used and practical method for the
measurement of central tendency. It is further divided into
Simple arithmetic mean
Weighted arithmetic mean

SIMPLE ARITHMETIC MEAN


Simple arithmetic mean is defined as the simple mean, i.e., total of all the
items by number of items.
Individual Series:
Direct method: In individual series, the following formula is used

Where
= arithmetic mean
= total of the items in a series
n = number of items

Example: Find the mean X : 10,15,12,9,6,8

=10

Indirect Method: the indirect method is used when the number of items is
very large and to simplify that data, we take the deviation from the assumed
mean. The following formula will be used for it

Where
= arithmetic mean
= frequencies
= variable or mid points of class interval frequency
N = Total Number of frequencies in series

EXAMPLE
X

20

30

40

50

60

70

No. of Students

12

20

10

fX

20

160

30

12

360

40

20

800

50

10

500

60

360

70

280

N = 60

= 41

INCLUSIVE SERIES
Different between upper limit of interval and lower limit of next interval is
noted; then half of the difference is deducted from lower limit of every
interval and the same is added to upper limit of every interval.
Example:
Class interval
Frequency

46

79

10 12

13 15

16 18

19 21

22 24

15

11

Class Interval

Frequency (f)

Mid point (M)

fm

3.5 6.5 1

6.5 9.5 3

24

9.5 12.5 7

11

77

12.5 15.5 15

14

210

15.5 18.5 11

17

187

18.5 21.5 3

20

60

21.5 24.5 2

23

46

42

609

= 14.5

Open end intervals: Open end intervals are those in which lower limit of
the first class and the upper limit of the last class are not known. In such case,
we cannot find out the arithmetic mean unless we make an assume about
the unknown limits. The assumption would naturally depend upon the class
interval.
Unequal intervals: if class intervals are not equal, make class intervals
equally, then solve the problem.
Example:
X

02

25

56

68

8 10

10 20

20 21

21 25

fm

05

2.5

7.5

5 10

11

7.5

82.5

10 15

12.5

37.5

15 20

17.5

52.5

20 25

22.5

112.5

25

292.5

= 11.7

WEIGHTED ARITHMETIC MEAN


WAM is defined as the calculation of arithmetic mean by assigning the
weights to different items in a series differently according to their relative
importance. It can be calculated from the following formula.
Where
= Weighted arithmetic mean
N = weighted assigned to different items differently
= sum of products of items with their respective weights
= Sum of weights.

First of all find out the product of items with their respective weights, that is
WX
Take the total of WX as
Divide the value of

by

to get the value of weights.

EXAMPLE
A train run 25 km at a speed of 30 kmph and another 50 km at a speed of 40
kmph. Due to repairs of the tracks it travels at a speed 10kmph for 6 minutes,
and finally covers the remaining distance of 24 km at a speed of 60 kmph.
What is the average speed in kmph?
Solution: Time taken in covering 25 km at a speed of 30kmph = 50 minutes
and so on. Therefore taking the time as weights.
Speed in KMPH (X)

Time taken in minutes (W) WX

30

50

1500

40

75

3000

10

60

24

60

1400

191

6000
=

MERITS OF ARITHMETIC MEAN


It is simple to understand and easy to calculate.
It is rigid in nature.
It includes all the items in calculation.
It has further applicability for mathematical treatment.
It is universal in nature.

DEMERITS OF ARITHMETIC MEAN


It cannot be represented in graphically.
It is not suitable in open ended classes.
It is useful only in the normal distribution,
It is mean effected by extreme values.

MEDIAN
When the observations are arranged in ascending or descending order of
magnitude, then the middle value is known as median of these observations.
Let x1,x2,xn be n observations arranged in the ascending order of
magnitude. Median is defined as the middle most term, that is the value of x

at the position
we can write
Me =
Me =

item (when n is odd)

(when n is even)

STEPS TO CALCULATE MEDIAN


First of all, arrange the data in order whether it is ascending or descending
order.
Put the value of N to find out the value of Me
Example:
Find the median of the following data: 391,384,591,407,672,522,777,753,1490

Solution:
X = 384,391,407,522,591,672,753,777,1490

Median =
=
= 5th item

Find the median of the following data: 391,384,591,407,672,522,777,753,1490,222

Solution:
X = 222,384,391,407,522,591,672,753,777,1490
Median =

= 5.5 item

IN CASE OF DISCRETE SERIES


In discrete series, following formula is useful
Median = Me =
Steps to calculate

th

item

Arrange the data in order whether it is ascending or descending order

Then, make the cumulative frequency


Put the value N in the formula

Look just greater value which find in step 3 in the cumulative frequency
table, the value of corresponding variable is median.

EXAMPLE
Find the median from the following data
Income

100

150

80

200

250

300

No. of Persons

12

13

10

15

Cf

80

100

12

20

Median =

150

13

33

180

15

48

Just greater than 31 is 33, and the corresponding variable is 150


Therefore, median = 150

200

10

58

250

61

31th item

MERITS OF MEDIAN
It is very simple to understand.
Its calculation is very easy and simple.
It is not effected by the extreme items.
It can be represented graphically very small.
It is not suitable average for open enabled class intervals.
It deals with quality more than quantity.

DEMERITS OF MEDIAN
It needs extra labour to make the ascending or descending order of data
than other averages measures.
It does not involve all the observations at the time of calculation which
affect its relationship.
It cannot be calculated exactly in the series of even number of items.
It is very difficult to calculate at the time of presence of very small or large
numbers of items in the series.
It has no further, mathematical applicability like other methods of average.

MODE
The mode is defined to be size of the variable which occurs most frequently
or the point of maximum frequency or the point of greatest density. It is also
an important measure of central tendency.
According to Kenny and Keeping, The value of the variable which occur
most frequently in a distribution is called the mode.




Where
L1 = Lower limit of class limit
F1 = Higher value of the frequency
F0 = Preceding the value of highest frequency
F2 = Succeeding the value of height frequency
I = Difference between two variables.

EXAMPLE
X: 19, 21, 20, 19, 19, 19, 25, 3, 1, 9, 2, 8, 5, 8
Solution
19 is the mode value which occurring very frequently.
Therefore Z = 19

Having two modes is called "bimodal".


Having more than two modes is called "multimodal".

GROUPING MODE
When all values appear the same number of times the idea of a mode is not
useful. But you could group them to see if one group has more than the
others.
Example: {4, 7, 11, 16, 20, 22, 25, 26, 33}

Each value occurs once, so let us try to group them.


We can try groups of 10:
0-9: 2 values (4 and 7)
10-19: 2 values (11 and 16)
20-29: 4 values (20, 22, 25 and 26)
30-39: 1 value (33)

In groups of 10, the "20s" appear most often, so we could choose 25 as the
mode.

EXAMPLE
Calculate the mode for the following distribution
Gross Profit as % of
sales

07

7 14

14 21 21 28 28 35 35 42 42 49

No. of co s

19

25

36

72

51

43

28

Solution:
Here, the largest frequency is 72. it lies in the class 21 28 so the model class is 21
28 and the lower limit of the model class is 21. Thus

07

19

7 14

25

14 21

36

F0

21 28

72

F1

28 35

51

F2

35 42

43

42 49

28

= 21 +

MERITS OF MODE
It is very simple to understand and easy to calculate because it is a
positional average.
This is based on quality rather than quantity.
It is least effected by the extreme values.
Where there is a large concentration of items around the value, that value is
the good representation of the items.
It is possible graphically to show the model value.

DEMERITS OF MODE
Is not a suitable measure of central tendency where the number of items is
very small.
It has no future mathematical applicability.
If we have given the data about more than two series, then it is not possible
to calculate model value.
It is not possible to find out the sum of the items by multiplying with the model
value the number or items in this measure of central.

SUMMARY OF WHEN TO USE THE


MEAN, MEDIAN AND MODE
Type of Variable

Best measure of central tendency

Nominal

Mode

Ordinal

Median

Interval/Ratio (not skewed)

Mean

Interval/Ratio (skewed)

Median

MEASURES OF SPREAD
A measure of spread, sometimes also called a measure of dispersion, is used to
describe the variability in a sample or population. It is usually used in conjunction
with a measure of central tendency, such as the mean or median, to provide an
overall description of a set of data.
Measures of spread, these are ways of summarizing a group of data by describing
how spread out the scores are. For example, the mean score of our 100 students
may be 65 out of 100. However, not all students will have scored 65 marks. Rather,
their scores will be spread out. Some will be lower and others higher. Measures of
spread help us to summarize how spread out these scores are. To describe this
spread, a number of statistics are available to us, including the range, quartiles,
absolute deviation, variance and standard deviation.

The measures of dispersion may be expressed as


Absolute Measure of Dispersion
Relative Measure of Dispersion

Absolute Measure of Dispersion: it is constituted when the deviation of actual


values from the measures of central tendency are taken. These measures
are expressed in the same statistical units in which the original values are
stated, that is, Kilograms, meters, rupees, years, months, times, etc. But
absolute measures of dispersion cannot be used for comparison of variations
between two series.

Relative Measures of Dispersion: Measure of relative dispersion is defined as


the ration of a measure of absolute dispersion to an appropriate average.
These are expressed in different statistical units; so these can be easily used
for comparison or variation between two series. It is also called the
coefficient of dispersion.

RANGE
The simplest possible measure of dispersion is the range, which is the
difference between the greatest and least level of the variables.
Range may be shown under these methods
Simple range
Inter quartile range
Percentile range and
Decline range

SIMPLE RANGE
It is the difference between the value of the smallest item and the value of
the largest item include in a distribution
Example: in the series 8, 9, 14, 10, 12, 7; range = 14 7 = 7
Coefficient of dispersion: the relative measure of the range is called the
coefficient of dispersion and is obtained by dividing the range with sum of
the extreme values
Coefficient of dispersion =




Where R1 is Max, and R2 is Min Value of the variate

INTERQUARTILE RANGE
The interquartile range is another range used as a measure of the spread.
The difference between upper and lower quartiles (Q3Q1), which is called
the interquartile range, also indicates the dispersion of a data set. The
interquartile range spans 50% of a data set, and eliminates the influence of
outliers because, in effect, the highest and lowest quarters are removed.
Interquartile range = difference between upper quartile (Q3) and lower quartile
(Q1)

EXAMPLE
A year ago, Angela began working at a computer store. Her supervisor
asked her to keep a record of the number of sales she made each month.
The following data set is a list of her sales for the last 12 months:
34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37.

find:
The range
The upper and lower quartiles
The interquartile range

Range = difference between the highest and lowest values


= 57 1 = 56
Lower quartile = value of middle of first half of data Q1
= the median of 1, 11, 15, 19, 20, 24 = (third + fourth observations) 2
= (15 + 19) 2 = 17
Upper quartile = value of middle of second half of data Q3
= the median of 28, 34, 37, 47, 50, 57 = (third + fourth observations) 2
= (37 + 47) 2 = 42
Interquartile range = Q3Q1
= 42 17 = 25

MERITS OF RANGE
Range is a very easy and simple measure to understand and calculate.
Therefore, even a layman can understand it with out any difficulty.
It is rigidly defined to some extent.

The disadvantage of using range is that it does not measure the spread of
the majority of values in a data setit only measures the spread between
highest and lowest values. As a result, other measures are required in order
to give a better picture of the data spread. The range is an informative tool
used as a supplement to other measures such as the standard deviation or
semi-interquartile range, but it should rarely be used as the only measure of
spread.

Você também pode gostar