Statistics Support Materials Guide

STATISTICS
STAT 1010

Centre for Professional Development and Lifelong Learning
UNIVERSITY OF MAURITIUS

ii

STATISTICS
STAT 1010

SUPPORT MATERIALS

Centre for Professional Development and Lifelong Learning
UNIVERSITY OF MAURITIUS

iii
AUTHORS

STATISTICS STAT 1010 was prepared for the Centre for Professional Development and
Lifelong Learning, University of Mauritius. The Pro-Vice Chancellor Teaching and
Learning - acknowledges the contribution of the following course team members:

Dr V Jowaheer - Faculty of Science
Mr S Kalasopatan - Faculty of Social Studies and Humanities
Dr F Khodabacus - Faculty of Engineering
Assoc. Prof M J Pochun
Dr A Ruggoo - Faculty of Agriculture
Assoc. Prof P Veerapen - Faculty of Social Studies and Humanities

August 2008

All rights reserved. No part of this work may be reproduced in any form, without the
written permission from the University of Mauritius, Rduit, Mauritius.

iv
TABLE OF CONTENTS

STUDY GUIDE:-

Support Materials
How to Proceed
How to Use the Support Materials
How to Use the Textbook
Suggested Coursework
Suggested Course Map
Final Examination
Suggested Grading Scheme

Unit 1 Introduction
Unit 2 Data Collection l, OJ, Chapter 16
Unit 3 Organisation and Presentation of Data l, OJ, Chapter 1
Unit 4 Organisation and Presentation of Data ll, OJ, Chapter 2
Unit 5 Organisation and Presentation of Data lll, OJ, Chapter 3
Unit 6 Measures of Central Tendency, OJ, Chapter 5
Unit 7 Measures of Dispersion, OJ, Chapter 9
Unit 8 Time Series Analysis, OJ, Chapters 6 and 7
Unit 9 Index Numbers, OJ, Chapter 8
Unit 10 Probability, OJ, Chapter 11
Unit 11 Data Collection ll, OJ, Chapters 15 and 16
Unit 12 Linear Relationship Between Variables l: Correlation, OJ, Chapter 23
Unit 13 Linear Relationship Between Variables ll: Regression, OJ, Chapter 23

v
STUDY GUIDE:

Welcome to STATISTICS. This is a one-semester course designed to cover first-year
syllabuses of programmes of studies in the various faculties. The course provides an
introduction to Statistics and the manual is designed to guide you through the course.

The Study Guide contains important information on materials and procedures. We suggest
that you spend some time to read it, and to familiarise yourself with what you will have to do
to complete STATISTICS successfully. The suggested course map, p: (vii), indicates what
you should be working one each week.

If you have any questions arising from the instructions in the support materials, do not
hesitate to contact your tutor.

SUPPORT MATERIALS AND TEXTBOOK

This document can be used as SUPPORT MATERIALS.

The module also include the following TEXTBOOK:

Owen, Frank and Jones, Ron. (4
th
Edition) Statistics. Pitman. The textbook will be referred to
as (OJ) in the Support Materials.

HOW TO PROCEED?

You should begin by taking a look at the TABLE OF CONTENTS in both the SUPPORT
MATERIALS and the TEXTBOOK. These tables provide you with a framework for the
entire course because they outline the organisation and structure of the material you will be
covering. You will notice that the Units in the support materials do not follow the same
sequence as the Chapters in the textbook. However, in the Support Materials, you will be
referred to the relevant parts of the various Chapters.

The guidelines that follow are designed to help you most effectively work your way through
the materials in this course. So, before you begin Unit 1 of the course, read the guidelines
below carefully.

How to Use the Support Materials?

vi

The Support Materials provide you with study plans and commentaries on the textbook
presentation. They introduce additional concepts and information, advise you to do particular
practice activities, offer clarification, examples and solutions.

Take a few minutes now to glance through the entire manual to get an idea of its structure.
Notice that the format to deal with each unit is fairly consistent throughout the support
material. For example, each unit begins with a UNIT STRUCTURE, an OVERVIEW and a
list of LEARNING OBJECTIVES.

The UNIT STRUCTURE and OVERVIEW identify the main topics in the Unit. You should
begin your study of each unit by reading this brief introduction. You should then read the
LEARNING OBJECTIVES. The importance of these objectives cannot be overstated. They
identify the knowledge and skills you will have acquired once you have successfully
completed the study of a particular unit. Keep the objectives in mind as you read the
corresponding content in your textbook. The learning objectives also provide a useful guide
for review.

How to Use the Textbook?

Studying requires that you take an active role. Therefore, use your textbook actively,
recognising it for the useful learning tool that it is. You should be studying pencil in hand,
circling an important concept, and making summary notes to crystallise your understanding.
You may like to highlight or underline the key ideas. If so, remember that a rule of thumb is
one quarter to one third of the material. If you overhighlight, you may be extracting more than
the key ideas.

Suggested Coursework

STATISTICS is designed to be completed in one semester, with weeks one to thirteen for
instruction, weeks fourteen and fifteen for review and with the final examination as scheduled
by Faculty. Although you are free to work at your own pace, you should try to distribute
your workload according to the suggested course map on page (vii).

In order to complete STATISTICS you must read the instructional units. Generally, each of
these will direct you to study specific Chapters in (OJ) though some of the units will be
almost self-contained.

The objectives are tied to particular sections of the textbook. Review these objectives when
you have completed a section to confirm that you have achieved the learning goals for it. If
you realise that you are not clear about some aspects of the section, go back and redo relevant
readings and exercises. It is important to build your understanding of Statistics patiently and
thoroughly.

The units contain directions to do various practice activities. You will find answers to these
exercises either in the textbook or in the unit itself. The practice exercises are designed to

vii
reinforce the learning objectives for each part of the course. Thinking through these activities
will train you in the skills you need for the examination and for later applications of Statistics.

SUGGESTED COURSE MAP

Week Unit Topic Tutorial
1 1 Introduction 1
2 2 Data Collection I 2
3 3
4
Organisation & Presentation of Data I
Organisation & Presentation of Data II
3
4 5 Organisation & Presentation of Data III 4
5 6 Measures of Central Tendency 5
6 7 Measures of Dispersion 6
7 8 Time Series Analysis 7
8
CLASS TEST*

9 9 Index Numbers 8
10 10 Probability 9
11 11 Data Collection II 10
12 12 Relationship between Variables I 11
13 13 Relationship between Variables II 12
14
15

REVISION

*Week/date for Class Test to be confirmed during the semester.

FINAL EXAMINATION

viii

Scheduled and administered by the Registrars Office
A two-hour paper at the end of the Semester.

SUGGESTED GRADING SCHEME

Invigilated class test : 30%
Final Examination : 70%

Now, it is time to get to work. Good luck and we hope you enjoy the course.

1
UNIT 1 INTRODUCTION

Unit Structure

1.1 What Is Statistics?
1.2 Definition and Measurement
1.3 Nature of Statistical Data
1.4 A Last Word

1.1 WHAT IS STATISTICS?

In various aspects of life, we come across many questions whose answers are not immediately
and accurately available. Very often, there is insufficient information or lack of knowledge or
no information available: there may exist varying degrees of uncertainty with regards to
possible answers for these questions.

For example, we may ask ourselves many questions:- Shall we have enough rainfall this
summer? How many cyclones shall we have during this summer? What is the pass rate for
the B.Sc. Management or B.Sc. Economics course? What is the level of unemployment in
Mauritius? Are University students satisfied with the canteen facilities available on campus?
Are our industries able to compete on the world market?

Statistics is that branch of knowledge which provides us with tools/techniques to answer, at
least to some extent, the above questions and many more such questions. To do so, we need,
on the one hand, a minimum level of knowledge (i.e. understanding) and, on the other,
information/data already available. If information/data are not available, then the first step is
to collect them. Statistics deals thoroughly with the collection of data/information with a
prime objective in mind: the quality of data collected should be of high standard. The data
collected constitute the raw materials of any statistical analysis.

Thus, if we would like to know whether university students are satisfied with available
canteen facilities, we may choose to collect the views of all students or of a small percentage
of students, provided that this small percentage of students is selected in an unbiased manner

2
and is representative of the whole student body. Whatever may be the approach, much can be
learnt from the data, provided we are sufficiently careful about what is being collected and
about how data are being collected.

Once data are collected, there is the need to organise and present them in a manner calculated
to reveal their salient features and any underlying pattern. Thus, the organisation and
presentation of data are most important for the interpretation of data. This interpretation may
be very basic and sometimes rather advanced.

We then have to analyse the data to uncover with precision patterns which exist in the data set
and relationships unheard of previously. Uncertainties can then be handled with some
precision and can even be assessed, using probability theory and related ideas. Sophisticated
analysis of the data can be carried out if necessary; statistical models are developed.

Finally, comprehensive reports together with conclusions and recommendations are produced
so that, in turn, ultimately the relevant authorities may take appropriate policy decisions.
Statistical data and their statistical analysis are essential ingredients for decision making in
almost any sphere of life: for government, business, community and individuals.

The above definition of Statistics can be summarised by the following diagrammatic
representation. Our starting point is always the need to study a specific
issue/problem/phenomenon concerning people/society at large (e.g. students problems) or
nature (e.g. the weather) or any interaction between people and nature (e.g. agriculture).

3

Figure 1.1 : Diagrammatic Definition of Statistics

This process continues indefinitely since implementation of the recommendations will
ultimately create a new situation and most probably a better understanding of the
problem/issue/phenomenon under consideration. Then, at a later stage, the need for new
information/data will be felt, if only to assess the impact of these same recommendations over
time.

In a similar manner, scientific experiments/observations are carried out to help us to study and
understand the world around us and to develop science and technology in general. The
scientific data collected are then analysed accordingly. Statistics indeed plays a key role not
only in the collection of scientific data but in the very development of scientific knowledge.
So much so that Professor A.F.M. Smith of Imperial college of Science, Technology and
Medicine, UK defines Statistics as The Science of doing Science (1996).

1.2 DEFINITION AND MEASUREMENT
People Nature
Organisation
and
Presentation
of Data
Analysis
of
Data
Conclusion
and
Recommendation
Collection
of
Data

4

Let us give some thought to one of the questions raised in the previous section: What is the
level of unemployment in Mauritius?

Statisticians, scientists and many other people take much time to measure a particular variable
or set of variables. It is relatively easy to measure the length of a table; but it is entirely a
different matter to measure the level of unemployment in Mauritius. To be able to do so, we
must know with precision what the term unemployment means not only in broad general
terms but in precise operational terms. In other words, unemployment must be precisely
defined before it can be measured.

Thus, how do we define an unemployed person? The Central Statistical Office would define
someone as unemployed if that person was not employed and was available for work and
looking for work. But then, this raises other questions. For how long was the person not
employed - for a day, for a week, ....? Is a full-time student who holds no job unemployed? Is
a worker on strike unemployed?

In this introduction, we are not going to provide answers to all these questions. But they drive
home the point that, in Statistics, precise definition of a variable is most important not only in
broad general terms but in operational terms as mentioned above.

The definition must be such that measurement is then possible. Sometimes good theoretical
definition of a variable does not lend itself easily to measurement; it has to be adapted from a
practical point of view so that the measurement is possible.

Furthermore, definitions of a given variable may vary over time and methods of measurement
may vary too! Hence particular care must be given to the problems of definition and
measurement of a given variable so that these measurements are comparable over time and
space as well.

5
1.3 NATURE OF STATISTICAL DATA

The discussion in the two preceding sections will most probably help us to become aware of
the fact that available data must be used with some caution.

For this reason, data are categorised in two ways: primary data and secondary data. Primary
data are data which have been collected for a specific purpose and are being used for that
purpose. That would include, amongst others, data collected by someone by means of a
sample survey or an experiment with some clearly defined objectives in mind.

Secondary data are data available in many statistical publications produced by the Central
Statistical Office and by other institutions whether governmental or from the private sector.
They include data which have been collected for a specific purpose but which are being used
for various other studies. Thus, government departments may collect data for administrative
reasons, not gathered specifically for the particular study which is being carried out.

It is obvious that secondary data must be used with much caution. To start with, the sources
of secondary data must be known. It helps to ascertain that the data are genuine and that they
have been produced by competent institutions having the required expertise. Various valuable
pieces of information would then be available:

(i) the definitions of variables used and problems of measurement involved;

(ii) the method of data collection, for this will give us an idea of the degree of accuracy of
the available data;

(iii) the date of collection which would be relevant with respect to possible change in
definition used over time and with respect to up-to-date character of the data collected
as well;

(iv) the units of measurement used; for example the average monthly salary of a Mauritian
in rupees is not comparable to the average monthly salary of an English person in

6
pounds sterling. Similarly, the month as a measure of time is not constant since each
month does not have the same number of working days.

Finally, it may be appropriate to note that data may be collected on, for example, the whole
student body or on a fraction of the student body, as mentioned in section 1.1. Sometimes a
statistical investigation is carried out on the entire group of units/individuals about which
information is wanted; such an entire group is known as the statistical population. We have
thus the population of students, population of cattle, population of buildings etc. A sample,
however, is a part of the population used to gain information, which, after proper statistical
analysis, can be generalised to the whole population. More will be said on samples and
different types of statistical investigations in other units.

1.4 A LAST WORD

Statistics is a fast developing subject, having a wide range of applications: biometrics,
econometrics, psychometrics, statistical quality control, etc. Over the last sixty years or so,
there has been a constant flow of new ideas in Applied Statistics as well as in Theoretical
Statistics and probability. So much so that different schools of thought have emerged in
Statistics. This is a healthy sign in a developing subject.

For our purposes, we may say, in simple terms, that the objective of Statistics is the
understanding of information contained in data characterised mainly by uncertainty. That
understanding demands one essential ingredient on your part: common sense! Everything
else would be straightforward. In fact, the psychologist S. S. Stevens referred to Statistics as

..... a straightforward discipline designed to amplify the power of common sense
in the discernment of order amid complexity.

7
UNIT 2 DATA COLLECTION I

Unit Structure

2.0 Overview
2.1 Learning Objectives
2.2 The Collection of Data I
2.2.1 Introduction
2.2.1 Quantitative v/s Qualitative Approach
2.3 Routine Data Collection(as byproduct of Administrative Procedures) v/s Special
Investigations
2.4 Censuses v/s Sample Surveys
2.4.1 Introduction
2.4.2 Comparative Advantages of Sample Surveys over Censuses
2.4.3 Sources of Errors in Censuses and Sample Surveys
2.4.3.1 Sampling Errors
2.4.3.2 Non Sampling Errors
2.5 Mode of Administration of a Questionnaire
2.5.1 Face to Face Interviewing
2.5.2 The Postal Method
2.5.3 The Telephone Method
2.6 Stages in a Sample Survey
2.7 Summary

2.0 OVERVIEW

This unit introduces you to the various approaches to data collection, the basic principles and
various ways of collecting quantitative data. Comparisons of the relative strengths and
weaknesses of alternative methods are included. Data collection is covered in OJ in Chapters
15 and 16. However, note that the material in OJ on sampling (Chapter 15 ) is not considered
appropriate for this course. You may find Chapter 16 of OJ useful supplementary reading to
the material in this manual.

8
2.1 LEARNING OBJECTIVES

When you have successfully completed this Unit, you should be able to do the following:

1. Identify the various methods of collecting quantitative data
2. Differentiate between censuses and sample surveys as means of collecting quantitative
data
3. Explain the various ways of administering a survey questionnaire and analyse their
relative strengths and weaknesses
4. Identify the various stages involved in a sample survey

2.2 THE COLLECTION OF DATA I

2.2.1 Introduction

In Unit 1, the importance of statistical data for informed decision making and planning was
mentioned. However, data do not just exist. They have to be collected. And data collection
can be a complex and technical task. It can also be very costly and time consuming. The
coverage of data collection in this course therefore is not intended to equip you to embark on
a complex and large scale data collection exercise on your own (much further study will be
required for this!) but rather to provide you with a basic appreciation of the general principles
of data collection, the various stages involved, the dangers to avoid and the precautions to
take. Additionally, this unit should encourage you to examine published data with a more
critical mind, to appreciate their limitations as well as their strengths and to exercise caution
in their use.

2.2.2 Quantitative v/s Qualitative Approach

There are two broad approaches to collecting data: the qualitative approach and the
quantitative one. Each of these approaches has its merits and limitations. The distinguishing
feature of the quantitative approach is that it uses standard instruments and procedures (e.g. a
standard questionnaire, with fixed sequence and phrasing of questions and uniform field

9
procedures). This makes responses comparable and allows them to be aggregated so as to
produce percentages, rates, averages etc. Hence it is possible, for example, to estimate the
proportion in a given population who possesses a certain characteristic or to quantify the
extent to which specific views or attitudes are held. Sample surveys using standard
questionnaires and uniform field procedures represent a major example of the quantitative
approach. By uniform field procedures we mean, for example that questionnaires
administered to all respondents in the same way, say by face to face interview, that
interviewers are trained to ask the questions and to deal with any problems arising on the field
in exactly the same way. The great advantage of the quantitative approach is that the results
are quantifiable and generalisable.

In the qualitative approach, instruments and procedures are more flexible and informal. There
is usually no standard questionnaire: the ordering of questions may vary and the phrasing of
questions is not rigid. Examples of the qualitative approach are the key informant approach
(where persons having specialised knowledge of the subject of interest, by virtue of their
occupations, are interviewed) and the focus group approach (where people are interviewed in
groups, in a rather informal way). Further examples (by no means an exhaustive list) of the
qualitative approach are participant observation and case studies. Certain qualitative
approaches have the advantages of low cost, rapidity, depth but the emphasis with qualitative
approaches is not on quantitative information. Thus, for example, interviews of trade union
representatives and focus groups of a small number of workers may indicate that the majority
of workers are against a proposed measure and that men are more strongly opposed than
women. However, it would not be possible, with any confidence, to generalise these
conclusions to all workers and still less quote percentages of those for and against the
measure.

In this course we shall deal only with the quantitative approach.

10

2.3 ROUTINE DATA COLLECTION (AS BY PRODUCT OF ADMINISTRATIVE
PROCEDURES) V/S SPECIAL INVESTIGATIONS

Often there exist opportunities for collecting quantitative data in the course of administrative
control procedures. For example, every person entering Mauritius has to go through the
immigration authorities, as is the practice in all other countries. This provides an opportunity
for collecting information on tourist arrivals, which is in fact done through the well known
disembarkation card. Similarly, anyone importing goods into the country has to go through
customs for control and taxation purposes, but this also provides an opportunity for collecting
data on imports such as the type of product, the origin, etc.
Collection of data as a by-product of administrative control is generally inexpensive. Often
the same forms or schedules are used for both administrative and statistical purposes.
However, much care must be taken in designing these forms or schedules, as what is suitable
for administrative purposes may not always be relevant for statistical purposes. In particular,
attention must be given to the definitions of terms used. Also, care must be taken not to
burden the administration too much by making the forms too long or complicated. Sometimes
separate forms for statistical purposes are necessary.
It is not always possible to obtain the data one needs as a by-product of administrative
procedures. It then becomes necessary to conduct special, dedicated investigations, with the
specific purpose of collecting the required data. This process can be quite costly, but the
importance and potential use of the data may well justify the expenditure. Two alternative
approaches are possible. The investigation may involve collecting data in respect of every
member of the population of interest (i.e. a Census). Alternatively, it may involve collecting
data in respect of a sample of the population. We discuss these two approaches next.

11
2.4 CENSUSES V/S SAMPLE SURVEYS

2.4.1 Introduction

A census involves the collection of data in respect of every member of a population of
interest. Familiar examples of censuses are the Housing and Population Censuses carried out
in Mauritius by the Central Statistical Office every ten years.

A sample survey involves the collection of data in respect of only some of the members of the
target population but with the purpose of learning about the whole of that population.
Examples of important national sample surveys carried out by the Central Statistical Office in
Mauritius are the Family Budget Sample Survey and the Labour Force Sample Survey.

This idea of examining a part to learn about the whole, which is what a sample survey is all
about, is familiar and intuitively appealing, and we apply it in our every day lives, often
unwittingly. For example, when buying grain for the household, we usually examine a
handful to check the quality before making our purchase. Of course, in order for observations
on the part to provide a valid basis for conclusions about the whole, certain precautions must
be taken in the selection of that part. We simply cannot use any part. We should ensure that
every member of the population has a fair chance of selection and this is achieved by a
method of selection which we call random selection. We should also aim at drawing a sample
that is likely to be representative of the whole population. We shall not pursue the matter
further here, but in Unit 11, we shall discuss the basic principles involved in selecting valid
samples.

2.4.2 Comparative Advantages of Sample Surveys over Censuses

As a means of collecting quantitative data, the sample survey has a number of advantages
over the census approach.

(i) Sample surveys require less resources and are far less costly than censuses.

12
When the population of interest is large, a Census becomes a very costly exercise. For
a small population, a Census could be considered as the cost may be moderate. For
large populations, it is avoided. Nevertheless, for certain purposes, although the
population may be very large, a Census is absolutely necessary and a sample survey
would not be appropriate. In such cases, Censuses are carried out at infrequent
intervals. The Population and Housing Censuses are carried out in Mauritius at 10
year intervals.

(ii) Sample surveys are less time consuming and hence, results are more timely.

For a Census, because of the sheer scale of the data collection, the processing of data
takes a lot of time. Not so long ago, data from Population Censuses used to take years
to process, even in developed countries, at times dragging on almost to the next
census. Under these circumstances, the results from the census were largely obsolete
by the time that they were out. With the advent of electronic processing, things have
improved a lot but it still takes a number of months to process data from a population
or housing census.

(iii) In a sample survey, because only a small portion of the population is involved, that
portion can be studied intensively.

In investigations of human populations, one important consideration is the need to
limit the burden on the respondent, i.e. on the individual contacted to provide the data.
In a census of a large population therefore, the questions must be simple and factual
and their number must be kept small because many people would have the burden of
answering the questions. In a sample survey, since only relatively fewer people are
involved, we can ask more questions and the questions can be more complex if
necessary.

Furthermore a census is inappropriate for asking in-depth questions and questions on
opinions and attitudes. Highly skilled interviewers are required in these cases. A
census of a large population requires a very large number of interviewers and it is
usually not easy to find such a large number of highly skilled interviewers.

13

As an illustration of the above, it may be noted that typically, the Population Census
carried out in Mauritius involves 25 to 30 simple factual questions. However, there are
sample surveys that have been carried out by the University of Mauritius involving at
times over 150 questions, many of them complex ones, often dealing with attitudes,
opinions and perceptions.

(iv) In certain contexts, data collection may involve destruction of the individual from
whom the data are collected, in which case, a census is then out of question.

For example, studying the life of electric bulbs would involve lighting them until
they burn out. Hence, if a bulb manufacturer used a census to study the life of his
bulbs, he would soon be left with no bulbs to sell!

In spite of the above, censuses are sometimes necessary because of the level of detail
required. For example, for local planning purposes, detailed information about all
towns and villages of the country are required. A national sample survey will not
contain enough members of each town or village for accurate information in respect of
each of them to be obtained. Indeed, certain towns or villages may not even appear in
the sample at all.

Censuses may also provide the sampling frame for future sample surveys.

2.4.3 Sources of Errors in Censuses and Sample Surveys

2.4.3.1 Sampling Errors

Suppose that we want to find out the average weight of all students of the University of
Mauritius, and we do this by selecting a sample of say 200 students in accordance with the
principles of scientific sampling (to be discussed in Unit 11). We then find the average weight
of our sample of students. What we get is an estimate. We may expect this estimate to be
close to the true average weight of all students but we cannot expect it to be exactly equal,

14
except by coincidence. Differences between the estimates based on samples and the true
population values are called sampling errors.

If we were to start anew and repeat the process, i.e. draw a sample again (putting back the 200
students), we will most likely have a different sample, although there may well be some
students who appeared in the first sample as well. If we now compute the average weight of
the sampled students again, we expect the result to be different from the first time, except for
a coincidence. Such differences are called sampling variation.

Thus the estimates based on samples are not in general exactly equal to the true population
values and they also vary from one sample to another. However, if the size of the sample is
sufficiently large, we can be reasonably certain that the estimate will be close to the true
population value. The theory of sampling (which is beyond the scope of this course) gives us
this guarantee. This guarantee gives sampling its power and makes it a viable alternative to
complete enumeration (census). Thus national surveys using samples of between 1000 and
3000 individuals are carried out in many countries (with populations of several millions).
Actually, sampling theory enables one to estimate the required sample size for a given degree
of precision. We shall not go into this but please note that the common belief that the larger
the population, the larger the required sample, is not quite true. In fact, the sample size hardly
depends on the population size, even when the population is large.
Of course, because censuses involve complete enumeration, they are not subject to sampling
errors.

2.4.3.2 Non-Sampling Errors

It is commonly believed that because a census is an exhaustive exercise and is therefore not
subject to sampling errors, it must be more reliable than a sample survey. This is not
necessarily the case. Both the census and the sample survey are subject to other errors called
non-sampling errors. Important types of non sampling errors are the following:

(i) errors of omission: omission occurs when individuals who belong to the target
population are forgotten or somehow not reached. For example, homeless people or
people who have no fixed abode may easily be missed.

15
(ii) non-response: non-response occurs when people contacted are not at home or refuse
to participate in the survey. Non response is a serious problem because people who
refuse to participate may have different opinions on aspects pertinent to the subject of
the survey from those who cooperate. For example, suppose we carry out a survey on
leisure and we have a lot of non-response. It is quite possible that those who did not
respond are very busy and have little leisure time. Therefore conclusions based on
those who responded would be misleading.

(iii) interviewer bias: interviewer bias occurs when the responses obtained are influenced
by the interviewers. This may happen in a number of ways: an unskilled interviewer
may by his/her intonation or facial expression during interviewing, by the way he or
she tries to clarify a question which has not been understood or probes for more
information in case of an ambiguous or incomplete answer, influence the respondent
to answer in a particular way. It may also occur by misinterpretation and misrecording
of answers, caused by the interviewers preconceptions.

(iv) coder bias: coder bias may occur when answers to questions which have been
recorded verbatim by the interviewer are coded in the office for the purposes of
analysis. Interpretation given to answers and the codes assigned as a result may be
influenced by the coders preconceptions.

All of these errors can occur with a census, as well as with a sample survey. However,
because a sample survey involves only a small number of respondents, the efforts made to
minimise non-sampling errors can be more intensive than they could be for a census.

2.5 MODE OF ADMINISTRATION OF A QUESTIONNAIRE

The process of implementing a questionnaire designed for data collection, i.e. of getting the
questionnaire completed, is called administering the questionnaire. There are four basic ways
of doing this:

16
(i) by observation
(ii) by face to face interviewing using interviewers
(iii) by mailing the questionnaire to all individuals from whom the data are to be collected
and asking them to complete and return it to the investigator
(iv) by interviewing individuals by telephone.

The scope for collecting information by observation is rather limited as the method requires
that the phenomenon being studied be observable. Some interesting possibilities do
nevertheless exist. It is thus possible to study the intensity of traffic flow by standing at a
particular spot and observing the number of vehicles that go by. However, in the discussion
which follows, we shall restrict ourselves to the other three modes. Choosing among these
alternative modes requires a thorough knowledge of their relative strengths and limitations.

2.5.1 Face to Face Interviewing

The face to face method of administering a questionnaire has a number of advantages:

(i) The response rate tends to be high, as possibly people find it hard to refuse when the
interviewer is standing right in front of them. Several sample surveys carried out by
the University of Mauritius using face to face interviewing have easily reached 95%
response.

(ii) The face to face approach, because it uses trained interviewers, makes it possible to
administer a complex questionnaire (e.g. a questionnaire which contains attitude and
opinion questions and a lot of skip instructions). When the questionnaire is self-
administered (i.e. filled by the respondent) as in a postal survey, the questionnaire
must be kept simple.

(iii) The face to face method provides an opportunity for the interviewer to find out the
reasons for any reticence on the part of the person contacted and to persuade the
person to cooperate.

17
(iv) The face to face method has practically no restrictions on the type of population that
can be investigated. With the face to face method, the interviewer reads out the
questions and records the answers. It is therefore not necessary for the respondent to
be literate as is the case with the postal method. The telephone interview method,
however, requires that the respondent be reachable by phone.

(v) With the face to face method, there is more control over the identity of the respondent.
In the postal survey, the person to whom the questionnaire is addressed may decide to
pass over the questionnaire to someone else to fill in his/her place.

(vi) The face to face method can be used for practically any topic of enquiry. Some people
believe that for sensitive or embarassing topics, postal surveys are better because of
the relative anonymity. However experience shows that, given trained interviewers
and the appropriate precautions, the face to face method works very well even for
sensitive topics. Moreover, it is difficult to see why people would bother to answer an
embarrassing questionnaire sent by mail.

(vii) The face to face method provides an opportunity for clarifying questions which the
respondent finds to be unclear.

(viii) The face to face method provides an opportunity for probing (i.e. asking for
additional information) if the answer given by the respondent is incomplete or
ambiguous.

(ix) With the face to face method, the interviewer can ensure that the sequence of the
questions as it appears on the questionnaire is respected. This is usually very
important. With the postal method, respondents have the opportunity to see all the
questions before answering any of them.

The great disadvantage of the face to face method is that it is costly, much more costly than
either the postal method or the telephone interview method. It also requires trained
interviewers.

18
2.5.2 The Postal Method

The main advantage of the postal method (also called the mail method) of administering a
questionnaire is its relatively low cost. The cost, it must be noted however, is not limited to
the initial cost of mailing out the questionnaires: usually reminders have to be sent out and
sometimes there are follow-up phone calls and even personal visits which raise up the cost.

Also the postal method does not require trained interviewers.
The postal method has a number of disadvantages, relative to the face to face method:-

(i) The response rate is usually low, often of the order of 30 to 40%, if not less. This is a
very serious disadvantage.

(ii) The questionnaire must be kept simple.

(iii) There is less opportunity to persuade people who are reticent to answer the questions.
Follow-up by phone is a possibility but it is not as effective as the face to face
presence of an interviewer.

(iv) Respondents can see all the questions before they answer any. This is usually not
desirable.

(v) The method is of course restricted to a target population that is literate.

(vi) It is important that the information obtained relates to the person selected and not
someone else. However, with the postal method, control over who actually answers
the questions is difficult. The person to whom the questionnaire is addressed may pass
it on to another family member or a friend for completion.

It is not certain, in that case, that the family member or friend would provide the same
information had the questionnaire been filled by the selected person. Especially in the
case of opinion or attitude questions, there is a high risk that the friend or family
member would substitute his or her own views in the questionnaire.

19
(vii) If a respondent finds a question unclear, he or she may ignore it or give an irrelevant
answer. There is no opportunity to detect that a respondent has misunderstood a
question as with the face to face method.

(viii) If the answer to a question is incomplete or ambiguous, there is no opportunity for
probing as in the case of the face to face method.

2.5.3 The Telephone Method

In terms of advantages and disadvantages, the telephone method is intermediate between the
face to face method and the postal method in many respects:

(i) The telephone method is less costly than the face to face method. However, it is
generally more costly than the postal method.

(ii) The telephone method does require trained interviewers. However travel costs and
travel time are eliminated. Interviewers spend all their time in an office doing
interviews by phone. Each interviewer can thus do more interviews.

(iii) The questionnaire can be more complex than with the postal method but it is not
advisable to attempt to administer a very long questionnaire by phone.

(iv) There is more control over the identity of the respondent than with the postal method,
although less than with the face to face method.

(v) There is opportunity for persuading reticent respondents, although the face to face
method is probably more effective in doing that.

(vi) There is opportunity for clarification if questions are not clear to respondents,
although, here again, this is more difficult to do over the phone than face to face

(vii) There is opportunity for probing if respondents answers are incomplete or ambiguous
but the same qualification as for (vi) applies.

20
(viii) The sequence of questions on the questionnaire can be respected. Respondents do not
have the opportunity of knowing all questions appearing on the questionnaire before
they start answering as in the case of the postal method.

The great disadvantage of the telephone method is that it can only be used when all members
of the target population are reachable by phone.

2.6 STAGES IN A SAMPLE SURVEY

From earlier discussion, it is clear that sample surveys are an important means of collecting
data.

We conclude this unit with a list of the main stages involved in a sample survey:

(i) Clear definition of the objective of the survey

A clear definition of the objective is fundamental for a survey. This will help make key
decisions in the subsequent stages. It is not sufficient to just define a broad objective although
one must start by that. It is necessary to break down this broad objective into finer objectives
for subsequent operationalisation.

(ii) Clear definition of the target population

It is necessary to be clear about what constitutes the target population and the unit of
investigation. For example, if we are doing a survey among the students of the University, do
we wish to cover part time students or only full time ones. If we are doing a survey on
consumer expenditure, is our unit of enquiry the household or the individual?

(iii) Sample design and determination of sample size

This will be dealt with in detail in Unit 11.
(iv) Questionnaire design

21
This will be dealt with in detail in Unit 11.

(v) Recruitment and training of field staff

The quality of data collected depends critically on the competence and dedication of the field
staff involved. Therefore great care should be applied in the recruitment and training of such
staff.

(vi) Pilot survey and pre-testing the questionnaire

A pilot survey consists of a rehearsal of all the survey procedures on a small number of
respondents. This process is very important as it permits the identification of any flaws or
weaknesses in the questionnaire, which can thus be remedied. It also provides a lot of
information about field procedures e.g. whether the method of approaching the respondent is
satisfactory, how long it takes to administer the questionnaire, how easy it is to locate the
respondents, how many call backs are required on average, etc. This information helps to
organise the full scale survey.

(vii) Conduct of interviews

This stage applies when face to face interviewing is used. We have mentioned the danger of
interviewer bias before. Interviewers need to possess a variety of skills, ranging from
approaching the respondent, establishing rapport, persuading respondents to cooperate, asking
questions in a neutral manner and recording the answers correctly. Training and experience
are important but supervision and control are also necessary.

(viii) Editing of completed questionnaires

Completed questionnaires may contain a number of problems, such as blanks (i.e. questions
which have not been answered), ambiguous or irrelevant or inconsistent answers. Therefore,
before the data are processed and analysed, it is necessary to screen the questionnaires for
such problems and remedy them. It is advisable to have a first edit carried out on the field by
the interviewer immediately after the interview, as any problems can then be remedied

22
immediately. A second edit need to be done by field supervisors to detect any mistakes that
may have gone unnoticed by the interviewer. Further edits can be done in the office,
including a computer edit stage.

(ix) Coding of answers where required for data entry

Where a questionnaire contains open ended questions i.e. questions where no pre-coded
answers are proposed, the answers must be coded before processing. Such coding must be
done carefully, ensuring that there is consistency both across and within coders i.e. the same
code is used for similar answers by different coders or by the same coder on different
occasions.

(x) Data Entry

The data collected must in general be captured on computer for eventual processing and
analysis. At this stage, it must be ensured that no errors are made during the transfer.

(xi) Data Processing and analysis

The data processing and analysis are usually done with the help of appropriate statistical
software. The objectives as defined in the very first stage will guide the analysis.

(xii) Interpretation and report writing

Care must be taken to ensure correct interpretation and the report writing needs to take into
consideration the readers targeted.

The success of a survey depends on strict observance of precautions and meticulous attention
to quality control at every stage.

23
2.7 SUMMARY

In this unit you have studied the various methods of collecting quantitative data, the
differences between censuses and sample surveys, the various ways of administering a survey
questionnaire (including their strengths and weaknesses) and the various stages of a sample
survey.

24
UNIT 3 ORGANISATION AND PRESENTATION OF DATA I

Unit Structure

3.0 The Aim and Forms of Presenting Data
3.1 Overview
3.3 Organisation and Presentation of Data I
3.3.1 Data types
3.3.2` Tabulations
3.3.3 The Stem and Leaf Diagram
3.3.4 The Time Series
3.4 Secondary Statistics
3.5 Interpretation of Tables
3.6 Summary

3.0 THE AIM AND FORMS OF PRESENTING DATA

The aim of presenting figures is to communicate information. Therefore the type of
presentation depends on the requirement and interests of the people receiving the information.
Effectively, there are different types of presentation:

Tabulation is covered in Unit 3.
Chart and Diagram are covered in Unit 4.
Graph is covered in Unit 5.

3.1 OVERVIEW

Chapter 1 of your textbook (OJ) introduces the methods of arranging data in tabular form.
The Textbook, as well as Unit 3 of this course manual, cover the key aspects of tabular
presentation, the different types of tables and secondary statistics.

25


1. Explain the importance of tabular presentation.
2. Identify the general principles of general tabulation.
3. Use the different types of tabulation.
4. Explain the importance of secondary statistics.
5. Use correctly the different secondary statistics to shed light on data.
6. Interpret information contained in tables and other forms of presentation.

3.3 ORGANISATION AND PRESENTATION OF DATA I

3.3.1 Data Types

Read pp 1- 4 of textbook (OJ).

Activity 1 Attempt Questions 1.2(a), 1.3 from textbook (OJ).

3.3.2 Tabulations

Read pp 4 - 13 of textbook (OJ).

3.3.2.1 Construction of tables

In the construction of tables, there are important guidelines to consider:

Be sure what you want the table to show.
All tables should have a title which is an explanatory title.
The source of the data must be included (usually below the table) so that the original
sources can be checked.

26
Tables should be neat, tidy and you should use a good handwriting.
To improve the quality of the table, make judicious use of different types of print.
Column and row headings should be brief but self-explanatory.
Units of measurement should be shown clearly.
Approximations and omissions can be explained in footnotes. However, footnotes
should be kept to a strict minimum.
Double lines or thick lines, can be used to break up a large table and make it easier to
read.
Two or three simple tables are often better than one very large table.
Sets of data which are to be compared should be close together.
Secondary statistics, such as percentages and averages, should be beside the figures to
which they relate.
In the particular case of frequency tables, the construction of classes should be done
judiciously, with particular attention to the class boundaries and class widths.

3.3.2.2 Class boundaries, class limits, class widths and class midpoints

Two important principles that must be observed when classifying data into categories are that
the categories should be (i) mutually exclusive -- i.e. there must be no overlap among
categories and (ii) the categories should be jointly exhaustive -- i.e. together the various
categories should cover the whole range of the data. These principles apply to the
construction of frequency tables.

Conversely, when studying frequency tables prepared by others, it is important to be clear
about the boundaries of each class. The correct determination of class boundaries and hence
class widths and class midpoints are, as you will discover later on, pertinent for the
computation of the mean, median etc. These boundaries often are not what they seem to be at
first sight.

At this stage, it is useful to draw a distinction between class boundaries and class limits:-

Class limits identify the inclusive values in a class of a frequency table.

27
Class boundaries are the specific points along a measurement scale that separate
adjoining classes. These can be different from the class limits.

We cannot give general rules for determining class boundaries. These have to be determined
on a case by case basis, applying some common sense. The key is to try and figure out what
are the smallest and largest values that would have been placed in each of the classes when
the table was compiled. A consideration of whether the variable involved is discrete or
continuous is also useful. Once the class boundaries have been correctly determined, the
class width is obtained simply from the difference between the upper and the lower
boundaries, whereas the class mid point is obtained by averaging the same boundaries.

Example 1

Table 3.1

Length of rod Number of rods
(nearest cm.)
11 - 15 5
16 - 20 12
21 - 25 23
etc. etc.

Since the lengths are given to the nearest centimetre, the boundaries of the first class extend
from 10.5 to 15.4999....., which for practical purposes you can take as 10.5 to 15.5 so that the
class width is 5 and the class midpoint is 13.0.

28
Example 2

Table 3.2

Hours of sunshine Number of days

0 and under 2 3
2 and under 4 15
4 and under 6 59
6 and under 8 92
etc. etc.

In this case, the class boundaries coincide with the class limits.

Example 3

Table 3.3

Number of calls made Number of subscribers

1-10 9
11-15 12
16-20 24
21-25 16
26-40 14
__
Total 75
__

Here the variable is intrinsically discrete (i.e. calls can take only whole numbers) and only
class limits are visible; for example, the lower class limit (l.c.l) and upper class limit (u.c.l) of
the first class are respectively 1 and 10. Those for the second class are respectively 11 & 15
and so on for the other classes.

29
Often, for analytical purposes, discrete variables are converted into continuous variables, i.e.,
rather than the variable taking countable number of values in a particular interval, we assume
that in this interval, it takes all the possible values. Later, you will see why we do that,
especially when we construct the histogram and compute the median.

So, for this case, the lower class boundary (l.c.b) and upper class boundary (u.c.b) of the
second class, for example, are taken respectively as 10.5 and 15.5.

Similarly those of the third class are 15.5 and 20.5. Note that the class boundaries are
obtained by subtracting and adding 0.5 respectively to the lower and upper class limits.

Note that u.c.b of a class coincides with l.c.b of the next class.

Class width = u.c.b - l.c.b

Class mid point = u.c.b + l.c.b = u.c.l + l.c.l
2 2

For example, the class width of the second class is 5 and the class midpoint is 13.

Example 4

Table 3.4

Age Number of club members

10-19 185
20-29 263
30-39 325
40-49 442
etc. etc.

30
The boundaries of the first class are 10 and 20 respectively. Try to figure out why. Note that
age is usually quoted as age at last birthday.

3.3.2.3 Types of tables

There are many types of tables, as you may have noticed in publications, journals and
magazines and in company reports.

Tables can be divided into

Frequency tables
Two-way tables or contingency tables or cross tabulation
General tables

Examples of frequency tables are clearly illustrated in your textbook (OJ). An
example of a two-way table is provided in this unit.

Two way tables

Example 5
Table 3.5
Student Marks in English and Maths

Student English Maths Student English Maths
1 35 40 11 47 49
2 32 41 12 61 54
3 41 50 13 63 61
4 31 27 14 58 73
5 65 66 15 72 82
6 42 66 16 69 76
7 58 72 17 58 69
8 71 80 18 55 54
9 82 58 19 48 58
10 64 59 20 50 44

Source: University X, 1971

31
The table above gives the marks in English and Mathematics gained by twenty students.
Arrange these results into a two-way grouped frequency distribution.

Answer to Example 5
Table 3.6
Student Marks in English and Maths

Eng\Maths

0-20

21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-
100
Total
0-20 A D 0
21-30 0
31-40 1 1 1 3
41-50 111
(3)
1 1 5
51-60 1 1 11 (2) 4
61-70 11 (2) 11 (2) 1 5
71-80 1 1 2
81-90 1 1
91-100 C B 0
Total 0 1 1 4 5 4 4 1 0 20

Source: University X, 1971

We observe a direct relationship between the scores in English and Maths as the diagonal
moves from A to B. i.e. students doing well in Maths will do well in English. Had the rend
been from C to D, then we would have said that an inverse relationship exists i.e. students
scoring high marks in English do not necessarily score high marks in Maths.

General Tabulation

32
Example 6

(a) According to the 1972 Census data published by the Central Statistical Office, out of a
total of 246,000 males aged 15 and over, 169,000 were employed and 35,000 were
unemployed. The remainder were inactive (i.e. were either retired, rentiers, homemakers,
students, disabled or voluntarily idle). According to the same data, out of a total of 249, 000
females aged 15 and over, 44,000 were in employment, 7,000 were unemployed and the rest
inactive.

The Central Statistical Office estimated that in 1986, there were 238,000 employed males and
106,000 employed females. The number of unemployed males and females were 37,000 and
18,000 respectively. The total number of males aged 15 and over was estimated at 339,000.
The corresponding number of females was estimated at 343,000.

(Note : The data have been rounded to the nearest thousand).

- Tabulate the above information, including in your table any secondary statistics
you consider useful for the interpretation of the data.

- Comment on the data, especially in relation to what they reflect on the role of
women. What are the main social and economic implications?

Answer to Example 6

33
Table 3.7
Population aged 15 and over by activity status and sex,
Mauritius, 1972 - 1986
Year
and Sex
1972 1986

Male

Female

Male

Female
Activity
Status
Number
('000)
%
Numbe
r
('000)
% Number
('000)
% Number
('000)
%
Employed
Unemploye
d
169
35
68.7
14.2
44
7
17.7
2.8
238
37
70.2
10.9
106
18
30.9
5.2
Total Active 204 82.9 51 20.5 275 81.1 124 36.2
Inactive
Total
42
246
17.1
100.0
198
249
79.5
100.0
64
339
18.9
100.0
219
343
63.8
100.0

Source : Central Statistical Office. Census figures (74) Estimates (86)

The table reflects the considerable changes that have taken place between 1972 and
1986, in particular the large number of jobs created and the increased demand for female
employment. The reduction in male unemployment probably implies a reduction in the
social evils associated with unemployment : crime, violence, drug abuse, alcoholism, suicides
etc. The greater participation of women in economic activity implies a changing role for
women, showing a movement away from the traditional idea of home as the proper place for
women. The greater employment of women also probably means increased prosperity for
households but may be accompanied by difficulty in reconciling domestic and occupational
responsibilities with the attendant consequences: strained relationships between spouses,
neglect of children, etc. (The increased female unemployment is due not to low job creation
but rather to the increased demand for jobs among women).

34
Activity 2

(a) In a recent survey, 7381 children were studied, of whom 219 attended private schools.
78% were the children of manual workers but only 40 of these children attended
private schools.

1 out of every 9 children were the only child in the family (enfant unique); among
private school attenders, the proportion of children from families with only child was
20.1%, of whom 7 were the children of manual workers. Of the families with only
one child, 567 came from the manual class.

Arrange these figures in a table, calculating any secondary statistics you consider
necessary and comment on the results.

(b) Attempt Questions 1.5, 1.6, 1.17 from textbook (OJ)

3.3.3 The Stem and Leaf Diagram

Read p 14 of textbook (OJ).

Activity 3 Attempt Questions 1.15 and 1.16 from textbook (OJ).

3.3.4 The Time Series

Read pp 14-15 of textbook (OJ).

3.4 SECONDARY STATISTICS

Secondary statistics are those simple calculations which are performed using given data, to
help us in our interpretation. Some examples of secondary statistics are sub-totals, totals,
rates, ratio and percentage.

35
Ratio

A ratio is a relationship between two quantities expressed in a number of units to enable
comparison.

Example 7

Three-quarters of the annual output of a factory consists of product A and one-quarter of
product B. The ratio of the output is then 3:1. For every 3 units of A produced in a year, 1
unit of B is produced.

Percentage

"Percentage" (or percent) means per hundred. Therefore 50 per cent is 50 out of a hundred,
that is, one half. The symbol for percentage is % . For example, to convert a fraction to a
percentage, multiply by 100 : equals 25% (25 = x100)

3.5 INTERPRETATION OF TABLES

When data are presented, it is important that tables provide information clearly and at the
same time make an impact. Interpretation is a matter of judgement based on knowledge of the
terms used in the table. It is not enough that a figure or the result of calculation is accurate,
the result has to be understood. There is little point in arriving at a correct answer to a
calculation if it is not known what it means..

3.6 SUMMARY

In this unit, you have learnt about presentation of data using the different types of tables
namely frequency tables, two way tables and general tabulation.

36
UNIT 4 ORGANISATION AND PRESENTATION OF DATA II

Unit Structure

4.0 Overview
4.2 Organisation and Presentation of Data II
4.2.1 Introduction
4.2.2 The Bar Chart
4.2.3 The Pie Chart
4.2.4 The Histogram
4.3 Summary

4.0 OVERVIEW

This unit introduces you to the methods of organising and presenting data, using various
charts and diagrams. Part of Chapter 2 of your textbook pp. 28-39 (OJ) covers the relevant
topics.


When you have successfully completed this Unit, you will be able to construct, interpret and
use the following:

1. the Bar chart.
2. the Pie chart.
3. the Histogram and Frequency Polygon.

37
4.2 ORGANISATION AND PRESENTATION OF DATA II

4.2.1 Introduction

Study pp 28-29 of your textbook (OJ).

There are some guidelines which are important for the construction of various charts,
diagrams and graphs, in the same way as we discussed for the construction of tables in
Section 3.3.2.1 of Unit 3.

Some of these guidelines are common:

Be sure what you want your chart or diagram or graph to show.
All charts, diagrams or graphs must have a title which is, as far as possible, self-
explanatory.
The source of the data must always be included (usually below the
chart/diagram/graph).
Units of measurement must be shown clearly.
Axes should be labelled clearly and scales must be made convenient, explicit and
clear.
Where appropriate, a key must be given so as to explain clearly what each shading etc.
represents.
Charts, diagrams or graphs must be neat and tidy.

4.2.2 The Bar Chart


Your textbook covers adequately the discussion on the bar chart; however, certain points need
to be added with regards to various charts developed from the idea of a bar chart.
It is desirable that the compound or component bar chart does not contain too many
components, or else, the impact on the reader may be blurred. Whenever there is a need to
compare two data sets using component bar charts, it is advisable to use percentages rather

38
than actual numbers : percentages make comparison easier, especially when charts or
diagrams are used. Think why!

The example given in Fig. 2.5 of p 32 of your textbook is an example of what is commonly
known as a multiple bar chart. Multiple bar charts are very useful when different
characteristics [e.g. % of labour force employed in agriculture, agrarian output as % of GNP
of various units of interest (e.g. countries)] need to be simultaneously presented. It is
however desirable that not too many characteristics are included in the diagram; the chart
might otherwise contain too much information and can become rather confusing.

Sometimes, bar charts or component bar charts are drawn with the bars horizontal; in some
cases, the variable on the horizontal axis is time. Such adaptation of the bar chart is known as
the Gantt Charts. It is used especially at the time of planning a project over time and
monitoring the implementation of the project with regards to the assigned time schedule.

Activity 1 Attempt Questions 2.8 and 2.9 of your textbook (OJ).

4.2.3 The Pie Chart

Study pp 33-34 of our textbook (OJ).

Your textbook tends to be too sceptical about the pie chart. In fact, the main objective of the
pie chart is to show the relative importance of the component parts of a total. And the pie
chart does this extremely well, provided there are not too many components.
The pie chart is used widely to present statistical data to the general public as well as to
highlight any shift in the relative importance of the component parts of a total over time. In
the latter case, two pie charts can be drawn for data available at two different points in time.

Activity 2

The urban population, as enumerated at the 1972 and 1983 censuses, was as follows:

39
Table 4.1
URBAN POPULATION FOR ISLAND OF MAURITIUS

Municipal Council Area 1972 1983

Port-Louis 133,996 133,702
Beau-Bassin - Rose-Hill 80,318 90,577
Quatre-Bornes 50,770 63,682
Vacoas-Phoenix 47,638 53,090
Curepipe 51,956 62,200

TOTAL 364,678 403,251

Source: Annual Digest of Statistics, C.S.O., 1988

Represent the above information by means of pie-charts.

4.2.4 The Histogram


Note that a histogram can only be constructed for continuous variables; thus a given discrete
variable needs to be transformed into the appropriate continuous form before the histogram is
constructed.

Consider the following example of a simple frequency distribution in the discrete form:

40
Table 4.2

Number of faults Number of cars (frequency)

1 18
2 25
3 19
4 8
5 3
6 or more 0
__
73
---

The variable number of faults is discrete and is first transformed into the continuous form as
follows:

Table 4.3

Number of faults

0.5 and under 1.5
1.5 and under 2.5
2.5 and under 3.5
3.5 and under 4.5
4.5 and under 5.5
5.5 and above

The histogram is then constructed with the first rectangle having its base between 0.5 and 1.5
inclusive. The second rectangle will have its base between 1.5 and 2.5 inclusive, etc. Thus
there is no gap between the rectangles. The rectangles must be contiguous i.e. touching each
other.

Similarly, a discrete grouped frequency distribution should first be transformed in its
continuous form. Thus the following discrete grouped frequency distribution from p 38 of
your textbook (OJ) can be transformed in the continuous form as follows:

41
Table 4.4

Discrete form Continuous form
Number of calls Number of calls

10 - 19 9.5 and under 19.5
20 - 29 19.5 and under 29.5
30 - 39 29.5 and under 39.5, etc.

Another important point needs to be highlighted in the construction of histogram.
Occasionally, we come across frequency distributions with class intervals being very different
from almost each other. Consider the following data relating to infant deaths in Table 4.5.

Infant Deaths
(deaths of Children Under 1 Year of Age) by age and sex
Island of Mauritius, 1986-1988

1986 1987 1988
Age Both
Sexes
Male Female Both
Sexes
Male Female Both
Sexes
Male Female
Under 1 day 91 48 43 60 31 29 87 53 34
1 6 days 191 118 73 183 111 72 117 117 57
7 27 days 75 40 35 91 63 28 30 30 21
28 days under 2 months 25 13 12 38 21 17 12 12 16
2 3 months 35 28 7 35 19 16 17 17 20
4 5 months 22 12 10 21 16 5 14 14 11
6 7 months 11 4 7 13 6 7 8 8 5
8 9 months 16 8 8 12 8 4 9 9 7
10 11 months 14 8 6 10 8 2 6 6 4
Under 1 year 480 279 201 463 283 180 266 266 175
Under 7 days 282 166 116 243 142 101 170 170 91
Under 28 days 357 206 151 334 205 129 200 200 112
28 days under 1 year 123 73 50 129 78 51 129 66 63

Source: Central Statistical Office, Annual Digest of Statistics 1989.

Table 4.5

42
In such cases, we first compute the frequency density which is defined as follows:

frequency density = frequency (class width.

Then the frequency density is used on the vertical axis, and the variable of interest is used on
the horizontal axis as usual.
Table 4.6

Age Number of deaths Frequency density
for both sexes, 1986
(frequency)

Under 1 day 91 91
1 - 6 days 191 191 ( 6 = 31.8
7 - 27 days 75 75 ( 21 = 3.6
etc. etc. etc.

Thus the frequency density gives the number of deaths per unit time (i.e. per day) and renders
all frequencies comparable. The fundamental principle underlying the histogram is that what
matters is the area of rectangle and not the height of rectangle. The examples considered in
your textbook are merely specific applications of this fundamental principle. In these
examples all or most class intervals have the same widths except two or three. Can you see
the link?

The histogram and the frequency polygon give us a view of the shape of a given frequency
distribution. In particular, they help to

(i) identify to what extent a particular distribution is asymmetrical, and

(ii) compare two distributions.

43
For the latter case, we may use, for example, two histograms (using the same scales) to
highlight the change in age structure of the population of Mauritius which has occurred
between 1972 and 1990 (at which times a population census was carried).
Activity 3

(i) Attempt questions 2.7 and 2.14 of your textbook (OJ).

(ii) The age distributions of the population as enumerated at the censuses of 1972 and
1990 for Island of Mauritius are as follows:

Table 4.7
AGE DISTRIBUTION OF POPULATION FOR ISLAND OF MAURITIUS

Age Group 1972 1990
(years) (000s) (000s)
________________________________________________

9 and less 220.0 191.2
10 - 19 211.9 201.6
20 - 29 133.0 202.5
30 - 39 83.9 171.0
40 - 49 74.5 102.5
50 - 59 52.8 68.1
60 - 69 31.8 (
(
70 - 79 13.4 ( 85.4
(
80 and above 3.8 (
________________________________________________
TOTAL 825.1 1,022.3
________________________________________________

Source: (a) Annual Digest of Statistics
(b) 1990 Census report, Volume II

44
(a) Illustrate, by means of histograms, the age distributions of the Island of Mauritius for
1972 and 1990.
Comment on your findings.

(b) Draw the respective frequency polygons.

4.3 SUMMARY

In this unit, you have learnt about the presentation of data by using some
charts/diagrams, namely the bar chart and its various adaptations, the pie chart and the
histogram.

45
UNIT 5 ORGANISATION AND PRESENTATION OF
DATA III

Unit Structure

5.0 Overview
5.2 Organisation and Presentation of Data III
5.2.1 The Ogive Curve
5.2.2 Plotting the Time Series
5.2.3 Logarithmic Graphs
5.2.4 The Lorenz Curve
5.2.5 The Z-Chart
5.2.6 The Scatter Diagram
5.2.7 Some Examples of Bad Practice
5.3 Summary

5.0 OVERVIEW

This unit further introduces you to some graphical methods of presenting data and to their
interpretation.


1. Construct, interpret and use
a. ogive curves.
b. time series graphs.
c. logarithmic graphs.
d. Lorenz curves.
e. Z-charts.
f. scatter diagram.
2. Identify some examples of bad practice whilst displaying data.

46
5.2 ORGANISATION AND PRESENTATION OF DATA III

5.2.1 The Ogive Curve

Study pp 39-41 of your textbook (OJ)

A minor point that has been overlooked in the textbook (example on the distribution of 100
metal pipes on p 39 of OJ) is what has actually been plotted on the x-axis of the ogive curve.
Though it may be clear to some of you, for the sake of completeness, we mention that we
usually plot the cumulative frequencies versus the corresponding upper class boundaries (that
is why it is sometimes called the less than ogive curve). Equivalently we do have the
more than ogive curve but for our discussion, we shall refer to the one in your textbook.
Ogive curves in general can be used to compute various secondary statistics of importance,
e.g. median, quartiles etc...(youll get to learn about these later).

You should also note that the point (10,0) has been plotted; well, this is so, as no pipes are
under 10 cm and consequently the c.f. is also zero.

It might be more useful to plot the cumulative % frequency rather than the cumulative
frequency. The cumulative % frequency is obtained by converting the c.f. into percentages.
In doing so, it enables comparison of various frequency distributions (Try to think why!) and
also the various secondary statistics like the median, quartiles and percentiles etc.. can be
read off directly from the graph .

Some additional points to bear in mind concerning the ogive curve is the case when we deal
with a discrete grouped frequency distribution which should first be transformed in its
continuous form before proceeding to construct the curve e.g. (refer to Table 4.4, page 32 of
manual):

47
Discrete form Continuous form
Number of calls Number of calls

10 - 19 9.5 and under 19.5
20 - 29 19.5 and under 29.5
30 - 39 29.5 and under 39.5, etc...

Solved Example

On Time Ltd has a unit of 130 workers all performing exactly the same task : the
assembly of watches. The output of the workers for the first week of October 1997
was recorded and is reproduced on the following page:

Watches Assembled

up to 449
450 - 469
470 - 489
490 - 509
510 - 529
530 - 549
550 - 569
570 - 589
Number of workers

3
7
18
25
36
27
10
4

(a) Construct a proper presentation table from the data including percentages, cumulative
frequencies and cumulative percentages.

(b) Construct an ogive from the data.

(c) Using your ogive curve, estimate
(i) the number of workers producing less than 500 watches
(ii) the value of x, if 40% of the workers produced x watches or more.

48
SOLUTION :-

(a) Output of workers at On Time Ltd, First week of October 1997

Watches
Assembled
Number of
workers
Percentage Cumulative
frequency
Cumulative
Percentage

up to 449
450 - 469
470 - 489
490 - 509
510 - 529
530 - 549
550 - 569
570 - 589
3
7
18
25
36
27
10
4
2.3
5.4
13.8
19.2
27.7
20.8
7.7
3.1
3
10
28
53
89
116
126
130
2.3
7.7
21.5
40.8
68.5
89.2
96.9
100.0
130 100.0

Source : Company Records, On Time Ltd

(b) To construct the ogive curve, we first convert the discrete grouped frequency
distribution into its continuous form (see section 3.3. 2.2, Ex 3).

Watches Assembled

up to 449.5
449.5 - 469.5
469.5 - 489.5
489.5 - 509.5
509.5 - 529.5
529.5 - 549.5
549.5 - 569.5
569.5 - 589.5
c.f

3
10
28
53
89
116
126
130

49
The ogive curve is shown below :

Ogive Curve for On Time Ltd
0
20
40
60
80
100
120
140
440 460 480 500 520 540 560 580 600
Watches Assembled
N
o

o
f

W
o
r
k
e
r
s

(c) (i) Around 42 (using ogive curve as above)
(ii) Please try yourself.
(Ans : x 524)

Activity 1

The following is a record of marks scored by candidates in an examination

Table 5.1

77 59 84 73 51 43 50 81 61 53 69
37 58 63 67 61 90 61 50 60 84 56
77 57 42 43 41 49 37 21 24 35 34
50 11 52 30 16 33 67 87 64 47 59
37 92 88 30 38 22 22 49 46 50 64
23 73 73 48 26 36 51 85 71 57 45

50
(a) Tabulate the marks in the form of a frequency distribution, grouping by suitable class
intervals.

(b) Construct an ogive curve for the data.

5.2.2 Plotting the Time Series

Read pp 41 - 45 of your textbook (OJ)

Activity 2

A large food store is open six days a week. Its sales, in thousands of kilograms, during a five
week period are as follows:

Table 5.2

Week Monday Tuesday Wednesday Thursday Friday Saturday
1 45.8 47.4 49.8 49.9 53.5 53.6
2 45.4 48.5 49.9 49.4 50.9 52.1
3 44.2 45.4 48.5 46.2 49.3 49.7
4 41.4 45.9 46.7 46.0 51.3 48.4
5 43.7 45.5 46.0 45.1 50.1 48.4

(a) Plot the values as a time series.
(b) Comment on your graph.

5.2.3 Logarithmic Graphs

Read pp 56 - 63 of your textbook (OJ).

5.2.4 The Lorenz Curve

51

Read pp 63- 65 of your textbook (OJ)

You have learnt how the Lorenz curve is useful to illustrate the inequality prevailing in the
distribution of income. In that case, equality is perceived as follows: 10% of household earn
10% of the income, 50% of households earn 50% of total income or, more generally x % of
households earn x % of total income.

In a similar manner, the Lorenz curve is used to illustrate the disparity which exists in the
distribution of a certain variable in relation to the distribution of another variable. Intuitively,
the further is the Lorenz curve from the line of equality, the greater is the disparity or the
inequality.

We shall now introduce the Ginis coefficient which is an index to measure the disparity of
income.

Consider the following diagram which illustrates Lorenz Curve for income distribution:

Line of Equal Distribution
O
A B
C
D
X
Y

Figure 5.2: Lorenz Curve for Income Distribution

52
Ginis Coefficient is defined as

= Area BDOB
Area BOC

=
X
X Y +
......................... (Equation 1)

As the curve tends towards the line of equal distribution, then X 0 so that in turn
0 from (Equation 1) above.

Thus the greater the value of , the greater the disparity of income distribution.

Hence the Gini Coefficient can be perceived as an Index of Inequality as it measures the
degree of departure from the line of equality.

Activity 3

Attempt questions 3.8 and 3.9 from (OJ).

5.2.5 The Z - chart

Read pp 66-67 of your textbook (OJ).

As you have noted on p 66 in your textbook, the Z-chart consists of three curves on the same
axes as shown in Figure 3.8 on p 67 . Usually, the chart covers a period of one year, by
months.

One curve shows the monthly figures, another shows the cumulative figures from the
beginning of the year, while the third shows the total for the twelve months ending with each
month. This last curve is generally called the moving annual total curve; more specifically, it
is a 12-month moving total for a period of twelve months ending with each designated month.

53
The concept of a moving annual total is important : it tends to smooth out fluctuations to
some extent. Note that previous year figures are used exclusively for computing the moving
annual total.

To illustrate the computation of the Moving Annual Total, we shall refer to the solved
example on p 66 in your textbook which represents the output of ABC limited. Let us see
how the January moving annual total is obtained:

- As stated in the textbook, the January figure is the total sales achieved during the period
1st February last to 31st January this year, i.e.,

January Figure = 8 + 9 + 13 + ........... + 11 + 11
144444244443 144424443
Previous yr. figures from Feb. to Dec. Current yr. figure for Jan

Similarly,

February figure = 9 + 13 + ..... + 11 + 11 + 14
144444424444443 144424443
Previous yr. figures from Mar to Dec. Current yr. figure for
Jan & Feb.

You can try for yourself and get the figures for the other months. One last point before we
leave this topic, is that it is imperative that you understand and be able to interpret the three
arms of the Z -chart.

Note: There is a slight modification in Fig. 3.8 (p. 67, OJ). The monthly totals have been
wrongly plotted at the mid points of the respective months. In fact, they should be plotted at
the end of the respective months. Thus, both the cumulative total and monthly total for
January should coincide and subsequently, figures for the remaining months should be
plotted accordingly.

54

Activity 4

1. Attempt Question 3.11 from (OJ).

2. The table that follows refers to the monthly production of electricity (in
Gigawatts/hour) by the Central Electricity Board for the years 1993 and 1994.

Table 5.3
Monthly production of electricity by the CEBfor the years 1993 & 1994

1993 1994
January 69 80
February 68 58
March 74 83
April 73 81
May 73 79
June 70 77
July 69 76
August 70 79
September 69 77
October 74 83
November 78 83
December 82 90
TOTAL 869 946

Source: Digest of Industrial Statistics, CSO

Attempt the questions that follow:

55
(a) From the information given in Table 5.3, construct a Z-chart for the year 1994.

(b) Explain briefly the three components of the chart.
5.2.6 The Scatter Diagram

Read pp 67-68 of your textbook (0J).

Example 1

You are advised to attempt the example that follows. The example enables you to further
understand the usefulness of the scatter diagram for exploratory data analysis, i.e. as a
prelude to further statistical analysis later.

Efforts are being made to cultivate a variety of upland cotton in Bangladesh. Cotton yield is
known to be directly related to the time of planting. Previous work suggests that the optimum
time of planting is May, in the wet season and September in the dry season. However, the dry
season crop is more economical as it takes six months to mature as against one year for the
wet-season crop. A study was therefore undertaken to find out if late planting can be
economical, especially because heavy rains occasionally interfere with cotton planting in
September.

The variety D5 was planted at fortnightly intervals between September 1973 and January
1974. The yields of cotton obtained in terms of kilograms/hectare are given below: (1st
September taken as Day 1).

56
Table 5.4
Yield of cotton planted at different fortnightly intervals
between Sep 1973 to Jan 1974

Date of Seeding Day Number Cotton yield

1st Sep 73 1 17.39
16 Sep 73 16 17.74
1 Oct 73 31 16.02
31 Oct 73 61 13.88
15 Nov 73 76 9.78
30 Nov 73 91 7.38
15 Dec 73 106 6.09
30 Dec 73 121 4.29
14 Jan 74 136 3.92

Source: X

(a) Plot a scatter diagram to illustrate the above figures.

(b) Comment on what the diagram reveals.

You can clearly note that a plot of yield against day number shows evidence of an inverse
relationship i.e. as the day number increases, yield decreases (the later the planting, the
smaller the yield).

So, you can see that though the scatter diagram is simple, it can reveal some interesting
results.

5.2.7 Some Examples of Bad Practice

Read pp 69-71 of your textbook (0J).

57
5.3 SUMMARY

In this unit, you have learnt about the construction, interpretation and usage of various
graphical methods of presenting data.

58
UNIT 6 MEASURES OF CENTRAL TENDENCY

Unit Structure

6.0 Overview
6.2 The -Notation
6.3 Measures of Central Tendency
6.3.1 The Arithmetic Mean
6.3.2 The Median
6.3.3 The Mode
6.3.4 The Harmonic Mean
6.3.5 The Geometric Mean
6.4 Summary

6.0 OVERVIEW

Often in real life, we are confronted with a mass of data. We are then interested to find a
single representative value that captures the order of magnitude of the whole. It seems
reasonable in some sense, that the tendency should be around some central value. This unit
will introduce you to different measures that will enable you to do so.


When you have successfully completed this Unit, you should be able to compute, interpret
and use the following:

1. the arithmetic mean.
2. the median.
3. the mode.
4. the harmonic mean.
5. the geometric mean.

59
6.2 THE
-NOTATION

As addition is a commonly used mathematical operation in Statistics, a special notation is
found useful to represent it, especially when we need to add many numbers.

Let x be a variable which takes values x
1
, x
2
, x
3
, ......., x
n.
Then, the sum of the x
i
s starting with x
1
, ending with x
n ,
and including all values in between
these limits is obviously given by x
1
+ x
2
+ ..... + x
n
.

The special notation,
, (the Greek letter read as sigma) is used instead. Thus, we have

x x x x x
i
i
n
n
=
= + + + +
1
1 2 3
....... , where the limits of the summation are incorporated in
the notation itself.

Note: Sometimes the
sign appears without the limits of summation: the latter should be

taken as extending over all the values, from the first to the last one; i.e. in this case
x x
i
i
n
=
=

1

SOME PROPERTIES OF THE
- NOTATION:

x x x
i
i
n
i
i
m
i
i m
n
= = = +

= +
1 1 1

For example,

( ) ( )
x x x x x x x
x x x x
x x
i
i
i
i
i
i
= + + + + +
= + + + +
= +
=
= =

1 2 3 4 5 6
1
6
1 2 3 6
1
2
3
6
. . . . .

i.e. in this case, m =2 .

60
k
i
n
=
=
1
n.k (k is a constant)

( ) x y x y
i
i
n
i i i
i
n
i
n
= = =

=
1 1 1

k x k x
i
i
n
i
i
n
= =

=
1 1
(k is a constant)

CAUTION

(i) x y x y
i i
i
n
i
i
n
i
i
n
= = =

|
\
|
.
|
|
\
|
.
|
1 1 1

(ii) x
i
i
n
2
1

=
( ) x
i
n
i
=
1
2

Activity 1

Let x, y and f be variables taking the following values:

x
i :
-5, -1, 0 , 4, 7

y
i
: 0, -2, -7, 8, 5

f
i :
0, 1, 3, 4, 7

Calculate

(i) x
i
i =
1
5
, f
i
i =
1
5
, y
i
i =
1
5

61
(ii) x
i
i =
1
4
, x
i
i =
2
4
, x
i
i =
1
8

(iii) ( ) x y
i i
i
+
=
1
5
, ( ) x y
i i
i

=
1
5
, x y
i i
i =
1
5

(iv) 8
1
5
x
i
i =
, 4
1
5
y
i
i =
, 8
1
5
x
i
i =
, 4
1
5
y
i
i =

(v) x
i
i
2
1
5
=
, y
i
i
2
1
5
=
, f x
i i
i =
1
5
, f y
i i
i =
1
5

(vi) ( ) x y
i i
i

=
2
1
5
, ( ) x y
i i
i
+
=
2
1
5

(vii) k
i
,
=
1
5
where k is a constant

(viii) f x
i i
i =
1
5
, f y
i i
i =
1
5

f
i
i =
1
5
f
i
i =
1
5

(ix) f x
i i
i
2
1
5
=
, f y
i i
i
2
1
5
=
, f x
i i
i
2
1
5
=

f
i
i =
1
5

(x) ( ) x k
i
i

=
2
1
5
, where k is a constant

62
6.3 MEASURES OF CENTRAL TENDENCY


To have further insight why averages are called indicators of central tendency, consider the
following example: we rarely find an adult who is as tall as 7 feet or as short as 4 feet, the
height of most people ranging around a point located centrally between these extremes.
Because so many measurements cluster near the middle of the distribution, we say they have
a central tendency.

Since so large a portion of the group clusters near this central level, we think of that as
representing the typical characteristic for the group - the most typical point is what we
compute when we find an average.

6.3.1 The Arithmetic Mean

You are strongly advised to go to section 3.3.2.2 on class widths and class midpoints for
better understanding of the topic.

Read pp 94 - 103 of your textbook (OJ).

An important issue that needs to be highlighted is the case when we are dealing with open
ended class intervals. We take this up in the next section.

6.3.1.1 Open Ended Class Intervals

As clearly stated on p101 of your textbook (OJ), there are no hard and fast rules to deal with
that. There is obviously a degree of arbitrariness in the choice of the boundaries of the open-
ended classes. You should thus ponder on the specific situation you are dealing with and
decide on the limits using some common sense.

You may feel a bit perplexed with what was just said, so let us consider the following
example which gives the age distribution of the management department of a large

63
private company.

Table 6.1

Age Frequency
(years)

Under 20 2
20-29 12
30-39 31
40-49 39
50-59 26
60 and above 10

Suppose you have to compute the mean of the distribution. Obviously, you have to make
some assumptions before carrying out the calculation, as we have open-ended classes, and
also at the same time justify the choice you make.

It would be ridiculous to have the first class interval as 10-19, as realistically we would not
expect managers of the company to be aged around 10!!. Thus it would appear reasonable to
take for the first class interval, viz. under 20 as say 18-19.

For the class interval 60 and above; well, 60 - 69 seems appropriate, taking into account that
the age of retirement is around 69 in the private sector.

You can now compute the mean by the methods you have just learnt (Direct method or using
Assumed Mean). Since you will probably be using a calculator to compute the mean, you
may not find it necessary to use the Assumed Mean, but the idea behind this method should
be kept in mind.

To end up, p103 of your textbook mentions that Any alternative to the Arithmetic Mean
cannot be used for advanced analysis, so their uses are descriptive rather than analytical.

64
Well, this statement may be valid for distributions that are symmetrical, but may not be so for
skewed (non-symmetrical) distributions, where the median (which you shall learn very soon)
is widely used for advanced statistical analysis.

Activity 2

(i) Show that

( ) x x
i
i
n
=
=
1
0
where x =
x
n
i
i
n
=
1
is the mean.

(ii) Show that
( )
( )
x x x
x
n
i
i
n
i
i
n
i
= =

=
1
2
2
1
2

HINT FOR (ii) -Expand the term in the bracket of the L.H.S of the equality and simply use
the various properties of the
- notation you learnt previously to

simplify it.

Activity 3

Attempt Questions 5.1, 5.6, 5.9 in textbook (OJ).

6.3.2 The Median

Read pp 103 - 107 of your textbook (OJ)

65
Pay particular attention to p. 106-107 of OJ where the limitations of the mean are addressed
and also the situations where the median would be a more appropriate measure of central
tendency than the mean.

We further illustrate the use of the ogive curve to estimate the median. Lets use the solved
example provided in subsection 5.2.1, p 37 of your course manual.

We reproduce the data and the ogive curve on the following page:

Watches Assembled

up to 449.5
449.5 - 469.5
469.5 - 489.5
489.5 - 509.5
509.5 - 529.5
529.5 - 549.5
549.5 - 569.5
569.5 - 589.5
c.f

3
10
28
53
89
116
126
130

The ogive curve is shown below :

Ogive Curve for On Time Ltd
0
20
40
60
80
100
120
140
440 460 480 500 520 540 560 580 600
Watches Assembled
N
o

o
f

W
o
r
k
e
r
s

We have n = 130

66

Rank of median =
130
2
65 =

Using the ogive curve above, we have
Estimate of median = 516

Example 1

We illustrate below, using a further example, the computation of the median.

The data below gives the time taken by 200 female students to solve a problem.

Table 6.2

Time Frequency
(nearest(s))

118 - 126 15
127 - 135 25
136 - 144 45
145 - 153 60
154 - 162 25
163 - 171 20
172 - 180 10

You are required to calculate (a) the Mean, (b) the Median, time taken to solve the problem.

(a) You can calculate the Mean (Ans: 147.0).
(b) Calculation of Median

We first construct the cumulative frequency column:

67

Table 6.3

Time
(nearest(s)) f c.f.

118 - 126 15 15
127 - 135 25 40
136 - 144 45 85
145 - 153 60 145
154 - 162 25 170
163 - 171 20 190
172 - 180 10 200
200
For the calculation of the median, the variable under study is assumed to be continuously
distributed; thus in this case, the time variable is redefined as follows(as per section 3.3.2.2,
Example 1):

117.5 and under 126.5
126.5 and under 135.5
135.5 and under 144.5
144.5 and under 153.5
Etc., etc.,

The next step is to identify the median class; i.e. the class which contains the median value.
The rank of the median value is given by
200
2
100 = (i.e. total frequency divided by 2). From
the above table and the cumulative frequency column in particular, we notice that the fourth
class interval. 144.5 and under 153.5, contains the 100
th
value, i.e. the median value. Hence, it
is the median class.

The problem now is to locate the median value, i.e. the 100
th
value for the whole data set,
within the median class. Make sure this is clear to you; what follows will then be simple and
straightforward!

68

Thus, the median value = 144.5 + something. What is that something?

We note that

a) In the median class, 144.5 and under 153.5, there are 60 values and the class width = 9
units.
b) the cumulative frequency for the first three class intervals is 85
c) The rank of the median value for the whole data set is 100, so that the rank of the
median value is within the median class is 100 -85 = 15. Thus we need to locate the
15
th
value within the median class out of the 60 values it contains.

To do that, we make an assumption. The 60 values within the median class are evenly
spreadout, this assumption being sensible when the data set is large. Then using simple direct
proportion, we can locate the 15
th
value within the median class, as follows:-

If 60 values are spread over 9 units

Therefore 1 value is spread over
9
60
unit

And 15 values are spread over
9
60
x 15 units = 2.25

Hence the median value = 144.5 + 2.25
= 146.75 = 146.8 (nearest to one decimal place)

It should be clear to you that this answer corresponds to the working out of the formula given
in your textbook (p 105 of OJ) viz.

Median = LCB +
class erval
n
c f to median group
frequency of median class
int .
+
|
\
|
.
|

(
1
2

69

Note that since n is large,

n n +

1
2 2
~

So that we have median =
| |
144 5
9
60
100 85 . +

= 146.8 (same as above)

You may also try estimating the median by first constructing the ogive curve. As you notice,
the mean and the median nearly coincide. What does this indicate?

Activity 4

Attempt Questions 5.14, 5.22 (a) and 5.23 in textbook (OJ).

6.3.3 The Mode


An important condition whilst computing the mode using the histogram, is that the modal
class and the two classes adjacent to it should necessarily be of equal width.

70
Activity 5

Attempt question 5.24in textbook (OJ).

6.3.4 The Harmonic Mean


6.3.5 The Geometric Mean


Activity 6

Attempt Question 5.27 in textbook (OJ).

6.4 SUMMARY

In this unit , you have learnt about the different measures of central tendency, viz., the
arithmetic mean, the median, the mode and the harmonic and geometric means. These
numerical descriptive measures enable us to create a mental image and summarise data sets
we usually encounter in practice.

71
UNIT 7 MEASURES OF DISPERSION

Unit Structure

7.0 Overview
7.2 Measures of Dispersion
7.2.1 Introduction
7.2.2 Measures of Range
7.2.3 Measures of Average Deviation
7.2.3.1 Mean Deviation
7.2.3.2 Standard Deviation
7.2.4 Coefficient of Variation/Measure of Relative Dispersion
7.3 Measures of Skewness
7.3.1 General Considerations
7.3.2 Coefficients of Skewness
7.4 Summary

7.0 OVERVIEW

Variation is an important consideration in life. In this unit, you study its importance as well as
various ways of measuring its magnitude and understanding its nature.


When you have successfully completed this unit, you should be able to do the following:

1. Explain the importance of studying variation: both its magnitude and its nature.

2. Compute, interpret, and use

a. the range.
b. the interquartile range.
c. the quartile deviation.

72
d. the mean deviation.
e. the standard deviation.
f. the coefficient of variation.
g. the quartile coefficient of variation.
h. the Pearsons coefficient of skewness.
i. the Bowleys coefficient of skewness.

7.2 MEASURES OF DISPERSION

7.2.1 Introduction

Study the top of p 181 of (OJ).

This section draws attention to the inadequacy of measures of central tendency as summary
measures for data, and highlights the need for measures of dispersion as well. Measures of
central tendency try to capture a sense of the order of magnitude of the data, whereas
measures of dispersion attempt to capture a sense of the variability in the data. Homogeneity
(small variation) and heterogeneity (wide variation) are important considerations in many
situations encountered in real life as they have implications for decisions and action.

The various measures of the magnitude of dispersion are presented in your book in a rational
progression, from the crudest to the most refined. Therefore, in studying the various
measures, pay attention to how each successive measure presented improves on the previous
one.

Activity 1

Ponder over the following:

(i) What would be the relevance of examining the variation in the marks scored by
students from a given class at an examination? What would be the relevance of
comparing such variation with that of marks for students from another class taking the

73
same exam? What would be the potential implications of your findings for decision
and action?

(ii) What would be the relevance of examining the variation in the output (measured by
number of items produced) of workers of a factory, given that all workers are
manufacturing the same item? What would be the relevance of comparing such
variation with that of output of workers of another branch producing the same item?
What would be the potential implications of your findings for decision and action?

7.2.2 Measures of Range

Study pp 180-183 of textbook (OJ).

Ensure that you can define the range, the upper and lower quartiles, the interquartile range,
the quartile deviation.

Note that the quartiles can be obtained either by graphical method from the cumulative
frequency curve (as illustrated on p53) or by calculation using the same kind of reasoning as
for determining the median value. The latter happens to be the second quartile Q
2
.

Ensure that you can compute the range and the quartile deviation (a fuller name for the latter
is semi-interquartile deviation, for obvious reasons)

Pay attention to the strengths and weaknesses of the range, the interquartile range and the
quartile deviation as measures of dispersion. One of the weaknesses of all three is that they
are all based on only two values. We sometimes say, because of this, that they are not
comprehensive measures as not all observations have been taken into consideration in their
computation. From this point of view, their representativeness is questionable.

Activity 2

1. Answer the following : In what way does the quartile deviation improve on the range
as a measure of dispersion?

74

2. Attempt parts (a), (d), (e) and (f) of Question 9.2 in (OJ).

3. Attempt Question 9.3 in (OJ).

7.2.3 Measures of Average Deviation

7.2.3.1 Mean Deviation

Study pp 184-185 of OJ. Note that where the sign appears without the limits of summation,
the latter should be taken as extending over all the values, from the first to the last one.

Activity 3

Attempt part (g) of Question 9.2 in textbook (OJ).

7.2.3.2 Standard Deviation

Study pp 186-189 of textbook (OJ).

The underlying idea for this measure of dispersion is the same as for the mean deviation, viz.
the averaging of deviations from the mean. However, the problem that arises from the fact
that the sum of such deviations is zero is overcome in a different

way: not by ignoring the signs of the deviations but by squaring them before averaging.
However, this process inflates the order of magnitude and to offset this, we then take the
square root .

The term standard deviation has become the established appellation because of its
simplicity but a more explicit name for the standard deviation would be the root mean squared
deviations. The logic underlying the calculation of the standard deviation is well brought out

75
in the form of the formula given at the bottom of p186 of OJ. However, as pointed out in OJ,
this form of the formula is computationally more difficult than the alternative, but equivalent,
formula given at the top of p 187. Note that both these formulae apply to ungrouped data (i.e.
data where the individual values of the variable are given).

For grouped data, the appropriate formula is

f x x
f
( )
2

where x is the centre point of the class, x the overall mean and f is the frequency. Try to
figure out the underlying logic of this formula.

Again there is an alternative formula which is computationally easier. This formula is given in
your book on p 187. Note that just before the formula, there is a slight misprint: you should
read .....simply replace x with fx, x with f(x) and n with f.

Note that the square of the standard deviation i.e. the average of the squared deviations from
the mean (without taking the square root) is also used as a measure of variation and is called
the variance.

Activity 4

1. Prove the equivalence of the alternative formulae for the standard deviation:

(a) in the case of ungrouped data

( ) x x
n
2
and
x
n
2
-
x
n
|
\
|
.
|
|
2

76
(b) in the case of grouped data

f x x
f
( )
2
and
fx
f
2
-
fx
f
|
\
|
.
|
|
2

[Hint: Refer to Unit 6 Activity 2(ii)].

2. On Time Ltd has a unit of 130 workers, all performing exactly the same task: the
assembly of watches. The output of the workers for the first week of October 1997 was
recorded and is reproduced below:

Watches Assembled Number of Workers

up to 449 3
450 - 469 7
470 - 489 18
490 - 509 25
510 - 529 36
530 - 549 27
550 - 569 10
570 - 589 4

In Unit 5, you were asked to construct a table and an ogive for the above data.

(a) Using the ogive, estimate

(i) the interquartile range
(ii) the semi interquartile deviation

(b) Also, obtain by calculation, the above measures as well as the standard deviation for the
same data.

3. Attempt Questions 9.13 and 9.16 in textbook (OJ).

77
7.2.4 Coefficient of Variation/Measure of Relative Dispersion

Study p 189 (second half) and p 190 (except last five lines).

The Standard Deviation has a major weakness. It may have the same value for very different
data sets. Thus, from the example given in your textbook (0J) on p189, two data sets, viz.,

A : 8, 9, 10, 11, 12, 13, 14
B : 1008, 1009, 1010, 1011, 1012, 1013, 1014

have the same standard deviation 2. Common sense would suggest that there is something
wrong. Intuitively, we feel they do not have the same degree of spread. Thus, for data set A,
the increase from the smallest to the largest value in A is an increase of

14 - 8 100 = 75%
8
but the corresponding increase for data set B is only

1014 - 1008 100 = 0.595%
1008

This is due to the fact that the standard deviation is independent of a sense of order of
magnitude to the data set (on an additive scale).

To correct this weakness, the standard deviation is related to a measure of the order of
magnitude of the data set. This is the idea underlying the Coefficient of Variation.

We have thus the
S.D.
Coefficient of variation, C.V. = 100
Arithmetic Mean

(provided that the Arithmetic Mean is not 0).

78

Can you compute the coefficient of variation for the two data sets A and B given above?

Sometimes, the coefficient of variation is referred to as a measure of relative dispersion.
Additionally, the standard deviation is not dimensionless : it is expressed in appropriate
units such as rupees, centimetres, grams etc. Thus, if we were measuring variation in
temperatures, the value of the standard deviation would differ depending on whether
temperatures were measured in degrees Celsius or in degrees Fahrenheit. We call this scale
dependence. The dependence on units of measurement also makes it impossible to use the
standard deviation to compare the variation of variables measured in different and mutually
inconvertible units. It is interesting to note that division by the mean also removes the
dependence on units. Thus in contrast to the standard deviation, the coefficient of variation is
dimensionless.

Moreover, from the practical point of view, we need to be cautious when using the standard
deviation or the coefficient of variation as measures of dispersion or spread. Common sense
must prevail!

Three illustrative examples are provided below:

Example 1

Suppose that cylindrical pins (and their corresponding sockets) are being manufactured to
different diameters, say 5 and 15 millimetres and that it is desired to achieve the same fit
irrespective of the diameters. Then the tolerable absolute variation in the diameters is the
same and the standard deviation is appropriate in comparing the variability in diameter of a
batch of 5 mm pins to that of a batch of 15 mm ones.

Example 2

Suppose we are comparing two groups of households: a low income group that spends around
Rs 1000 a month and, a high income group that spends around RS 10,000 a month. A
variation of RS. 200 in the first group would be considered substantial but a similar variation

79
among the second group would be considered minor. When this kind of consideration applies,
it is obvious that we are interested in the dispersion relative to the order of magnitude of the
values and not in the absolute dispersion. The standard deviation, which is a measure of
absolute dispersion, is therefore inappropriate in such situations.

Example 3

Suppose that we are interested to compare the variability in the weights of a group of students
to the variability in their heights. It is impossible to use the standard deviation as the latter is
dependent on units of measurement and is expressed in relevant units such as rupees,
centimetres, grams etc. and there is no way to convert kilograms, say, to centimetres.
However, the coefficient of variation is dimensionless, i.e. independent of units, and is
therefore appropriate here.

Activity 5

1. Answer the following:

What would happen to

(a) the mean
(b) the standard deviation
(c) the coefficient of variation of wages,

if the wages of all workers of a factory were to be increased (i) by a uniform amount
of Rs. 400 (ii) by 10%? Justify your answers.

2. Attempt Questions 9.19 and 9.21 in (OJ).

7.3 MEASURES OF SKEWNESS

Study from last but one paragraph of p190 to top of p 191 in (OJ).

80

7.3.1 General Considerations

You were introduced to the notion of skewness in Unit 6. Skewness is another aspect of the
variation in the values of a variable. Whereas measures of dispersion focus on the extent of
the variation, measures of skewness focus on the nature of that variation: as the values vary,
do they tend to be symmetrically distributed around the centre, or do they tend to cluster more
at the lower end or more at the upper end of the range of values? Such considerations are of
interest as they have important implications. For example, income distributions are
notoriously positively skewed. Excessive skewness in an income distribution is often
criticised as reflecting an unfair distribution of income (Figure out why!) although some
skewness in such distributions usually exists.

Activity 6

Governments usually have a means of reducing the skewness of income distributions if this is
perceived as excessive. What is the instrument for doing that and how does it operate?

7.3.2 Coefficients of Skewness

You know from Unit 6 that:

(i) for a perfectly symmetrical distribution, the median and the mean coincide.
(ii) for a positively skewed distribution, the mean exceeds the median.
(iii) for a negatively skewed distribution, the mean is less than the median.

A measure of skewness can therefore be based on the deviation of the mean from the median.

However, it is considered desirable that the value of the measure of skewness should not
change merely because of a change in location ( e.g. if the wages of all workers in a factory
were increased by the same amount ) or a change in scale (e.g. if the wages of all workers in
the factory were increased by a constant percentage). It is also considered desirable that the
measure of skewness should be independent of units.

81
These objectives are achieved by dividing the difference between the mean and the median by
the standard deviation, hence the Pearsons coefficient of skewness defined in OJ at the
bottom of p 190.

3(mean-median)
Pearsons Coefficient of Skewness =
Standard Deviation

An alternative measure of skewness (known as Bowleys coefficient of skewness) is based on
the fact that in a symmetrical distribution, the lower and upper quartiles are equidistant from
the median. However, in a skewed distribution, the deviations of the upper and lower
quartiles from the median will be unequal.

Thus the difference between such deviations can be used as a measure of skewness. Again, it
is desirable that the value of the measure of skewness be independent of a change in location
or scale and be free of units. These objectives are achieved by dividing the difference by the
interquartile range.

Bowleys Coefficient of Skewness:

( ) ( )
( )
( )
sk
Q Me Me Q
Q Q
Q Q Me
Q Q
=

=
+
3 1
3 1
3 1
3 1
2

Activity 7

82
Refer to the data in Question 2 of Activity 4.

Calculate (i) Pearsons coefficient of skewness the coefficient of variation for the data.
(ii) Bowleys coefficient of skewness.

7.4 SUMMARY

In this unit, you have appreciated the universality of variation and learnt about measures of
dispersion and skewness: their importance, computation, application and interpretation. The
following measures have been covered: range, interquartile range, quartile deviation, mean
deviation, standard deviation, coefficient of variation, quartile coefficient of variation,
Pearsons coefficient of skewness, Bowleys coefficient of skewness. In Unit 6, you saw how
measures of central tendency provide a partial summary of certain data sets. In this unit, you
have seen how measures of dispersion complement that summary.

83
UNIT 8 TIME SERIES ANALYSIS

Unit Structure

8.0 Overview
8.2 Components of a Time Series and Time Series Models
8.2.1 Components of a Time Series
8.2.2 Time Series Models
8.3 Calculation of Trend
8.3.1 Using Moving Average Method
8.3.2 Using Exponential Smoothing
8.4 Seasonal Variation
8.5 The Residual Component
8.6 Forecasting from the Time Series
8.6.1 Projecting the Trend
8.6.2 Using the Average Rate of Change
8.7 The Multiplicative Model
8.8 Summary

8.0 OVERVIEW

Time Series is a type of data set which tells us how a given variable varies over time; such
data sets exhibit very particular pattern and fluctuations. They exist in almost any sphere of
human activity: in economics, business, engineering, medicine, meteorology, agriculture, etc.
In this unit, you will study the analysis of time series and you will be introduced to
elementary forecasting.

84


1. Explain
a. the components of a time series.
b. a time series model.
c. hence, the underlying structure of a times series.

2. Calculate, interpret and use
d. the trend.
e. the seasonal component.
f. the residual component.

3. Carry out elementary forecasting from time series analysis.

8.2 COMPONENTS OF A TIME SERIES AND TIME SERIES MODELS

8.2.1 Components of a Time Series


8.2.2 Time Series Models

From subsection 8.2.1, you have learnt that there are four components which interact among
themselves in some way to generate the data set under consideration.

85
Let X denote the time series variable (e.g. imports)
T the trend
S the seasonal component
C the cyclical component
R the residual component.

For our purposes, as explained in subsection 8.2.1, the time series under consideration will
not be covering a sufficiently long period for us to be able to capture the contribution of the
cyclical component. Thus we shall ignore this component in our discussion; the point should
be made that this unavoidable omission, under the present circumstances, would influence our
calculations. Can you think why?

Then the three components are assumed to interact in mainly two ways to produce the time
series, giving rise to two time series models for short term data:

(i) The Additive Model

X = T + S + R

(ii) The Multiplicative Model

X = T S R

The terms are self explanatory. Moreover, in this unit, we shall consider mainly the additive
model; the multiplicative model will be dealt with briefly in subsection 8.7.

8.3 CALCULATION OF TREND

8.3.1 Using Moving Average Method


86
Your attention is drawn to two main points. Firstly, the idea of a moving total, used in the
construction of a Z-chart (subsection 5.2.5), is used here again to smooth out the fluctuations
before a moving average is computed.

Secondly, sometimes it may be desirable to draw a line of best fit by the eye from the
values obtained, for the trend. For example, from the graph given on p 129 of your textbook
(OJ), a straight line which fits best the points (hence the line of best fit) can be drawn. A line
of best fit is drawn such that, overall, the points are uniformly distributed around the line.

Activity 1

1. Draw the graphs shown on p 129 of your textbook (OJ). Then draw the line of best fit for
the trend values. Can you forecast the trend values for years 16, 17 and 25? Comment on
your results.

2. Attempt questions 6.4 and 6.7 in your textbook(OJ),

8.3.2 Using Exponential Smoothing

Study pp 131-135 of your textbook (OJ). It is appropriate to note that if X
i
and T
i
denote the
corresponding values of the time series and the trend respectively at time i, then the trend
values are given as follows:

T
1
= X
1

T
2
= X
2
+ (1-)T
1

= X
2
+ (1-) X
1
Note that + (1-) = 1
T
3
= X
3
+ (1-)T
2

= X
3
+ (1-) { X
2
+(1-) T
1
}
= X
3
+ (1-) X
2
+ (1-)
2
X
1
, Note that + (1-) + (1-)
2
= 1
T
t
= X
t
+ (1-) X
t-1
+ (1-)
2
X
t-2
+ ...+ (1-)
r
X
t-r
+ ... + (1-)
t-1
X
1
Note that the sum of the weights , (1-), (1-)
2
,..., (1-)
t-1
assigned to the X
t
s is
equal to 1 or tend to 1 as t becomes very large.

87

Thus, when using exponential smoothing, we use all relevant values of the time series to
compute the trend, with the most recent value having more weight and the most distant value
having the least weight. Furthermore, the sum of the weights adds up to 1 or tend to add up to
1, so that then the trend value is a weighted average of all relevant values of the time series.

Activity 2

1. Attempt Questions 6.4 and 6.7 in your textbook (OJ).
2. For the data in question 6.4, obtain exponentially smoothed trend line with
= 0.2

8.4 SEASONAL VARIATION


Recall that our additive model is given by

X = T + S + R .

Thus, we can have X - T = S + R and the values for X-T give the deviation from trend as
mentioned on p 139 of your textbook (OJ); alternatively, these values are also known as
detrended values.

Similarly, we can have X - S = T + R and the values for X - S give the series with seasonal
variation eliminated as mentioned on p 142 of your textbook; alternatively, these values are
also known as the seasonally adjusted series or deseasonalised series.

Further, on top of the assumption made regarding the residuals (p 140 of your textbook) and
which underlie the calculation of the seasonal component, two more assumptions should be
made explicit:-

88
(i) With reference to the example discussed on p 140 of your textbook, the sum of the
four seasonal components for a given year is equal to zero. This seems to be a
sensible assumption, in line with the very notion of seasonal variation.

(ii) Each seasonal component (i.e. each one of the four considered in the example) is
assumed to be constant over time. This may not always be true, the more so when the
time series is spread over a long period of time. Moreover, it is somewhat difficult for
us to handle such cases; and these cases are not within the scope of this unit (and
course). It is to be noted that the multiplicative model referred to in 8.7 can cope to a
certain extent with this situation.

It is to be noted that, as mentioned on p. 141 (OJ), we expect the sum of the averages of S+R
for each quarter to be equal to 0. This is due to the fact that

(i) in a given year, the sum of the four seasonal components is expected to be equal to 0;

(ii) The residual fluctuations are assumed to be random so that, in the long run, we expect
them to cancel out each other.

Thus, if the sum is not zero, then an adjustment is necessary as explained on p. 141
(OJ).

Activity 3

(i) Obtain seasonal components for the data set from Question 6.4 in your textbook (OJ) .
(ii) Attempt Question 7.6 in your textbook (OJ)
.
8.5 THE RESIDUAL COMPONENT


Recall that our assumed time series model is defined by

89
X = T + S + R .

We must bear in mind that

(i) the exclusion of the cyclical fluctuations from our model and

(ii) the possible failure, at times, of assumptions underlying the computation of the
seasonal component

have a bearing on the accuracy of the calculation of the various components in some way or
the other. In particular, on top of the various points made in your textbook regarding the
nature and content of the residual component, the residual component captures the resulting
errors to some extent. And this may lead to further inaccuracy. For our purpose, the model
and the method are satisfactory.

8.6 FORECASTING FROM THE TIME SERIES

8.6.1 Projecting the Trend

Study pp 144-147 on your textbook (OJ).

Some comments on the method used to project the trend in your textbook are pertinent. As it
is already written in your textbook on p 147, projecting the trend by eye is neither accurate
nor consistent. Different people may give different forecasts.

The alternative is to draw a line of best fit as explained in subsection 8.3.1 and then to carry
out the needed forecasts. This method is obviously approximate and holds, provided the trend
is linear or at least approximately linear.

Moreover, sometimes it happens that there is a marked change in slope of the trend at a given
point in time, so that the trend may be approximated by two linear parts, as shown in Diagram
8.1:

90

Figure 8.1

In such cases, the trend is estimated by extrapolating only the second linear part (BC) of the
graph.

Activity 4

Reproduce the graph found on p 145 of your textbook (OJ). Carry out the forecast of the
trend for 19-6 Quarters 1 and 2 over again, using a line of best fit by eye.

8.6.2 Forecasting the Trend Using the Additive Model

The trend values obtained from the moving average method are plotted against time on graph.
We shall assume, for our purposes, that we have a linear trend. Thus a line of best fit by the
eye is drawn. Thereafter, that line is extrapolated and corresponding forecasts for the trend
can be read off.

In your textbook, the method used to obtain forecasts of imports from the forecast trend
values (pp 146-147) has the merit that we do not have to disentangle the seasonal component
Trend
Time
A
x
x
C
x
x
x
x
x
x
B

91
from the residual component. This is particularly valid when it is believed that the residual
fluctuations are pronounced.

Moreover, if they are of minor importance (as is normally expected), then we can simply
forecast the imports by using our model

X = T + S .

Substituting the estimated trend value and the corresponding seasonal component in the
model gives the forecast value for the time series.

8.6.3 Using the Average Rate of Change


8.7 THE MULTIPLICATIVE MODEL


In a multiplicative model, as mentioned on p. 150 (OJ), the total of the average ratios is
expected to be four in the example under consideration. It is appropriate to understand why
this is so.
As in the additive models, the seasonal components at times produce an increase in the trend
or at times a decrease in the trend. In a given year (as per the example), we expect these
increasing and decreasing effects to cancel out each other. But in the multiplicative model,
the sum of the seasonal components do not add to 0 as in the additive model. But instead,
they add up to 4. Why?

The ratios are calculated using Actual data/Trend. If there were not seasonal fluctuations, this
ratio would be equal to 1 (assuming the residual fluctuations to be insignificant). Moreover,
some seasonal components would produce an increasing effect, so that the ratio is more than
1 (e.g. 1.0633 or 1.1071 as per p. 150 OJ). Other seasonal components would produce a
decreasing effect, so that the ratio is less than 1 (e.g. 0.8471 or 0.9873). Thus, as per the

92
definition of seasonal fluctuation, we expect them to cancel out each other in one year (as per
the example), so that their total is expected to be 4. If not, we then have an adjustment as
explained on p. 150 (OJ).

Activity 5

Attempt Question 7.10 of your textbook (OJ).

8.8 SUMMARY

In this unit, you have learnt about time series analysis. You should now have a good grasp of
the nature of the various components of a time series and of their calculation and
interpretation. You have also learnt how to carry out elementary forecasting from time series
analysis.

93
UNIT 9 INDEX NUMBERS

Unit Structure

9.0 Overview
9.2 Index Numbers
9.3 Methods of Construction of Index Numbers
9.3.1 Index Numbers of Prices
9.3.2 Two Approaches
9.3.3 Worked Examples
9.3.4 Fishers Index of Prices
9.3.5 General Formulae for Index Numbers
9.3.6 Index Numbers of Quantities (or Volume)
9.4 Further Concepts
9.4.1 Splicing Two Series of Index Numbers
9.4.2 Chain-Based Index Numbers
9.4.3 Using an Index to Deflate a Time Series
9.5 General Problems of Index Number Construction
9.6 Uses and Limitations of Index Numbers
9.7 Summary

9.0 OVERVIEW

This Unit introduces you to statistical tools called index numbers, which attempt to measure
the magnitude of changes in any variable over time. Here, we shall be more concerned with
changes in economic variables over time. The unit will cover the different types of price and
quantity index numbers, the general problems of index number construction, interpretation,
uses and limitations of index numbers. Chapter 8 of your textbook covers some of these
topics. However, the different types of index numbers are not introduced in a proper order
and also the different types of index number construction are not well defined (pp 158-164).
This unit therefore, consists of a complete write-up of the topics covered in these pages in

94
your textbook as well as of other topics which are omitted in the chapter. We shall make
special reference to these pages where necessary. Treatment of topics on pp 165-170 is quite
satisfactory and, therefore, these topics are not re-written in the unit.



1. Compute, interpret and compare the different types of price and quantity index
numbers
2. Explain the importance of weights in an index number
3. Identify the main practical issues to be considered when constructing an index number
4. Change from fixed base to chain base and vice versa, splice and deflate an index
number series
5. Identify the uses and limitations of index numbers.

9.2 INDEX NUMBERS

As mentioned in the overview, index numbers are devices for measuring the magnitude of
changes in a variable over time. Such changes could be in the price of commodities, in the
quantity of goods produced, marketed, or consumed, or in such concepts as productivity,
efficiency, etc. The comparisons may be between different time periods, between places, or
between like categories. In many of these situations, the volume of data that has to be
analysed is huge and also has other characteristics that you did not come across in data for
averages. Index numbers are special types of averages which can make such masses of
complex data more manageable and better understood, and thus enable us to compare
different sets of data. Thus, we may have index numbers comparing the consumer prices in
different years or in different countries, the volume of production in different years, the
productivity of different sectors of the economy, or, the efficiency of different school
systems. Read pp. 157-158 of your textbook for further details.

95
This unit is mainly concerned with index numbers of prices and of quantities comparing
changes over time. At first, methods of construction of index numbers of prices and of
quantities are considered. We then introduce further concepts such as chain-based index
numbers, splicing of index numbers and use of an index number to deflate a time series. You
can read about splicing and deflating in your textbook (OJ). Finally, we deal with general
problems of index number construction, and, uses and limitations of index numbers.

9.3 METHODS OF CONSTRUCTION OF INDEX NUMBERS

We shall consider methods of construction of price indices at first and later on show how
quantity indices can be obtained similarly.

9.3.1 Index Numbers of Prices

To illustrate the construction and interpretation of index numbers of prices, we have used the
data of example on p. 158 (OJ), which uses a list of three commodities.

Al Coholic throws a (rather unusual) party each Christmas for his friends. Details of prices
and quantities of the three food and drink items purchased by him in 1992 and 1993 are as
follows:

Table 9.1
1992 1993
Commodity Price
P
o

Quantity
Q
o

Price
P
n

Quantity
q
n

Lager (per bottle) 1.00 40 1.15 50
Crisps (per packet) 0.20 100 0.27 90
Cake 2.00 1 2.20 1

96
The objective of A1 Coholic is to know the changes in the prices of these commodities taken
as a whole in 1993 as compared with those in 1992. The time period that serves as the basis
for comparison, is called the base period, whereas, the time period that is compared with the
base period is called the current period. Thus, here 1992 is the base year and 1993 is the
current year.

Note: The base period is sometimes indicated by calling that period as equal to 100. For
example, here, 1992 = 100 will show that 1992 is the base year.

9.3.2 Two Approaches

There are two main approaches to handle the problem of determining the changes in the
prices of a group of commodities taken as a whole in a given year as compared with those in
another year:

One approach is to consider the change in the price of each commodity initially and then
to try, in some way, to bring together these changes, by, for example, using an average.

The second approach is to consider the prices of all the commodities at one point in time
and then relate them is some way, with those at the other point in time under consideration.

We shall now consider these two methods for the rest of this section.

Method I Price Relatives Method

A very simple way of finding the change in the price of each commodity would be by
calculating what is known as a price relative for each commodity.

97
The price relative of a commodity is defined as its price in the current period expressed as a
percentage of its price in the base period.

If p
o
and p
n
denote the prices of a commodity during the base period and given period
respectively, then, symbolically,

100 =
0
n
p
p
relative Price Equation 9.1

In the above example, the price relatives for the three commodities in 1993 with 1992 as the
base year are then given by:

110 100
00 . 2
20 . 2
Cake for relative Price
135 100
20 . 0
27 . 0
Crisps for relative Price
115 100
00 . 1
15 . 1
Lager for relative Price
= =
= =
= =

Thus it is observed from all the above price relatives that the prices of Lager, Crisps and Cake
have risen by 15, 35 and 10 percent respectively.

However, as mentioned earlier, Al Coholic is interested in a single measure which would
compare the change in prices, of all the three commodities taken as a whole, in the two years.

We are tempted to believe that one way of combining the changes in the price of a group of
commodities is to find an average of price relatives of these commodities.

98
An average of price relatives is calculated, most frequently using the arithmetic mean, as
follows:

100 =

N
p
p
Relatives Price of Mean Arithmetic
0
n
Equation 9.2

Where N = the number of commodities included in the index.

Table 9.2

Price () Price Relative

Commodity
1992

p
o

1993

p
n

100
0
n
p
p

Larger (per bottle)

1.00

1.15
Crisps (per packet) 0.20 0.27
Cake 2.00 2.20

115.0
135.0
110.0

Total

360.0

Substituting the above calculated value in (9.2), the arithmetic mean of price relatives of 1993
with 1992 as the base year is

0 . 120
3
0 . 360
100
N
p
p
0
n
= =

Thus, in 1993, the average percentage increase in the prices for this group of commodities as
compared with 1992 is 20%.

99

Moreover, this method does not recognise the relative importance of
different commodities in the consumption pattern. Thus, according to this
method the weight or importance given to Lager, Crisps and Cake is equal.
Before we tackle this weakness, let us consider the second approach first.

Method II Aggregative Method

The second approach is to relate the sum (aggregate) of the unit prices of a group of
commodities in the current year to the sum (aggregate) of the unit prices of these commodities
in the base year in some way. We are tempted to consider the following expression:-

100
0
n
p
p
Equation 9.3

In the example given above, the overall percentage change in the prices of the three
commodities is

, 1 . 113 100
20 . 3
62 . 3
100 = =
0
n
p
p

Substituting the calculated values from Table9.3 in (9.3)

100
Table 9.3
Price ()
Commodity 1992 1993

0
P
n
P
Lager (per bottle) 1.00 1.15
Crisps (per packet) 0.20 0.27
Cake 2.00 2.20
Total 3.20 3.62

Thus, in 1993 Al Coholics total cost of one bottle of Lager, one packet of crisps and one
cake was 113.1% of the total cost of these commodities in 1992. In terms of percentage
change, it cost 13.1% more than in 1992 to purchase this basket of goods.

Yet again, this method does not recognise the relative importance of
different commodities in the consumption pattern.

WEIGHTS

As mentioned above, the various items in a group do not generally have equal relative
importance, and both the methods considered above suffer from the drawback that they do not
take into account the relative importance of the different items.

Thus, although we were tempted to use these simple methods to determine the average
change in a variable over time, we should not use them; we should include, in each method,
measures of the relative importance of the various items known as weights, denoted by w, as
done below.

101

Method I (Relative Method) Method II (Aggregative Method)

100
W
W
p
p
0
n
100
W P
W P
0
n

Which Weights to Use ?

In the consumer price example used here, the weights considered are usually


Expenditure, i.e. the product of Quantities, i.e., q.
price and quantity, pq. Thus w = q
Thus w = pq

Obviously, there exist two sets of values of prices, quantities and expenditures: those of base
year (i.e. ) q p , q , p
0 0 0 0
and those of current year (i.e. ) q , p , q , p
n n n n
. Effectively, each set of
values of expenditures and quantities can be used as weights within the corresponding
methods giving the following formulae for the index numbers:

Method I (Relative Method) Method II (Aggregative Method

(a) Base Weighting (a) Base Weighting

102

0 0
0 0
0
n
q P
q P
P
P
100 100
0 0
0 n
q P
q P

(b) Current Weighting (b) Current Weighting

100
n n
n n
0
n
q P
q .P
P
P
100
0 0
n n
q P
q P

We note that when weights of base year are used, we refer to base weighting. Laspeyre is the
person who introduced this technique and hence the corresponding formulae are referred to
Laspeyres.

Similarly, we have current weighting; Paasche is the person who introduced this technique
and hence the corresponding formulae are referred to Paasches.

9.3.3 Worked Examples

The corresponding calculations for the different formulae are given below:

103
Method I (Relative Method)

Laspeyres Index Number (Using Base Weighting)

As per page 83, the Price Index 100
q P
q P
P
P
0 0
0 0
0
n
=

The table below shows the calculations for using the above formula.

Table 9.4
Price () Quantit
y
Pric
e
Rela
tive
Base
Expend
iture
PR
X
Commodit
y
(PR) (weight) Weigh
t
1992 1993 1992
0
n
P
P

1992

0
P
n
P
0
q
0 0
q P
Lager (per
bottle)
1.00 1.15 40 1.15 40 46.0
Crisps (per
packet)
0.20 0.27 100 1.35 20 27.0
Cake 2.00 2.20 1 1.10 2 2.20
Total 62 75.20

Thus,
3 . 121 100
62
20 . 75
100
q P
q P
P
P
0 0
0 0
0
n
= =

104

Read relevant parts of p 161 of your textbook (OJ).

(b) Paasches Index Number (Using Current Weighting)

We compute the corresponding current year expenditures
n n
q P and then we obtain the
Price Index from 100
q P
q P
P
P
n n
n n
0
n
(as per page 83).

Table 9.5

Commodity

Price Relative (PR)
0
n
P
P

Current
Expenditure
(Weight)
) q P (
n n

PR

Weight

Lager (per bottle)

1.15

57.50

66.125
Crisps (per packet) 1.35 24.30 32.805
Cake 1.10 2.20 2.42

Total

84.00

101.350

The Price Index (current weighting) = 100
84
35 . 101

= 120.65

Method II (Aggregative Method)

105
(a) Laspeyres Index Number (Base Weighting)
The index number is given by (as per page 83), 100
q P
q
0 0
0 n
, and is calculated
below:-

Table 9.6
Price () Quantity Price Base Quantity
Commodity 1992 1993 1992

0
P
n
P
0
q
0 0
q P
0 n
q P
Lager (per bottle) 1.00 1.15 40 40.00 46.00
Crisps (per packet) 0.20 0.27 100 20.00 27.00
Cake 2.00 2.20 1 2.00 2.20
Total 62.00 75.20

Laspeyres Price Index 3 . 121 100
00 . 62
20 . 75
= =

i.e. prices in 1993 have increased by 21.3% as compared with those in 1992.

Paasches Index Number (Current Weighting)

Paasches Index Number is here given (as per page 83) by
n o
n
q p
q p

106
Table 9.7
Price () Quantity Price
x Current
Commodity 1992 1993 1993 Quantity
p
o
p
n
q
n
p
o
q
n
p
n
q
n

Large (per bottle)

1.00

1.15

50

50

57.5
Crisps (per packet)

0.20 0.27 90 18 24.3
Cake

2.00 2.20 1 2 2.2

Total

70

84.0

Paasches Price Index 0 . 120 100
70
84
= =

Comments:

1. We note that the answers for Laspeyres Price indices, whether using expenditures or
quantities as weights are exactly equal to 121.3. This is not a coincidence. In fact, if
we study the corresponding formulae carefully, we find out that the formulae for base
weighting in the case of the Method I (Relative Method) simplifies to the formula
used for base weighting in the case of the Method II (Aggregative Method).

Thus

100
q p
q p
100
q p
q p .
P
p
o o
o n
o o
o o
o
n
=

2. The choice of weights is of utmost importance in the construction of index numbers.
Here when the variable study is price, the weights chosen are expenditures and
quantities. For other variables, different weights would be used. They should be
appropriate to the purpose for which they are meant: they should measure the relative
importance of the items under consideration.

107
3. There is a tendency to prefer base weighting to current weighting because it helps in
the comparability of the indices over time and also because, with current weighting,
there may be problems of interpretation of the index over time. Also, it is easier to use
and understand base weighted indices.

Moreover when prices are changing sharply over a short period of time, causing
changes in the pattern of consumption, then current weighting will be obviously more
appropriate.
4. There is a tendency to use more often aggregative indices because they are rather easy
to compute, use and understand. Bearing in mind the point made (3) above,
aggregative indices with base weighting tend to be used quite often.

5. Finally, it is to be noted that indices obtained by the relative method are independent
of the units of measurement whilst those obtained by the aggregative method are not.

Interpretation/Discussion:

1. Generally speaking, Laspeyres Index tends to overstate and Paasches Index tends to
understate changes in prices or quantities.
Read Parliament 161-162 (OJ), Section on Paasches Index.

2. If we were to consider the formulae for the aggregative prices indices, it would be of
interest to give some thought to the possible interpretation of these indices.

Thus consider the aggregative Laspeyres Price Index.

o o
o n
q p
q p

The denominator,
o o
q p
, is in fact, the effective expenditure incurred in the base

period; whilst the numerator,
o n
q p
, presents the expenditure that would have been

108
incurred in the current period if the pattern of consumption in year n is the same as
that of year 0 (as measured by the quantity
o
q
).

We can therefore interpret the index number as the ratio of the would have been
expenditure in current year keeping pattern of consumption constant to effective
expenditure in base year.

3. Consider now the aggregative Paasches Price Index

n o
n n
q p
q p

Can you interpret this index number?

4. One last word:- Given the different approaches involved in the construction of index
numbers and given the different prevailing systems of weights, it is inevitable that
there are more than one possible index number to measure the change in a variable
over time. The preceding discussion and your common sense should guide you in
choosing and interpreting the appropriate index number in a given context.

Activity 1

1. Attempt Question 8.3 and 8.10 from you textbook (OJ).

2. Using data of Question 8.13 in OJ and, taking 19-4 = 100, calculate Laspeyres and
Paasches index numbers for 19-9 prices, using both approaches.

3. A basic Food Price Index (F.P.I) comprises the undermentioned items, weighted for
the average family taking a normal diet as follows:

109
Price Weighting
Bread Rs 1/loaf 7 loaves
Potatoes Rs 4/lb 20 lbs
Milk Rs 5/pint 15 pints
Eggs Rs 18/dozen 2 dozen
Meat Rs 40/lb 10 lbs

It is expected that during the next year, the cost of bread will rise by 10%, potatoes
will rise by 25%, milk will fall by 10%, egg will fall by 5% and meat will increase by
30%.
Calculate the F.P.I expected in one years time, if the present F.P.I is 112.

Calculate the F.P.I expected in three years time if prices continue to change at the
same average rate.

Suppose that it is predicted that people will spend rather more on milk and eggs and
somewhat less on meat during the coming year. In what way would you expect your
answer to part (1) to be affected, if a current weighted index were used?

Why could a current weighted F.P.I be unsatisfactory?

4. A factory produces togs, clogs and pegs, each of these three products having a
different work content. The proportions of these products vary from month to month
and the factory requires an index for assessing productivity changes. Each tog, clog
and peg produced is to be weighted according to its work content, these weights being
6,8 and 5 respectively. Also, because some months contain more working days than
others, the index should offset the effect of this.

110
Data for the months of May, June and July are as follows:

May June July
23 22 16
(due to factory closure for 2
weeks)
Output (thousands)
togs 19 16 10
clogs 12 20 15
pegs 22 15 10

It is intended that May should be the base month for comparison, with a productivity index of
100.

Design a simple productivity index, calculate its value for June and July, and comment
briefly on the results.

Now, due to a change in the type of peg produced, a new weight is required.

Production data are shown below for two days when productivity was judged to be
about equal.

Day 1 Day 2

Output
Togs 921 811
Clogs 800 747
Pegs 1042 1206

Use these data to estimate a suitable weight for the new pegs, to 1 decimal place, assuming
that the weightings of 6 for togs and 8 for clogs are as before.

111
9.3.4 Fishers Index of Prices

Read pp 165-166 of your textbook (OJ)

9.3.5 General Formulae for Index Numbers

The examples considered above in measuring changes in the prices of a group of items can
now be generalized to the measuring of changes in any other variable (quantities,
productivity, efficiency, examination marks etc.) for a group of items. The two methods used
are always applicable, together with the possibility of base and current weighting.

Thus suppose we need an index number to measure the changes in a variable X for a group
of items over time. Then using subscripts n and o for current and base periods respectively,
and denoting weights by w, in a manner similar to that used on page 83, we have


100
w
w .
x
x
o
n
100
w x
w x
o
n

(a) Base Weighting

100
w
w .
x
x
o
o
o
n
100
w x
w x
o
n

(b) Current Weighting

100
w
w
x
x
n
n
o
n
100
w x
w x
n o
n n

112
The effective choice of weights will depend on the variable under consideration.

Activity 2

Attempt Question 8.14 from your textbook (OJ).

9.3.6 Index Numbers of Quantities (or Volume)

Index numbers of quantities (or volume) measure changes in physical quantities such as the
quantities of goods and services consumed, volume of industrial production, volume of
imports and exports, etc.

As per the discussion of Section 9.3.5, these indices are calculated using similar methods that
you have learnt so far in the case of price indices. You will note that now quantity is the
variable for which the magnitude of changes over time is to be measured. Consequently,
weights are values of commodities (or expenditures), or prices, depending upon the index
used.
Thus the quantity relative for a commodity = 100
q
q
o
n

The formulae for the various quantity indices are as follows:-


100
w
w .
q
q
o
n
100
w q
w q
o
n

113
(a) Base Weighting

100
q p
q p
q
q
o o
o o
o
n
100
p q
p q
o o
o n

(b) Current Weighting

100
q p
q p .
q
q
n n
n n
o
n
100
p q
p q
n o
n n

Activity 3


9.4 FURTHER CONCEPTS

9.4.1 Splicing Two Series of Index Numbers

It is general practice to change the base year after a certain time period to take into account
any changes in the consumption pattern, or, in the weights, i.e. the relative importance of
different items included, or both. Thus, in Mauritius, for the Consumer Price Index (CPI)
calculated over the last 20 years, the base period has been regularly changed at an interval of
five years. For the purpose of historical comparison, however, it may be desirable to have a
single series of index numbers with either the old base period or the new base period. The
process by which the two series of index numbers with different base years/periods are
combined is called splicing.

For splicing two such time series of index numbers to form one continuous series, it is
necessary that the two series have one common year so that both types of index numbers have
been calculated for that year (or period). The index numbers revised in the process of splicing

114
are generally the index numbers of the old series, with the overlapping year being used as the
base for the combined series.

The first step in the process of splicing is to determine the quotient obtained by dividing the
new index number for the overlap year by the old index for this year. The overlap year is
generally the new base year, so that the new index = 100, and the quotient is determined by

overlap ) old ( 1
100
overlap ) old ( 1
overlap ) new ( 1
Q = =

The next step in the process of splicing is to multiply each of the index numbers in the old
time series by the quotient given by the formula above to calculate the new index numbers as
follows:

Q ) old ( 1 ) new ( 1 =


Activity 4


9.4.2 Using an Index to Deflate a Time Series


Activity 5

9.4.3 Chain-Based Index Numbers

115
The index numbers calculated so far are called fixed-base indices.

Read the relevant parts of pp 163, 164 and 165 of your textbook (OJ) to know about the
chain-based index numbers.

This method is suitable when the relative importance of items and the consumption
pattern are changing rapidly. Because, with this method, new items can be introduced and
old ones removed with ease.

Its disadvantage is that comparisons over long periods of time are not possible.

Activity 6

Convert the fixed-based system of the Index Number of Prices given in Question 8.19 of your
textbook (OJ) to a chain based system for the years 1985 to 1991.

9.5 GENERAL PROBLEMS OF INDEX NUMBER CONSTRUCTION

The first essential point to be considered is the purpose for which the index number is to be
constructed. What is the index number intended to measure? For example, the Consumer
Price Index (CPI) attempts to answer a question concerning the average movement of certain
prices over time. The index of industrial production among other things, is constructed to
show the trend of economic activity.

In general, there are four main problems that can arise in the construction of a new index
number. They are:

1. Selection of items to be included.
2. Selection of a suitable base period.
3. Choice of appropriate weights measuring the relative importance of various items
included in the index.
4. Choice of a suitable average or index number formula.

116

These major problems need to be tacked very cautiously in the light of the object of the index
number. Here, we shall consider the construction of a price index and mostly refer to the CPI
as an illustration. However, you should know that similar problems can occur in the
construction of quantity or volume indices.

Selection of Items to be Included

Since it is not practicable, if not impossible, either from the consideration of cost or time,
to measure changes in the prices of all the relevant commodities, a selection must be made
of items to be included in the index. The movements in the prices of the selected items
should be representative of the movements of prices of all the relevant commodities. The
items selected for a CPI should also be representative of tastes, habits, customs and
necessities of the people to whom the index relates.

The number of items should be fairly large, consistent with the ease of handling item.
The CPI of Mauritius includes 230 item classes after ascertaining that these items
accurately reflect the average change in the cost of the entire market basket. By means of
the periodic Household Expenditure Surveys, the Central Statistical Office (C.S.O)
determines the representative basket of goods and services purchased by households on
average and also how the total expenditure is spread over these items.

The items selected are classified in nine major commodity groups (for example, Food,
Fuel and Lighting, Housing, Medical care, etc.) so that separate indices can be calculated
for these major groups, in addition to an overall CPI.

Selection of a Suitable Base Period

A second problem in the construction of a price index is the choice of a base or reference
period, that is, a period relative to which the prices in the current year are compared.
For a general purpose, for price index such as wholesale price index or consumer price
index, it is desirable to have a base period of relative economic stability that is not too
distant in the past. Thus the time period selected as base should be one with normal

117
price levels, since the use of a base with unusually high or low price levels could distort
comparisons of price changes for subsequent years.

On the other hand, the problem with a distant base is that the economic conditions
prevailing at that time could be quite different and the comparisons with such remote
periods are not of any interest.

Further, the base period once selected should be regularly shifted to more recent period.
Thus the base year for the CPI of Mauritius is regularly shifted after every five years,
since the year 1976. Finally, a relatively recent base facilitates the inclusion of new
commodities, as well as dropping of obsolete ones.

You will note that the above discussion relates to the fixed-base method. As seen earlier,
in specific cases index numbers are constructed using the chain-base method. If this
method is used, then the year preceding the current year is automatically taken as the base
year. Thus in the chain-base method, the problem of selection of base does not arise.

The Choice of Appropriate Weights

As mentioned earlier, it is necessary to give weights to individual items included in an
index number so as to show their relative importance in the comparison of price changes.
Surely, in the construction of the CPI a 25% increase in the price of bread will have more
significance than a 25% increase in the price of jam. The weights to be used depend on
the purpose of the index to be constructed.

It is necessary to adopt some system of rational weighting, that is, according to some
logical basis. Thus, for the CPI, proportionate expenditure upon different items found
from a Household Expenditure Survey would constitute appropriate weights if price
relatives of different commodities are to be averaged; the weights in this case are known
as value weights. If, on the other hand, prices rather than price relatives are used,
reasonable weights would be given by the quantities of individual items purchased, and
are known as quantity weights.

118
The types of quantities to be used in a price index would depend on the nature of the
index computed. Thus, an index of export prices would use quantities of commodities
and services exported, whereas an index of import prices would use quantities imported.

To conclude, weighting of an Index Number is essential, weights should be rational and
should be renewed after a few years. In the case of the CPI, the Household Expenditure
Surveys carried out regularly at intervals of about five years would provide changes in
weights of the commodities, if any.

The Choice of a Suitable Average

As seen earlier in the index number calculations, there are different types of averages
from which a selection of suitable average could be done. The form of the average
selected generally depends more on practical considerations than on their mathematical
properties. Thus the Weighted Price Index with base period quantity weights (q
o
), ie.
Laspeyres Price Index, and the Weighted Arithmetic Mean of Price Relatives with
base period value weights (p
o
q
o
) are the averages which are mostly used.

You will recall that both these methods are actually equivalent. The CPI is constructed by
using the Weighted Arithmetic Mean of Price Relatives with base period value weights,
since such value weights are easily available from the Household Expenditure Survey and
can be used for a few years till the weights are changed following the next Household
Expenditure Survey.

Current year weights, although theoretically better in rapidly changing economic
situations, pose the problem of data collection every time the index is to be calculated.
Thus the based-weighted index is preferred to the current-weighted index. Similarly, the
Arithmetic Mean is preferred to the Geometric Mean due to its simplicity in calculations.

9.6 THE USES AND LIMITATIONS OF INDEX NUMBERS

Uses of the Index Numbers

119

Index number series are very useful for analysis of economic activity and for decision
making. Thus the CPI is used to calculate the rate of inflation in a country, and as a
basis for wages negociations in the collective bargaining processes. Again, it can also
be used to deflate current incomes with a view to ascertaining the real incomes and for
adjusting National Income Accounts.

Index numbers are useful for showing trends in economic activity of a country. Thus
comparisons can be made between movements in the levels of prices of different
groups of commodities, or between the price levels and wages, or between the levels
of production and wages, between the import prices and the consumer prices, etc.

Index numbers are also used for international comparisons of socio-economic
development. Thus index numbers are extremely useful tools for governments,
businessmen, economists, as well as in other fields of human endeavour.

Limitations of Index Number

Index numbers, however, have their own limitations. An index is only an approximate
indicator of the change it is attempting to measure. Errors can be committed at the
various selection processes mentioned earlier concerning problems of construction of
an index number. The index number is usually based on a sample, so that sampling
errors are bound to occur. It is not possible to include all changes in quality or
product. Unless the base period is a fairly recent one, comparisons are not reliable.
Different methods of computation give different results, some overstating the upward
movement in prices while others understating them. However, unless an index is
deliberately distorted, it will show correctly at least the trend of the phenomenon
which it is measuring, except when there are rapid changes in conditions.

9.7 SUMMARY

In this unit, you have learnt about statistical tools called index numbers. You should now
have a clear understanding of the different types of price and quantity index numbers, of their

120
calculation and interpretation, and of the main practical issues involved in the construction of
an index number. You have also learnt how to change from fixed base to chain base and vice
versa, splice, and deflate an index number series. Lastly, you should know that any index
number has its merits and limitations.

Recommended Readings

1. Household Budget Survey, July 1991-June 1992.
Vol. I, Methodological Report July 1993

2. Household Budget Survey, July 1991-June 1992.
Vol. II, Analytical Report July 1994.

121
UNIT 10 PROBABILITY

Unit Structure

10.0 Overview
10.2 Introduction
10.3 Mathematical Preliminary:
Elementary Theory of Sets and Venn Diagrams
10.4 The Sample Space and Events
10.5 The Probability of an Event
10.6 General Law of Addition
10.7 Conditional Probability
10.8 Law of Multiplication of Probability
10.9 Independent Events
10.10 Tree Diagrams
10.11 Joint Probability Tables
10.12 Summary

10.0 OVERVIEW

This unit introduces you to the concept of uncertainty involved in almost every real-world
problem. Probability is a measure of uncertainty associated with the occurrence/non
occurrence of an event. In everyday language, it is synonymous with chance. This unit
covers the mathematical concepts of Sets, Venn Diagrams, basic concepts in probability,
different rules of probability and methods of analysis useful in statistical applications.
Chapter 11 of your textbook (OJ) covers some of these topics. However, the presentation of
different topics and the explanation provided are not found to be appropriate for the present
course. This unit, therefore, consists of a complete write-up of the topics covered in Chapter
11(except the Bayes Theorem, which is not included in the present syllabus) as well as of
those topics which are omitted in the chapter. Reference will therefore be made only to the
solved examples and to the exercises in the textbook.

122


1. Compute and interpret probabilities of different types of events
2. Use Venn Diagrams for illustrations, where appropriate
3. Compute and interpret conditional probability
4. Draw Tree Diagrams.
5. Use Joint Probability Tables.

10.2 INTRODUCTION

Probability is a measure of uncertainty associated with the occurrence or non-occurrence of
an event. It is a concept used in our everyday life, for example, the chance of obtaining a head
when a coin is tossed, or, the chance that it rains today. Probability plays an important role in
all advanced statistical methods which deal with decision-making in situations involving an
element of risk and uncertainty.

10.3 MATHEMATICAL PRELIMINARY: ELEMENTARY THEORY OF SETS
AND VENN DIAGRAMS

Set Theory and Venn Diagrams are very useful mathematical tools, both in describing the
basic concepts in probability as well as in understanding the different rules of probability.
This section therefore, provides a brief introduction to the theory of sets and Venn Diagrams.

10.3.1 Elementary Theory of Sets

Sets A set is a collection of distinct objects, normally referred to as elements or
members. A set is usually denoted by a capital letter, and the elements by small
letters.

Example 1
A = {x : x is an even number}, i.e., Set A consists of elements x
where x is an even number.

123

B = {2, 4, 6, 8}, i.e., Set B consists of the four elements 2, 4, 6, 8.

Subsets A subset of a set A is a set which consists of some or all of the
elements of A. If B is a subset of A, then B A

Example 2

In Ex 1, B is a subset of A
This is denoted by B A.

The Number of a Set The number of a set A, written as n(A), is defined as the
number of elements that A contains.

Example 3

In Ex 1, n (B) = 4

The Universal Set The set of all objects relevant to a particular application is
called the Universal Set. A Universal Set is usually denoted by the capital letter U.

Any other set defined with respect to the particular application will necessarily be a
subset of the Universal Set.

Example 4

If U = {a, b, c, d, e, f}

then A = {a, d, e} is a subset of U.
Disjoint Sets

124

Two sets are said to be disjoint if they have no elements in common.

Example 5

Sets {2, 3, 4, 5, 6} and {1, 7, 8, 9} are disjoint, as they have no elements in
common.

The Complement of a Set If A is any subset of the Universal Set U, then all those
elements that belong to U, but are not contained in A, form the complement of A, denoted by
A or by A
c
,

Example 6

In Example 4, A = {b, c, f}

The Empty Set or Null Set A set which has no elements is called an empty set or
null set and is denoted by (phi).

Set Operations

(i) Set Union

The union of two sets A and B is written as A B and defined as that set which
contains all the elements of A or B or both. Thus

A B = {x : x A or x B or x (both A and B)}
Where means belongs to.

125
Example 7

If A = {1, 2, 3, 4}, and
B = {4, 5, 6, 7), then
A B = {1, 2, 3, 4, 5, 6, 7}

(ii) Set Intersection

The intersection of two sets A and B is written as A B and is the set of elements
belonging to both A and B.

Thus A B = (x : x A and x B).

Example 8

In Example 7, A B = {4}.

10.3.2 Venn Diagrams

A Venn Diagram is a diagram associated with set theory. Venn Diagrams provide pictorial
descriptions of sets, subsets, intersections and unions. In a Venn diagram, the universal set U
is represented by a rectangle and its subsets are usually represented by circles or ovals.
Figure 10.1 shows a universal set containing the union of A and B, and the intersection of A
and B, respectively. The union and intersection are shown by the shaded areas in the figure
that follows:

A B A B

(a) (b)
Figure 10.1. Venn diagrams showing

126
(a) the union of A and B, and
(b) the intersection of A and B

Note: (i) n(A) + n ( ) A = n (U)
Where A is the complement of the set A in U.

(ii) n (A) + n (B) = n (A B), when A and B are disjoint, and,

(iii) n (A) + n (B) - n (A B) = n (A B) whether A and B are disjoint or not.

10.4 THE SAMPLE SPACE AND EVENTS

Before defining probability, it is necessary to define several terms that are used in the process
of determining the probability.

An experiment is a process or phenomenon which is being studied or observed.

Here we are concerned with an experiment whose outcome depends on chance, i.e. the
outcome is not predictable at the outset as there are many possible outcomes. For
example, tossing a coin is an experiment whose outcome depends on chance as there
are two possible outcomes, namely, a head or a tail.

The set of all possible distinct outcomes of an experiment is called the sample space or
possibility space of the experiment, which is usually denoted by S. For example, the
number of defectives in a batch of 5 components is given by the set S = {0, 1, 2, 3, 4,
5}

An event is a subset of the sample space S. In the preceding example, if A is the
event of getting at most one defect, then A = {0, 1}, which is a subset of S.

127
When each outcome is as likely to occur as any other, the outcomes are called equally
likely. For example, when tossing a fair dice, S = {1,2,3,4,5, 6}, where all outcomes
are equally likely.
When the occurrence of one event say, A, precludes the occurrence of another
event B, events A and B are said to be mutually exclusive events.

Note: The subsets representing two mutually exclusive events A and B, are disjoint.
For example, if A is the event of getting an even number, and B is the event of getting
an odd number, when a fair dice is tossed, then

S = {1,2,3,4,5,6}

A = {2,4,6}

and B = {1,3,5}

Since getting an even number precludes the occurrence of an odd number on the same trial, or
observation, events A and B are mutually exclusive, and are represented by the two disjoint
sets A and B given above. That is, they have no common points.

10.5 THE PROBABILITY OF AN EVENT

10.5.1 Definition of Probability

Historically, different approaches have been developed for defining probability. These
approaches determine how probability is defined and computed. One of the definitions of
probability is given below.

If all n(S) outcomes of a sample space S are equally likely and mutually exclusive and n(A)
of these are favourable to the occurrence of an event A, i.e., an event A consists of a subset of

128
n(A) of these n(S) outcomes, then the probability of occurrence of A denoted by P(A), is
given by

P (A) =
n A
n S
( )
( )

Thus, the probability that event A will occur is the ratio of the number of outcomes in the
subset A to the number of outcomes in the sample space S.

It is to be noted that with this definition, the probability can be determined without actually
carrying out the experiment, and observing the sample events.

For example, the probability of drawing a king from a well-shuffled pack of cards is given by
P (K) =
4
52
=
1
13

where K represents the event that the card drawn is a king, n(K) = 4 and n(S) = 52.

Example 9

The data collected by a supermarket showed that 161 of the 253 women who entered the
supermarket on a Saturday morning made at least one purchase. Estimate the probability that
a woman entering this supermarket on a Saturday morning will make at least one purchase.

Let A be the event that a woman entering the supermarket makes at least one purchase.

The estimate of P(A) is then given by

P(A) =
161
253
,

Since n = 253 and A occurred 161 times.

129
Activity 1

1. A fair dice is tossed once. What is the probability of obtaining an even number?

2. A retailer has 12 TV sets out of which 4 sets are known to be defective. If one set is
selected at random, what is the probability that it turns out to be defective?

3. An inspector randomly samples 50 components manufactured during one day and
finds that 2 components are defective. What is the probability that an electronic
device containing one component will be inoperative because the component is
defective?

4. Attempt Question 11.2 in textbook (OJ).

10.5.2 Axioms of Probability

From the definition of probability given above, the probability of an event is a proportion.
Probability should thus possess the essential properties of a proportion. Therefore,
probability should be a number between 0 and 1. Furthermore, the probability of the event S,
where S is the sample space, should be 1 because one of the possible outcomes is certain to
occur when the experiment is carried out.

These ideas are contained in the axioms or postulates, of probability which follow:

AXIOMS OF PROBABILITY:

Probability is a function, defined on a sample space S that satisfies

1. P(A) 0 , for any event A, i.e., probability is non-negative.

2. P(S) = 1,
i.e., the probability of a certain event is equal to 1.

130
3. If A, A
2
is a sequence of mutually exclusive events, the probability that A
1
or
A
2
or occurs equals the sum of their separate probabilities.

Symbolically,

P (A
1
A
2
..) = P (A
1
) + P (A
2
) + ..

Note: If events A
1
, A
2
, are mutually exclusive, then

n (A
1
) + n (A
2
) + . = n (A
1
A
2
.),

since n (A
i
A
j
) = 0 for i j and i, j = 1, 2,

This proves the above Axiom 3.

10.5.3 Further Properties of Probability derived from Axioms

By using the three axioms of probability, we can derive more rules which the probability
measure must satisfy.

1. If A
1
, A
2
, . A
n
are n mutually exclusive events, then the probability that A
1
or A
2
or
or A
n
occurs equals the sum of their separate probabilities. Symbolically,

P (A
1
A
2
A
n
) = P (A
1
) + P (A
2
) + ..P( A
n
)

This follows from the third Axiom of Probability. Applying this result to individual outcomes
of an experiment, the probability of any event A is given by the sum of the probabilities of the
individual outcomes A
i
s comprising A. Symbolically,

=
i
i
A P A P ) ( ) (

131
Further, if n = 2, i.e., if two events A
1
and A
2
are mutually exclusive, then the probability that
A
1
or A
2
occurs equals the sum of their separate probabilities.

Symbolically,

P (A
1
A
2
) = P (A
1
) + P (A
2
)

This is known as the Law of Addition of two mutually exclusive events.

Thus an event which cannot occur or is impossible has the probability zero, and that the
respective probabilities that an event will occur and that it will not occur add up to 1.
Symbolically,

2. As S = S
P( S) = P(S)

Also, P( S) = P() + P(S), from Axiom 3, since and S are mutually exclusive
events.

Thus P() + P(S) = P(S)

Since P(S) =1 from Axiom 2, it follows that

P() = 0

3. As A A = S
P(A A ) = P (S) = 1

Also, because A and A are mutually exclusive, from the third Axiom of Probability
we have,

P(A A )= P(A) + P( A )

132
Thus, P(A) + P( A ) = 1
or P( A ) = 1 - P(A) for any event A.

Example 10

A card is drawn from a pack of well-shuffled cards. Find the probability of drawing either an
ace (A) or a king (B).

The events A and B are mutually exclusive. Therefore the probability of drawing either an
ace or a king in a single draw is

P(A B) = P(A) + P(B)

=
4
52
+
4
52

=
8
52
2
13
=

Example 11

A pair of dice is rolled once. Determine the probability of obtaining a total of 7.

The total number of possible outcomes are 36, as any one of the 6 outcomes on the first dice
can be combined with any of 6 outcomes on the second dice. Assuming each one of these 36
possible outcomes have equal probabilities, the probability of any individual outcome is
1
36
.
The probability of any event is therefore given by
1
36
times the number of individual
outcomes comprising the event.

133
The sum of 7 points is obtained for the 6 individual outcomes:

(1,6), (2,5), (3,4), (4,3), (5,2) and (6,1)

Thus if A is the event of obtaining a sum of 7 points then,
P(A) =
1
36
(6)
=
1
6

This can be clearly seen from Figure 10.2 as the sum of probabilities of points inside the
dotted line.

The following figure can also be used to determine the probabilities of any of the possible
totals when two dice are rolled.

6
5
4
3
2
1
1 2 3 4 5 6
S
e
c
o
n
d
D
i
c
e
First Dice
Figure 10.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.

Thus if B is the event of obtaining a total of 2 or 3 then the probability of B is sum of the
three circled points and is equal to
1
12
.

134
P(B) =
1
36
(3) =
1
12

Activity 2

1. Using your answer to Question 1 of Activity 1, find the probability of obtaining an
odd number.

2. Using your answer to Question 2 of Activity 1, find the probability of obtaining a
good TV set.

3. In a given week the probability that the price of a product will increase (A) in price,
remain unchanged (B), or decline (C) in price is estimated to be 0.30, 0.20, and 0.50,
respectively. What is the probability that in a given week the price of a product will

(a) increase or remain unchanged?
(b) change during the week?

4. The delivery of an item of raw material from a supplier may take up to five weeks
from the time the order is placed. The probabilities of various delivery times are as
follows:

Delivery Time Probability

< 1 week 0.12
> 1, < 2 weeks 0.27
> 2, < 3 weeks 0.22
> 3, < 4 weeks 0.22
> 4, < 5 weeks 0.17
---------
1.00
=====

What is the probability that a delivery will take the following times?

135
(a) Two weeks or less
(b) Three or four weeks
(c) More than four weeks
(d) More than two weeks
(e) More than three weeks

5. A pair of dice is rolled once. Determine the probability of obtaining:

(a) a total of 8
(b) a total of 9

10.6 GENERAL LAW OF ADDITION

General Law of Addition for any two events

For any two events A and B

P (A B) = P (A) + P (B) - P (A B)
= probability that at least one of A and B occurs.

Recall from Section 10.3.2 (iii) that n(A B) = n(A) + n(B) - n(A B) for any two sets A and B.
Thus, for any two events A and B,

P(A B) =
n A B
n
( )
(S)

=
n A n B n A B
n
( ) ( ) ( )
(S)
+

=
n A
n
( )
(S)
+
n B
n
( )
(S)
-
n A B
n
( )
(S)

= P(A) + P(B) - P(A B)

136

Note: This law is even applicable to mutually exclusive events because then A
B = so that P(A B) = 0.

Example 12

From Example 10 above, what is the probability of drawing an ace (A) or a spade (C)?

The events ace and spade are not mutually exclusive. Therefore, the probability of
drawing an ace (A) or a spade (C), or both, in a single draw is

P(A C) = P (A) + P (C) - P (AC)

=
n A
n
( )
(S)
+
n C
n
( )
(S)
-
n A C
n
( )
(S)

= 4 + 13 - 1
52 52 52
= 16
52
= 4
13

Note: The Venn diagram can be used to show the union of two events, A and
B, denoted by A B, the intersection of two events A and B, denoted
by A B, the complement of event A, denoted by A , etc., as shown
earlier.

137
Activity 3

1. Out of 300 business students, 100 are enrolled in Marketing and 50 are enrolled in
Finance. In fact, 30 of these students are enrolled in both Marketing and Finance.

(a) Draw a Venn diagram of the data.
(b) What is the probability that a randomly chosen student will be enrolled in
either Marketing (M) or Finance (F) or both?
(c) What is the probability that a randomly chosen student will be enrolled in
either Marketing (M) or Finance (F), but not both?

2. A hotel has a total of 75 rooms, of which 65 contain a radio. Of the 65 rooms with a
radio, 10 have a refrigerator and 43 have a bath. In the entire hotel, 12
rooms have a refrigerator, but only one of these has neither a radio nor a bath. Of all the
rooms with baths, 8 have both a radio and a refrigerator, 2 have neither a radio nor a
refrigerator, and one has a refrigerator only.

Represent the above information by means of a Venn diagram.

Calculate the probability that

(a) a room contains a bath,
(b) a room does not contain either a refrigerator or a bath
(c) a room contains a bath and a radio but no refrigerator.

138
10.7 CONDITIONAL PROBABILITY

Suppose A and B are two non-mutually exclusive events in a sample space S.

A B
S

Figure 10.3

As shown earlier,
P(A) =
n A
n S
( )
( )

and P(B) =
n B
n S
( )
( )

If, however we know that B has occurred, then the sample space is reduced to B, instead of
the original S. Thus, if we are interested in knowing whether A will occur, given that B has
occurred, then the sample space to be considered, is the reduced sample space B.

The probability that A will occur, given that B has occurred, denoted by P(AB) is then given
by
P(AB) =
n A B
n B
( )
( )

since n(A B), i.e. the number of outcomes common to both A and B, gives the number of
outcomes in the sample space B that are favourable to the event A.

139
Consider the following example to illustrate the probability P(AB).

Example 13

There are 200 applicants for a secretarial position in a large company. It is known that among
the 200 applicants, some have had previous experience in secretarial work and some have had
formal training in such work as shown in Table 10.1:

Table 10.1
Training

Experience
Formal
Training

No Formal
Training
Total
Previous
Experience
34 48 82
No Previous
Experience
41 77 118

Total

75

125

200

Let E denote the event of selection of an applicant with previous experience, and T denote the
event of selection of an applicant with formal training.

As can be seen from the table, E and T are non-mutually exclusive events since there are
some applicants with both previous experience and formal training.

If an applicant is randomly selected from these 200 applicants, the probability that the
selected applicant has some previous experience is given by
P(E) =
n E
n
( )
(S)

=
82
200

140
= 0.41
and P(T) =
n T)
n S
(
( )

=
75
200

= 0.37

It is assumed that each of the 200 applicants has the same chance of being selected.

Suppose now that the management decides to limit the selection to only those applicants who
have had some formal training. As a result of this decision, the number of applicants to be
considered, i.e. the new sample space, is now reduced to 75 i.e., T. Assuming that each of
these 75 has an equal chance of being selected,

P(ET) =
34
75

= 0.45

i.e. P(ET) =
n E T)
n T)
(
(

This is called the Conditional Probability of selecting an applicant with previous experience
given that the applicant has had some formal training. Note that this conditional probability
can also be written as

P(ET) =
34 200
75 200
/
/

=
P E T)
P T)
(
(

141
Thus, P(E T) is the ratio of the probability of selecting an applicant with previous experience
and formal training to the probability of selecting an applicant with formal training.

Generalising from the above example, it can be seen that for any two events A and B
belonging to a given sample space S, the revised probability of A when it is known that B has
occurred, called the conditional probability of A given B and denoted by P (AB) is defined
by the formula

P (AB) =
P A B
P B
( )
( )

provided P (B) > 0

A and B are said to be dependent events if the probability of occurrence of one must be
modified in the light of information as to whether or not the other event has taken place.

Note in Example 12 ,

P(E) P(ET)

Since P(E) = 0.41 and P (ET) = 0.45

Activity 4

1. For Example 13, find P(TE), using both the table and the formula for the conditional
probability. Explain what it means in terms of the data.
What do you observe? Is P(TE) same as P(T)?

2. An electrical component consisting of two elements that operate in sequence will
work only if both elements are good elements. From previous records it is found that
80 percent of the components produced work properly. Occasional tests on the first
element indicate that 10 percent of these elements are likely to be defective. The

142
second element cannot be tested separately. What is the probability that the second
element of a component will be a good one if the first element is a good one?

3. Attempt Question 11.23 (b) in textbook (OJ).

10.8 LAW OF MULTIPLICATION OF PROBABILITY

If we multiply the formula for conditional probability by P(B) on both sides, we have,

P(A B) = P(B) . P(A B)

This is called the Law of Multiplication of probability. It enables us to calculate the
probability that two events, A and B, will both occur. The formula thus states that the
probability that two events will both occur is the product of the probability that one of the
events will occur and the conditional probability that the other event will occur given that the
first event has occurred (occurs, or will occur).

Note: This formula can also be written as

P(A B) = P(A) . P(B A)

The law of multiplication can be generalised as follows:

P (ABC) = P(AB) . P(C AB)
= P(A) . P(BA) . P(CAB)

Example 14

A set of 10 spare parts is known to contain seven good parts (G) and three defective
parts (D). Two parts are selected at random without replacement.

143
Find the probability that both the parts drawn are good.

P (G
1
G
2
) = P (G
1
) . P (G
2
G
1
)
= 7 . 6
10 9
= 42
90
= 7
15

Activity 5

1. For the problem of Activity 3, determine the conditional probability that a randomly
chosen business student is enrolled for finance given that he has enrolled for
Marketing.

2. Of 12 letters kept in a file, 4 contain typing errors.

(a) If a clerk randomly selects two of these letters (without replacement), what is
the probability that neither letter will contain typing errors?

(b) If the clerk samples three letters, what is the probability that none of the letters
have typing errors?

10.9 INDEPENDENT EVENTS

When information about the occurrence of B has no effect on the probability of occurrence or
non-occurrence of A ( or, vice versa), then A and B are said to be independent events.

In this case, the conditional probability P (AB) is the same as the unconditional probability,
P (A).

144
Thus, when P (AB) = P(A)
or P(BA) = P(B)
or P(AB) = P(A) . P(B)
A and B are independent events.

The probability of the occurrence of both A and B, when they are independent, is therefore,
the product of their separate probabilities.

Note: When two events are mutually exclusive they cannot be independent, because when
A and B are mutually exclusive, P(A B) = 0

Example 15

A card is drawn at random from a pack of well-shuffled cards. Let A denote an ace and B
denote a spade. Show that events A and B are independent.

P (A) = 4
52

P (B) = 13
52

so that P(A) . P (B) = 4 . 13 = 1
52 52 52

Also, p (A B) = 1 , since
52
there is only one card which is both an ace (A) and a spade (B).

Thus P (A B) = P (A) . P (B)

145
Activity 6

1. For the problem of Activity 3, apply an appropriate test to determine if Marketing and
Finance are independent events.

2. Ex. 11.3 (b) in your textbook (O.J.)

3. In general, the probability that a client will make a purchase when he is contacted by a
salesman is P = 0.40. If a salesman selects two clients randomly from a file and
contacts them, what is the probability that both the clients will make a purchase?

4. Attempt Questions 11.9 and 11.26 in your textbook.

5. In a certain hospital, 43% of the patients examined are found to suffer from a heart
problem and 17% from a respiratory problem. Furthermore, it is observed that 52% of
patients examined suffer from either one or both of these problems. Find:

(a) The probability that a patient has neither a heart problem nor a respiratory
problem.
(b) the probability that a patient has a respiratory problem but not a heart problem.
(c) the probability that a patient has a respiratory problem given that he/she has no
heart problem.
(d) Determine whether there is any association between the event of suffering
from a heart problem and that of suffering from a respiratory problem.
Explain briefly the implication of your answer.

10.10 TREE DIAGRAMS

A tree diagram is very useful as a method of displaying the possible events associated with
sequential observations, or sequential trials.

E.g. A tree diagram for the events associated with tossing a coin twice.

146

A tree diagram shows the outcomes of successive trials, joint events and the probabilities of
the joint events. Since the joint events thus obtained are exhaustive and mutually exclusive,
the sum of the probabilities of all joint events is 1.

Example 16

Construct a tree diagram to represent the sequential sampling process in the problem of
Example 14

Joint Event Probability
G
1
I G
2
G
1
I D
2
D
1
I G
2
D
1
I D
2
7
10
G
1
G
2
G
2
D
2
D
2
D
1
3
10
6
9
3
9
7
9
2
9
42
90
21
90
21
90
6
90
90
90
= 1
Figure 10.4
Tree Diagram

Note: The Tree Diagram of Example 16 can be used to calculate conditional
probabilities. Thus the probability that the second spare part is good given that
the first spare part is good, is given by

P(G
2
G
1
) =
6
9

Similarly P(G
2
D
1
) =
7
9

147
Activity 7

1. Construct a tree diagram to represent the possible events associated with three tosses
of a fair coin.

2. For Example 16, find the conditional probabilities P(D
2
G
1
) and P(D
2
D
1
).

10.11 JOINT PROBABILITY TABLES

A joint probability table is a table in which all possible events for one variable are listed as
column headings, all possible events for a second variable are listed as row headings, and the
value entered in each resulting cell is the probability of each joint occurrence.

A table of joint-occurrence frequencies which can be used as the basis for constructing a joint
probability table is called a contingency table.

Example 17

The Table below is a contingency table.

Table 10.2
Number of Boxes of 100 units
containing Defective Electron Tubes

Firm
No. of Defective Tubes Marginal
Total
0 1 2 3 or more

Supplier A

500

200

200

100

1000

Supplier B

320

160

80

40

600

Supplier C

600

100

50

50

800
Marginal
Total

1420

460

330

190

2400

148

The joint probability table is as shown below.

Table 10.3
Joint Probability Table for Boxes of 100 units
containing Defective Electron Tubes

No. of Defective Tubes Marginal
Firm 0 1 2 3 or more Probability

Supplier A 500
2400
200
2400
200
2400
100
2400
1000
2400
Supplier B 320
2400
160
2400
80
2400
40
2400
600
2400
Supplier C 600
2400
100
2400
50
2400
50
2400
800
2400
Marginal
Probability
1420
2400
460
2400
330
2400
190
2400
2400
2400
1 =

Note: A marginal probability is so named because it is a marginal total of a column or a
row. The marginal probabilities are unconditional probabilities of particular events.
For example,

P (A) =
1000
2400

P (1) =
460
2400
, etc.,

Conditional probabilities can now be calculated from the above table. For example,

P (2B) =
P B
P B
( )
( )
2
=
80
2400
600
2400

149

=
80
600

Activity 8

1. For Example 17 above, calculate the following probabilities.

(a) If one box had been selected at random what is the probability that

(i) it came from supplier B?
(ii) it would contain two defective tubes?
(iii) it would have no defective tubes and would have come from
supplier A?

(b) Given that a box selected at random came from supplier B, what is the
probability that it contained one or two defective tubes?

(c) If a box came from supplier A, what is the probability that the box would have
two or less defective tubes?

2. Attempt Question 11.8 in your textbook(OJ).

3. The contingency table below describes a sample of 350 people who made a purchase
in a large store selling sports shoes according to age and gender

150
Customers in a Sports Shoes Store, by Age and Gender

Age Gender Total
Male Female
Under 30 125 100 225
30 and over 75 50 125
Total 200 150 350

Source: Survey Report of the Store

(a) Construct the joint probability table for the above data.
(b) If one customer had been selected at random, what is the probability that the customer
selected was

(i) Under 30?
(ii) a female?
(iii) a male 30 and over?

(c) Given that the customer selected was under 30, what is probability that the customer
was a female?
(d) Given that the customer selected was a male, what is the probability that the customer
was 30 and over?

10.12 SUMMARY

In this unit, you have learnt about the probability of occurrence of one or more of uncertain
events. You should now have a clear understanding of the different types of events and the
calculation and interpretation of their probabilities. You have also learnt about the use of
Sets, the different laws of probability, Venn Diagrams, Tree Diagrams and Joint Probability
Tables.

151
UNIT 11 DATA COLLECTION II

Unit Structure

11.0 Overview
11.2 Sample Design
11.2.1 Introduction
11.2.2 Requirements of a Good Sample
11.2.3 The Importance of Random Selection
11.2.4 Representatives
11.2.5 Table of Random Numbers
11.2.6 Sampling Frames
11.2.7 Methods of Random Sampling
11.2.7.1 Simple Random Sampling
11.2.7.2 Systematic Sampling
11.2.7.3 Stratified Random Sampling
11.2.7.4 Cluster Sampling
11.2.8 Sample Size
11.2.9 Quota Sampling
11.3 The Questionnaire
11.3.1 Question Construction
11.3.2 Concluding Remarks
11.4 Summary

11.0 OVERVIEW

In unit 2, you were introduced to data collection. In this unit we take up two very important
aspects of data collection for further discussion. We repeat the cautionary note which
appeared at the beginning of Unit 2. The material in OJ on sampling (Chapter 15) is not
considered appropriate for this course. However, you may find Chapter 16 of OJ useful
supplementary reading.

152



1. Explain the importance of random selection
2. Use a table of random number to draw a random sample
3. Explain the strength and weaknesses of the following sample designs and be able to
apply them in simple situations:

(i) simple random
(ii) systematic
(iii) stratified random
(iv) cluster sample

4. Explain the strengths and weaknesses of quota sampling
5. Explain the general principles of questionnaire design and the precautions to be
applied in the wording of questions and be able to construct a simple questionnaire.

11.2 SAMPLE DESIGN

11.2.1 Introduction

In unit 2, we noted that the idea of studying or examining a part in order to learn about the
whole is familiar and often applied in everyday life. We also noted that sampling has a
number of advantages over exhaustive studies. However, we pointed out that the findings
relating to a sample are only generalisable to the whole population provided the sample has
been selected according to certain principles. Here we discuss the basic principles of valid
sample design and the considerations that govern the choice among alternative designs.

11.2.2 Requirements of a Good Sample

The dual characteristics of a good sample are:

153
(i) randomness
(ii) representativeness

These two notions will be elaborated respectively in the following two subsections.

11.2.3 The Importance of Random Selection

Selection on grounds of convenience is not appropriate

Suppose that a number of people are assembled in a large theatre, say, the University
auditorium, for a lecture and it is desired to draw a sample from the audience for the purpose
of eliciting their views about the presentation. One way that might suggest itself to us would
be to simply select any convenient group, say, people in the front row. Now, it is not hard to
think of reasons why people in the front row might not be representative of the whole
audience. Can you think of one or two? This method of selection is therefore not appropriate.

Haphazard selection is not appropriate also

An alternative method that might suggest itself to us would be to stand on the rostrum and
pinpoint haphazardly persons to form part of the sample. It is perhaps not as easy to pick
any fault with this approach but it is nevertheless faulted. Can you think why? Well the
reason is that the individual doing the selection may, for example, have a preference,
conscious or unconscious, for young persons so that once again the sample would not be
representative of the whole population. It has been found, time and again, that when the task
of picking a sample is left to a human being, biases tend to creep in, one way or another.
Assuming that a particular researcher felt that he or she was free from biases or prejudices of
any sort and could therefore safely proceed personally to the selection of a required sample,
it would still be impossible for him or her to prove to the rest of the world that the sample
were free from any bias whatsoever. Findings based thereon and extended to the whole
population would be subject to challenge on grounds of subjectivity.

154
A Random Selection Procedure

Suppose that, instead of the above, all members of the audience are asked to write their names
on bits of paper which are then scrapped and placed in a basket. The bits of paper are mixed
thoroughly by shaking the basket and a sample is then drawn by picking out some bits of
paper from the basket. Now this is a truly random procedure which is not subject to biases
arising in the way described above. This method of carrying out random selection is not very
practical especially if the size of the population is large. However, we shall see that there
exists other more practical ones.

Why is random selection required?

In spite of it being random, the procedure just described could nevertheless occasionally
yield samples similar to the ones obtainable under either the first or second procedures
described above. So what has been gained? Well there are three very important advantages:

(i) With the first two procedures, any bias present would persist even if the selection
process were repeated many times, always in the same direction. We refer to such
biases as systematic biases. For example if an assistant was asked to select animals for
an experiment and, for one reason or another, he tended to over represent larger
animals in his selection, this bias would persist in repetitions of the selection with the
same assistant. With the third procedure described, there is no systematic bias. In a
single sample, larger animals may well be over represented but in repetitions of the
selection, the biases will not always be in the same direction but rather tend to cancel
out.

(ii) Even more interesting is the fact that, with the third procedure described, as the sample
size increases, the risk of having an unrepresentative sample decreases. On the other
hand, the potential biases in the first two procedures do not diminish with increases in
the sample size. Think why?

(iii) When sampling has been carried out by a random procedure, it is possible to assess how
far the result based on the sample reflects the corresponding true population

155
characteristic. In other words it is possible to indicate the likely margins of error in the
sample result.( This is done through the notion of confidence intervals, by application
of sampling theory and is outside the scope of this course.) No such margins can be
quoted when sampling is non-random.

Definition of random selection

Definition: Random (or probability) sampling is a method of sampling where every
member of the target population has a known, non-zero probability of selection.

Note that equal chance of selection is not a requirement of random selection. In fact, as we
shall see later, there are instances where there are valid reasons for wanting certain sections
of a population to be over-represented and others to be underrepresented. So long as every
one has a chance of being selected and that chance is known, such over or under-
representation is not a problem and can be compensated at the analysis stage by a technique
known as re-weighting.

11.2.4 Representativeness

As we have seen, although randomisation eliminates systematic and persistent biases, it does
not guarantee a representative sample, particularly when the sample is small. For example,
consider a large population made up of men and women in equal proportions. If we draw a
sample of ten persons from this population by simple random sampling, we may very well
end up with eight men and two women. This is clearly a sample that is unrepresentative in
terms of gender. If we had drawn a sample of 100, the risk of having a similarly
unrepresentative sample, i.e. 80 men and 20 women would be considerably less. However,
whatever the size of our sample, we could easily have forced it to be representative on the
gender criterion by selecting half of our sample from men and the other half from women.

The objective of sampling is to try to capture into the sample the variation in the target
population in respect of the characteristic or characteristics under study. To be certain to
do this requires prior knowledge of the distributions such characteristics in the population
which is , of course not available. What we can do however, is to ensure representativeness in

156
terms of other known characteristics of the population such as age, gender, occupation, etc
which we suspect may be correlated to with the characteristic or characteristics under study.
This is the idea underlying stratified sampling which is discussed in 11.2.9 below.

Randomisation alone does ensure a valid sample and produces valid estimates with margins
of error that can be calculated by the application of statistical theory. Stratified sampling
using relevant stratification factors gives additional guarantee of representativeness resulting
in smaller margins of error.

11.2.5 Table of Random Numbers

Drawing a random sample by writing names on bits of paper, scrapping the latter and
dropping them in a basket, mixing them thoroughly and then picking out the required number
of names, as we pointed out, is not very practical. It may be okay for a small sample from a
small population but it would be tedious, for instance, for a sample of, say, 200 from the
student population of the University of Mauritius.

A more practical way of drawing random samples is to use a table of random numbers. The
numbers tabulated in a random number table have been generated by a truly random process.
No pattern can be discerned in such a table when examining the succession of numbers, in
whatever direction we proceed with the examination, whether horizontally, vertically or
diagonally. However, if we analyse a sufficiently large block of the table, it will be found that
the latter has certain properties. Thus, for example, the frequencies of all digits would be
found to be similar. In other words, there is no bias towards any digit or sequence of digits.

An extract from a random number table is presented below:

92294 46614 50948 64886 20002 97365
35774 16249 75019 21145 05217 47286
83091 91530 36466 39981 62481 49177
85966 62800 70326 84740 62660 77379
41180 10089 41757 78258 96488 88629

157
Note that the numbers are presented in groups of five separated by spaces for greater
readability. The spaces have no significance whatsoever.

Suppose we need to draw a simple random sample of 200 from a population of 2000. First we
number the individual members of the population from 0001 to 2000. We would need a
bigger table than the extract presented above but for the sake of illustration suppose we start
at its beginning and read the numbers horizontally. (We can start anywhere we like in a
random number table and it is good practice not to

always start at the same spot). The first two four digit numbers (9229 and 4466) are irrelevant
and are ignored. The third one is relevant and selects the 1450th individual on our list. The
next two four digit numbers (9486 and 4886) are again irrelevant but the following one
selects the 2000th individual.

11.2.6 Sampling Frames

A list from which we select a sample is called a sampling frame. Sampling frames are often
not available. And when they are, they are not always perfect. They may be subject to
problems of inaccuracies, omission, duplication, inclusion of irrelevant individuals etc. When
no sampling frame exists or available ones carry imperfections which cannot be remedied, it
may be necessary to compile one. This can be both costly and time consuming. Fortunately
there are methods of random sampling which require no list or only a partial list. This will be
elaborated upon below.

11.2.7 Methods of Random Sampling

It is possible to devise various alternative random sample designs. These alternative designs
will vary in terms of precision achievable, convenience and cost. In the following
subsections we discuss several basic designs. Sample designs for large surveys, e.g. national
surveys may be more complex than any of these basic designs but will incorporate the same
ideas: they may be made up of several stages and involve combinations of the strategies of the
basic designs discussed here. The choice of sample design in a practical situation is a question

158
of choosing the design that gives maximum precision subject to the constraints of cost and
resources (including the availability of a sampling frame).

In studying the alternative basic designs described below, you should ensure that you
understand

(i) how to apply each of them
(ii) their relative advantages and disadvantages

11.2.7.1 Simple Random Sampling

Definition: Simple random sampling (s.r.s.) is a method of random selection where every
subset of the chosen sample size has the same chance of selection.

It can be proved that s.r.s. has the property that every member of the sampled population has
the same chance of selection but this property is not unique to s.r.s. and therefore should not
be used as a definition thereof.

The application of simple random sampling is very simple. We need a list of all members of
the target population. This list is numbered serially (e.g. if the population consists of 10,000
individuals, we number them 00001, 00002, 00003 etc., up to 10,000). Suppose we need a
sample of size 400. We select 400 random numbers (lying between 00001 and 10,000) from a
table of random numbers using the method described in 11.2.5. above.

Simple random sampling is important for two main reasons: (i) it serves as a yardstick against
which to compare other sample designs in terms of precision, convenience and cost. (ii) it is a
component of other sample designs. This latter statement will become clear as we discuss
these other designs.

Note that the implementation of simple random sampling requires a sampling frame.

Another important point about simple random sampling is that if the population of interest is
geographically scattered, the sample selected by this method will also be physically dispersed.

159
This is not only inconvenient from the point of view of organisation of the field work; it is
also costly as it inflates field costs, e.g. the travel costs of field staff if face to face
interviewing is used.

S.r.s. gives more precision than cluster sampling but less than stratified sampling.

11.2.7.2 Systematic Sampling

Drawing numbers from a random number table, although more practical than picking out slips
of paper from a basket can nevertheless prove tedious if the sample to be drawn is large.
Given a sampling frame, it is possible to select a sample in a very practical way as follows:

Suppose we have a list of N people and it is desired to select a sample of n from it. Regard
the N units as arranged in a circle. Let k be the nearest integer to N/n .We first select a
random number between 1 and N. This identifies our first selection. We then select every kth
individual after that first selection going round the circle until n units have been selected.

For example suppose we have a population of 2000 and we need a sample of 175. Then k=11.
We select a random number (from a random number table) between 0001 and 2000. Suppose
this number is 0379. Then the first selection into our sample is the 379th individual from the
start of our (original) list. We then, referring to our imaginary circle, select every kth
individual after that first selection going round the circle until 175 units have been selected.

If our list of individuals is not arranged in any particular order, then systematic sampling is
almost (but not quite equivalent) to simple random sampling. The difference is that not all
subsets are possible as in srs.(Think why). However, this is not a very serious drawback and
the convenience of systematic sampling constitutes a great practical advantage.

If the list is arranged, say by age starting from the youngest, then the systematic sampling
procedure spreads out the sample across age more evenly than simple random sampling
would do normally. Systematic sampling in these conditions becomes almost equivalent to
stratified sampling (discussed in the next subsection) with stratification according to age. The
procedure, however, should be avoided if there are cyclical patterns in the list, as may happen

160
if there are several subgroups in the list with each subgroup ordered by age. The selections
may then coincide with a particular age range thus biasing the sample.

11.2.7.3 Stratified Random Sampling

Stratified sampling consists of dividing the target population into sub groups called strata and
taking a separate sample within each stratum. The sample within each stratum is usually
drawn by simple random sampling.

Stratification i.e., the division of the target population into groups must be on a criterion or on
criteria relevant to the survey topic. For example it is pointless to ensure representativeness in
terms of religion if respondents answers are unlikely to be influenced by religion. But
provided the stratification criterion or criteria are appropriate, the representativeness (on these
criteria) achieved by stratified sampling produces greater precision than with simple random
sampling. In practice this is reflected by smaller margins of error.

The data prerequisites for implementation of stratified sampling are however more demanding
than for simple random sampling. We need not only a list of every member of the target
population but also in respect of each such member we need information on the stratification
criterion. For example, if we are stratifying by educational level, we need to know every
individuals educational level.

There are a number of options for allocating the total sample to the various strata i.e. how
many to sample from each stratum. One simple option is to share the total sample among the
various strata in proportion with their sizes (i.e. a stratum which has 40% of the population
gets 40% of the sample, another which has 25% of the population gets 25% of the sample and
so on). We refer to this as stratified sampling with proportionate allocation. This allocation
has the advantage that it gives every member of the population the same chance of selection
and hence eliminates the need for re weighting at the analysis stage.

However, there are sometimes good reasons for not using proportionate allocation. For
instance, if it is desired to make comparisons among strata, and certain strata are small, it is

161
possible that proportionate allocation will produce too small samples from such strata for
meaningful comparisons.

11.2.7.4 Cluster Sampling

Sometimes populations occur or can be conveniently divided into groups or clusters. For
example, school children are located in schools. Households in a country can be grouped into
geographical clusters made of blocks of houses bounded by streets or natural boundaries. This
fact provides an alternative random sampling strategy. This consists of drawing up a list of
clusters that together comprise the whole population and then selecting a sample of clusters.
This can be done by simple random sampling. We can then include in our sample all
individuals in each selected cluster. This is referred to as sampling of whole clusters. It gives
equal chance of selection to every member of the population.

The method of sampling just described has the great advantage that it does not require a list of
all members of the target population. It only requires a list of the clusters and this is usually
not hard to obtain. Furthermore, it concentrates the field work (think why and contrast with
srs!) and this reduces field costs. However, if the clusters are of unequal size, the method
provides no control over the overall sample size. Note that sample size has a direct bearing
on costs. In addition, for the same size of sample, the precision with cluster sampling is less
than with either srs or stratified sampling.

Instead of including every individual in the selected clusters into the sample it is possible to
take only a sample from each selected cluster. This method is referred to as cluster sampling
with subsampling. Note that it requires lists of individuals but only for the selected clusters.
Various strategies are available for both the selection of clusters and the the selection of
individuals within clusters.

Activity 1

Explain how you would select a random sample of students from the University of Mauritius
for the purpose of eliciting their views on the University Library services

162
(a) by simple random sampling
(b) by stratified random sampling (choose appropriate stratification criteria and justify
your choice)
(c) by cluster sampling
(d) by systematic sampling

11.2.8 Sample Size

What size of sample do I need for my survey? This is an often asked question. There is no
magic answer. Certain items of information are needed before an answer can be attempted.
We need an indication of the acceptable margins of error, of the maximum acceptable risk of
exceeding these margins and some advance information about the population to be sampled.
The latter requirement may seem difficult to satisfy but it is often possible to make certain
estimates or assumptions. Given these items of information, there exists theory that enables
one to determine the required sample size but this is beyond the scope of this course.

Suppose that you are doing a survey to find out what percentage of your target population
would be interested to purchase a certain product, and you would like to be quite confident
that your survey finding is not off mark by more than 5 percentage points on either side (i.e. if
your survey finds that 50 % are interested, for example, you want to be reasonably sure that
the true percentage lies between 45% and 55%). Your required sample size would be of the
order of 400. If you can relax your error margins to 10 percentage points on either side, then
your required sample size would be of the order of 100.

11.2.9 Quota Sampling

Quota sampling is a non-random method of selection. It attempts to ensure
representativeness on criteria that are considered important in the same way that stratified
random sampling does i.e. by ensuring that the proportions in the various strata in the sample
are the same as in the population. However, no sampling frame is used and the selection of
respondents is left to the interviewers. Thus, for example if it is desired to ensure
representativeness by age-group and the target population consists of 45% under 25, 35%
between 25 and 44 and 20% aged 45 and over, then interviewers would be sent out with

163
quotas relating to the number of persons in the various age groups that each one should
interview. There is then the danger that although the sample would be representative in terms
of age group, it could be biassed in terms of other characteristics. Can you think why? Certain
precautions can be taken but the risk can neither be eliminated nor controlled statistically (by
specifying margins of error as in the case of random sampling).

Opinion poll organisations usually use quota sampling because of its low cost and
convenience. Over the years these organisations have considerably refined the procedure by
incorporating more stratification criteria and more controls so that, nowadays, they are
usually able to produce reliable results. Surveys using quota sampling have become a
common feature of modern society as the general public is very fond of information on a
variety of subjects. The results of such surveys often appear in newspapers or are presented
on television. Watch for the next one!

11.3 THE QUESTIONNAIRE

As was noted in Unit 2, the observational methods are less effective in providing information
about personal beliefs, feelings, motivations, expectations or future plans.

The signal advantage of the mail questionnaire and personal interviews as principal ways of
collecting survey data have been discussed. We now turn to the instrument on which both
approaches depend, the questionnaire or recording schedule.

Both of them contain a set of questions logically related to a problem under study, but whilst
a schedule is used as a tool for interviewing, the questionnaire is used for mailing. The
process of construction of a schedule and a questionnaire is almost same, except for some
minor technical differences. Whilst the questionnaire itself is simpler, shorter and carefully
and clearly laid out, the requirements for the recording schedule are in some respects different
as it is handled by interviewers.

164
Having made the distinction between questionnaires and recording schedules, we shall now
concentrate on some basic points to be kept in mind whilst designing them. We shall however
use the term questionnaire for discussion of both types of documents.

Read pp 317-321 of your textbook (OJ)

As you see, several considerations must be borne in mind while designing a questionnaire.
Careful planning, the physical design of the questions, careful selection and phrasing of the
questions affect the number of returns as well as the quality and accuracy of the findings.

The entire process of questionnaire construction can be divided into the following steps:

(i) Information to be sought
(ii) Type of questionnaire to be used
(iii) Writing a first draft
(iv) Re-examining the questions
(v) Pre-testing and editing the questionnaire
(vi) Specifying procedure for its use.

So, the first step of questionnaire design is to define the problem to be tackled and hence
decide on what questions to ask. The temptation is to cover too much, but this has to be
resisted as lengthy questionnaires can prove to be demoralising for both the interviewer and
the respondent.

11.3.1 Question Construction

Once the information needs and size of the questionnaire have been agreed on, we can begin
question construction - this involves the following :

(a) Question Relevance and Content

In considering any question, it is wise to ponder upon whether respondents are likely to
possess the required knowledge, or have access to the appropriate information, necessary for

165
giving a correct answer. Further, it must be made clear whether questioning would secure the
required information or not. If we find that our objectives are not met by questioning, then we
should think of alternative procedures.

(b) Question wording

Obviously, great care is required in formulating the questions. Reliable and meaningful
returns depend to a large extent on this. Naturally, if questions are beyond the
understanding of the respondent, he/she may choose one of the alternative responses without
any idea as to the meaning of his/her response.

Some suggestions for wording questions are given below:

(i) Simple words which are expectedly familiar to all potential informants should be used.
Avoid multiple meaning questions, as they tend to give rise to confusion on both
sides; they should be formulated as two or more questions. Avoid ambiguity and
vague words as they encourage ambiguous and vague answers.

(ii) Caution must be exercised in the use of phrases which reflect upon the prestige of the
respondent. Embarrassing questions, leading questions, those involving memory,
catch-words or words with emotional connotations should be avoided. Further, the
question must allow for all possible responses - thus provision for such indefinite
answers as dont know, no choice, other (specify) should be made. But, at the
same time, to avoid abuses of these indefinite questions, the range of answers should
be exhaustive and well established as far as possible.

(iii) Questions should not, generally speaking, presume anything about the respondent. For
example,

166
How many cigarettes a day do you smoke? are best asked only after a filter question.
For e.g.,

Do you smoke?

Yes

No

This has revealed whether the respondent smokes cigarettes or not. Once filters have been
formulated, skip instructions are necessary. For instance, for the above case, suppose the
respondent does not smoke, then he may be directed to skip questions related to those who
smoke.

Question wording remains a matter of experience and common sense and what we have
discussed above is in no way complete.

(c) Response form or types of questions

The third major area in question construction is the type of questions to be included in the
instrument. They may be classified into open questions and closed questions.
The closed (sometimes called Pre-coded, Fixed Alternative) questions are structured ones
with two or more alternative responses from which the respondent can choose. They are
efficient where the possible alternative replies are known, limited and clear-cut as in the case
of factual information. They have the advantage of being standardisable, simple to
administer, quick and relatively inexpensive to analyse. But at the other end, they may tend to
force a statement of opinion on an issue or the respondent may be led to choose a response,
even when he/she has no knowledge of it, or the limited alternatives may not cover his/her
viewpoints.

Open-ended questions are unstructured ones, providing free scope to the respondents to reply
with their own choice of words, e.g.

167
What do you propose to do after leaving the University?

While they present a major strength in the sense that the informant is given the chance of
answering in his/her own terms and frame of reference, their analysis are often complex,
difficult and expensive.

Open-ended questions are desirable when the issue is complex or when the interest of the
researcher is the exploration of a process, but in other cases, closed questions are preferable.

(d) Question order/sequence

The order in which questions are arranged is as important as question wording, as they may
affect the refusal rate and there is evidence that they may even influence the answers
obtained.

As mentioned in the book by Goode and Hatt (1952), Methods in Social Research:
McGraw-Hill, NY, there should be a logical progression in the sequence so that the
respondent is

(i) drawn into the questioning process by awakening his/her interest
(ii) not confronted by an early and sudden request for personal information

(iii) easily brought along items which are simple to answer to those which are complex

(iv) never asked to give an answer which could be embarrassing without being given an
opportunity to explain

(v) brought smoothly from one frame of reference to another rather than made to jump
back and forth.

The overall sequence in a questionnaire is of paramount importance, as usually the
interviewer is a stranger to the respondent and the latter is under no obligation to comply. So,

168
the interviewer should try to awaken the respondents interest in the study and motivate
participation.

There was a tendency in the past to begin the questionnaire with easy-to-answer demographic
profiles of the respondent such as age, marital status, religion etc., but there is a school of
thought that sees this practice as not desirable because people do not like to furnish such
information so abruptly to strangers. It may thus be more desirable that these questions be put
at the end, as by that time, the interviewer has evoked the interest of the respondent in the
study and the latter is more willing to give such information.

(e) Pilot studies/Pre-testing

A pilot study is a full-fledged miniature study of a problem, while a pre-test is a trial test of a
specific aspect of the study, such as method of data collection, data collection instrument,
interview schedule etc..

The draft questionnaire must be pre-tested in order to find out how it works before launching
off on a full-scale survey. This often solves unforeseen problems in field work and indicates
any necessary change in the questions and other problems with the questionnaire. After the
editing is done, other pre-tests might be necessary before administering the questionnaire
,depending on the complexity of the study. Finally a pilot study, which is a main rehearsal of
the main study is vital for the proper running of the survey later.

11.3.2 Concluding Remarks

For the purpose of this course , the coverage of questionnaire design has been brief. The
interested reader is directed to specialised books referred in the Recommended Readings for
a comprehensive discussion on the topic.

To end up, questionnaire design remains a matter of common sense, experience and avoiding
known pitfalls. And detailed pre-tests and pilot studies, more than anything else, are the
essence of a good questionnaire.

169
Activity 2

(i) You have been requested by University management to conduct a small study on the
adequacy of accessibility by students to computer facilities on the University campus.
Describe and justify what sampling procedure you would adopt and also, design a
short questionnaire of about 10-15 questions for the purpose.

(ii) Concern has been expressed in various quarters about the difficulty experienced by
working women in reconciling their domestic responsibilities with their work. Issues
of interest are the extent of the domestic responsibilities, whether any help in coping
with them is obtained, the amount and type of leisure enjoyed and stresses generated
by the dual responsibilities. Differences in level of difficulty experienced and ways of
coping with them across different categories of women are also of interest. Design a
suitable short questionnaire (of about 15 to 20 questions) for carrying out a national
sample survey to address the issue described. The survey will use face to face
interviewing.

Recommended Readings:

1. Payne, S.L.B., The Art of Asking Questions, Princeton: Princeton University Press 1951.

2. Moser C. A., and Kalton G., Survey methods in Social Investigation, ELBS and
Heinemann educational books Ltd: University Press 1971.

11.4 SUMMARY

In this unit, you have studied the importance of randomness and representativeness in sample
selection, the use of random number tables, the application, strengths and weaknesses of
simple random sampling, systematic sampling, stratified sampling, cluster sampling and quota
sampling. You have also studied the principles of questionnaire design.

170
UNIT 12 LINEAR RELATIONSHIP BETWEEN VARIABLES - 1:
CORRELATION

Unit Structure

12.0 Overview
12.2 Bivariate Data
12.2.1 Scatter Diagrams
12.3 Measures of Correlation
12.3.1 Product Moment Correlation Coefficient
12.3.2 Rank Correlation Coefficient
12.4 Interpretation of the Coefficient of Correlation and Problems Related
Thereof
12.5 Coefficient of Determination
12.6 Summary

12.0 OVERVIEW

In many situations it is of interest to find out whether two or more variables are related, and
if so, to investigate the nature and strength of these relationships. For instance, one might be
interested in studying the relationship between Yield, Temperature, Humidity, Rainfall, etc.
Such relationships are studied using the techniques of Correlation and Regression. In this
unit, you shall study the concept of Correlation, different ways of measuring it, its importance
and limitations.

171


1. Identify bivariate relationships.
2. Construct and interpret Scatter Diagrams.
3. Compute, interpret and use the following:
(i) Product moment correlation coefficient.
(ii) Rank correlation coefficient.
(iii) Coefficient of Determination.

12.2 BIVARIATE DATA

So far, we have confined ourselves to univariate data i.e. the data concerning only one
variable. We may, however, come across data involving two or more variables, for example,
the marks of students in various subjects. The data involving two variables is known as
Bivariate Data.

Table 12.1 is an example of bivariate data:

Table 12.1

Student % Marks in % Marks in
Maths Statistics

A 40 65
B 68 75
C 35 35
D 52 48
E 70 50

172
Note: Bivariate data must always be in pairs and the two sets of data should correspond
to the same units of observation. For instance, in Table 12.1, the marks in Maths
and Statistics should correspond to the same set of students.

We are often interested to find out the nature and strength of relationship between two
variables under study. In the above example, we might be interested in knowing about the
type of relationship that exists between Marks in Maths (X) and Marks in Statistics (Y) and
whether high values of X tend to be associated with high or low values of Y or vice-versa.

The coming sections of this unit are devoted to the analysis of bivariate data using correlation
technique.

12.2.1 Scatter Diagrams

You have been introduced to scatter diagrams in Section 5.2.6 of the manual and you have
seen their usefulness. We further develop scatter diagrams in this section, especially in the
context of discerning the relationship between the variables.

As mentioned earlier, if the paired values of variables X and Y are plotted along x-axis and y-
axis respectively in the xy-plane, the diagram of points so obtained is known as Scatter
Diagram.

From the scatter diagram, we can form a fairly good idea about the relationship between X
and Y.


173
Study the following scatter diagrams carefully and try to identify the nature of relationship
represented by each of them.

Y Y

Height
of Interest
Sons
0 X 0 X
Height of father Savings
Fig. 12.1 Fig. 12.2

Y Y

Price Number of
errors made

0 X 0 X
Demand Number of weeks experience
Fig. 12.3 Fig. 12.4

Y

Consumption
Yield of
Cigarettes

0
Rainfall X Height
Fig. 12.5 Fig. 12.6

174
You will note that in Figures 12.1 and Fig. 12.2, high values of variable X are associated with
high values of Variable Y indicating positive relationship. Since the points in Fig. 12.2 are
less scattered as compared to those in Fig. 12.1, the positive relationship exhibited in Fig.
12.2 is stronger than that in Fig. 12.1.

Activity 1

A machine will run at different speeds but the higher the speed the sooner a certain part has to
be replaced. Trial observation gives the following data:

Table 12.2

Speed Life of
(revolutions per minute) drill-head

18 162
20 154
20 171
21 165
23 128
26 138
26 140
28 129
31 125
32 106
32 97
40 95
41 103
42 109
43 69

Plot the figures on a scatter diagram and comment.
12.3 MEASURES OF CORRELATION

175

You have seen that the scatter diagram provides a useful aid in discerning the nature of the
relationship between two variables, but it cannot supply a quantitative measure of the extent
of the relationship between the two variables. Thus, in addition to examining the scatter
diagram, it is therefore, necessary to compute a descriptive measure that reflects the strength
of the existing relationship.

Correlation, in fact, does so and gives us a measure of the strength of the linear relationship
that exists between two or more variables. In this unit, we consider the linear relationship
between two variables i.e. simple correlation. In this section, you study two measures of
correlation:

1. Product Moment Correlation Coefficient (r).
2. Rank Correlation Coefficient (P).

12.3.1 Product Moment Correlation Coefficient

Consider the following table representing the volume of Sales and Total expenses for ten
firms.
Table 12.3

Volume of Sales Total Expenses
(in thousands of units) (000)

Y X

20 60
2 25
4 26
23 66
18 49
14 48
10 41
8 18
13 40
18 33

176
The scatter diagram is produced below:

Graph of Volume of Sales v/s Total Expenses

0
5
10
15
20
25
0 20 40 60 80
Total Expenses
V
o
l
u
m
e

o
f

s
a
l
e
s

Fig. 12.7
The scatter diagram indicates a positive relationship between the two variables but it is
insufficient to give us a measure of the strength of the relationship between the two.

Let us consider the problem:-

Compute X and Y and form the columns
( )
X X and
( )
Y Y .
Calculate
( )
X X
( )
Y Y . Plot the points Y Y against X X .

Table 12.4

Note: n = 10, X = 406, Y = 130
= X 40 6 . , Y = 13
X X Y Y
( )( )
X X Y Y

19.4 7 135.8
-15.6 -11 171.6
-14.6 -9 131.4
25.4 10 254
8.4 5 42
7.4 1 7.4
0.4 -3 -1.2
-22.6 -5 113
-0.6 0 0
-7.6 5 -38
___
816
===

177
-15
-10
-5
0
5
10
15
-30 -20 -10 0 10 20 30

Figure 12.8

Scatter Diagram (Fig. 12.7) indicates that high values of X are associated with high values
of Y and Fig. 12.8 shows that most of the points lie in the I
st
and III
rd
Quadrant, where the
product
( )( )
X X Y Y is positive. Hence, we expect
( )( )
X X Y Y
to be positive if
most of the points lie in the first and third quadrant of
( )
X X Y Y , plane. It implies that
there is a direct or positive correlation between X and Y.

Similarly, if most of the points lie in the second and fourth quadrant,
( )( )
X X Y Y
will
be negative, thereby implying negative or inverse correlation.

If the points are rather evenly distributed in all the four quadrants, the sum of the positive
products
( )( )
X X Y Y would roughly equal the sum of negative products.

Thus,
( )( )
X X Y Y
is expected to be close to zero, indicating very weak linear

relationship between X and Y.

Ist Quadrant IInd Quadrant
IIIrd Quadrant IVth Quadrant
Y Y
X X

178
Alternatively, by shifting the origin from (0, 0) to
( )
X Y , i.e. (40.6, 13) in Fig. 12.7, we can
draw similar conclusions as above. It is shown in Fig. 12.9.

0
5
10
15
20
25
0 20 40 60 80
Total Expenses
Volume of sales

Figure 12.9

You may be wondering if we could use
( )( )
X X Y Y
as a measure for the degree of

association? Well, it could be used but this term has two deficiencies:

(1) it is influenced by the variability of X and Y and

(2) the magnitude of this term depends upon the size of the sample.

To overcome the first deficiency, we divide
( )( )
X X Y Y
by the measures of dispersion

x y
, . The measure thus derived is also a dimensionless ratio.

A remedy for the second deficiency is to divide the ratio by n, the sample size.

Ist/IIIrd Quadrant
( )( )
X X Y Y is +ve
Y
X

179
Hence Product Moment Correlation Coefficient is given by

( )( )
r
X X Y Y
n
x y
=

............................... (Formula 12.1) .

From Table 12.4, we have

( )( )
X X Y Y =
816

Also, you recall

( )
X
n
X X
2
2 1
=

and
( )
Y
n
Y Y
2
2 1
=

Using Table 12.4, we obtain

X Y
2 2
217 24 436 = = . & .

Using Formula 12.1

= r
x
x
816
10
217 24 436 . .

= 0.8384

r measures the strength of the linear relationship between X and Y. This formula was
developed by Karl-Pearson; hence this coefficient of correlation is commonly known as Karl-
Pearsons product moment correlation coefficient or simply Pearsons coefficient of
correlation.

180
Note: For calculation purpose, 12.1 is often expressed as

( )( )
( )
| |
( )
| |
r
n XY X Y
n X X n Y Y
2
2
2
2
=

..................(Formula 12.2)

Activity 2 (Optional)

(1) Show that
( )( )
X X Y Y XY
X Y
n
=

(2) Recall from Unit 6 Section 6.3.1.1, Activity 2(ii) that

( )
( )
X X X
X
n
=
2
2
2

( )
( )
Y Y Y
Y
n
=

2
2
2

(3) Hence show that

( )( ) ( )( )
( )
| |
( )
| |
X X Y Y
n
n XY X Y
n X X n Y Y
x y

=

2
2
2
2

To illustrate the calculation of r, using the computational formula (12.2), we reconsider the
data from Table 12.3, and compute the columns XY, X and Y

181
Table 12.5

X Y XY X Y

60 20 1 200 3 600 400
25 2 50 625 4
26 4 104 676 16
66 23 1 518 4 356 529
49 18 882 2 401 324
48 14 672 2 304 196
41 10 410 1 681 100
18 8 144 324 64
40 13 520 1 600 169
33 18 594 1 089 324
___ ___ _____ _______ _____
406 130 6 094 18 656 2 126

r =
( )
| |
( )
| |
10 6 094 406 130
10 18 656 406 10 2 126 130
2 2
x x
x x

=
( )( )
60 940 52 780
21 724 4 360
08384
= .

Note that we get the same answer as before, but computations are simpler.

Activity 3

Attempt Question 23.6 (b) and 23.18 in (OJ).

12.3.2 Rank Correlation Coefficient


182
Note that the equation on p:472 (OJ) at the bottom of the page, should read as:

( )
P =

= = 1
6 34 5
9 81 1
1
207
720
0 7125
.
.

Activity 4

Attempt Question 23.24 (b) from (OJ).

12.4 INTERPRETATION OF THE COEFFICIENT OF CORRELATION AND
PROBLEMS RELATED THEREOF

Correlation coefficient r lies between -1 and +1 i.e. -1 r 1

r = -1 implies a perfect inverse linear relationship between X and Y, that is, all the sample
points will fall on a straight line with negative slope.

r = 0 implies no linear relationship between X and Y

r = +1 implies a perfect direct linear relationship between X and Y, that is, all the points (X,
Y) will fall on a straight line which has a positive slope.

It follows that a value of r near to -1 indicates a high degree of negative association per high
values of one variable are associated with low values of the other. The negative sign shows
that the relationship is inverse. On the other hand, a value of r close to +1 implies a high
degree of positive association, i.e., high values of one variable are associated with high values
of the other. The positive sign shows that the relationship is direct.

183
Note:

Rank Correlation Coefficient (P) is nothing but product moment correlation
coefficient between the ranks. Hence, it can be interpreted in the same way as r.

Correlation might exist between two variables and it could be strong, yet there is no
logical or causal relationship.

A Causal or Cause and effect relationship is said to exist between two variables if
change in one variable causes change in the other. For example: Age of Machine
(Cause) v/s. Maintenance Cost (effect); Rainfall (Cause) v/s Yield (effect).

Some relationships are even purely accidental. This is known as a spurious or non-
sense correlation. For example, Average working hours per week and percentage of
fibre in diet. Automation is responsible for cutting down the hours required to work
while medical awareness is causing diets to become healthier. The correlation in this
case will obviously be high but spurious. One would surely not wish to make a
comment --- Healthy eating causes laziness!

Two series may vary together, being under the influence of other variable/s. You
might find a close relationship between jewellery sales and sales of colour TV sets.
Here, changes in both sales are probably a result of changes in consumer income.

Zero correlation doesnt always mean that there is no relationship between the
variables. All it says is that there is No Linear Relationship between the Variables ---
there may be strong relationship but of a non- linear kind.

184
Activity 5

Consider the following bivariate data

X -4 -3 -2 -1 1 2 3 4
Y 16 9 4 1 1 4 9 16

For the above data, compute the following:

X Y X Y XY , , , ,
2 2

Hence, calculate the Karl Pearsons correlation coefficient, r.

What do you notice?
Based on the value of r, what conclusions can you make?
Plot a scatter diagram of Y against X.
Is there any relationship between Y and X?
Is there anything special about the form of the relationship between Y and X?
(HINT!! Compare Y with the computed values of X ).
According to you, what important issue on the interpretation of r does this
activity bring out?

12.5 COEFFICIENT OF DETERMINATION

The Coefficient of Determination is equal to the square of the Coefficient of Correlation and
is denoted by
( )
r r
2 2
0 1 . It gives the percentage of variation in one variable explained by
variation in the other variable. For example, if r = 0.5,
r = 0.25 which implies that only 25 percent of the variation in Y (or X) is explained by the
variation in X (or Y), thereby indicating very low linear relationship between the two
variables.

185
Similarly, since r = 0.7 seems to indicate a high positive correlation, yet in fact only (0.7),
i.e. 49% of the variation in one variable is explained by the variation in the other. Hence,
there is a moderate relationship between the two variables.

Thus, we should take care in interpreting the values of the correlation coefficient.

Note: The concept of the coefficient of determination, its applications and uses are
further developed in Unit 13, in connection with regression analysis.

12.5 SUMMARY

In this unit, you have learnt about the linear relationships that exist between two variables
using scatter diagrams and the measures of correlation viz. Product Moment Correlation
Coefficient and the Coefficient of Rank Correlation. You should now be able to interpret the
results properly.

186
UNIT 13 LINEAR RELATIONSHIP BETWEEN VARIABLES - II:
REGRESSION

Unit Structure

13.0 Overview
13.2 Modelling the Relationship Between X and Y
13.3 The Simple Linear Regression
13.4 Functional Forms
13.5 Linear Regression and the Time Series
13.6 Coefficient of Determination
13.7 Summary

13.0 OVERVIEW

Regression analysis is used to study the nature of the relationship that exists between two or
more variables, as well as to serve as a basis for prediction. This unit focuses on computing
and analysing the linear regression that describes the relationship between two variables.


When you have successfully completed the Unit, you should be able to do the following:

1. Explain the purpose of regression analysis.
2. Compute, interpret and use the simple linear regression equation for elementary
forecasting purposes.
3. Interpret the coefficient of determination.

187
13.2 MODELLING THE RELATIONSHIP BETWEEN X AND Y

Consider the following example. A home service repair charges are Rs 40 for the service call
and Rs 10 per hour spent on location. This situation can be modelled exactly as

Y = 40 + 10 X ............................ (Formula 13.1)

where Y (the labour charges) is the dependent variable and X (the number of hours) is the
independent variable.

Graph of Labour charges v/s
Number of Hours
0
20
40
60
80
0 1 2 3
N umb er o f Ho ur s

Figure 13.1

When a graph is drawn as shown in Fig. 13.1, you find that all the points lie on a straight line
so that for a given value of X, you can calculate an exact value for Y, i.e., say a worker
spends 2 hours on a particular job then the labour charges would be 40+10(2)= RS 60. Such
a model is generally represented by Y X = + and referred to as Exact Model.

But in real life, more often, we come across situations which cannot be modelled exactly as
discussed above. For example, consider the scatter diagram of advertising expenditure and
sales on page 453 of textbook (OJ) and reproduced below.

188
Graph of Sales v/s Advertising Expenditure

Brown's Department Store
0
2
4
6
8
10
60 70 80 90 100 110
Advertising(000)
S
a
l
e
s
(
m
)

Figure 13.2
It can be noted that though the relationship between the advertising expenditure (X) and sales
(Y) may be strong, yet we cannot calculate sales exactly for a given advertising expenditure,
i.e., suppose we wished to calculate sales for an expenditure of 71,000 or 102,000.

You note that in Figure 13.2, a straight line cannot be placed through all the points in the
scatter diagram. However it is plausible to suppose that there exists a linear relationship
between the two variables except that there are also some unpredictable deviations. So there is
a need to include an error term in our exact model.

Hence, in such situations, we can fit a statistical model of the form Y X = + + where
is the error term.

When you read your textbook later, you shall learn more about the error term and how by
minimising the sum of squares of the errors, we can obtain estimates of and .

189
13.3 THE SIMPLE LINEAR REGRESSION


The simple linear regression model is Y = + X +

Where Y is the dependent variable
X is the independent variable
and are the population parameters
and is the random error term.

The fitted regression line of y on x is thus $ y a bx = + , where a and b are estimates of and
respectively.

Pages 458-459, provides you with formulae to calculate a and b rather than having to solve
the normal equations simultaneously. The normal equations were (p:456) (OJ)

y na b x
xy a x b x
= +
= +

2

giving
( )
( )
( )
b
x x y y
x x
=

2

which can be further simplified to

( )( )
( )

=
=
=

=
y
n
y
x
n
x
x b y a
x x n
y x y x n
b
1
1
where
(2) and
(1)
2
2

190
(The formulae (1) and (2) are provided in exams).

Assumptions

There are a number of assumptions underlying the simple linear regression model but we
mention only some of them:

(i) The independent variable X is measured without errors.
(ii) The model has been correctly specified.
(iii) Errors are independently distributed, their standard deviation is
constant and the average of all errors is zero.

SOLVED EXAMPLE

We take up the data used in 12.3.1 to fit a regression line by the method of least squares.

The fitted regression line is $ y a bx = + , where a and b are as above.

To compute a and b, we need x x y , ,

2
and xy
.

We refer to Table 12.5, where these quantities have already been computed:

We have, n = 10

x
y
x
xy
=
=
=
=
406
130
18656
6094
2

Thus

191

( )( )
( )
( ) ( ) ( )
( ) ( )
b
n xy x y
n x x
=

=

2 2
2
10 6094 406 130
10 18656 406
0 3756 .

and

a y bx =
=
|
\
|
.
|
=
130
10
03756
406
10
2 25
.
.

Therefore the fitted line is:-

$ . . y x = + 2 25 0 3756

Activity 1

Attempt Questions 23.3 and 23.5 of textbook (OJ).

Note: Equation Y a bX
^
= + gives the line of regression of Y on X. It is
used to estimate or predict the value of Y for any given value of X
i.e., when Y is a dependent variable and X is an independent variable.
Moreover, the estimate obtained will be best since this regression
equation minimises the sum of squares of errors by the method of
least squares.

Caution!!! This regression equation cannot be used to estimate or predict the value of X
for any given value of Y.
13.4 FUNCTIONAL FORMS

192

It may be possible to use the techniques you have learnt so far to fit other functional forms
which are different from the usual form, Y a bX = + ...... (Formula 13.2) but can be easily
converted by some appropriate transformation to form Formula 13.2 above.

Example 1

Consider Y = a
b
X
+

If we let Z =
1
X
i.e.,

treat
1
X
as the independent variable we get

Y = a + bZ
which is of the form and can be fitted by the method of least squares.

Similarly

y ab
x
=

Taking ln (log
e
) on both sides, we get

ln y = ln( ) ab
x

= ln ln a b
x
+

= ln ln a x b +

i.e. Y = A + xB

193
i.e. Y = ln y is the dependent variable

Try y ax
b
= , y ae
x
=

SOLVED EXAMPLE

The number y of bacteria per unit volume present in a culture after x hours is given in the
table below.

Number of Hours (x)

0 1 2 3 4 5 6
Number of Bacteria
per Unit Volume (y)
32 47 65 92 132 190 275

(a) It is suggested that for this type of situations, the curve

y = A.B
x

where A and B are constants, gives a good fit.

Show that the curve above can be transformed into the usual standard linear regression
equation.

Hence obtain estimates of A and B (correct to 2 d.p).

(b) Compare the fitted values of y obtained from the above equation with the actual values
and comment on your findings.

(c) Estimate the value of y for

194

x = 3.5 and x = 8

Which one of the two estimates would you expect to be more reliable and why?

SOLUTION

(a) y = A . B
x

Taking ln (log e) on both sides, we have

( ) l l
l l l
n y n A B
n y n A n B
x
x
=
= +
.

l l l n y n A x n B = + . 1

Let

Y n y
n A n B
=
= =
l
l l & 2

Thus (1) is reduced to

Y = + . x 3

which is the usual standard linear regression equation and thus estimates of and say a and
b respectively, can be obtained by the method of least squares.

x y Y = ln y x
2
x . Y
0 32 3.466 0 0
1 47 3.850 1 3.850
2 65 4.174 4 8.349

195
3 92 4.522 9 13.565
4 132 4.883 16 19.531
5 190 5.247 25 26.235
6 275 5.617 36 33.701
21 31.759 91 105.231

We have

n x Y x x Y = = = = =

7 21 31759 91 105231
2
, . , , & . .

Thus we have

( )( )
( )
( ) ( )( )
( ) ( )
( )
( )
( )
b
n x Y x Y
n x x
a Y bx
Y b x
=

=
=
=
=
=

.
. .
.
. .
.
2
2
2
7 105 231 21 31 759
7 91 21
0 355
1
7
1
7
31 759 0 355 21
3 472

196
Hence, from (2)

( )
( )
( )
$
. .
&
$
. .
$ . .
.
.
A e e to d p
B e e to d p
y
a
b
x
= = =
= = =
=
3 472
0 355
32 20 2
143 2
32 20 143

(b)
x y
( ) $ . . y
x
= 32 20 1 43
0 32 32.2
1 47 46.0
2 65 65.8
3 92 94.1
4 132 134.6
5 190 192.5
6 275 275.3

The curve y = A . B
x
gives a very good fit for this type of situation.

(c ) Try this part yourself.

Activity 2

The table below gives experimental values of the pressure P of a given mass of gas
corresponding to various values of the volume, V. According to thermodynamic principles, a
relationship of the type

PV
a
= c

where a & c are constants should exist between the variables.

Volume, V 54.3 61.8 72.4 88.7 118.6 194.0
Pressure, P 61.2 49.5 37.6 28.4 19.2 10.1

197
(a) Using the method of least squares, estimate the values of a and c.
(b) Write down the equation connecting P and V.
(c) Estimate P when V = 100.0

13.5 LINEAR REGRESSION AND THE TIME SERIES


Activity 3

Attempt Question 23.8 of textbook (OJ).

13.6 COEFFICIENT OF DETERMINATION


Activity 4

For the data set used in 12.3.1,
(a) compute the coefficient of determination and interpret its value.
(b) compare with the value of the correlation coefficient that you previously calculated for
the data and comment.

13.7 SUMMARY

In this unit, you have learnt how to use sample data to fit the simple linear regression, relating
a dependent variable y on a single independent variable x, and also how it can be used for
prediction. Further, you have understood the importance of the coefficient of determination in
regression.

Statistics Support Materials Guide

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Statistics Support Materials Guide

Enviado por

Direitos autorais:

Formatos disponíveis

STATISTICS

, (the Greek letter read as sigma) is used instead. Thus, we have

sign appears without the limits of summation: the latter should be

- notation you learnt previously to

(as per page 83).

, is in fact, the effective expenditure incurred in the base

, presents the expenditure that would have been

is expected to be close to zero, indicating very weak linear

as a measure for the degree of

by the measures of dispersion

Você também pode gostar