Escolar Documentos
Profissional Documentos
Cultura Documentos
Assignment No.1
Statistics in Psychology
(a) Statistics
Organize Data
When dealing with an enormous amount of information, it is all too easy to become
overwhelmed. Statistics allow psychologists to present data in ways that are easier to
comprehend. Visual displays such as graphs, pie charts, frequency distributions, and scatterplots
make it possible for researchers to get a better overview of the data and to look for patterns that
they might otherwise miss.
Describe Data
Think about what happens when researchers collect a great deal of information about a
group of people. The census is a great example. Using statistics, we can accurately describe the
information that has been gathered in a way that is easy to understand. Descriptive statistics
provide a way to summarize what already exists in a given population, such as how many men
and women there are, how many children there are, or how many people are currently employed.
By using what's known as inferential statistics, researchers can infer things about a given
sample or population. Psychologists use the data they have collected to test a hypothesis or a
guess about what they predict will happen. Using this type of statistical analysis, researchers can
determine the likelihood that a hypothesis should be either accepted or rejected.
There are two main branches of statistics that are descriptive statistics and inferential statistics.
Both of these are employed in scientific analysis of data and both are equally important for the
student of statistics.
Descriptive Statistics
Descriptive statistics deals with the presentation and collection of data. This is usually the first
part of a statistical analysis. It is usually not as simple as it sounds, and the statistician needs to
be aware of designing experiments, choosing the right focus group and avoid biases that are so
easy to creep into the experiment.
Example
Different areas of study require different kinds of analysis using descriptive statistics as a
physicist studying turbulence in the laboratory needs the average quantities that vary over small
intervals of time. The nature of this problem requires that physical quantities be averaged from a
host of data collected through the experiment.
Inferential Statistics
Inferential statistics, as the name suggests, involves drawing the right conclusions from
the statistical analysis that has been performed using descriptive statistics. In the end, it is the
inferences that make studies important and this aspect is dealt with in inferential statistics.
While drawing conclusions, one needs to be very careful so as not to draw the wrong or
biased conclusions. Even though this appears like a science, there are ways in which one can
manipulate studies and results through various means.
Example
The data dredging is increasingly becoming a problem as computers hold loads of information
and it is easy, either intentionally or unintentionally, to use the wrong inferential methods.
Both descriptive and inferential statistics go hand in hand and one cannot exist without
the other. Good scientific methodology needs to be followed in both these steps of statistical
analysis and both these branches of statistics are equally important for a researcher.
(d) Limitations of Statistics in Psychology
Statistics is indispensable to almost all sciences - social, physical and natural. It is very
often used in most of the spheres of human activity. In spite of the wide scope of the subject it
has certain limitations. Some important limitations of statistics are the following:
Statistics deals with facts and figures. So the quality aspect of a variable or the subjective
phenomenon falls out of the scope of statistics. For example, qualities like beauty, honesty,
intelligence etc. cannot be numerically expressed. So these characteristics cannot be examined
statistically. This limits the scope of the subject.
Statistical laws are not exact as incise of natural sciences. These laws are true only on average.
They hold well under certain conditions. They cannot be universally applied. So statistics has
less practical utility.
Statistics deals with aggregate of facts. Single or isolated figures are not statistics. This is
considered to be a major handicap of statistics.
Statistics is mostly a tool of analysis. Statistical techniques are used to analyse and interpret the
collected information in an enquiry. As it is, statistics does not prove or disprove anything. It is
just a means to an end. Statements supported by statistics are more appealing and are commonly
believed. For this, statistics is often misused. Statistical methods rightly used are beneficial but if
misused these become harmful. Statistical methods used by less expert hands will lead to
inaccurate results. Here the fault does not lie with the subject of statistics but with the person
who makes wrong use of it.
6. It sufficient care is not exercised in collecting, analyzing and interpretation the data,
statistical results might be misleading.
7. Only a person who has an expert knowledge of statistics can handle statistical data
efficiently.
8. Some errors are possible in statistical decisions. Particularly the inferential statistics
involves certain errors. We do not know whether an error has been committed or not.
(e) Data
Data, information and statistics are often misunderstood. They are actually different things, as
Figure shows.
Data can take various forms, but are often numerical. As such, data can relate to an enormous
variety of aspects.
For example:
Types of Data
Quantitative data deals with numbers and things you can measure objectively:
dimensions such as height, width, and length. Temperature and humidity. Prices. Area and
volume.
Qualitative data deals with characteristics and descriptors that can't be easily measured,
but can be observed subjectively such as smells, tastes, textures, attractiveness, and color.
Numerical data. These data have meaning as a measurement, such as a persons height,
weight, IQ, or blood pressure; or theyre a count, such as the number of stock shares a
person owns, how many teeth a dog has, or how many pages you can read of your
favorite book before you fall asleep. (Statisticians also call numerical data quantitative
data.)
Numerical data can be further broken into two types: discrete and continuous.
o Discrete data represent items that can be counted; they take on possible values
that can be listed out. The list of possible values may be fixed (also called finite);
or it may go from 0, 1, 2, on to infinity (making it countable infinite). For
example, the number of heads in 100 coin flips takes on values from 0 through
100 (finite case), but the number of flips needed to get 100 heads takes on values
from 100 (the fastest scenario) on up to infinity (if you never get to that 100th
heads). Its possible values are listed as 100, 101, 102, 103 . . . (representing the
countable infinite case).
Primary Data
Primary data means the raw data (data without fabrication or not tailored data) which has just
been collected from the source and has not gone any kind of statistical treatment like sorting and
tabulation. The term primary data may sometimes be used to refer to firsthand information.
The sources of primary data are primary units such as basic experimental units, individuals,
households. Following methods are used to collect data from primary units usually and these
methods depends on the nature of the primary unit. Published data and the data collected in the
past is called secondary data.
Personal Investigation
Through Questionnaire
Through Telephone.
Through Internet
Secondary Data
Data which has already been collected by someone, may be sorted, tabulated and has
undergone a statistical treatment. It is fabricated or tailored data.
Government Organizations
Federal and Provincial Bureau of Statistics, Crop Reporting Service-Agriculture
Department, Census and Registration Organization etc.
Semi-Government Organization
Municipal committees, District Councils, Commercial and Financial Institutions like
banks etc.
Internet
(f) Constant
Constant, as its name suggests, is something that does not vary or change (or that may not
be susceptible to variation or change).
Properties
The search for constants is the ultimate goal in science. A scientific law (e.g. light in a
vacuum travels at 300,000 km/sag) is a constant. As such, it allows us to predict
properties based on such law, to use it as a premise in deductive sciences, and to use it as
an assumption in inductive sciences.
A main task in experimental research, however, is to control the variability of all but the
research variables. This control is done by way of keeping all other variables as
constants. Once this is achieved, it is possible to ascertain the relationship between the
research variables, as these would be the only variables actually varying.
Any variable can be made into a constant by reducing its expression to only one of its
values.
For example, we could keep the temperature of a room constant at 35o C during an
experiment. In this case, temperature stops being a variable.
A constant has no use in statistics. That is, anything that is or remains constant cannot be
subjected to statistical analysis. Many researches, however, may take this property as
meaning that a given constant is not relevant for predicting something, which is not
always the case.
For example; 'income' is a variable that can vary between data units in a population (i.e. the
people or businesses being studied may not have the same incomes) and can also vary over time
for each data unit (i.e. income can go up or down).
Numeric Variables
Numeric variables have values that describe a measurable quantity as a number, like 'how many'
or 'how much'. Therefore numeric variables are quantitative variables.
Numeric variables may be further described as either continuous or discrete:
A continuous variable is a numeric variable. Observations can take any value between
a certain set of real numbers. The value given to an observation for a continuous variable
can include values as small as the instrument of measurement allows. Examples of
continuous variables include height, time, age, and temperature.
An ordinal variable is a categorical variable. Observations can take a value that can be
logically ordered or ranked. The categories associated with ordinal variables can be
ranked higher or lower than another, but do not necessarily establish a numeric difference
between each category. Examples of ordinal categorical variables include academic
grades (i.e. A, B, C), clothing size (i.e. small, medium, large, extra-large) and attitudes
(i.e. strongly agree, agree, disagree, strongly disagree).
A nominal variable is a categorical variable. Observations can take a value that is not
able to be organized in a logical sequence. Examples of nominal categorical variables
include sex, business type, eye color, religion and brand.
Dependent variable
The presumed effect in an experimental study. The values of the dependent variable
depend upon another variable, the independent variable. Strictly speaking, dependent
variable should not be used when writing about no experimental designs.
Confounding variable
A variable that obscures the effects of another variable. If one elementary reading
teacher used a phonics textbook in her class and another instructor used a whole language
textbook in his class, and students in the two classes were given achievement tests to see
how well they read, the independent variables (teacher effectiveness and textbooks)
would be confounded. There is no way to determine if differences in reading between the
two classes were caused by either or both of the independent variables.
A population is a group of phenomena that have something in common. The term often
refers to a group of people, as in the following examples:
All persons who played golf at least once in the past year
Sample
PART-1
Scales of Measurements
Nominal
Ordinal
Interval
Ratio
Nominal Scales
Here the numbers are used merely as names and have no quantitative value. Typically, a tackle
on the football team wears a number in the 70s. This number merely gives him a name. It does
not tell how many tackles he made, how fast he can run or if his team wins. Nominal scales are
the lowest levels of measurement. It is a naming scale and is used with categorical data.
Examples:
place of birth
political orientation
gender
types of sports
Nominal can used numbers to represent labels within a category, but the number does not have
qualities of a true number--just a category label.
Example:
Ordinal Scales
This scale has the characteristic of the nominal scale in that different numbers mean different
things, but also has the characteristic of "greater or lesser". It measures a variable in terms of
magnitude, or rank.
Example:
socioeconomic
class
grades
preferences
* Ordinal scales tell us relative order, but give us no information regarding differences between
the categories. For example, runners in the 100 meter dash finish 1st, 2nd, 3rd etc. Is the number of
seconds between 1st and 2nd place the same as those between 2nd and 3rd place? Certainly not
necessarily.
Interval Scales
This scale has the properties of the nominal and ordinal scales, but here the magnitude between
the consecutive intervals are equal. Temperature is the example that is usually given to illustrate
an interval scale.
Example:
90 are hotter than 45 and the difference between 60 and 70 are the same as the difference
between 30 degrees and 40
* Interval scales do not have a true zero. 0 degrees do not mean the absence of heat (although it
might feel like it).
Example:
Attitude scales are sometimes considered to be interval scales.
Ratio Scales
Ratio scales have all of the characteristics of the nominal, ordinal and interval scales. In
addition, however, ratio scales have a true zero. This is the kind of scale that you used when you
learned arithmetic in grade school. You assumed that the numbers had meaning, that they had
rank order (3 is larger than 2), that the intervals between the consecutive numbers were equal and
that there was a zero. Four was twice two; eight was half of sixteen etc. There are true ratios.
One can use all mathematical operations on this scale.
Example:
weight
height
time
distance
1- Response time
CONTINOUS VARIABLE
QUATITATIVE
DISCRETE
3- Favorite color
QUALITATIVE
4- Occupation aspired to
QUALITATIVE
CONTIONOUS
QUANTITATIVE
DISCRETE
7- Temperatures recorded
QUANTITATIVE
Assignment No. 3
Frequency distribution
The frequency (f) of a particular observation is the number of times the observation
occurs in the data. The distribution of a variable is the pattern of frequencies of the observation.
Frequency distributions are portrayed as frequency tables, histograms, or polygons.
Frequency distributions can show either the actual number of observations falling in each
range or the percentage of observations. In the latter instance, the distribution is called a relative
frequency distribution.
OR
Frequency distribution tables can be used for both categorical and numeric variables.
Continuous variables should only be used with class intervals. Number of times a given quantity
(or group of quantities) occurs in a set of data. For example, the frequency distribution of income
in a population would show how many individuals (or households) have the income of a certain
level (say, 5,000 a month). It is plotted either as a step-column chart (histogram) or as a line-
chart (hectograph).
Steps of Construction
If required
Example:
A survey was taken on Sector A-II. In each of 20 homes, people were asked how many
cars were registered to their households. The results were recorded as follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Use the following steps to present this data in a frequency distribution table.
1. Divide the results (x) into intervals, and then count the number of results in each interval.
In this case, the intervals would be the number of households with no car (0), one car (1),
two cars (2) and so forth.
2. Make a table with separate columns for the interval numbers (the number of cars per
household), the tallied results, and the frequency of results in each interval. Label these
columns Number of cars, Tally and Frequency.
3. Read the list of data from left to right and place a tally mark in the appropriate row. For
example, the first result is a 1, so place a tally mark in the row beside where 1 appears in
the interval column (Number of cars). The next result is a 2, so place a tally mark in the
row beside the 2, and so on. When you reach your fifth tally mark, draw a tally line
through the preceding four marks to make your final frequency calculations easier to
read.
4. Add up the number of tally marks in each row and record them in the final column
entitled Frequency.
0 4
1 6
2 5
3 3
4 2
By looking at this frequency distribution table quickly, we can see that out of 20 households
surveyed, 4 households had no cars, 6 households had 1 car, etc.
Example
At a recent chess tournament, all 10 of the participants had to fill out a form that gave
their names, address and age. The ages of the participants were recorded as follows:
Use the following steps to present these data in a cumulative frequency distribution table.
1. Divide the results into intervals, and then count the number of results in each interval. In
this case, intervals of 10 are appropriate. Since 36 is the lowest age and 92 is the highest
age, start the intervals at 35 to 44 and end the intervals with 85 to 94.
2. Create a table similar to the frequency distribution table but with three extra columns.
In the first column or the Lower value column, list the lower value of the result
intervals. For example, in the first row, you would put the number 35.
The next column is the Upper value column. Place the upper value of the result
intervals. For example, you would put the number 44 in the first row.
The third column is the Frequency column. Record the number of times a result
appears between the lower and upper values. In the first row, place the number 1.
The fourth column is the Cumulative frequency column. Here we add the
cumulative frequency of the previous row to the frequency of the current row.
Since this is the first row, the cumulative frequency is the same as the frequency.
However, in the second row, the frequency for the 3544 intervals (i.e., 1) is
added to the frequency for the 4554 intervals (i.e., 2). Thus, the cumulative
frequency is 3, meaning we have 3 participants in the 34 to 54 age group.1 + 2 = 3
The next column is the Percentage column. In this column, list the percentage of
the frequency. To do this, divide the frequency by the total number of results and
multiply by 100. In this case, the frequency of the first row is 1 and the total
number of results is 10. The percentage would then be 10.0. (1 10) X 100 = 10.0
The final column is Cumulative percentage. In this column, divide the cumulative
frequency by the total number of results and then to make a percentage, multiply
by 100. Note that the last number in this column should always equal 100.0. In
this example, the cumulative frequency is 1 and the total number of results is 10,
therefore the cumulative percentage of the first row is 10.0.
35 44 1 1 10.0 10.0
45 54 2 3 20.0 30.0
55 64 2 5 20.0 50.0
65 74 2 7 20.0 70.0
75 84 2 9 20.0 90.0
85 94 1 10 10.0 100.0
Example
20-25 10
25-30 12
30-35 8
35-40 20
40-45 11
45-50 4
50-55 5
Solution:
Relative frequency distribution table for the given data.
Here n = 70
20-25 10 10 / 70 = 0.143
25-30 12 12 / 70 = 0.171
30-35 8 8 / 70 = 0.114
35-40 20 20 / 70 = 0.286
40-45 11 11 / 70 = 0.157
45-50 4 4 / 7 0 = 0.057
50-55 5 5 / 70 = 0.071
Total n = 70
A grouped frequency distribution is an ordered listed of a variable XX, into groups in one
column with a listing in a second column, the frequency column. A grouped frequency
distribution is an arrangement class intervals and corresponding frequencies in a table.
There are certain rules to be remembered while constructing a grouped frequency distribution
For Example: A teacher gave a test to a class of 26 students. The maximum mark is 5. The
marks obtained by the pupils are:
3 2 3 3 4 3 1 2 5
1 5 4 2 1 1 3 3 4
1 2 1 4 5 4 2 2
1 1111112222 2
3 3333344444 4 5 5
The difference between the greatest and the smallest number is called range of the data. Thus for
the above data, the range is 5 - 1 which equals 4 marks.
Assignment No. 4
Graphs
Graphs are used to show a relationship between the independent variable and
the dependent variable. The independent variable is typically on the x-axis (horizontal line
or abscissa) of a graph and the dependent variable is typically on the y-axis (vertical line
or ordinate) of a graph. Caution should be taken when drawing a graph due to the horizontal-
vertical illusion. This illusion makes vertical lines appear longer than horizontal lines. Graphs
can be used to display information that has been summarized in a frequency distribution. Graphs
should make the data easier to understand. In this case, the dependent variable is placed on the x-
axis and the frequency is on the y-axis. Graphs are used to illustrate trends and to help predict
the future.
Graphs should always contain a descriptive Title that informs us what kind of information is
being conveyed. Labels on both axes tell us what is being measured. The numbers along the
vertical axis tell us in what increments the measurements are being reported. A graph Key is
used to identify two different dependent of independent variables. The most common graphs
are histograms and polygons. Both are discussed below.
I. Histogram
Polygons consist of points on a graph with lines connecting them. A Polygon uses a
single point rather than a bar to represent an interval on a graph. They use the midpoint of the
interval as the single point plotted. The polygon should begin and end at the abscissa. Polygons
can plot Frequency or Relative Frequency on the y-axis.
A. Frequency Polygon.
The steps to generating a Frequency Polygon and an example of a Frequency Polygon (without a
title) are listed below:
Step 2: add two extra intervals: one below the lowest interval and one above the highest
interval
Step 4: plot the frequency for each of the midpoints on the graph
Relative frequency is used to compare two distributions that have different numbers of
subjects. Relative frequency can be graphed as a Relative Frequency Polygon. Relative
frequency polygons are created in the same manner as the frequency polygon. The only
difference being that you use relative frequency instead of frequency values. The graph below is
an example of a Relative Frequency Polygon:
III. Cumulative Polygon
Cumulative polygons the data in a distribution that fall below a particular score. Cumulative
Polygons are created in the same manner as the polygon. The only difference being that you use
cumulative values and the upper real limits of the intervals are used instead of the
midpoints. The s-shaped curve of the graph below is called an ogive, pronounced oh-jive.
Cumulative frequency polygons graph the number of subjects in a distribution that fall
below a particular score. Cumulative Frequency Polygons are created in the same manner as the
frequency polygon. The only difference being that you use cumulative frequency values and
the upper real limits of the intervals are used instead of the midpoints. The graph below is an
example of a Cumulative Frequency Polygon:
Cumulative Relative Frequency Polygons are created in the same manner as the
cumulative frequency polygon with the only difference being that you use cumulative relative
frequency values instead of cumulative frequency on the y-axis. The graph below is an example
of a Cumulative Relative Frequency Polygon:
C. Cumulative Percent Polygons:
Cumulative Percent Polygons are created in the same manner as the cumulative
frequency polygon and the cumulative relative frequency polygon with the only difference being
that you use cumulative percent values on the y-axis. Remember that percent is simply relative
frequency multiplied by 100. The graph below is an example of a Cumulative Percent Polygon:
Stem and Leaf Diagrams allow you to display raw data visually. Each raw score is divided into
a stem and a leaf. The leaf is typically the last digit of the raw value. The stem is the remaining
digits of the raw value. To generate a stem and leaf diagram you must first create a vertical
column that contains all of the stems. Then list each leaf next to the corresponding stem. In
these diagrams, all of the scores are represented in the diagram without the loss of any
information. The graph below is an example of a Stem and Leaf Diagram:
Stem Leaf
2 2 3 5 5 5 7
3 1 1 4 6
4 0 0 4 5 6 7 7 8 9
5 011
6 3 3 5 6 7
7 7 8 9
8 1 1 2 6 9 9 9
9 5.
Bar Graphs
Bar Graphs use a free standing bar to represent the data associated with each level of the
independent variable. Bar graphs are preferable when the levels or groups of the independent
variable are true categories or nominal or ordinal classifications. The example below shows a bar
graph:
Line Graphs
Line Graphs connects the representative data point associated with each level of the independent
variable. Line graphs are preferable when the levels or groups of the independent variable are
quantitative or interval or ratio classifications. The example below shows a line graph:
Pictogram
or pictograph represents the frequency of data as pictures or symbols. Each picture or symbol
may represent one or more units of the data.
Example:
The following table shows the number of computers sold by a company for the months January
to March. Construct a pictograph for the table.
Number of computers 25 35 20
Solution:
January
February
March
Assignment No. 5
Definition:
Measures of central tendency are numbers that describe what is average or typical within
a distribution of data. There are three main measures of central tendency: mean, median, and
mode. While they are all measures of central tendency, each is calculated differently and
measures something different from the others.
MEAN
The mean is the most common measure of central tendency used by researchers and people in all
kinds of professions.
It is the measure of central tendency that is also referred to as the average. A researcher
can use the mean to describe the data distribution of variables measured as intervals or ratios.
These are variables that include numerically corresponding categories or ranges (like race, class,
gender, or level of education), as well as variables measured numerically from a scale that begins
with zero (like household income or the number of children within a family).
A mean is very easy to calculate. One simply has to add all the data values or "scores" and then
divide this sum by the total number of scores in the distribution of data. For example, if five
families have 0, 2, 2, 3, and 5 children respectively, the mean number of children is (0 + 2 + 2 +
3 + 5)/5 = 12/5 = 2.4. This means that the five households have an average of 2.4 children.
The mean can be used for both continuous and discrete numeric data.
Limitations of the mean:
The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers
and skewed distributions.
The population mean is indicated by the Greek symbol (pronounced mu). When the mean is
calculated on a distribution from a sample it is indicated by the symbol x (pronounced X-bar).
MEDIAN
The median is the value at the middle of a distribution of data when those data are organized
from the lowest to the highest value.
This measure of central tendency can be calculated for variables that are measured with
ordinal, interval or ratio scales.
Calculating the median is also rather simple. Lets suppose we have the following list of
numbers: 5, 7, 10, 43, 2, 69, 31, 6, 22. First, we must arrange the numbers in order from lowest
to highest.
The result is this: 2, 5, 6, 7, 10, 22, 31, 43, 69. The median is 10 because it is the exact middle
number. There are four numbers below 10 and four numbers above 10.
If data distribution has an even number of cases which means that there is no exact
middle, you simply adjust the data range slightly in order to calculate the median. For example,
if we add the number 87 to the end of our list of numbers above, we have 10 total numbers in our
distribution, so there is no single middle number. In this case, one takes the average of the scores
for the two middle numbers. In our new list, the two middle numbers are 10 and 22. So, we take
the average of those two numbers: (10 + 22) /2 = 16. Our median is now 16.
The median is less affected by outliers and skewed data than the mean, and is usually the
preferred measure of central tendency when the distribution is not symmetrical.
Limitation of the median:
The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.
MODE
The mode is the measure of central tendency that identifies the category or score that occurs the
most frequently within the distribution of data.
In other words, it is the most common score or the score that appears the highest number
of times in a distribution. The mode can be calculated for any type of data, including those
measured as nominal variables, or by name.
For example, lets say we are looking at pets owned by 100 families and the distribution looks
like this:
Dog 60
Cat 35
Fish 17
Hamster 13
Snake 3
The mode here is "dog" since more families own a dog than any other animal. Note that the
mode is always expressed as the category or score, not the frequency of that score. For instance,
in the above example, the mode is "dog," not 60, which is the number of times dog appears.
Some distributions do not have a mode at all. This happens when each category has the same
frequency. Other distributions might have more than one mode. For example, when a distribution
has two scores or categories with the same highest frequency, it is often referred to as "bimodal."
1. The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical (non-numerical) data.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
2. It is also possible for there to be more than one mode for the same distribution of data, (bi-
modal, or multi-modal). The presence of more than one mode can limit the ability of the mode in
describing the center or typical value of the distribution because a single value to describe the
center cannot be identified.
3. In some cases, particularly where the data are continuous, the distribution may have no mode
at all (i.e. if all values are different).
4. In cases such as these, it may be better to consider using the median or mean, or group the data
in to appropriate intervals, and find the modal class.
Symmetrical distributions:
When a distribution is symmetrical, the mode, median and mean are all in the middle of the
distribution. The following graph shows a larger retirement age dataset with a distribution which
is symmetrical. The mode, median and mean all equal 58 years.
Skewed distributions:
When a distribution is skewed the mode remains the most commonly occurring value, the
median remains the middle value in the distribution, but the mean is generally pulled in the
direction of the tails. In a skewed distribution, the median is often a preferred measure of central
tendency, as the mean is not usually in the middle of the distribution.
A distribution is said to be positively or right skewed when the tail on the right side of the
distribution is longer than the left side. In a positively skewed distribution it is common for the
mean to be pulled toward the right tail of the distribution. Although there are exceptions to this
rule, generally, most of the values, including the median value, tend to be less than the mean
value.
The following graph shows a larger retirement age data set with a distribution which is right
skewed. The data has been grouped into classes, as the variable being measured (retirement age)
is continuous. The mode is 54 years, the modal class is 54-56 years, the median is 56 years and
the mean is 57.2 years.
A distribution is said to be negatively or left skewed when the tail on the left side of the
distribution is longer than the right side. In a negatively skewed distribution, it is common for the
mean to be pulled toward the left tail of the distribution. Although there are exceptions to this
rule, generally, most of the values, including the median value, tend to be greater than the mean
value.
The following graph shows a larger retirement age dataset with a distribution which left skewed.
The mode is 65 years, the modal class is 63-65 years, the median is 63 years and the mean is
61.8 years.
Importance
Measures of central tendency are very useful in Statistics. Their importance is because of the
following reasons:
Measures of central tendency or averages give us one value for the distribution and this value
represents the entire distribution. In this way averages convert a group of figures into one value.
Collected and classified figures are vast. To condense these figures we use average. Average
converts the whole set of figures into just one figure and thus helps in condensation.
To make comparisons of two or more than two distributions, we have to find the representative
values of these distributions. These representative values are found with the help of measures of
the central tendency.
Measures of Variability
Definition
In addition to figuring out the measures of central tendency, we may need to summarize
the amount of variability we have in our distribution. In other words, we need to determine if the
observations tend to cluster together or if they tend to be spread out. Consider the following
example:
Both of these samples have identical means (5) and an identical number of observations
(n = 5), but the amount of variation between the two samples differs considerably. Sample 2 has
no variability (all scores are exactly the same), whereas Sample 1 has relatively more (one case
varies substantially from the other four). In this course, we will be going over four measures of
variability: the range, the inter-quartile range (IQR), the variance and the standard deviation.
The Range
The range is the most obvious measure of dispersion and is the difference between the lowest
and highest values in a dataset. In figure 1, the size of the largest semester 1 tutorial group is 6
students and the size of the smallest group is 4 students, resulting in a range of 2 (6-4). In
semester 2, the largest tutorial group size is 7 students and the smallest tutorial group contains 3
students, therefore the range is 4 (7-3).
The range is simple to compute and is useful when you wish to evaluate the whole of a
dataset.
The range is useful for showing the spread within a dataset and for comparing the spread
between similar datasets.
An example of the use of the range to compare spread within datasets is provided in table 1. The
scores of individual students in the examination and coursework component of a module are
shown.
To find the range in marks the highest and lowest values need to be found from the table.
The highest coursework mark was 48 and the lowest was 27 giving a range of 21. In the
examination, the highest mark was 45 and the lowest 12 producing a range of 33. This indicates
that there was wider variation in the students performance in the examination than in the
coursework for this module.
Since the range is based solely on the two most extreme values within the dataset, if one
of these is either exceptionally high or low (sometimes referred to as outlier) it will result in a
range that is not typical of the variability within the dataset. For example, imagine in the above
example that one student failed to hand in any coursework and was awarded a mark of zero,
however they sat the exam and scored 40. The range for the coursework marks would now
become 48 (48-0), rather than 21, however the new range is not typical of the dataset as a whole
and is distorted by the outlier in the coursework marks. In order to reduce the problems caused
by outliers in a dataset, the inter-quartile range is often calculated instead of the range.
The inter-quartile range is a measure that indicates the extent to which the central 50% of
values within the dataset are dispersed. It is based upon, and related to, the median.
In the same way that the median divides a dataset into two halves, it can be further
divided into quarters by identifying the upper and lower quartiles. The lower quartile is found
one quarter of the way along a dataset when the values have been arranged in order of
magnitude; the upper quartile is found three quarters along the dataset. Therefore, the upper
quartile lies half way between the median and the highest value in the dataset whilst the lower
quartile lies halfway between the median and the lowest value in the dataset. The inter-quartile
range is found by subtracting the lower quartile from the upper quartile.
For example, the examination marks for 20 students following a particular module are arranged
in order of magnitude.
The median lies at the mid-point between the two central values (10th and 11th)
The lower quartile lies at the mid-point between the 5th and 6th values
The upper quartile lies at the mid-point between the 15th and 16th values
The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the range is: 80 - 43
= 37.
The inter-quartile range provides a clearer picture of the overall dataset by removing/ignoring the
outlying values.
Like the range however, the inter-quartile range is a measure of dispersion that is based upon
only two values from the dataset. Statistically, the standard deviation is a more powerful measure
of dispersion because it takes into account every value in the dataset. The standard deviation is
explored in the next section of this guide.
The Variance
The variance is a measure of variability that represents on how far each observation falls from
the mean of the distribution. For this example, we'll be using the following five numbers, which
represent my total monthly comic book purchases over the last five months:
2, 3, 5, 6, 9
The formula for calculating a variance is usually written out like this:
This equation looks intimidating, but it's not that bad once you break it down into its component
parts. S2x is the notation used to denote the variance of a sample. That giant sigma () is a
summation sign; it just means we're going to be adding things together. The x represents each of
our observations, and the x with a line over it (often called "x-bar") represents the mean of our
distribution. The capital "N" on the bottom is the total number of observations. Basically, this
formula is telling us to subtract the mean from each of our observations, square the difference,
add them all together and divide by N-1. Let's do an example using the above numbers.
1. The first step in calculating the variance is finding the mean of the distribution. In this case,
the mean is 5 (2+3+5+6+9 = 25; 25/5 = 5).
2. The second step is to subtract the mean (5) from each of the observations:
2-5 = -3
3-5 = -2
5-5 = 0
6-5 = 1
9-5 = 4
Please note: we can check our work after this step by adding all of our values together. If they
sum to zero, we know we're on the right track. If they add up to something besides zero, we
should probably check our math again (-3+-2+0+1+4 = 0, we're golden).
3. Third, we square each of those answers to get rid of the negative numbers:
(-3)2 = 9
(-2)2 = 4
(0)2 = 0
(1)2 = 1
(4)2 = 16
9+4+0+1+16=30
30/4 = 7.5
After all those rather tedious calculations, we're left with a single number that quickly and
succinctly summarizes the amount of variability in our distribution. The bigger the number, the
more variability we have in our distribution. Please note: a variance can never be negative. If you
come up with a variance that's less than zero, you've done something wrong.
The standard deviation is a measure that summarizes the amount by which every value within a
dataset varies from the mean. Effectively it indicates how tightly the values in the dataset are
bunched around the mean value. It is the most robust and widely used measure of dispersion
since, unlike the range and inter-quartile range, it takes into account every variable in the dataset.
When the values in a dataset are pretty tightly bunched together the standard deviation is small.
When the values are spread apart the standard deviation will be relatively large. The standard
deviation is usually presented in conjunction with the mean and is measured in the same units.
In many datasets the values deviate from the mean value due to chance and such datasets are said
to display a normal distribution. In a dataset with a normal distribution most of the values are
clustered around the mean while relatively few values tend to be extremely high or extremely
low. Many natural phenomena display a normal distribution.
For datasets that have a normal distribution the standard deviation can be used to determine the
proportion of values that lie within a particular range of the mean value. For such distributions it
is always the case that 68% of values are less than one standard deviation (1SD) away from the
mean value, that 95% of values are less than two standard deviations (2SD) away from the mean
and that 99% of values are less than three standard deviations (3SD) away from the mean. Figure
3 shows this concept in diagrammatical form.
If the mean of a data set is 25 and its standard deviation is 1.6, then
68% of the values in the dataset will lie between MEAN-1SD (25-1.6=23.4)
and MEAN+1SD (25+1.6=26.6)
If the dataset had the same mean of 25 but a larger standard deviation (for example, 2.3) it
would indicate that the values were more dispersed. The frequency distribution for a dispersed
dataset would still show a normal distribution but when plotted on a graph the shape of the curve
will be flatter as in figure 4.
There are two different calculations for the Standard Deviation. Which formula you use
depends upon whether the values in your dataset represent an entire population or whether they
form a sample of a larger population. For example, if all student users of the library were asked
how many books they had borrowed in the past month then the entire population has been
studied since all the students have been asked. In such cases the population standard deviation
should be used. Sometimes it is not possible to find information about an entire population and it
might be more realistic to ask a sample of 150 students about their library borrowing and use
these results to estimate library borrowing habits for the entire population of students. In such
cases the sample standard deviation should be used.
While it is not necessary to learn the formula for calculating the standard deviation, there
may be times when you wish to include it in a report or dissertation.
The standard deviation of an entire population is known as (sigma) and is calculated using:
Where x represents each value in the population, is the mean value of the
population, is the summation (or total), and N is the number of values in the population.
Where x represents each value in the population, x is the mean value of the sample, is
the summation (or total), and n-1 is the number of values in the sample minus 1.
Main Points
Measures of central tendency tell us what is common or typical about our variable.
Three measures of central tendency are the mode, the median and the mean.
The mode is used almost exclusively with nominal-level data, as it is the only measure of
central tendency available for such variables. The median is used with ordinal-level data
or when an interval/ratio-level variable is skewed (think of the Bill Gates example). The
mean can only be used with interval/ratio level data.
Measures of variability are numbers that describe how much variation or diversity there
is in a distribution.
Four measures of variability are the range (the difference between the larges and smallest
observations), the interquartile range (the difference between the 75th and 25th
percentiles) the variance and the standard deviation.
The variance and standard deviation are two closely related measures of variability for
interval/ratio-level variables that increase or decrease depending on how closely the
observations are clustered around the mean.
Importance
An important use of statistics is to measure variability or the spread of data. For example,
two measures of variability are the standard deviation and the range. The standard deviation
measures the spread of data from the mean or the average score. Once the standard deviation is
known other linear transformations may be conducted on the raw data to obtain other scores that
provide more information about variability such as the z score and the T score. The range,
another measure of spread, is simply the difference between the largest and smallest data values.
The range is the simplest measure of variability to compute.
The standard deviation can be an effective tool for teachers. The standard deviation can
be useful in analyzing class room test results. A large standard deviation might tell a teacher the
class grades were spread a great distance from the mean. A small standard deviation might
reflect the opposite of the preceding. In analyzing test results, a teacher can make an assumption
that with a small standard deviation the students understood the material uniformly. With a large
standard deviation the teacher may assume there is a large amount of variation in regards to
students understanding the material tested.
One of the limits of statistics can be found in calculating the range. Since outliers are
used to determine the range, they are very influential in the resulting statistic. For example, if
one class had test grades of 0%, 50%, and100%, and another class had grades of 50%, 90%, and
100% the range is differs greatly between the two classes due to the outliers. The ranges would
be 100 and 50 respectively.
When analyzing test results it is helpful to group scores into number of standard
deviations from the mean. For the data set [3, 5, 7, 7, 7,38] the mean is 11.16 and the standard
deviation is 13.24, one standard deviation from the mean would be 11.16 plus or minus 13.24 or
24.40 and -2.08 and two standard deviations from the mean would be 11.16 plus or minus twice
13.24 or 37.64 and -15.72. Imagine that the above data are points scored in baseball game.
Thirty-eight is between two and three standard deviations above the mean.