Você está na página 1de 50

Applied Statistics

Assignment No.1

Statistics in Psychology

(a) Statistics

Statistics is defined as:

Branch of mathematics concerned with collection, classification, analysis, and


interpretation of numerical facts, for drawing inferences on the basis of their quantifiable
likelihood (probability). Statistics can interpret aggregates of data too large to be intelligible by
ordinary observation because such data (unlike individual quantities) tend to behave in regular,
predictable manner.

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation,


presentation, and organization of data. In applying statistics to, e.g., a scientific, industrial, or
social problem, it is conventional to begin with a statistical population or a statistical model
process to be studied. Populations can be diverse topics such as "all people living in a country"
or "every atom composing a crystal." Statistics deals with all aspects of data including the
planning of data collection in terms of the design of surveys and experiments.

(b) Importance of Psychology in Statistics

In psychology, psychologists are also confronted by enormous amounts of data.


Psychological research is a cornerstone of the field, and experimental research helps the
psychologists provide better treatment for their patients. Statistics are essential for determining if
certain treatments are effective. One of the most challenging aspects of psychology is deciding
how diseases, disorders and other problems should be categorized. By using advanced statistical
analyses, experts can determine which symptoms seem to cluster together.
Statistics allow psychologists to:

Organize Data

When dealing with an enormous amount of information, it is all too easy to become
overwhelmed. Statistics allow psychologists to present data in ways that are easier to
comprehend. Visual displays such as graphs, pie charts, frequency distributions, and scatterplots
make it possible for researchers to get a better overview of the data and to look for patterns that
they might otherwise miss.

Describe Data

Think about what happens when researchers collect a great deal of information about a
group of people. The census is a great example. Using statistics, we can accurately describe the
information that has been gathered in a way that is easy to understand. Descriptive statistics
provide a way to summarize what already exists in a given population, such as how many men
and women there are, how many children there are, or how many people are currently employed.

Make Inferences Based Upon Data

By using what's known as inferential statistics, researchers can infer things about a given
sample or population. Psychologists use the data they have collected to test a hypothesis or a
guess about what they predict will happen. Using this type of statistical analysis, researchers can
determine the likelihood that a hypothesis should be either accepted or rejected.

(c) Branches of Statistics

There are two main branches of statistics that are descriptive statistics and inferential statistics.
Both of these are employed in scientific analysis of data and both are equally important for the
student of statistics.

Descriptive Statistics

Descriptive statistics deals with the presentation and collection of data. This is usually the first
part of a statistical analysis. It is usually not as simple as it sounds, and the statistician needs to
be aware of designing experiments, choosing the right focus group and avoid biases that are so
easy to creep into the experiment.

Example

Different areas of study require different kinds of analysis using descriptive statistics as a
physicist studying turbulence in the laboratory needs the average quantities that vary over small
intervals of time. The nature of this problem requires that physical quantities be averaged from a
host of data collected through the experiment.

Inferential Statistics

Inferential statistics, as the name suggests, involves drawing the right conclusions from
the statistical analysis that has been performed using descriptive statistics. In the end, it is the
inferences that make studies important and this aspect is dealt with in inferential statistics.

Most predictions of the future and generalizations about a population by studying a


smaller sample come under the purview of inferential statistics. Most social sciences experiments
deal with studying a small sample population that helps determine how the population in general
behaves. By designing the right experiment, the researcher is able to draw conclusions relevant
to his study.

While drawing conclusions, one needs to be very careful so as not to draw the wrong or
biased conclusions. Even though this appears like a science, there are ways in which one can
manipulate studies and results through various means.

Example

The data dredging is increasingly becoming a problem as computers hold loads of information
and it is easy, either intentionally or unintentionally, to use the wrong inferential methods.

Both descriptive and inferential statistics go hand in hand and one cannot exist without
the other. Good scientific methodology needs to be followed in both these steps of statistical
analysis and both these branches of statistics are equally important for a researcher.
(d) Limitations of Statistics in Psychology

Statistics is indispensable to almost all sciences - social, physical and natural. It is very
often used in most of the spheres of human activity. In spite of the wide scope of the subject it
has certain limitations. Some important limitations of statistics are the following:

1. Statistics does not study qualitative phenomena:

Statistics deals with facts and figures. So the quality aspect of a variable or the subjective
phenomenon falls out of the scope of statistics. For example, qualities like beauty, honesty,
intelligence etc. cannot be numerically expressed. So these characteristics cannot be examined
statistically. This limits the scope of the subject.

2. Statistical laws are not exact:

Statistical laws are not exact as incise of natural sciences. These laws are true only on average.
They hold well under certain conditions. They cannot be universally applied. So statistics has
less practical utility.

3. Statistics does not study individuals:

Statistics deals with aggregate of facts. Single or isolated figures are not statistics. This is
considered to be a major handicap of statistics.

4. Statistics can be misused:

Statistics is mostly a tool of analysis. Statistical techniques are used to analyse and interpret the
collected information in an enquiry. As it is, statistics does not prove or disprove anything. It is
just a means to an end. Statements supported by statistics are more appealing and are commonly
believed. For this, statistics is often misused. Statistical methods rightly used are beneficial but if
misused these become harmful. Statistical methods used by less expert hands will lead to
inaccurate results. Here the fault does not lie with the subject of statistics but with the person
who makes wrong use of it.

And other common problems that occur are

1. Statistics does not deal with isolated measurement

2. Statistics deals with only quantitative characteristics


3. Statistics laws are true on average. Statistics are aggregates of facts. So single
observation is not a statistics, it deals with groups and aggregates only.

4. Statistical methods are best applicable on quantitative data.

5. Statistical cannot be applied to heterogeneous data.

6. It sufficient care is not exercised in collecting, analyzing and interpretation the data,
statistical results might be misleading.

7. Only a person who has an expert knowledge of statistics can handle statistical data
efficiently.

8. Some errors are possible in statistical decisions. Particularly the inferential statistics
involves certain errors. We do not know whether an error has been committed or not.

(e) Data

Data is "facts or figures from which conclusions can be drawn".

Data, information and statistics are often misunderstood. They are actually different things, as
Figure shows.

Figure. Data collected on the weight of 20 individuals in your classroom

Data Information Statistics

20 kg, 25 kg 5 individuals in the 20-to-25-kg range Mean weight = 22.5 kg

28 kg, 30 kg, etc. 15 individuals in the 26-to-30-kg range Median weight = 28 kg

Data can take various forms, but are often numerical. As such, data can relate to an enormous
variety of aspects.
For example:

The daily weight measurements of each individual in classroom.


The number of movie rentals per month for each household in your neighborhood.
The city's temperature (measured every hour) for a one-week period.

Types of Data

Quantitative data deals with numbers and things you can measure objectively:
dimensions such as height, width, and length. Temperature and humidity. Prices. Area and
volume.

Qualitative data deals with characteristics and descriptors that can't be easily measured,
but can be observed subjectively such as smells, tastes, textures, attractiveness, and color.

Numerical data. These data have meaning as a measurement, such as a persons height,
weight, IQ, or blood pressure; or theyre a count, such as the number of stock shares a
person owns, how many teeth a dog has, or how many pages you can read of your
favorite book before you fall asleep. (Statisticians also call numerical data quantitative
data.)

Numerical data can be further broken into two types: discrete and continuous.

o Discrete data represent items that can be counted; they take on possible values
that can be listed out. The list of possible values may be fixed (also called finite);
or it may go from 0, 1, 2, on to infinity (making it countable infinite). For
example, the number of heads in 100 coin flips takes on values from 0 through
100 (finite case), but the number of flips needed to get 100 heads takes on values
from 100 (the fastest scenario) on up to infinity (if you never get to that 100th
heads). Its possible values are listed as 100, 101, 102, 103 . . . (representing the
countable infinite case).

o Continuous data represent measurements; their possible values cannot be


counted and can only be described using intervals on the real number line. For
example, the exact amount of gas purchased at the pump for cars with 20-gallon
tanks would be continuous data from 0 gallons to 20 gallons, represented by the
interval [0, 20], inclusive. You might pump 8.40 gallons, or 8.41, or 8.414863
gallons, or any possible number from 0 to 20. In this way, continuous data can be
thought of as being uncountable infinite. For ease of recordkeeping, statisticians
usually pick some point in the number to round off. Another example would be
that the lifetime of a C battery can be anywhere from 0 hours to an infinite
number of hours (if it lasts forever), technically, with all possible values in
between. Granted, you dont expect a battery to last more than a few hundred
hours, but no one can put a cap on how long it can go (remember the Energizer
Bunny?).

Categorical data: Categorical data represent characteristics such as a persons gender,


marital status, hometown, or the types of movies they like. Categorical data can take on
numerical values (such as 1 indicating male and 2 indicating female), but those
numbers dont have mathematical meaning. You couldnt add them together, for
example. (Other names for categorical data are qualitative data, or Yes/No data.)

Primary Data

Primary data means the raw data (data without fabrication or not tailored data) which has just
been collected from the source and has not gone any kind of statistical treatment like sorting and
tabulation. The term primary data may sometimes be used to refer to firsthand information.

Sources of Primary Data

The sources of primary data are primary units such as basic experimental units, individuals,
households. Following methods are used to collect data from primary units usually and these
methods depends on the nature of the primary unit. Published data and the data collected in the
past is called secondary data.

Personal Investigation
Through Questionnaire
Through Telephone.
Through Internet

Secondary Data
Data which has already been collected by someone, may be sorted, tabulated and has
undergone a statistical treatment. It is fabricated or tailored data.

Sources of Secondary Data

The secondary data may be available from the following sources:

Government Organizations
Federal and Provincial Bureau of Statistics, Crop Reporting Service-Agriculture
Department, Census and Registration Organization etc.

Semi-Government Organization
Municipal committees, District Councils, Commercial and Financial Institutions like
banks etc.

Teaching and Research Organizations

Research Journals and Newspapers

Internet

(f) Constant

Constant, as its name suggests, is something that does not vary or change (or that may not
be susceptible to variation or change).

Properties

A constant has only one attribute or value.

The search for constants is the ultimate goal in science. A scientific law (e.g. light in a
vacuum travels at 300,000 km/sag) is a constant. As such, it allows us to predict
properties based on such law, to use it as a premise in deductive sciences, and to use it as
an assumption in inductive sciences.

A main task in experimental research, however, is to control the variability of all but the
research variables. This control is done by way of keeping all other variables as
constants. Once this is achieved, it is possible to ascertain the relationship between the
research variables, as these would be the only variables actually varying.
Any variable can be made into a constant by reducing its expression to only one of its
values.

For example, we could keep the temperature of a room constant at 35o C during an
experiment. In this case, temperature stops being a variable.

A constant has no use in statistics. That is, anything that is or remains constant cannot be
subjected to statistical analysis. Many researches, however, may take this property as
meaning that a given constant is not relevant for predicting something, which is not
always the case.

A variable is any characteristics, number, or quantity that can be measured or counted. A


variable may also be called a data item. Age, sex, business income and expenses, country of
birth, capital expenditure, class grades, and eye color and vehicle type are examples of variables.
It is called a variable because the value may vary between data units in a population, and may
change in value over time.

For example; 'income' is a variable that can vary between data units in a population (i.e. the
people or businesses being studied may not have the same incomes) and can also vary over time
for each data unit (i.e. income can go up or down).

Numeric Variables
Numeric variables have values that describe a measurable quantity as a number, like 'how many'
or 'how much'. Therefore numeric variables are quantitative variables.
Numeric variables may be further described as either continuous or discrete:

A continuous variable is a numeric variable. Observations can take any value between
a certain set of real numbers. The value given to an observation for a continuous variable
can include values as small as the instrument of measurement allows. Examples of
continuous variables include height, time, age, and temperature.

A discrete variable is a numeric variable. Observations can take a value based on a


count from a set of distinct whole values. A discrete variable cannot take the value of a
fraction between one value and the next closest value. Examples of discrete variables
include the number of registered cars, number of business locations, and number of
children in a family, all of which measured as whole units (i.e. 1, 2, 3 cars).

The data collected for a numeric variable are quantitative data.

Categorical variables have values that describe a 'quality' or 'characteristic' of a data


unit, like 'what type' or 'which category'. Categorical variables fall into mutually
exclusive (in one category or in another) and exhaustive (include all possible options)
categories. Therefore, categorical variables are qualitative variables and tend to be
represented by a non-numeric value.

Categorical variables may be further described as ordinal or nominal:

An ordinal variable is a categorical variable. Observations can take a value that can be
logically ordered or ranked. The categories associated with ordinal variables can be
ranked higher or lower than another, but do not necessarily establish a numeric difference
between each category. Examples of ordinal categorical variables include academic
grades (i.e. A, B, C), clothing size (i.e. small, medium, large, extra-large) and attitudes
(i.e. strongly agree, agree, disagree, strongly disagree).

A nominal variable is a categorical variable. Observations can take a value that is not
able to be organized in a logical sequence. Examples of nominal categorical variables
include sex, business type, eye color, religion and brand.

Dependent variable

The presumed effect in an experimental study. The values of the dependent variable
depend upon another variable, the independent variable. Strictly speaking, dependent
variable should not be used when writing about no experimental designs.
Confounding variable

A variable that obscures the effects of another variable. If one elementary reading
teacher used a phonics textbook in her class and another instructor used a whole language
textbook in his class, and students in the two classes were given achievement tests to see
how well they read, the independent variables (teacher effectiveness and textbooks)
would be confounded. There is no way to determine if differences in reading between the
two classes were caused by either or both of the independent variables.

(g) Population & Parameter

A population is a group of phenomena that have something in common. The term often
refers to a group of people, as in the following examples:

All registered voters in sector.

All members of the International Machinists Union.

All persons who played golf at least once in the past year

But populations can refer to things as well as people:

All daily maximum temperatures in July for major Pak cities.

All basal ganglia cells from a particular rhesus monkey.

Sample

A sample is a smaller group of members of a population selected to represent the


population. In order to use statistics to learn things about the population, the sample must
be random. A random sample is one in which every member of a population has an
equal chance of being selected. The most commonly used sample is a simple random
sample. It requires that every possible sample of the selected size has an equal chance of
being used.

A parameter is a characteristic of a population. A statistic is a characteristic of a


sample. Inferential statistics enables you to make an educated guess about a population
parameter based on a statistic computed from a sample randomly drawn from that
population.
Assignment No.2

PART-1

Scales of Measurements

Recognized four scales of measurement that are used in behavioral sciences:

Nominal

Ordinal

Interval

Ratio

Nominal Scales

Here the numbers are used merely as names and have no quantitative value. Typically, a tackle
on the football team wears a number in the 70s. This number merely gives him a name. It does
not tell how many tackles he made, how fast he can run or if his team wins. Nominal scales are
the lowest levels of measurement. It is a naming scale and is used with categorical data.

Examples:

place of birth

political orientation

gender

types of sports

Nominal can used numbers to represent labels within a category, but the number does not have
qualities of a true number--just a category label.

Example:

Gender is an example of a variable that is measured on a nominal scale. Individuals may be


classified as "male" or "female", but neither value represents more or less "gender" than the
other. Religion and political affiliation are other examples of variables that are normally
measured on a nominal scale

Ordinal Scales

This scale has the characteristic of the nominal scale in that different numbers mean different
things, but also has the characteristic of "greater or lesser". It measures a variable in terms of
magnitude, or rank.

Example:

socioeconomic

class

grades

preferences

* Ordinal scales tell us relative order, but give us no information regarding differences between
the categories. For example, runners in the 100 meter dash finish 1st, 2nd, 3rd etc. Is the number of
seconds between 1st and 2nd place the same as those between 2nd and 3rd place? Certainly not
necessarily.
Interval Scales

This scale has the properties of the nominal and ordinal scales, but here the magnitude between
the consecutive intervals are equal. Temperature is the example that is usually given to illustrate
an interval scale.

Example:

Temperature on Fahrenheit/Celsius thermometer.

90 are hotter than 45 and the difference between 60 and 70 are the same as the difference
between 30 degrees and 40

* Interval scales do not have a true zero. 0 degrees do not mean the absence of heat (although it
might feel like it).

Example:
Attitude scales are sometimes considered to be interval scales.

Calendar years are based on an interval scale.

Ratio Scales

Ratio scales have all of the characteristics of the nominal, ordinal and interval scales. In
addition, however, ratio scales have a true zero. This is the kind of scale that you used when you
learned arithmetic in grade school. You assumed that the numbers had meaning, that they had
rank order (3 is larger than 2), that the intervals between the consecutive numbers were equal and
that there was a zero. Four was twice two; eight was half of sixteen etc. There are true ratios.
One can use all mathematical operations on this scale.

Example:

weight

height

time

distance

10 miles is twice as long as 5 miles. 0 miles is no distance.


PART-2

1- Response time

CONTINOUS VARIABLE

2- Rating of job satisfaction

QUATITATIVE

DISCRETE

3- Favorite color

QUALITATIVE

4- Occupation aspired to

QUALITATIVE

5- Amount of change in weight

CONTIONOUS

6- Number of newspaper sold in different areas

QUANTITATIVE

DISCRETE

7- Temperatures recorded

QUANTITATIVE
Assignment No. 3

Frequency distribution

The frequency (f) of a particular observation is the number of times the observation
occurs in the data. The distribution of a variable is the pattern of frequencies of the observation.
Frequency distributions are portrayed as frequency tables, histograms, or polygons.

A frequency distribution is an orderly arrangement of data classified according to the


magnitude of the observations. When the data are grouped into classes of appropriate size
indicating the number of observations in each class we get a frequency distribution. By forming
frequency distribution, we can summarize the data effectively. It is a method of presenting the
data in a summarized form. Frequency distribution is also known as Frequency table.

Uses: Frequency distribution helps us


1. To analyze the data.
2. To estimate the frequencies of the population on the basis of the ample.
3. To facilitate the computation of various statistical measures.

Frequency distributions can show either the actual number of observations falling in each
range or the percentage of observations. In the latter instance, the distribution is called a relative
frequency distribution.

OR

Frequency distribution tables can be used for both categorical and numeric variables.
Continuous variables should only be used with class intervals. Number of times a given quantity
(or group of quantities) occurs in a set of data. For example, the frequency distribution of income
in a population would show how many individuals (or households) have the income of a certain
level (say, 5,000 a month). It is plotted either as a step-column chart (histogram) or as a line-
chart (hectograph).
Steps of Construction

Step 1: Determine the class width (CW).

Step 2: Determine Lowest Lower Limit.

Step 3: Determine Lowest and subsequent Upper Limits.

Step 4: Determine the class frequencies using a tally.

If required

Step 5: Compute relative and cumulative relative frequencies.

Example:

A survey was taken on Sector A-II. In each of 20 homes, people were asked how many
cars were registered to their households. The results were recorded as follows:

1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0

Use the following steps to present this data in a frequency distribution table.

1. Divide the results (x) into intervals, and then count the number of results in each interval.
In this case, the intervals would be the number of households with no car (0), one car (1),
two cars (2) and so forth.

2. Make a table with separate columns for the interval numbers (the number of cars per
household), the tallied results, and the frequency of results in each interval. Label these
columns Number of cars, Tally and Frequency.

3. Read the list of data from left to right and place a tally mark in the appropriate row. For
example, the first result is a 1, so place a tally mark in the row beside where 1 appears in
the interval column (Number of cars). The next result is a 2, so place a tally mark in the
row beside the 2, and so on. When you reach your fifth tally mark, draw a tally line
through the preceding four marks to make your final frequency calculations easier to
read.
4. Add up the number of tally marks in each row and record them in the final column
entitled Frequency.

Frequency distribution table for this exercise is:

Table 1. Frequency table for the number of cars registered


in each household

Number of cars Tally Frequency (f)


(x)

0 4

1 6

2 5

3 3

4 2

By looking at this frequency distribution table quickly, we can see that out of 20 households
surveyed, 4 households had no cars, 6 households had 1 car, etc.

Cumulative Frequency Distribution

One of the important type of frequency distribution is Cumulative frequency distribution.


In cumulative frequency distribution, the frequencies are shown in the cumulative manner. The
cumulative frequency for each class interval is the frequency for that class interval added to the
preceding cumulative total. Cumulative frequency can also defined as the sum of all previous
frequencies up to the current point.
Cumulative relative frequency distribution is one type of frequency distribution. The relative
cumulative frequency is the cumulative frequency divided by the total frequency.

Example
At a recent chess tournament, all 10 of the participants had to fill out a form that gave
their names, address and age. The ages of the participants were recorded as follows:

36, 48, 54, 92, 57, 63, 66, 76, 66, 80

Use the following steps to present these data in a cumulative frequency distribution table.

1. Divide the results into intervals, and then count the number of results in each interval. In
this case, intervals of 10 are appropriate. Since 36 is the lowest age and 92 is the highest
age, start the intervals at 35 to 44 and end the intervals with 85 to 94.

2. Create a table similar to the frequency distribution table but with three extra columns.

In the first column or the Lower value column, list the lower value of the result
intervals. For example, in the first row, you would put the number 35.

The next column is the Upper value column. Place the upper value of the result
intervals. For example, you would put the number 44 in the first row.

The third column is the Frequency column. Record the number of times a result
appears between the lower and upper values. In the first row, place the number 1.

The fourth column is the Cumulative frequency column. Here we add the
cumulative frequency of the previous row to the frequency of the current row.
Since this is the first row, the cumulative frequency is the same as the frequency.
However, in the second row, the frequency for the 3544 intervals (i.e., 1) is
added to the frequency for the 4554 intervals (i.e., 2). Thus, the cumulative
frequency is 3, meaning we have 3 participants in the 34 to 54 age group.1 + 2 = 3

The next column is the Percentage column. In this column, list the percentage of
the frequency. To do this, divide the frequency by the total number of results and
multiply by 100. In this case, the frequency of the first row is 1 and the total
number of results is 10. The percentage would then be 10.0. (1 10) X 100 = 10.0

The final column is Cumulative percentage. In this column, divide the cumulative
frequency by the total number of results and then to make a percentage, multiply
by 100. Note that the last number in this column should always equal 100.0. In
this example, the cumulative frequency is 1 and the total number of results is 10,
therefore the cumulative percentage of the first row is 10.0.

10.0. (1 10) X 100 = 10.0

3. The cumulative frequency distribution table should look like this:

Table 2. Ages of participants at a chess tournament

Lower Upper Frequency (f) Cumulative Percentage Cumulative


Value Value frequency percentage

35 44 1 1 10.0 10.0

45 54 2 3 20.0 30.0

55 64 2 5 20.0 50.0

65 74 2 7 20.0 70.0

75 84 2 9 20.0 90.0

85 94 1 10 10.0 100.0

Relative Frequency Distribution

A relative frequency distribution is a distribution in which relative frequencies are recorded


against each class interval. Relative frequency of a class is the frequency obtained by dividing
frequency by the total frequency. Relative frequency is the proportion of the total frequency that
is in any given class interval in the frequency distribution.

Relative Frequency Distribution Table


If the frequency of the frequency distribution table is changed into relative frequency then
frequency distribution table is called as relative frequency distribution table. For a data set
consisting of N values. If f is the frequency of a particular value then the ratio 'fnfn' is called its
relative frequency.

Example

Class interval Frequency

20-25 10

25-30 12

30-35 8

35-40 20

40-45 11

45-50 4

50-55 5

Solution:
Relative frequency distribution table for the given data.

Here n = 70

Relative Cumulative Frequency


Class interval Frequency (f)
(fnfn)

20-25 10 10 / 70 = 0.143

25-30 12 12 / 70 = 0.171

30-35 8 8 / 70 = 0.114

35-40 20 20 / 70 = 0.286
40-45 11 11 / 70 = 0.157

45-50 4 4 / 7 0 = 0.057

50-55 5 5 / 70 = 0.071

Total n = 70

Grouped Frequency Distribution

A grouped frequency distribution is an ordered listed of a variable XX, into groups in one
column with a listing in a second column, the frequency column. A grouped frequency
distribution is an arrangement class intervals and corresponding frequencies in a table.

There are certain rules to be remembered while constructing a grouped frequency distribution

1. The number of classes should be between 55 and 2020.


2. If possible, the magnitude of the classes must be 55 or multiple of 55.
3. Lower limit of first class must be multiple of 55
4. Classes are shown in the first column and frequencies in the second column.

Grouped Frequency Distribution Table


Inclusive type of frequency distribution can be converted into exclusive type as in Table (b)

Ungrouped Frequency Distribution

A frequency distribution with an interval width of 1 is referred to an ungrouped


frequency distribution. Ungrouped frequency distribution is an arrangement of the observed
values in ascending order. The ungrouped frequency distribution is those data, which are not
arranged in groups. They are known as individual series. When the ungrouped data are grouped,
we get the grouped frequency distribution.

For Example: A teacher gave a test to a class of 26 students. The maximum mark is 5. The
marks obtained by the pupils are:

3 2 3 3 4 3 1 2 5

1 5 4 2 1 1 3 3 4

1 2 1 4 5 4 2 2

Such data as above is called ungrouped (or raw) data.


We may arrange the marks in ascending or descending order. The data so represented is called an
array.

1 1111112222 2

3 3333344444 4 5 5

The difference between the greatest and the smallest number is called range of the data. Thus for
the above data, the range is 5 - 1 which equals 4 marks.
Assignment No. 4

Graphs

The Purpose of Graphs

Graphs are used to show a relationship between the independent variable and
the dependent variable. The independent variable is typically on the x-axis (horizontal line
or abscissa) of a graph and the dependent variable is typically on the y-axis (vertical line
or ordinate) of a graph. Caution should be taken when drawing a graph due to the horizontal-
vertical illusion. This illusion makes vertical lines appear longer than horizontal lines. Graphs
can be used to display information that has been summarized in a frequency distribution. Graphs
should make the data easier to understand. In this case, the dependent variable is placed on the x-
axis and the frequency is on the y-axis. Graphs are used to illustrate trends and to help predict
the future.

Graphs should always contain a descriptive Title that informs us what kind of information is
being conveyed. Labels on both axes tell us what is being measured. The numbers along the
vertical axis tell us in what increments the measurements are being reported. A graph Key is
used to identify two different dependent of independent variables. The most common graphs
are histograms and polygons. Both are discussed below.

I. Histogram

A Histogram consists of a number of bars placed side by side. The y-axis


uses Frequency as its label so Histograms are generally referred to as Frequency Histograms.
The width of each bar on the x-axis indicates the interval size. The height of each bar indicates
the frequency of the interval. To create a Frequency Histogram you must first determine the
frequency of the intervals of the axes. Then draw the bars of the histogram. The bars are drawn
from the lower limits to the upper limits along the x-axis. Since the intervals are connected the
bars of the histogram should also be connected. The graph below is an example of a Frequency
Histogram:
II. Polygon

Polygons consist of points on a graph with lines connecting them. A Polygon uses a
single point rather than a bar to represent an interval on a graph. They use the midpoint of the
interval as the single point plotted. The polygon should begin and end at the abscissa. Polygons
can plot Frequency or Relative Frequency on the y-axis.

A. Frequency Polygon.

The steps to generating a Frequency Polygon and an example of a Frequency Polygon (without a
title) are listed below:

Step 1: draw and label the axes

Step 2: add two extra intervals: one below the lowest interval and one above the highest
interval

Step 3: determine the midpoint for each interval

Step 4: plot the frequency for each of the midpoints on the graph

Step 5: connect the dots with a straight line


B. Relative Frequency Polygon

Relative frequency is used to compare two distributions that have different numbers of
subjects. Relative frequency can be graphed as a Relative Frequency Polygon. Relative
frequency polygons are created in the same manner as the frequency polygon. The only
difference being that you use relative frequency instead of frequency values. The graph below is
an example of a Relative Frequency Polygon:
III. Cumulative Polygon

Cumulative polygons the data in a distribution that fall below a particular score. Cumulative
Polygons are created in the same manner as the polygon. The only difference being that you use
cumulative values and the upper real limits of the intervals are used instead of the
midpoints. The s-shaped curve of the graph below is called an ogive, pronounced oh-jive.

A. Cumulative Frequency Polygons:

Cumulative frequency polygons graph the number of subjects in a distribution that fall
below a particular score. Cumulative Frequency Polygons are created in the same manner as the
frequency polygon. The only difference being that you use cumulative frequency values and
the upper real limits of the intervals are used instead of the midpoints. The graph below is an
example of a Cumulative Frequency Polygon:

B. Cumulative Relative Frequency Polygons:

Cumulative Relative Frequency Polygons are created in the same manner as the
cumulative frequency polygon with the only difference being that you use cumulative relative
frequency values instead of cumulative frequency on the y-axis. The graph below is an example
of a Cumulative Relative Frequency Polygon:
C. Cumulative Percent Polygons:

Cumulative Percent Polygons are created in the same manner as the cumulative
frequency polygon and the cumulative relative frequency polygon with the only difference being
that you use cumulative percent values on the y-axis. Remember that percent is simply relative
frequency multiplied by 100. The graph below is an example of a Cumulative Percent Polygon:

V. Stem and Leaf Diagrams

Stem and Leaf Diagrams allow you to display raw data visually. Each raw score is divided into
a stem and a leaf. The leaf is typically the last digit of the raw value. The stem is the remaining
digits of the raw value. To generate a stem and leaf diagram you must first create a vertical
column that contains all of the stems. Then list each leaf next to the corresponding stem. In
these diagrams, all of the scores are represented in the diagram without the loss of any
information. The graph below is an example of a Stem and Leaf Diagram:

Stem Leaf

2 2 3 5 5 5 7

3 1 1 4 6

4 0 0 4 5 6 7 7 8 9

5 011

6 3 3 5 6 7

7 7 8 9

8 1 1 2 6 9 9 9

9 5.

Bar Graphs

Bar Graphs use a free standing bar to represent the data associated with each level of the
independent variable. Bar graphs are preferable when the levels or groups of the independent
variable are true categories or nominal or ordinal classifications. The example below shows a bar
graph:
Line Graphs

Line Graphs connects the representative data point associated with each level of the independent
variable. Line graphs are preferable when the levels or groups of the independent variable are
quantitative or interval or ratio classifications. The example below shows a line graph:

Pictogram

or pictograph represents the frequency of data as pictures or symbols. Each picture or symbol
may represent one or more units of the data.
Example:

The following table shows the number of computers sold by a company for the months January
to March. Construct a pictograph for the table.

Month January February March

Number of computers 25 35 20

Solution:

January

February

March
Assignment No. 5

Measures of Central Tendency

Definition:

Measures of central tendency are numbers that describe what is average or typical within
a distribution of data. There are three main measures of central tendency: mean, median, and
mode. While they are all measures of central tendency, each is calculated differently and
measures something different from the others.

MEAN

The mean is the most common measure of central tendency used by researchers and people in all
kinds of professions.

It is the measure of central tendency that is also referred to as the average. A researcher
can use the mean to describe the data distribution of variables measured as intervals or ratios.
These are variables that include numerically corresponding categories or ranges (like race, class,
gender, or level of education), as well as variables measured numerically from a scale that begins
with zero (like household income or the number of children within a family).

A mean is very easy to calculate. One simply has to add all the data values or "scores" and then
divide this sum by the total number of scores in the distribution of data. For example, if five
families have 0, 2, 2, 3, and 5 children respectively, the mean number of children is (0 + 2 + 2 +
3 + 5)/5 = 12/5 = 2.4. This means that the five households have an average of 2.4 children.

Advantage of the mean:

The mean can be used for both continuous and discrete numeric data.
Limitations of the mean:

The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers
and skewed distributions.
The population mean is indicated by the Greek symbol (pronounced mu). When the mean is
calculated on a distribution from a sample it is indicated by the symbol x (pronounced X-bar).

MEDIAN

The median is the value at the middle of a distribution of data when those data are organized
from the lowest to the highest value.

This measure of central tendency can be calculated for variables that are measured with
ordinal, interval or ratio scales.

Calculating the median is also rather simple. Lets suppose we have the following list of
numbers: 5, 7, 10, 43, 2, 69, 31, 6, 22. First, we must arrange the numbers in order from lowest
to highest.

The result is this: 2, 5, 6, 7, 10, 22, 31, 43, 69. The median is 10 because it is the exact middle
number. There are four numbers below 10 and four numbers above 10.

If data distribution has an even number of cases which means that there is no exact
middle, you simply adjust the data range slightly in order to calculate the median. For example,
if we add the number 87 to the end of our list of numbers above, we have 10 total numbers in our
distribution, so there is no single middle number. In this case, one takes the average of the scores
for the two middle numbers. In our new list, the two middle numbers are 10 and 22. So, we take
the average of those two numbers: (10 + 22) /2 = 16. Our median is now 16.

Advantage of the median:

The median is less affected by outliers and skewed data than the mean, and is usually the
preferred measure of central tendency when the distribution is not symmetrical.
Limitation of the median:

The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.

MODE

The mode is the measure of central tendency that identifies the category or score that occurs the
most frequently within the distribution of data.

In other words, it is the most common score or the score that appears the highest number
of times in a distribution. The mode can be calculated for any type of data, including those
measured as nominal variables, or by name.

For example, lets say we are looking at pets owned by 100 families and the distribution looks
like this:

Animal Number of families that own it

Dog 60

Cat 35

Fish 17

Hamster 13

Snake 3

The mode here is "dog" since more families own a dog than any other animal. Note that the
mode is always expressed as the category or score, not the frequency of that score. For instance,
in the above example, the mode is "dog," not 60, which is the number of times dog appears.
Some distributions do not have a mode at all. This happens when each category has the same
frequency. Other distributions might have more than one mode. For example, when a distribution
has two scores or categories with the same highest frequency, it is often referred to as "bimodal."

Advantage of the mode:

1. The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical (non-numerical) data.

Limitations of the mode:


1. The are some limitations to using the mode. In some distributions, the mode may not reflect
the center of the distribution very well. When the distribution of retirement age is ordered from
lowest to highest value, it is easy to see that the center of the distribution is 57 years, but the
mode is lower, at 54 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

2. It is also possible for there to be more than one mode for the same distribution of data, (bi-
modal, or multi-modal). The presence of more than one mode can limit the ability of the mode in
describing the center or typical value of the distribution because a single value to describe the
center cannot be identified.

3. In some cases, particularly where the data are continuous, the distribution may have no mode
at all (i.e. if all values are different).

4. In cases such as these, it may be better to consider using the median or mean, or group the data
in to appropriate intervals, and find the modal class.
Symmetrical distributions:

When a distribution is symmetrical, the mode, median and mean are all in the middle of the
distribution. The following graph shows a larger retirement age dataset with a distribution which
is symmetrical. The mode, median and mean all equal 58 years.

Skewed distributions:

When a distribution is skewed the mode remains the most commonly occurring value, the
median remains the middle value in the distribution, but the mean is generally pulled in the
direction of the tails. In a skewed distribution, the median is often a preferred measure of central
tendency, as the mean is not usually in the middle of the distribution.

A distribution is said to be positively or right skewed when the tail on the right side of the
distribution is longer than the left side. In a positively skewed distribution it is common for the
mean to be pulled toward the right tail of the distribution. Although there are exceptions to this
rule, generally, most of the values, including the median value, tend to be less than the mean
value.
The following graph shows a larger retirement age data set with a distribution which is right
skewed. The data has been grouped into classes, as the variable being measured (retirement age)
is continuous. The mode is 54 years, the modal class is 54-56 years, the median is 56 years and
the mean is 57.2 years.

A distribution is said to be negatively or left skewed when the tail on the left side of the
distribution is longer than the right side. In a negatively skewed distribution, it is common for the
mean to be pulled toward the left tail of the distribution. Although there are exceptions to this
rule, generally, most of the values, including the median value, tend to be greater than the mean
value.

The following graph shows a larger retirement age dataset with a distribution which left skewed.
The mode is 65 years, the modal class is 63-65 years, the median is 63 years and the mean is
61.8 years.
Importance

Measures of central tendency are very useful in Statistics. Their importance is because of the
following reasons:

(i) To find representative value:

Measures of central tendency or averages give us one value for the distribution and this value
represents the entire distribution. In this way averages convert a group of figures into one value.

(ii) To condense data:

Collected and classified figures are vast. To condense these figures we use average. Average
converts the whole set of figures into just one figure and thus helps in condensation.

(iii) To make comparisons:

To make comparisons of two or more than two distributions, we have to find the representative
values of these distributions. These representative values are found with the help of measures of
the central tendency.

(iv) Helpful in further statistical analysis:

Many techniques of statistical analysis like Measures of Dispersion, Measures of Skewness,


Measures of Correlation, and Index Numbers are based on measures of central tendency. That is
why; measures of central tendency are also called as measures of the first order.
Assignment No. 6

Measures of Variability

Definition

In addition to figuring out the measures of central tendency, we may need to summarize
the amount of variability we have in our distribution. In other words, we need to determine if the
observations tend to cluster together or if they tend to be spread out. Consider the following
example:

Sample 1: {0, 0, 0, 0, 25}


Sample 2: {5, 5, 5, 5, 5}

Both of these samples have identical means (5) and an identical number of observations
(n = 5), but the amount of variation between the two samples differs considerably. Sample 2 has
no variability (all scores are exactly the same), whereas Sample 1 has relatively more (one case
varies substantially from the other four). In this course, we will be going over four measures of
variability: the range, the inter-quartile range (IQR), the variance and the standard deviation.

The Range

The range is the most obvious measure of dispersion and is the difference between the lowest
and highest values in a dataset. In figure 1, the size of the largest semester 1 tutorial group is 6
students and the size of the smallest group is 4 students, resulting in a range of 2 (6-4). In
semester 2, the largest tutorial group size is 7 students and the smallest tutorial group contains 3
students, therefore the range is 4 (7-3).

The range is simple to compute and is useful when you wish to evaluate the whole of a
dataset.

The range is useful for showing the spread within a dataset and for comparing the spread
between similar datasets.
An example of the use of the range to compare spread within datasets is provided in table 1. The
scores of individual students in the examination and coursework component of a module are
shown.

To find the range in marks the highest and lowest values need to be found from the table.
The highest coursework mark was 48 and the lowest was 27 giving a range of 21. In the
examination, the highest mark was 45 and the lowest 12 producing a range of 33. This indicates
that there was wider variation in the students performance in the examination than in the
coursework for this module.

Since the range is based solely on the two most extreme values within the dataset, if one
of these is either exceptionally high or low (sometimes referred to as outlier) it will result in a
range that is not typical of the variability within the dataset. For example, imagine in the above
example that one student failed to hand in any coursework and was awarded a mark of zero,
however they sat the exam and scored 40. The range for the coursework marks would now
become 48 (48-0), rather than 21, however the new range is not typical of the dataset as a whole
and is distorted by the outlier in the coursework marks. In order to reduce the problems caused
by outliers in a dataset, the inter-quartile range is often calculated instead of the range.

The Inter-quartile Range

The inter-quartile range is a measure that indicates the extent to which the central 50% of
values within the dataset are dispersed. It is based upon, and related to, the median.

In the same way that the median divides a dataset into two halves, it can be further
divided into quarters by identifying the upper and lower quartiles. The lower quartile is found
one quarter of the way along a dataset when the values have been arranged in order of
magnitude; the upper quartile is found three quarters along the dataset. Therefore, the upper
quartile lies half way between the median and the highest value in the dataset whilst the lower
quartile lies halfway between the median and the lowest value in the dataset. The inter-quartile
range is found by subtracting the lower quartile from the upper quartile.

For example, the examination marks for 20 students following a particular module are arranged
in order of magnitude.

The median lies at the mid-point between the two central values (10th and 11th)

= half-way between 60 and 62 = 61

The lower quartile lies at the mid-point between the 5th and 6th values

= half-way between 52 and 53 = 52.5

The upper quartile lies at the mid-point between the 15th and 16th values

= half-way between 70 and 71 = 70.5

The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the range is: 80 - 43
= 37.

The inter-quartile range provides a clearer picture of the overall dataset by removing/ignoring the
outlying values.
Like the range however, the inter-quartile range is a measure of dispersion that is based upon
only two values from the dataset. Statistically, the standard deviation is a more powerful measure
of dispersion because it takes into account every value in the dataset. The standard deviation is
explored in the next section of this guide.

The Variance

The variance is a measure of variability that represents on how far each observation falls from
the mean of the distribution. For this example, we'll be using the following five numbers, which
represent my total monthly comic book purchases over the last five months:

2, 3, 5, 6, 9

The formula for calculating a variance is usually written out like this:

This equation looks intimidating, but it's not that bad once you break it down into its component
parts. S2x is the notation used to denote the variance of a sample. That giant sigma () is a
summation sign; it just means we're going to be adding things together. The x represents each of
our observations, and the x with a line over it (often called "x-bar") represents the mean of our
distribution. The capital "N" on the bottom is the total number of observations. Basically, this
formula is telling us to subtract the mean from each of our observations, square the difference,
add them all together and divide by N-1. Let's do an example using the above numbers.

1. The first step in calculating the variance is finding the mean of the distribution. In this case,
the mean is 5 (2+3+5+6+9 = 25; 25/5 = 5).

2. The second step is to subtract the mean (5) from each of the observations:

2-5 = -3
3-5 = -2
5-5 = 0
6-5 = 1
9-5 = 4

Please note: we can check our work after this step by adding all of our values together. If they
sum to zero, we know we're on the right track. If they add up to something besides zero, we
should probably check our math again (-3+-2+0+1+4 = 0, we're golden).

3. Third, we square each of those answers to get rid of the negative numbers:

(-3)2 = 9
(-2)2 = 4
(0)2 = 0
(1)2 = 1
(4)2 = 16

4. Fourth, we add them all together:

9+4+0+1+16=30

5. Finally, we divide by N-1 (the total number of observations is 5, so 5-1=4)

30/4 = 7.5

After all those rather tedious calculations, we're left with a single number that quickly and
succinctly summarizes the amount of variability in our distribution. The bigger the number, the
more variability we have in our distribution. Please note: a variance can never be negative. If you
come up with a variance that's less than zero, you've done something wrong.

The Standard Deviation

The standard deviation is a measure that summarizes the amount by which every value within a
dataset varies from the mean. Effectively it indicates how tightly the values in the dataset are
bunched around the mean value. It is the most robust and widely used measure of dispersion
since, unlike the range and inter-quartile range, it takes into account every variable in the dataset.
When the values in a dataset are pretty tightly bunched together the standard deviation is small.
When the values are spread apart the standard deviation will be relatively large. The standard
deviation is usually presented in conjunction with the mean and is measured in the same units.

In many datasets the values deviate from the mean value due to chance and such datasets are said
to display a normal distribution. In a dataset with a normal distribution most of the values are
clustered around the mean while relatively few values tend to be extremely high or extremely
low. Many natural phenomena display a normal distribution.

For datasets that have a normal distribution the standard deviation can be used to determine the
proportion of values that lie within a particular range of the mean value. For such distributions it
is always the case that 68% of values are less than one standard deviation (1SD) away from the
mean value, that 95% of values are less than two standard deviations (2SD) away from the mean
and that 99% of values are less than three standard deviations (3SD) away from the mean. Figure
3 shows this concept in diagrammatical form.

If the mean of a data set is 25 and its standard deviation is 1.6, then
68% of the values in the dataset will lie between MEAN-1SD (25-1.6=23.4)
and MEAN+1SD (25+1.6=26.6)

99% of the values will lie between MEAN-3SD (25-4.8=20.2)


and MEAN+3SD(25+4.8=29.8).

If the dataset had the same mean of 25 but a larger standard deviation (for example, 2.3) it
would indicate that the values were more dispersed. The frequency distribution for a dispersed
dataset would still show a normal distribution but when plotted on a graph the shape of the curve
will be flatter as in figure 4.

Population and sample standard deviations

There are two different calculations for the Standard Deviation. Which formula you use
depends upon whether the values in your dataset represent an entire population or whether they
form a sample of a larger population. For example, if all student users of the library were asked
how many books they had borrowed in the past month then the entire population has been
studied since all the students have been asked. In such cases the population standard deviation
should be used. Sometimes it is not possible to find information about an entire population and it
might be more realistic to ask a sample of 150 students about their library borrowing and use
these results to estimate library borrowing habits for the entire population of students. In such
cases the sample standard deviation should be used.

Formulae for the standard deviation

While it is not necessary to learn the formula for calculating the standard deviation, there
may be times when you wish to include it in a report or dissertation.

The standard deviation of an entire population is known as (sigma) and is calculated using:

Where x represents each value in the population, is the mean value of the
population, is the summation (or total), and N is the number of values in the population.

The standard deviation of a sample is known as S and is calculated using:

Where x represents each value in the population, x is the mean value of the sample, is
the summation (or total), and n-1 is the number of values in the sample minus 1.

Main Points

Measures of central tendency tell us what is common or typical about our variable.

Three measures of central tendency are the mode, the median and the mean.

The mode is used almost exclusively with nominal-level data, as it is the only measure of
central tendency available for such variables. The median is used with ordinal-level data
or when an interval/ratio-level variable is skewed (think of the Bill Gates example). The
mean can only be used with interval/ratio level data.

Measures of variability are numbers that describe how much variation or diversity there
is in a distribution.
Four measures of variability are the range (the difference between the larges and smallest
observations), the interquartile range (the difference between the 75th and 25th
percentiles) the variance and the standard deviation.

The variance and standard deviation are two closely related measures of variability for
interval/ratio-level variables that increase or decrease depending on how closely the
observations are clustered around the mean.

Importance

An important use of statistics is to measure variability or the spread of data. For example,
two measures of variability are the standard deviation and the range. The standard deviation
measures the spread of data from the mean or the average score. Once the standard deviation is
known other linear transformations may be conducted on the raw data to obtain other scores that
provide more information about variability such as the z score and the T score. The range,
another measure of spread, is simply the difference between the largest and smallest data values.
The range is the simplest measure of variability to compute.

The standard deviation can be an effective tool for teachers. The standard deviation can
be useful in analyzing class room test results. A large standard deviation might tell a teacher the
class grades were spread a great distance from the mean. A small standard deviation might
reflect the opposite of the preceding. In analyzing test results, a teacher can make an assumption
that with a small standard deviation the students understood the material uniformly. With a large
standard deviation the teacher may assume there is a large amount of variation in regards to
students understanding the material tested.

One of the limits of statistics can be found in calculating the range. Since outliers are
used to determine the range, they are very influential in the resulting statistic. For example, if
one class had test grades of 0%, 50%, and100%, and another class had grades of 50%, 90%, and
100% the range is differs greatly between the two classes due to the outliers. The ranges would
be 100 and 50 respectively.
When analyzing test results it is helpful to group scores into number of standard
deviations from the mean. For the data set [3, 5, 7, 7, 7,38] the mean is 11.16 and the standard
deviation is 13.24, one standard deviation from the mean would be 11.16 plus or minus 13.24 or
24.40 and -2.08 and two standard deviations from the mean would be 11.16 plus or minus twice
13.24 or 37.64 and -15.72. Imagine that the above data are points scored in baseball game.
Thirty-eight is between two and three standard deviations above the mean.

Você também pode gostar