Você está na página 1de 3

Definitions and Uses of Statistical Data

-the notes were made from SMP series, CUP.

Why find and quote an 'average' value?


To 1, SUMMARISE a set of data by giving a 'REPRESENTATIVE' VALUE, or
2, TO MAKE COMPARISONS.

DISCRETE data: the data values are separated by gaps like 1,2,3 ot 5,5.5,6,6.5
CONTINUOUS data: comes from measurement such as SPEED OF A CAR 42.34568 mph

LINEAR INTERPOLATION is a method of estimating unknown values that lie between known values.

2. SUMMARISING DATA
The purpose of summarising a set of data is to OBTAIN A FEW MEASURES used in comparing one set of
data with another.

MODE:
PROS. With data such as clothes size, its useful as it is the MOST POPULAR size
CONS. Of no use with SMALL data sets, and in some cases does NOT give a central value.
Modal group depends on how it was initially grouped.

a 'central' value, the value that occurs most frequently. But unless the mode occurs sufficiently it is a POOR
MEASURE of 'CENTRAL OR TYPICAL' value as in 8,8,8,12,14,16,25,74,15,42 unless there is a DEFINITE
CONCENTRATION OF EQUAL values near the MIDDLE of a distribution.
4,7,11,15,17,18,18,18,18,19,20,25,29
Generally, it is MUCH LESS USEFUL than other measures of avergae:

Average or 'central' value: mean and median but give an INCOMPLETE PICTURE
Measures of spread: standard deviation and IQR

A measure of spread is needed to identify better the DISTRIBUTION OF DATA.


RANGE: spread of data
PROS. Shows full extent of spread of data and easily understood.
CONS. It is DEPENDENT on most extreme values.individual extreme values at lower or upper ends can
DISTORT the overall picture. And so, QUARTILES are preferred.

MEDIAN AND IQR are unaffected by or insensitive to extreme values at either end of distribution of data.

MEDIAN is a good measure of average of data


Median is greater for the second set than first indicates that road sign caused a REDUCTION in speed

PROS. Unaffected by extreme outliers.


CONS, It is INSENSITIVE to changes in DATA VALUES, does NOT use the whole data set, GOVERNED BY
ONLY A SMALL SECTION OF DATA IN THE MIDDLE.
Calculators are difficult to program to find the median.
'Median' wage is the wage of 'middle earning' worker

Two commonly used MEASURES OF SPREAD are Standard Deviation and Interquartile Range
INTERQUARTILE RANGE: tells the spread of 'middle half' of the data. It excludes extreme values so is a
better measure of spread than the simple range

PROS Unaffected by extreme outliers, useful statements like 50 percent of data lie within IQR,
CONS It is INSENSITIVE to some ways of changing the data such as INCREASING OR DECREASING
EXTREME VALUES without making a difference to the IQR, does not use the whole data set, governed by
two small sections of data.

Median and IQR are unaffected by extreme values, if EXTREME VALUES are relatively unimportant, IQR
may be a BETTER MEASUREMENT OF SPREAD.

MEAN AND S.D. are affected by or sensitive to extreme values at either end of distribution of data.
Definitions and Uses of Statistical Data
-the notes were made from SMP series, CUP.

A Measure of Spread that accounts for all the data values is STANDARD DEVIATION is used for a
THEORETICAL MODEL otherwise there may be reasons for preferring the IQR.
Which is the deviations of each data value from the MEAN.

PROS. So is MEAN which TAKES INTO ACCOUNT ALL THE ACTUAL DATA VALUES and is
mathematically defined by a formula for ease and calculators can easily be programmed to find it. It
is SENSITIVE to extreme values but give a MORE FULLER PICTURE OF THE WHOLE DATA SET.
CONS It is SENSITIVE to extreme values. It can be greatly affected by a SMALL NO. OF OUTLIERS.

If something is UNIVERSALLY APPLIED, multiplication or addition to EVERY data values, SAME HAPPENS
TO THE MEAN. Add 5 to every value, multiply by 5 to every value, add 5 to the mean, multiply by 5 to mean
If coded, MEAN TOO has to be decoded.
MEAN is calculated using every individual data value so is SENSITIVE TO CHANGES IN DATA VALUES.
Mean wage is found by adding up all the workers' wages and sharing out total equally between the
workers.

Standard Deviation uses


PROS. WHOLE DATA SET, REPRESENT EVERY ITEM OF DATA
DEFINED BY A MATHEMATICAL FORMULA, calculators easily programmed to find it, useful in comparing
TWO SETS OF DATA
CONS. It can be greatly affected by OUTLIERS, for a SINGLE SET OF DATA it gives little useful info.
If every data value is increased or decreased by the same amount(+ -), STANDARD DEVIATION is
UNALTERED but when multiplied( X ), STANDARD DEVIATION IS ALSO MULTIPLIED.
When DE-CODING, ignore any add- or subtraction but MULTIPLY OR DIVIDE ACCORDINGLY.

The Mean and Standard Deviation calculated from GROUPED FREQUENCY TABLES are ONLY
ESTIMATES and the ASSUMPTION is that the dta is EVEN SPREAD WITHIN EACH GROUP.
Using MORE GROUPS of SMALLER SIZE will improve ACCURACY OF THESE ESTIMATES.

Users of statistics sometimes opt for a measure that display a favourable result. In practice it is best to use
one or more than one set of measures and emphasise any limitations of the measures in light of the data.

LINEAR REGRESSION

Explanatory or controlled variable always plotted as x: is which ever variable that affects an outcome, not the
other way around such as the temperature affecting ice cream sales not vice versa.
Response variable plotted as y: is which ever variable that is being predicted
In y=a+bx, a is the y intercept, when x=0, such as it is the PRESSURE at 0'c temperature.
B is the gradient which is INCREASE IN PRESSURE resulting from each 1 degrees RISE IN TEMP.

Least squares regression line is a line which MINIMISES the sum of squared deviations of the actual values
from the line of best fit.

Predicting a value within the range of given data is INTERPOLATION.


A STRAIGHT LINE is an appropriate model then PREDICTIONS BY INTERPOLATION are usually
reliable.
Predicting outside the range of original data is EXTRAPOLATION which must be treated with
CAUTION.
There is an assumption that the LINEAR RELATIONSHIP will continue to hold outside the observed
range but this must be checked.

Fitting a straight line to a et of data is FITTING A LINEAR REGRESSION MODEL. Before this, whether data
follows a linear relationship must be checked whether a linear model is appropriate, BY A SCATTER
DIAGRAM OR ELSE.(use pmcc – see below)

CORRELATION
CODING does NOT alter the value of correlation coefficient, as the correlation between x and y does not
change no matter what eg y values doubled and original data with 5 added to x values.
Definitions and Uses of Statistical Data
-the notes were made from SMP series, CUP.

Statistics NEVER AIMS TO PROVE anything but provide EVIDENCE TO SUPPORT A PARTICULAR
THEORY, often used in inappropriate ways to 'support' claims.
A common MISUSE OF CORRELATION is to take a strong correlation as one variable is the 'CAUSE' of the
other variable. A high positive correlation simply says that high values of one variable occur with high values
of other variable, low with low.
These are cases SPURIOUS CORRELATION where the apparent link is the EFFECT OF A THIRD
VARIABLE such as the height and intelligence are affected by AGE.

Outliers can exaggerate a strong correlation or make an otherwise strong correlation appear weak.
In some cases, correlation may exist in a WHOLE POPULATION but not within a SUBGROUP OF IT or vice
versa.
P.M.CC is only an APPROPRIATE MEASURE if the relationship between two random variables is LINEAR.
Finding pmcc can be an evidence of a LINEAR RELATIONSHIP to make the regression line worth finding.
If the correlation is CLOSER to 1 or -1, MORE ACCURATE ARE VALUES PREDICTED FROM THE
REGRESSION LINE and STRONGER THE CORRELATION, THE BETTER THE FIT OF A LINEAR MODEL

MATHEMATICAL MODELLING
The main reason for MODELLING REAL LIFE SITUATIONS is to MATHEMATICALLY ANSWER
QUESTIONS, SOLVE PROBLEMS, MAKE CONFIDENT PREDICTIONS ABOUT ITS BEHAVIOUR.

It saves TIME, EFFORT, EXPENSE costs and is more efficient, an experiment can be performed to verify
findings and if there is DISCREPANCY, model can be improved

STATISTICAL MODELS are different from mathematical models in that they MODEL SITUATIONS THAT
INVOLVE UNCERTAINTY... such as:
PROBABILITY MODELS
Probability model can be used to model reliability of an appliance made of components,
spread of disease as AUTOMATIC INFECTION is not the case when an infected comes into contact with an
uninfected, nature of disease, general health of the person has to be considered.
PARAMETERS are the values needed to arrive at definite predictions such as the PROBABILITY OF
TRANSMISSION, LENGTH OF TIME BEFORE SOMEONE WHO CAUGHT THE DISEASE BECOMES
INFECTIOUS THEMSELVES, AND LENGTH OF TIME BEFORE A PERSON IS NO LONGER INFECTIOUS
which need to be collected from EXPERIMENTS AND SURVEYS or estimated.
REGRESSION MODELS
Relationship between two variables can be expressed as a mathematical equation.
Real world variables do not follow these mathematical relationships exactly.
A linear regression line is the best attempt to fit a line to a set of data, predictions made are uncertain.
Reliability depends on how closely the line fits the data.
The parameters are the intercept and gradient which define position of regression line.
One example is the HEIGHT OF A PERSON predicted in terms of BONE LENGTH IN THEIR BODY
RANDOM VARIABLES: DISTRIBUTION MODELS
Real world quantities can often be modelled by a random variable whose distribution is defined by a
MATHEMATICAL FUNCTION.
Most important mathematically defined continuous distribution is NORMAL DISTRIBUTION which model
closely many real quantities like HEIGHT OF ADULT MALES or ERRORS MADE BY MEASURING
DEVICES.
Parameters which specify the normal distribution model are MEAN and STANDARD DEVIATIONS which
allows the facts about the distribution to be deduced.
CHECKING VALIDITY OF A MODEL
Before using a type of model, check that assumptions it makes correspond with the situation being modelled.
For example, there is no point in applying a probability model that assumes outcomes to be independent to
situations that are not.
Many errors are made by MISUSE OF MODELS.

Você também pode gostar