Você está na página 1de 97

COLLEGE OF MEDICINE AND HEALTH SCIENCES

SCHOOL OF MEDICINES AND PHARMACY


DEPARTMENT OF PHARMACY YEAR 2 2016-2017

BIOSTATISTICS
SESSIONS 1 , LEARNING NOTES

BY STUDENT MT-EINSTEIN , YEAR 2016-2017


WHAT IS BIOSTATISTICS
Etymologically,
biostatistics refers to the application of statistics to a wide
range of topics in biology, including medical sciences.

Specifically, biostatistics is the science which deals with the


development and application of the most appropriate methods for
the:
Collection of data;
Presentation/organization of the collected data in quantitative
form;
Analysis and interpretation of the results;
Interpretation and making decisions on the basis of such analysis.
WHY OF BIOSTATISTICS
Role of biostatisticians
GENERAL
To guide the design of an experiment or survey prior to data collection
To analyze data using proper statistical procedures and techniques
To present and interpret the results to researchers and other decision
makers
SPECIFIC
Identify and develop treatments for disease and estimate their
effects.
Identify risk factors for diseases.
Develop statistical methodologies to address questions arising from
medical/public health data
Design, monitor, analyze, interpret, and report results of clinical
studies.
AREA OF BIOSTATISTICS APPLICATION
Areas of application of biostatistics

Biostatistics concepts are applied to biological problems, including for:

Public health
Medicine
Ecological and environmental

Biostatisticians must have knowledge of above areas.


WHAT IS DATA
Data – definition and source
- It appears that “data”
are the starting point when dealing with the whole
biostatistical operations,
 specifically from design of an experiment up to
interpretation of results of clinical studies though accurate
data analysis process(es).

There is a need to first of all know what a data is, how


and where it is obtained.
DATA VS VARIABLE
Data: measurements or observations of a variable
made from a sample of a given population.

Variable: the characteristic or phenomenon that


can be measured or classified is called a variable.
TYPE EXAMPLE
Example: class survey
Students in an introductory statistics course were asked the following
questions as part of a class survey:
1 What is your gender, male or female?
2 Are you introverted or extraverted?
3 On average, how many hours of sleep do you get per night?
4 What is your bedtime: 8pm-10pm, 10pm-12am, 12am-2am, later
than 2am?
5 How many countries have you visited?
6 On a scale of 1 (very little) - 5 (a lot), how much do you dread
this semester?
SCHEMATIQUE
TYPE OF VARIABLE
Gender,contry,personality : categorical=categorical=dicothomous
no inherent order between male and female, therefore gender is
not ordinal

sleep: numerical, continuous


even though data is reported as whole numbers, sleep is measured
on a continuous scale, people just tend to round their responses in
surveys

bedtime: categorical, ordinal


there is an inherent ordering in these time intervals
TYPE OF VARIABLE
countries: numerical, discrete
data are counted, and can only take on whole numbers

dread: categorical, ordinal, could also be used as numerical


categories have an inherent ordering

Demographic data are recorded as nominal variables.


Categorical variables can be nominal or ordinal.

A nominal variable is assigned (not measured) and could be a demographic


characteristic such as sex or race.

An ordinal variable is a ranking, such as mild, moderate, or severe.


POPULATION VS SAMPLE
Population Sample
we may learn something about the We learn data from the population
characteristics of the population sample since the whole study population
(parameters). is time consuming
Population is parameter Sample is statistics
POPULATION VS SAMPLE
Statistic: Summary data from a sample.

Examples:
The observed proportion of the sample that responds to treatment;
The observed association between a risk factor and a disease in this
sample.

Parameter: Summary data from a population.


Examples:
The proportion of the population that would respond to a certain
drug
The association between a risk factor and a disease in a population
POPULATION VS SAMPLE

Population
A group of individuals
that we would like to
know something about

Sample
A subset of a population
(hopefully
representative)
RANDOM VS NON RANDOM SAMPLING

Random samples

Subjects are selected from a population so that


each individual has an equal chance of being selected.

Random samples are representative of the source population.

Non-random samples are not representative.


They may be biased regarding age, severity of the condition,
socioeconomic status etc. …
RANDOM VS NON RANDOM
Random samples are rarely utilized in health care
research.

 Instead, patients are randomly assigned to treatment and


control groups.

Each person has an equal chance of being assigned to either


of the groups.

@Random assignment is also known as randomization.


LEVELS OF MEASUREMENT IN BIOSTATISTICS
Variables in a study are measured on a certain scale of measurement.

Scales or levels of measurement refer to how the properties of numbers


can change with different uses.

@There are 4 levels of variables or scale of measurement, which define


different kinds of variables, hence different kind of data:

Nominal
Ordinal
Interval
Ratio
NOMINAL DATA/VARIABLE

Nominal= categorical

*Data that is classified into


categories and
cannot be arranged in any particular order.
Nominal=Categorical=Dichotomous
*E.g. Gender (Male and Female); country of birth (Rwanda, USA,
India...), personality type, yes or no, demographic population.
ORDINAL VARIABLE
Ordinal =ranked
Data that is ranked or ordered: 1st, 2nd, 3rd etc..
Used to rank and order the levels of the data or variable being
studied. No particular value is placed between the numbers in the
rating scale.
E.g. Adverse events: ocular problem determination in patient
Mild, moderate, severe, life-threatening, death

Income , the level of income is diff and ordered


Low, medium, high
INTERVAL DATA/VARIABLE

Interval

Difference between the numbers on the scale is meaningful and intervals


are equal in size. NO absolute zero. 7

Temperatures on a thermometer: The difference between 60 and 70


is the same as the difference between 90 and 100.
The length of a person or an object
Intervals allow for comparisons between things being measured
RATIO VARIABLE/ DATA

Ratio : no absolute 0 point

Scales that do have an absolute zero point


than indicated the absence of the variable
being studied.

E.g. Body weight, height, family size, age,....


MEASURE OF CENTRAL TENDENCY

Measure of scale Best measurement

Nominal Mode

Median
Ordinal

Symmetrical – Mean
Interval
Skewed – Median

Ratio Symmetrical – Mean


Skewed – Median
ASPECTS AND CATEGORY OF DATA
COMMENT

Quantitative variables are measured values.

A discrete quantitative variable has a finite number of possible


measurements.

A continuous quantitative variable has an infinite number of possible


measurements within a range, as would be typical for a serum
chemistry test such as glucose.
CLINICAL CASE
A 33-year-old woman comes to you complaining of lower
abdominal pain which she has had for the past day. She
left her job as a nurse's aide (her second day on the job)
because the pain was so bad. She says the pain began
after she had fallen off a stepstool while getting a
bedpan off a top shelf. No one saw her fall, but she
convinced her supervisor that she had an industrial
accident and needed medical attention because of blood
in her urine. To prove it, she brings in a urine specimen
QUESTIONS 1
How do you correlate the macroscopic and microscopic findings?

The macroscopic appearance is red, but the test for blood is


negative and there are no RBC's microscopically. It is unlikely
to be rhabdomyolysis. This specimen could be factitious.

It would be a simple matter to have the patient produce


another sample (though she might still be carrying the same
bottle of red food coloring with her). Remember that various
drugs can also produce colored urine. Eating fresh beets can
color the urine red temporarily
QUESTIONS 2

What do you think is happening?


Although care and concern should be the
immediate response of health care workers to
a patient, and historical findings should be duly
noted, remember that patients may not always
be telling you everything, or telling you
correctly, particularly when compensation is
being sought.
QUESTIONS 3
What kind of variables are pH and protein?

These measurements represent a quantitative (measured) variable that is


discrete, with a finite number of possible measurements in the range of 5 to 8 for pH
and from 0 to 4+ for protein.

The other form of quantitative variable is continuous with an infinite


number of possible measurements within a range, as would be typical for a serum
chemistry test such as urea nitrogen or creatinine.

Categorical variables could be nominal or ordinal. A nominal variable is


assigned (not measured) and could be a demographic characteristic such as sex
or race. An ordinal variable is a ranking, such as mild, moderate, or severe
CONTINUOUS VS DISCRETE DATA

continuous Discrete
Definition: A set of data is said to be Definition: A set of data is said to be
continuous if the values belonging to the discrete if the values belonging to the
set can take on ANY value within a finite set are distinct and separate
or infinite interval. (unconnected values

Examples: • A person's height could be Examples: • The number of students in


any value (within the range of human a class. • The number of questions on a
heights), not just certain fixed heights pharmacology test.
CONTINUOUS DATA VS DISCRETE

A person's body weight, age …. NOTE: Discrete data (DD) is counted


• The outdoor temperature (To) at noon Function: In the graph of DD, only
(any value within possible To ranges separate, distinct points are plotted, and
only these points have meaning to the
NOTE: Continuous data (CD) is original problem
measured
Function: In the graph of CD, the points
are connected with a continuous line,
since every point has meaning to the
original problem
MORE TO HEAD IN
Continuous numeric data are of interest in investigations such as:
Average age of patients compared to average age of non-patients
Respiratory rate of those exposed to a chemical vs. respiratory rate
of those who were not exposed

If there are many different discrete values, then discrete data is


often treated as continuous.

Examples: CD4 count, HIV viral load

If there are very few discrete values, then discrete data is often
treated as ordinal.
TYPE OF VARIABLE NOT KIND
2type
Variables can be classified as
independent or
dependent

1.independent variable (IV)

is the variable that is manipulated (measured) in an experiment and that


remains unchanged (=“independent”) between conditions being observed in an
experiment.

IV is believed to influence the outcome measure (dependent variable) and is


the “presumed cause.”
e.g. time, age,..
TYPE OF VARIABLE
A dependent variable (DV)

is the variable that is dependent on the independent variable(s)


i.e a DV is the variable that is believed to change in the presence of the
independent variable.

It is the “presumed effect.”


The measured variable in an experiment (e.g. plasma concentration) is
referred to as DV.

DV vs IV: plasma concentration and time: Let’s take example of a patient who
has taken a drug in the morning. The plasma concentration of this drug is a DV
since it changes over time during the day after drug intake.
TYPE OF VARIABLE
An intervening variable
is the variable that links the independent and dependent variable

A confounding variable is a variable that has many other variables, or


dimensions built into it. Not sure what it contains or measures.
For example: Socio Economic Status (SES)How can we measure SES? Income,
Employment status, etc
EXAMPLE OF COFOUNDING VARIABLE

Need to be careful when using confounding variables.


Example
A researcher wants to study the effect of Vitamin C on cancer.
Vitamin C would be the independent variable because it is hypothesized
that it will have an affect on cancer, and cancer would be the dependent
variable because it is the variable that may be influenced by Vitamin C.
DATA PRESENTATION METHOD

Numerical presentation
Graphical presentation
Mathematical presentation
NUMERICAL PRESENTATION
Like frequency presentation in the table and other
GRAPHICAL PRESENTATION
1.Graphs drawn using Cartesian coordinates

In graphs, the data can be concisely summarized into:


• Bar graph (or Bar charts) , Histogram , Box Plot , Line graph , Frequency polygon
, Frequency curve , , , Scatter plot

Bar Graphs when presenting Nominal data (No order to horizontal axis)
Histograms when presenting Continuous or ordinal data (these should be on
horizontal axis)
Box Plots when presenting Continuous data
2.pie chart
3.statistical maps
FREQUENCY POLYGON
WHY IS IT ALWAYS BETTER OF SUMMARIZING UR DATA

It is ALWAYS a good idea to summarize your data (at least for important
variables)

You become familiar with the data and the characteristics of the sample
that you are studying

You can also identify problems with data collection or errors in the data
(data management issues)

Dataset Structure presenting data need data building


Think of data as a rectangular matrix of rows and columns.
Simplest structure.
Rows represent the “experimental unit” NB: Each row is an independent observation.
Columns represent “variables” measured on the experimental unit
SCATTER PLOT IS INVOLVE IN SHOWING CORRELATION
MATHEMATICAL PRESENTATION

Data presentation is usually performed through Descriptive statistics.


Some measures that are commonly used to describe a data set are the
following

Measures of Central Measures of


Tendency Variability (Dispersion)
-mean -range
-median -variance
-mode -standard deviation
MEASURE OF CENTRAL TENDENCY
Mode : The mode is the most frequently occurring score

Median : divide the score into 2 halves , care about odd and even number

mean is the sum of all the scores divided by the total number of scores =average

distribution of the data is normal, the mean =in middle distribution of the score =median
mean is a good measure of central tendency

It is preferred whenever possible and is the only measure of central tendency that is
used in advanced statistical calculations:
o More reliable and accurate
o Better suited to arithmetic calculations
C.T
mean can be misleading because it can be greatly influenced by extreme
scores called the out layer
For example, the average length of stay at a hospital could be greatly
influenced by one patient that stays for 5 years
17-46 C.T

Sometimes the median may yield more information when your


distribution contains outliers, or is skewed (not normally distributed).
What is a median?
Range = MAX-MIN MEASURE OF THE VARIABILITY
Used only for Ordinal, Interval, and Ratio scales as the data must be ordered
Example: 2 3 4 6 8 11 24 (Range is 22)
 Variance (S2)
- The variance is the extent to which individual scores in a distribution of scores differ from one
another. The larger the variance, the further spread out the data. IS a measure of how
spread out a distribution S
- The variance is the average squared deviation of the observations from their mean (how the
observations ‘vary’ from the mean).
Standard Deviation (SD)
SD=The square root of the variance
SD is a measure of the variability of a set of data in a distribution (most widely used measure of the
dispersion)
SD reflects how the data/observations/scores vary from the mean
SD AND S
STANDARD DEVIATION AND VARIANCE
QUARTILES
Quartiles are the three values
that split the sorted data into
four equal parts.
-Second Quartile (Q2) =
median.
-Lower quartile (Q3) = median
of lower half of the data
-Upper quartile (Q1) = median
of upper half of the data
-Need to order the individuals
first (from 1 to “N” individuals)
-One quarter of the individuals
are in each inter-quartile range
STANDARD ERROR OF MEAN

A measure of variability among means of samples selected from certain population.


PROBABILITY OF DISTRIBUTION

A probability distribution
is a device for indicating the values that a random variable may have.
There are two categories of random variables:
c. discrete random variables, and
d. continuous random variables.

1.The probability distribution of a discrete random variable


specifies all possible values of a discrete random variable along with their respective
probabilities. Examples can be:

Frequency distribution
Probability distribution (relative frequency distribution)
Cumulative frequency
PROBABILITY OF DISTRIBUTION

A continuous random variable can assume any value


within a specified interval of values assumed by the
variable.

In a general case, with a large number of class


intervals, the frequency polygon begins to resemble a
smooth curve.
NORMAL DISTRIBUTION=GAUSSIAN DISTRIBUTION
The shape of data
Histograms of frequency distributions
demonstrates better the shape of the
data.

Distributions are often symmetrical


with most scores falling in the middle
and fewer toward the extremes.

Most biological data are symmetrically


distributed and form a normal curve
(also called “bell-shaped curve”). Such
data are said to be normally
distributed.
PROPERTIES OF A NORMAL DISTRIBUTION
The area under a normal curve
has a normal distribution
Properties of a normal distribution
are:
It is symmetric about its mean
The highest point is at its mean
The mean, median and mode are
all equal.
The total area under the curve
above the x-axis is 1 square unit.
Therefore 50% is to the right of
median and 50% is to the left of
median
PROPERTIES OF A NORMAL DISTRIBUTION
Perpendiculars of:

± 1s contain about
68%;
±2 s contain about
95%;
±3 s contain about
99.7% of the area
under the curve
WHY WIDE SPREAD IS NOT IMPORTANT
Spread is important
when comparing 2 or
more group means.
For instance, it is
more difficult to see
a clear distinction
between groups in
the upper example
because the spread is
wider, even though
the means are the
same.
STANDARD NORMAL DISTRIBUTION
A normal distribution is determined by
. This creates a family of distributions
depending on whatever the values of
m and s are.
- The standard normal distribution has
mean=0 and standard dev =1.
Standard Z-Scores The standard z
score is obtained by creating a
variable z whose value is:
STANDARD NORMAL DISTRIBUTION
Given the values of m and s we can convert a value of x to a value of z.
A Z-score
is the number of standard deviations above or below the mean.

A Z-score of 1.5 means


that the score is 1.5 standard deviations above the mean;

a Z score of -1.5 means


that the score is 1.5 standard deviations below the mean.

It always has the same meaning in all distributions.


DISTORTION OF NORMAL CURVE
Data may not be normally
distributed:
Normal Distribution Graph-Box Plot:
- There may be data that are
outliers that distort the mean.
This is called skewed distribution
(SKEWNESS).
- Data may be bunched about
the mean in a non-normal
fashion. This is called kurtotic
distribution (KURTOSIS).
+,-SKEWNESS
Skewness : not distributed symmetrically around the
mean. Consequently:
The mean, median, and mode are not equal and
are in different positions;
Scores (data) are clustered at one end of the
distribution (right or left)
A small number of outliers are located in the limits
of the opposite end
A variable that is positively skewed has large
outliers to the right of the mean, that is, greater than
the mean. In that case, a positively skewed
distribution ‘points’ towards the right.
A negatively skewed variable has large outliers to
the left of the mean; a negatively
+(LEPTO) ,- (PLATY) KURTOSIS
It examines the horizontal movement
of a distribution from a perfect normal
‘bell shape’.
A variable that is positively kurtic
(has a positive kurtosis) is lepto-kurtic
and is too ‘pointed’ have low standard
deviation value. In this case, the data
are bunched together and give a tall,
think distribution which is not normal.
A variable that is negatively kurtic
is platy-kurtic and is too ‘flat’. In this
case, the data are spread out and give
a low, flat distribution which is not
normal.
HOW TO EXAMINE THE NORMAL DISTRIBUTION OF THE DATA
There are both
graphical and
statistical methods for evaluating normality.

Graphical methods mainly include Histogram, Box-Whisker plot, Dot plot, the
normality plots (=Q-Q and P-P plots), etc… Normality plots are much used.

Statistical methods include:


o diagnostic tests for skewness and kurtosis between (+ 0r – 0.5 interval is norma)
o Normality Statistical tests
WHAT SHOULD BE DONE FOR THE ABNORMAL DISTRIBUTION OF THE DATA
Transformation is required in order to study the data parametrically while
normality is tested
If not done we conduct a non parametric study for the data
Three common transformations are:
the logarithmic transformation (the commonest),
the square root transformation, and
the inverse transformation. They actually correct for skews & unequal
variances

Notice
Transformation should be justified: it is recommended when including a non-
normally distributed variable in the analysis will reduce the effectiveness at
identifying statistical relationships, i.e. when this leads to losing power, due
to lack of normal distribution of the variable to be analyzed.
TYPE OF THE STATISTICS
There are two types of statistics:
Descriptive Statistics
 Inferential Statistics

1.Descriptive statistics
used to summarize, organize, and make sense of a set of data (scores or observations).
are typically presented graphically, in tabular form (in tables), or as summary statistics (single
values) (descriptive statistics).

-e.g. : Mean, median, mode, frequencies, range, variance, standard deviation, quartiles, standard error of
the mean
also helps when it comes to describe the relationship between variables.

NB: descriptive statistics has been largely discussed in the previous paragraphs.
INFERENTIAL STATISTICS

Inferential Statistics are used to draw inferences about a population


from a sample.
Specifically it allows researchers to infer (make inferences) or
generalize observations made with samples to the larger population from
which they were selected.
Population and samples (reminder!):
Population: Group that the researcher wishes to study.
 Sample: A group of individuals selected from the population.
 Census: Gathering data from all units of a population, no sampling.

Inferential statistics generally require that data come from a random


sample (i.e. Probability sampling/equal chance of being chosen).
STATISTICAL SIGNIFICANCE
Significance level
Statistical analyses:
Allow to quantify the degree of relationship between variables
Allow generalization about populations using data from samples (inferential)
Specifically, the goal of statistical analysis is to answer the questions whether there is a significant
effect/association/difference between the variables of interest, and how big it is (if there is any).

Significance level is the value that is pre-determined used to reject or retain the hypothesis.
value of 0.05 is used called “p-value” common
Statistically significant findings mean that the probability of obtaining such findings by chance only is less than 5%
(i.e findings would occur no more than 5 out of 100 times by chance alone).

Therefore, findings would be deemed


statistically significant if they were found to be 0.05 or less (p<0.05)
not statistically significant (insignificant) if they were found to be greater than 0.05
(p>0.05)
MEASURE OF ASSOCIATION
What if there is an effect?
You need to measure how big the effect is by using a measure of
association like odds ratio, relative risk, absolute risk, attributable risk
etc..
Absolute Risk is the chance that a person will develop a certain disease
over a period of time is like the hazard is toxicology
E.g.: out of 20,000 people, 1600 developed lung cancer over 10 years,
therefore the absolute risk of developing lung cancer is 8%.

Relative Risk (RR) is a measure of association between the presence or


absence of an exposure and the occurrence of an event.
o RR is when we compare one group of people to another to see if there
is an increased risk from being exposed.
MEASURE OF ASSOCIATION

o Used in randomized control trials and cohort studies.


o Can't use RR unless looking forward in time.
o RR is the measure of risk for those exposed compared to those
unexposed.

E.g. :
The 20 year risk of lung cancer for smokers is 15%
The 20 year risk of lung cancer among non-smokers is 1%
MEASURE OF ASSOCIATION

Odds Ratio (OR) is a way of comparing whether the probability of a certain event
is the same for two groups. Compare event in two grp
Used for cross-sectional studies, case control trials, and retrospective trials is
study done referring to the past event.
o In case control studies you can't estimate the rate of disease among study
subjects because subjects selected according to disease/no disease. So, you can't
take the rate of disease in both populations (in order to calculate RR).
o OR is the comparison between the odds of exposure among cases to the odds of
exposure among controls.
o Odds are same as betting odds. Example: if you have a 1 in 3 chance of winning a
draw, your odds are 1:2.
o To calculate OR, take the odds of exposure (cases)/odds of exposure (controls).
E.g. Smokers are 2.3 times more likely to develop lung cancer than non-smokers.
CONFIDENCE INTERVALS
When we measure the size of the effect we use confidence intervals
(CI). A CI is the range* in risk we would expect to see in the population.

CI provide an expected upper and lower limit (=range*) for a statistics
at a specified probability level (usually 95%, and sometime 99%)
The odds ratio we found from our sample (E.g. Smokers are 2.3 times
more likely to develop cancer than non-smokers) is only true for the
sample we are using.

This exact number is only true for the sample we have examined; it
might be slightly different if we used another sample.

For this reason we calculate a confidence interval-which is the range in


risk we would expect to see in this population.
C.I

E.g: “a study of the effect of smoking on developing cancer”:


o A 95% confidence interval of 2.1 to 3.4 tells us that while smokers in
our study were 2.3 times more likely to develop cancer, in the general
population, smokers are between 2.1 and 3.4 times more likely to develop
cancer. We are 95% confident that this is true.

Calculating a CI:
For example, a sample mean is an estimate of the population mean.
A CI provides a band within which the population mean is likely to fall:
CI = mean ± (Sm × confidence level) , Sm is standard error dev
CI

Example: n = 30, M = 40, s = 8


CI = 40 ± (1.46 × 2.045)
CI = 40 ± 2.99 = 37.01 to 42.99
The value “1.46” came from the following formula:

The value “2.045” (confidence level) came from appropriate tables.


POWER
If findings are statistically significant, then conclusions can be
easily drawn, but what if findings are insignificant? Power
is the probability that a test or study will detect a
statistically significant result.
Did the independent variables or treatment have zero effect? If
an effect really occurs, what is the chance that the experiment
will find a "statistically significant" result?
Determining power depends on several factors:
1) Sample size: how big was your sample?
2) Effect size: what size of an effect are you looking for? E.g.
How large of a difference (association, correlation) are you looking
for? What would be the most scientifically interesting?
3) Standard deviation: how scattered was your data?
POWER
For example:

a large sample, with a large effect, and a small standard


deviation

would be very likely to have found a statistically significant


finding, if it existed.
A power of 80%-95% is desirable.

 One of the best ways to increase the power of your study is to increase
your sample size.
STATISTICAL ANALYSES
Statistical analyses are either
 parametric and
 non-parametric.
Therefore, statistical analyses are performed using
parametric tests =variable in question is from a normal
distribution:
non-parametric tests =do not require any assumption of normal
distribution, are not sensitive

Most non-parametric tests do not require an interval or ratio


level of measurement; can be used with nominal/ordinal level data.
EXAMPLES OF PARAMETRIC AND NON-PARAMETRIC TESTS
INTRODUCTION TO SPSS FOR DATA HANDLING

Data entry in SPSS


Drawing graphs in SPSS
Computing descriptive statistics
Testing for normality assumptions

SPSS (Statistical Package for the Social Sciences) was designed to offer a more
user-friendly data analysis presentation than other statistical software.
It has got different versions over the past years (SPSS, IBM-SPSS, PASW -
Predictive Analytics Software
TYPE OF THE DATA
Types of data (reminder):
Nominal , Ranked , Scales (measures :Interval Ratio) , Mixed types
Text answers (open ended questions)

Nominal (categorical)
− Order is arbitrary when entering data in SPSS
− e.g. Gender, country of birth, personality type, yes or no.
− Use numeric in SPSS and give value labels.

(e.g. 1=Female, 2=Male, 99=Missing)


(e.g. 1=Yes, 2=No, 99=Missing)
(e.g. 1=UK, 2=Ireland, 3=Pakistan, 4=India, 5=other, 99=Missing)
Ranks or Ordinal
Data must be ordered, 1st, 2nd, 3rd etc. e.g. status, social class
Use numeric in SPSS with value labels
E.g. 1=Working class, 2=Middle class, 3=Upper class
• E.g. Class of degree, 1=First, 2=Upper second, 3=Lower second, 4=Third,
5=Ordinary, 99=Missing

Measures, scales
− Interval - equal units
− Ratio - equal units, zero on scale
• e.g. Family size, Salary
• Makes sense to say one value is twice another
− Use numeric (or comma, dot or scientific) in SPSS
• NB: numeric if you can manage to use numbers
• E.g. Family size, 1, 2, 3, 4 etc.
• E.g. Salary per year, 25000, 14500, 18650 etc.
Mixed type
− Categorised data
− Actually ranked, but used to identify categories or groups
e.g. age groups
= ratio data put into groups
− Use numeric in SPSS and use value labels.
E.g. Age group, 1=Under 15

2=15-34
3=35-54
4=55 or greater
Text answers
− E.g. answers to open-ended questions
− Either enter text as given (Use String in SPSS) or
− Code or classify answers into one of a small number types (Use numeric/nominal in
SPSS)
COMPUTING DESCRIPTIVE STATISTICS
Steps for statistical data analysis
Statistical data analysis is conducted in two steps:

1st step = Descriptive Statistics (to describe the sample) including Testing for
NORMALITY ASSUMPTIONS

2nd step = making inference (Inferential Statistics) (making inferences about the
population using what is observed in the sample).

Association statistics
Comparative statistics
Notice: As an introduction to SPSS for data analysis, we will focus on the first step (Descriptive
statistics); the second step is better covered after or combined with “Research Methodology”
courses/lectures
SPSS

about more on the SPSS ,


Come back into the notes
TESTING FOR NORMALITY ASSUMPTIONS
Evaluating normality
There are both graphical and statistical methods for evaluating normality.

Graphical methods mainly include Histogram, Box-Whisker plot, Dot plot, the
normality plots (=Q-Q and P-P plots), etc… Normality plots are much used (Q-Q
plot is more common).

The assumption of univariate normality can be investigated using Statistical


methods including:
o diagnostic tests for skewness and kurtosis
o Normality Statistical tests

(=Kolmogorov-Smirnov Test and Shapiro-Wilk Test)


GRAPHICAL METHOD VS STATISTICAL METHOD
Statistical tests
Make an objective judgment of normality
sometimes not being sensitive enough at low sample sizes or overly sensitive to
large sample sizes.

As such, some statisticians prefer to use their experience to make a subjective


judgment about the data from plots/graphs.

Graphical interpretation
allowing good judgment to assess normality in situations when statistical tests
might be over or under sensitive
graphical methods do lack objectivity.
Conclusions :In some cases, both methods complement each other (sometimes
you need to rely on statistical methods when graphical methods do not help you
to decide whether your data is normally distributed or not)
ASSESSING NORMALITY GRAPHICALLY
Q-Q plot and P-P plot are called probability plots.
Probability plot helps to compare two data sets in terms of distribution;
one data set being from the data to be analyzed (data you collected yourself) and
another one from reference normally distributed data (usually shown as a straight
solid line) (theoretical normally distributed data).
If the data is normally distributed, the result would be a straight line with positive
slope like in the figure on right below indicating a good match for both data
distributions.
WHY DO WE EVEN NEED Q-Q PLOT OR P-P PLOT?

If we consider plotting non-cumulative distribution of two data sets against each other
then it is called Q-Q plot.
If we consider plotting cumulative distribution of two sets against each other then it is
called P-P plot. Q-Q plot is more common
Difficult to interpret histogram that’s why Q-Q or P-P plots is better
BOX-WHISKER PLOT
Usually used as measure of Variability (Dispersion). Box-Whisker
plot shows four equal parts along with three quartiles:
• Second Quartile (Q2) = median.
• Lower quartile (Q3) = median of lower half of the data
• Upper quartile (Q1) = median of upper half of the data
• Need to order the individuals first (from 1 to “N” individuals)
• One quarter of the individuals are in each inter-quartile range
ASSESSING NORMALITY STATISTICALLY

Statistical methods include


a) diagnostic tests for skewness and kurtosis
b) Normality Statistical tests :
Kolmogorov-Smirnov Test
Shapiro-Wilk Test

tests for normality follow a rule of thumb

distribution is normal if its skewness and kurtosis have values between –1.0 and
+1.0”.
A perfectly normal distribution will have a skewness statistic of zero.
ASSESSING NORMALITY STATISTICALLY

Positive values of the skewness score describe positively skewed


distribution (pointing to large positive scores) and
negative skewness scores are negatively skewed.

A perfectly normal distribution will also have a kurtosis statistic of


zero.
Values above zero (positive kurtosis score) will describe “pointed”
distributions leptokurtosis and
values below zero will make flat platykurtosis (negative skewness)
NORMALITY STATISTICAL TESTS

Normality Statistical Tests include


Shapiro-Wilk Test (SW)
Kolmogorov-Smirnov Test (KS).

The KS is for a completely specified distribution (so if you are testing


normality, you must specify the mean and variance; they can't be
estimated from the data).

the SW is for normality, with unspecified mean and variance.


So the SW test is better for testing normality.
The KS test is a good method for comparing the shapes of two
sample distributions.
TST
however.
As such, the SW is more appropriate for small sample sizes (< 50
samples), but can also handle sample sizes as large as 2000, which
makes it the best test for normality.
How do you ascertain statistically normal distribution of the data?
 If the p-value (see as “Sig.” in the output table) of the Shapiro-
Wilk Test is greater than 0.05 (> 0.05), the data is
significantly normally distributed.
If it is below 0.05 (< 0.05), the data significantly deviate
from a normal distribution.
QUESTIONS
Study and analyze the SPSS result about the normality below
TRANSFORMATION REMINDER
When a variable is not normally distributed, we can create a transformed variable
to achieve normality. After transformation, normality should be tested.
Then the transformed variable (normally distributed) is analyzed by parametric
methods.
Three common transformations are: the logarithmic transformation (the commonest),
the square root transformation, and the inverse transformation. They actually correct for
skews & unequal variances
Transformation should be justified: it is recommended when including a non-normally
distributed variable in the analysis will reduce the effectiveness at identifying
statistical relationships, i.e. when this leads to losing power, due to lack of normal
distribution of the variable to be analyzed.
When transformations do not work, we do have the option of changing the way the
information in the variable is represented, e.g. substitute several dichotomous variables
for a single metric variable. You better seek guidance from a Statistician
BIOSTATISTICS

SESSION 2 EXERCISING YOURSELF REMINDING YOUR WORK DONE


I.A Fahrenheit thermometer is an III.Classifications of dental disease is an
example of what: example of what:

A. Nominal
B. Ordinal A. Nominal
C. Interval B. Ordinal
D. Ratio C. Interval
D. Fratio
II.Within 3 standard deviations, the mean
picks up how much of the scores?
IV. Has categorical variables and bars
A. 68 are separate, but equal distances apart:
B. 78
C. 99 A. Bar Graph
D. 99.7 B. Histogram
E. 99.9 C. Frequency Polygon
V. Has continuous variables, bars touch and you VIII. The students t-test measures what:
can always find a third value:
A. Test the difference between 2 means
A. Bar Graph B. Test for differences between 3 or more means
B. Histogram
C. Frequency Polygon C. Differences between two frequency distributions
D. Whether two distributions are independent or
VI. Within 1 standard deviation, the mean picks dependent
up over how many of the values?
A. 60 IX. The Scientific Method is:
B. 62
C. 65 A. Qualitatitive Research
D. 66
E. 68 B. Quantitative Research
VII. The degree to which the independent X. As income level declines, tooth decay increases.
variable alone brings about the change in the This is an example of what:
dependent variable is what:
A. Internal Validity A. Positive correlation
B. External Validity B. Negative correlation
C. Internal Validity
D. External Validity
XIV. Randomly selecting a proportionate amount
XI. Randomly selecting a proportionate from subgroups is an example of what:
amount from subgroups is an example of A. Random Sampling
what: B. Stratified Sampling
A. Random Sampling C. Systematic Sampling
B. Stratified Sampling D. Convenience Sampling
C. Systematic Sampling
D. Convenience Sampling XV. In systematic sampling, every person has an
equal or random chance of being selected.
XII. Retrospective and Prospective are
what types of Epidemiological Studies? A. True
B. False
A. Analytical
B. Descriptive XVI. A zero correlation coefficient shows:
XIII. Descriptive statistics make no attempt A. A strong relationship
to generalize the research findings B. No relationship
beyond the immediate sample.
What is thw difference between positive
A. True correlation and negative correclation
B. False

Você também pode gostar