Você está na página 1de 103

INFORMATION MANAGEMENT :

Data Analysis
Prepared By : Dr. Essam Morkos

Many people who readily acknowledge variation in their private lives fail to acknowledge the presence of variation in the workplace. Why does a negative 2.3 % change in hospital admissions from April to May send shock waves through the hospital? Why do hospitals get excited when the volume of procedures drops for three consecutive months? The answer to these questions is actually quite simplepeople really do not understand variation!

W heeler (1993) offers a potential reason for this situation:

Wheeler (1993 (1993) ) offers a potential reason for this situation: Managers and workers, educators and students, , doctors and nurses, all have one thing in common. They come out of their educational experience knowing how to add, subtract, multiply, and divide;
yet they have no understanding of how to digest numbers to extract knowledge that may be locked up inside the data. In fact, this shortcoming is also seen, to a lesser extent, among engineers and scientists.

numerical This deficiency has been called numerical illiteracy. illiteracy

Data Management
DATA

Variable
A characteristic or condition that changes or has different values for different individuals. weight, height, gender, net income

Data
Ungrouped :
Data Sets

Grouped :
Frequency Table

Ungrouped Data
Data set (30.0, 32.0, 33.5, 32.0, 33.0, 29.0, 31.0, 32.5, 34.5, 33.5, 30.5, 30.0, 34.0, 32.0, 35.0, 32.5.) mg/dL 31.5, 29.5, 31.5, 32.0,

Grouped Data
Suppose that in thirty shots at a target, a marksman makes the following scores:
52234 13155 43203 24004 03215 54455

The frequencies of the different scores can be summarized as:


Score
0 1 2 3 4 5

Frequency
4 3 5 5 6 7

Frequency (% )
13% 13 % 10% 10 % 17% 17 % 17% 17 % 20% 20 % 23% 23 %

Symmetry
Symmetry is implied when data values are distributed in the same way above and below the middle of the sample.
5 4 Frequency 3 2 1

Symmetrical data sets:

0 29 29.5 30 30.5 31 31.5 32 32.5 33 33.5 34 34.5 35 Value

Blood Urea mg/dL

Are easily interpreted; Allow a balanced attitude to outliers, that is, those above and below the middle value ( median) can be considered by the same criteria; Allow comparisons of spread or dispersion with similar data sets.

Many standard statistical techniques are appropriate only for a symmetric distributional form.
For this reason, attempts are often made to transform skew-symmetric data so that they become roughly symmetric.

Skewness
Skewness is defined as asymmetry in the distribution of the sample data values.
Values on one side of the distribution tend to be further from the 'middle' than values on the other side.

For skewed data, the usual measures of location will give different values,
for example, mode<median<mean indicate positive (or right) skewness. would

Positive (or right) skewness is more common than negative (or left) skewness. If there is evidence of skewness in the data, we can apply transformations,
for example, taking logarithms of positive skew data.

Transformations of ALT
Raw data Lambda=1 Logarithmic Lambda=0
1800 1600 1400 1200 1000 800 600 400 200 0
0 10 20 30 40 50 60 70
0 1 2 3 4 5 6

Over-Log Lambda=-0.5

0.5

1.5

NHANES III: ALT, male, age 20 to 80, n=6423

2.5

Population
A population is any entire collection of people, animals, plants or things from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.

Sample
A sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group.

Parameter
A parameter is a value, usually unknown (and which therefore has to be estimated), used to represent a certain population characteristic.
For example, the population mean is a parameter that is often used to indicate the average value of a quantity.

Estimate
An estimate is an indication of the value of an unknown quantity based on observed data. More formally, an estimate is the particular value of an estimator that is obtained from a particular sample of data and used to indicate the value of a parameter.

Statistic
A statistic is a quantity that is calculated from a sample of data. It is used to give information about unknown values in the corresponding population.
For example, the average of the data in a sample is used to give information about the overall average in the population from which that sample was drawn.

Sampling Distribution
The sampling distribution describes probabilities associated with a statistic when a random sample is drawn from a population.

Continuous Random Variable


A continuous random variable is one which takes an infinite number of possible values. Continuous random usually measurements. Examples include :
height, weight, the amount of sugar in an orange, the time required to run a mile.

variables

are

Continuous Probability Distributions


For Numerical Data :

The Normal distribution The t-distribution The F-distribution


For Categorical Data :

The Chi-squared ( 2) distribution

Normal Distribution
All values are symmetrically distributed around the mean Characteristic shaped curve Assumed quality statistics bell5 4 Frequency 3 2 1 0 29 29.5 30 30.5 31 31.5 32 32.5 33 33.5 34 34.5 35 Value

Blood Urea mg/dL

for all control

The importance of numbers


12 10 8 6 4 2 0 1
450 400 350 300 250 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

N=50

10

11

12

13

14

15

Which distribution is Gaussian?

N=2000

Both!

Students t-distribution
Also called Students t-distribution. A continuous distribution whose shape is similar to the Normal distribution and that is characterized by its degrees of freedom. It is particularly useful for inferences about the mean.

F test
(ANOVA Analysis of Variance )
A right skewed continuous distribution characterized by the degrees of freedom of the numerator and denominator of the ratio that defines it; It is useful for comparing two variances, and more than two means using the analysis of variance

Categorical Data
Categorical data are very common in medical research, arising when individuals are categorized into one of two or more mutually exclusive groups. In a sample of individuals, the number falling into a particular group is called a frequency frequency, so the analysis of categorical data is the analysis of frequencies. When two or more groups are compared , the data are often shown in the form of a frequency table.

The ChiChi-Square distribution


A right skewed continuous distribution characterized by its degrees of freedom; useful for analyzing categorical data Chi-squared test : Used on frequency data. It tests the null hypothesis that there is no association between the factors that define a contingency table. Also used to test differences in proportions.

Discrete Random Variable


A discrete random variable is one which may take on only a countable number of distinct values such as 0, 1, 2, 3, 4, ...
Discrete random variables necessarily) counts. are usually (but not

If a random variable can take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables include :
The number of children in a family, The number of patients in a doctor's surgery, The number of defective light bulbs in a box of ten.

Discrete Sampling Distributions

The Binomial distribution The Poisson distribution Hypergeometric distribution

Binomial Distribution
A binomial random variable is a discrete random variable that is defined when the conditions of a binomial experiment are satisfied. The conditions of a binomial experiment are given in Table below :
1. The total number of trials is fixed in advance; 2. There are just two outcomes of each trial; success and failure; 3. The outcomes of all the trials are statistically independent; 4. All the trials have the same probability of success.

Two Types of Statistics


Descriptive statistics of a POPULATION Relevant notation (Greek):

mean N population size sum

Inferential statistics of SAMPLES from a population. Assumptions are made that the sample reflects the population in an unbiased form. Roman Notation: mean n sample size sum

Two types of statistics: DESCRIPTIVE and INFERENTIAL

Descriptive statistics are a way of summarizing the complexity of the data with a single number.

Descriptive Statistics
Tables Graphic representation Summary Statistics Central Tendency Dispersion

Two types of statistics: DESCRIPTIVE and INFERENTIAL

Inferential statistics answer the


question, "To what extent can these findings be GENERALIZED?
Can we infer that they are probably true for the whole population, not just the sample?

Variability in Medical Research


Error is a false or mistaken result obtained in a study, experiment or data collection.
Random variation (NOT Precise nor Reliable) is the portion of variation in a measurement that is resulting from chance variation in the entity measured. Systematic error (Not Valid) occurs when the measurements are consistently stray from the true value.

CommonCommon -cause variation


Common-cause variation is an inherent part of every process. It is random and is due to regular, natural, or ordinary causes. Common-cause variation affects all outcomes of a process and results from the regular rhythm of a process. It produces processes that are stable, or "in control". One can make predictions, within limits, about a process that has only common-cause variation.

SpecialSpecial -cause variation


Special-cause variation, on the other hand, is due to irregular or unnatural causes that are not inherent in a process. Special-cause variation affects some, but not necessarily all, outcomes of a process. When special causes are present, a process will be out of control, or unstable. A process also will be unpredictable if special causes are present.

Depicting Variation
There are basically two options for depicting variation: 1. Static displays, and 2. Dynamic displays. Historically, the dominant approach to depicting variation has been to use static displays.

Static Display
A static display of variation occurs when data are presented in :
Tabular fashion, Graphical fashion;
e.g., bar charts, histograms, etc.

Measures of central tendency:


the mean, median, and mode) and

Measures of dispersion :
the range, standard deviation

Dynamic Display
A dynamic display occurs when the data are plotted on a run or control chart. A dynamic display allows you to see how the data vary over time.

Basic Points regarding Variation


The basic points that both Shewhart and Deming were trying to make were : Variation exists in all that we do, Processes that exhibit common or chance causes of variation are predictable within statistical limits, Special causes of variation should be identified and eliminated, Attempting to improve processes that contain special causes will increase variation and waste resources, and Once special causes have been eliminated, it is appropriate to consider changing the process.

Consequences of Not Understanding Variation


The first thing that happens when people do not understand variation is that they see trends where there are no trends. The second behavior that people exhibit when they do not understand variation is to blame and give credit to others for things over which they have little or no control. Third, failure to understand variation will build barriers, decrease morale, and create an atmosphere of fear. Finally, if you do not understand variation, you will never be able to fully understand past performance, make predictions about the future, and make significant improvements in your processes.

Data Quality : Accuracy and Precision


Precision or Reliability is the closeness of repeated measurements to each other. Validity or Accuracy is the closeness of measurements to the true value.

Quality Control monitors both precision and the accuracy of the data in order to provide reliable results.

Variability in Medical Research

Major Types of Variability


A distinction is drawn between :

Random variation variation, which is inversely related to precision in measurement, and Nonrandom or systematic error, error which is related to a distortion in measurement, and inversely related to validity in measurement.

Precision (Reliability) vs Validity

Descriptive Statistics
Tables Graphic representation Measures of central tendency Measures of dispersion

Measures of Central Tendency


Measures of central tendency will usually indicate a value near the center or highest point of a frequency distribution curve The most useful measures of central tendency include:
Mean, Median, and Mode.

Each of these measures describes a group of values in a simple and concise manner, providing a mechanism to quickly see and evaluate the data.

Measures of Central Tendency


Mode :
The value which occurs frequency in a given data set with most

Median :
The central value of a data set arranged in order

Mean :
The calculated average of all the values in a given data set

Mode, What is the Mode?


The mode is the value that occurs with the greatest frequency in a set of observations. If no value is repeated within the observations, then there is no mode. set of

If two or more values are repeated at the same frequency, then each of those observations is a mode. In a normal or symmetrical distribution of data, the mean, median, and mode have the same values (or very close).

How is the Mode Determined?


The mode of a set of observations is the value that occurs with the greatest frequency. To determine the mode,
Rank the observations from the smallest to the largest. Evaluate the ranked data set by counting the number of times each individual value occurs, and Determine which value(s) occur with the greatest frequency

When is the Mode Useful?


The mode is useful if you are trying to focus on the most frequent value for a certain population. Although the mode is seldom used in public health statistics, it could be used to focus attention on the modal (most common) age group of a population in the outbreak of a disease, or establish some other modal characteristic for a population experiencing a disease.

Median, What is the Median?


The median is a measure of central tendency that is useful in representing data that is skewed. "Skewed" simply means that there are significantly more data points with values below the mean than there are above the mean (or vice-versa). With skewed data, the normally centered hump on the frequency distribution curve is offset to the left or right of center. The median is the value that divides the distribution of values into two equal parts.

How is the Median Determined?


The median is determined rather than calculated. That is, the median is based on its relationship to other data in the population rather than calculated algebraically. The median of a set of observations is the value that falls in the middle position when the observations are ranked in order from the smallest to the largest. The rules for calculating the mean are:
Rank the observations from the smallest to the largest. If the number of observations is odd, the median is the middle number. If the number of observations is even, the median is the average of the two middle numbers.

When is the Median Useful?


The median is useful when you have an abnormal (or skewed) distribution of data. A skewed distribution data shows up clearly if you present the information in a graph.

What is the Mean?


The mean is the most common measure of central tendency. The arithmetic mean (or average) is defined as the sum of all the observed values, divided by the number of observations. The mean is a good way to describe the center of a group of data if the values have a more or less normal distribution It may not well describe a group of data if a few values are far from the rest (the data is "skewed" or there are many "outliers").

When is the Mean Useful?


The mean is useful when you have a normal distribution of data The mean is not very useful when you have an abnormal distribution of data.

Measures of Dispersion or Variability


There are several terms that describe the dispersion or variability of the data around the mean: Range Variance Standard Deviation Percentiles and Quartiles

When are measures of Dispersion useful?


If you are evaluating the norm for a particular characteristic, like weight or height, you need to establish the extremes (lowest and highest values) in order to assess what might be outside the norm. For example, there are standards for weight in proportion to height. Some people are very heavy for their height, whereas others are much lighter compared to their even if they are of the same height. The extremes of this range can describe how far from the norm a person's weight is when assessed with their height

Range
Range is the difference or spread between the highest and lowest observations. It is the simplest measure of dispersion. It makes no assumption about the central tendency of the data.

Range
What is the Range? The range is calculated as the difference between the smallest and the largest values in a set of data. Heavily influenced by two most extreme values and ignores the rest of the distribution

Range
When is the Range Useful? The range is an adequate measure of variation for a small set of data, like class scores for a test. Think of other measures where range might be useful: Salaries for a particular job category; or Indoor versus outdoor temperatures?

Percentiles and Quartiles


Definition of Percentiles :
Given a set of n observations x1, x2,, xn, the pth percentile P is value of X such that p % or less of the observations are less than P and (100-p) % or less are greater than P P10 indicates 10th percentile, etc.

Definition of Quartiles :
First quartile is P25 Second quartile is median or P50 Third quartile is P75

InterInter -quartile Range


Better description of distribution than range Range of middle 50 percent of the distribution Definition of Inter-quartile Range IQR = Q3 - Q1.

Calculation of Variance
Variance is the measure of variability about the mean. It is calculated as the average squared deviation from the mean.
s2 =

( x i x ) 2 /( n 1 )

the sum of the deviations from the mean, squared, divided by the number of observations (corrected for degrees of freedom)

What is the Standard Deviation?


The standard deviation of a data set is based on how much each data value deviates from the mean, It is equal to the square root of the variance. s = s2 The greater the dispersion of values, the larger the standard deviation. Much of statistical theory is based on the standard deviation and the 'normal' distribution.

When is the Standard Deviation Useful?


It is a useful measure when your data distribution is very close to a normal curve. In this situation, the mean is the best measure of central tendency, and the standard deviation is the best measure of dispersion.
In a normal distribution, if you measure 1 standard deviation to either side of the mean, you will find that 68.3% of the observations fall into this area; 95.5% of the observations fall within 2 standard deviations to either side of the mean; and 99.7% of observations fall within 3 standard deviations of the mean

Summary
In practice, descriptive statistics play a major role
Always the first 1-2 tables/figures in a paper Statistician needs to know about each variable before deciding how to analyze to answer research questions

In any analysis, 90% of the effort goes into setting up the data
Descriptive statistics are part of that 90%

TYPES OF DATA
Independent: each observation from a different subject Paired: two observations (eg, before and after some intervention, left and right eyes) in same subject, or in closely related subjects (e.g., siblings for genetics studies) Clustered: multiple observations on each subject When designing study and conducting analyses, need to use methods appropriate to data type

Most Common Procedures of Statistical Inference


1. Confidence Intervals
When we calculate a statistic, such as a mean, we have calculated a point estimate A confidence interval gives us a range around that estimate that indicates how precisely the estimate has been measured It also tells us how confident we can be that the population parameter lies within that range

2. Statistical Significance Tests (Hypothesis Testing) Allow us to test a claim about a population
parameter More specifically, we test whether the observed sample statistic differs from some specified value

Confidence Interval
Confidence intervals contain two parts : 1. An interval within which the population parameter is estimated to fall ( estimate margin of error ) 2. A confidence level which states the probability that the method used to calculate the interval will contain the population parameter If we use a 95% 95% Confidence Interval, we have used a method that would give the correct answer 95% of the time when using random sampling In other words, 95% of samples from the sampling distribution would give a confidence interval that contains the population parameter This does not mean that the estimate is 95% 95% correct! Typically confidence levels of 95% or 99% are used, but sometimes 90% is used as well

Confidence Limits
Confidence limits are the lower and upper boundaries / values of a confidence interval, i.e., the values which define the range of a confidence interval. The upper and lower bounds of a 95% confidence interval are the 95% confidence limits.
These limits may be taken for other confidence levels, for example, 90%, 99%, 99.9%.

Confidence Level
The confidence level is the probability value associated with a confidence interval. It is often expressed as a percentage. For example, say , then the confidence level is equal to (1 - 0.05) = 0.95, i.e., a 95% confidence level. Example Suppose a football poll predicted that, if the match

were held today, the X team would win 60% of the vote. The pollster might attach a 95% confidence level to the interval 60% plus or minus 3%. i.e., he thinks it very likely that the X team would get between 57% and 63% of the total vote with a 95% confidence.

Confidence Interval
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. The width of the confidence interval gives us some idea about how uncertain we are about the unknown parameter (see precision).
A very wide interval may indicate that more data should be collected before anything very definite can be said about the parameter.

Hypothesis Test
Setting up and testing hypotheses is an essential part of statistical inference. In order to formulate such a test, usually some theory has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved, for example, claiming that a new drug is better than the current drug for treatment of the same symptoms. In each problem considered, the question of interest is simplified into two competing claims / hypotheses between which we have a choice; the null hypothesis, denoted H0, against the alternative hypothesis, denoted HA.
These two competing claims / hypotheses are not however treated on an equal basis: special consideration is given to the null hypothesis.

Null Hypothesis
The null hypothesis, H0, represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug.
We would write H0: there is no difference between the two drugs on average.

Alternative Hypothesis
The alternative hypothesis, HA, is a statement of what a statistical hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug.
We would write
HA: the two drugs have different effects, on average.

The alternative hypothesis might also be that the new drug is better, on average, than the current drug.
In this case we would write
HA: the new drug is better than the current drug, on average.

The final conclusion once the test has been carried out is always given in terms of the null hypothesis.
We either "Reject H0 in favor of HA" or "Do not reject H0". We never conclude "Reject HA", or even "Accept HA".

Type I Error
In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is wrongly rejected rejected. .
For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.

A type I error would occur if we concluded that the two drugs produced different effects when in fact there was no difference between them. The following table gives a summary of possible results of any hypothesis test:

Type II Error
In a hypothesis test, a type II error occurs when the null hypothesis H0 is not rejected when it is in fact false false.
For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.

A type II error would occur if it was concluded that the two drugs produced the same effect, i.e., there is no difference between the two drugs on average, when in fact they produced different ones. A type II error is frequently due to sample sizes being too small.

Significance Level
The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true. It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error. That is, we want to make the significance level as small as possible in order to protect the null hypothesis and to prevent, as far as possible, the investigator from inadvertently making false claims. The significance level is usually denoted by Significance Level = P(type I error) = Usually, the significance level is chosen to be 0.05 (or equivalently, 5%).

P-Value
The probability of having observed your data - or more extreme data by chance alone - when the null hypothesis is true. It is the probability of wrongly rejecting the null hypothesis when it is in fact true. The P value for a test may be also defined as the smallest value of for which the null hypothesis can be rejected , i.e. , If the P value is less than or equal to , we reject the null hypothesis ; if the P value is grater than , we do not reject the null hypothesis.

P-Value Interpretation

Power
The power of a statistical hypothesis test measures the test's ability to reject the null hypothesis when it is actually false, i.e., to make a correct decision decision. . In other words, the power of a hypothesis test is the probability of not committing a type II error. It is calculated by subtracting the probability of a type II error from 1, usually expressed as: Power = 1 - P(type II error) = ( 1 - ) The maximum power a test can have is 1, the minimum is 0. Ideally we want a test to have high power, close to 1.

ESTIMATION vs TESTING
Sometimes primary goal is to describe data. Then we are interested in estimation. We estimate parameters such as Means Variances Correlations When primary goal is to draw a conclusion about a state of nature or the result of an experiment, we are interested in statistical testing

Sample Size Determination

Regression Equation
A regression equation allows us to express the relationship between two (or more) variables algebraically. It indicates the nature of the relationship between two (or more) variables. In particular, it indicates the extent to which you can predict some variables by knowing others, or the extent to which some are associated with others. A linear regression equation is usually written Y = a + bX + e where Y is the dependent variable a is the intercept b is the slope or regression coefficient X is the independent variable (or covariate) E is the error term

The equation will specify the average magnitude of the expected change in Y given a change in X. The regression equation is often represented on a scatterplot by a regression line.

Correlation Coefficient
A correlation coefficient is a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables. There are a number of different correlation coefficients that might be appropriate depending on the kinds of variables being studied.

Parametric Statistics
Originally, statisticians introduced parameterised distributions to make calculations easier Statistics based on parameterised distributions, such as the Normal distribution, are termed

parametric statistics

Non-parametric statistics were subsequently introduced to deal with situations where the data does not follow any easy equations
The statistics are based on the data itself

Parametric statistics
Can be used on parametric distributions Parametric distributions are those which can be described by parameters Gaussian Distribution defined by 2 parameters: Mean (average) indication of the center Standard deviation indication of scatter Symmetrical distribution (not skewed) 68.3% within +/- 1SD 95.4% within +/- 2SD 99.7% within +/- 3SD

NonNon -parametric statistics


No assumptions about distribution Percentiles determined by ranking Measure of centre is median (50th percentile) Measure of scatter is percentiles (eg 2.5th and 97.5th)
35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12

Nonparametric Tests

FINAL COMMENTS
Statistics are only helpful if the approach taken is appropriate to the problem at hand Most statistical procedures are based on some assumptions about the characteristics of the data these need to be checked Remember GIGO

All of the following statements, concerning SPC methodology, are true, except: except:

1. SPC has great potential to help hospitals improving their performance. 2. It was developed by Deming. 3. Mainly deployed through using certain charts, graphs and diagrams. 4. Data collection and measurement are fundamental to SPC.

Which of the following is not correct about (Common Cause Variation) ?

1. It is inherent in the process. 2. The process which exhibits (Common Cause Variation) is always functioning at an acceptable level. 3. It indicates that the process is stable and predictable within certain limits. 4. In some processes, (Special Cause Variation) may be better than (Common Cause Variation).

Which of the following tools represents a "Dynamic display" of variation?

a) Histogram. b) Bar graph. c) Pareto diagram. d) Control chart.

Which of the following may be considered examples of discrete variables ?

1. Height and weight 2. Community-acquired and nosocomial infection rates 3. Surgical or emergency department response time 4. Patient visits in the months of May and June

Questions?
essamwmw77 essamwmw 77@gmail.com @gmail.com

Você também pode gostar