Experimental Design

Experiments
Pre and Post condition
Classic experimental design

Random assignment to control and treatment
conditions
Why random assignment and control groups?

Random assignment helps with internal validity Some threats to internal validity: Experimenter/Subject expectation Mortality bias
Is there an attrition bias such that subjects later in the research
process are no longer representative of the larger initial group?

Selection bias
Without random assignment our treatment effects might be due to
age, gender etc. instead of treatments

Evaluation apprehension
Does the process of experimentation alter results that would occur
naturally?
Classic experimental design when done properly can help guard
against many threats to internal validity

Posttest only control group design:
Experimental Group Control Group
R R
X O1 O2
With random assignment, groups should be
largely equivalent such that we can assume the differences seen may be largely due to the treatment

Special problems involving control groups: Control awareness Is the control group aware it is a control group and is not receiving the experimental treatment? Compensatory equalization of treatments Experimenter compensating the control group's lack of the benefits of treatment by providing some other benefit for the control group Unintended treatments The Hawthorne effect (as it is understood though not actually shown by the original study) might be an example
Mixed design: prepost experiments

Back to our basic control/treatment setup A common use of mixed design includes a pre-test
post test situation in which the between groups factor includes a control and treatment condition Including a pretest allows:
A check on randomness Added statistical control Examination of within-subject change
2 ways to determine treatment effectiveness Overall treatment effect and in terms of change
Pre-test/Post-test
Random assignment
Observation for the two groups at time 1 Introduction of the treatment for the experimental group Observation of the two groups at time 2 Note change for the two groups
Mixed design
2x2
Between subjects factor of treatment

Within subjects factor of pre/post Example
treatment treatment treatment treatment treatment control control control control control
Pre 20 10 60 20 10 50 10 40 20 10
Post 70 50 90 60 50 20 10 30 50 10
SPSS output
Tests of Within-Subj ects Effects Measure: MEASURE_1 Source prepost prepost * treat Error(prepost) Type III Sum of Squares 1805.000 2205.000 1040.000 df 1 1 8 Mean Square 1805.000 2205.000 130.000 F 13.885 16.962 Sig. .006 .003 Partial Eta Squared .634 .680 Sphericity Assumed Sphericity Assumed Sphericity Assumed
Tests of Between-Subj ects Effects Measure: MEASURE_1 Transformed Variable: Average Source treat Error Type III Sum of Squares 1805.000 4240.000 df 1 8 Mean Square 1805.000 530.000 F 3.406 Sig. .102 Partial Eta Squared .299
Why are we not worried about sphericity here? No main effect for treatment (though close with
noticeable effect) Main effect for prepost (often not surprising) Interaction
Interaction
The interaction suggests
Estimated Marginal Means of MEASURE_1
Estimated Marginal Means
that those in the treatment are benefiting from it while those in the control are not improving due to the lack of the treatment
70
treat control treatment
60
50
40
30
20 Pre Post
factor1
Another approach: t-test

Note that if the interaction is the only thing of
interest, in this situation we could have provided those results with a simpler analysis Essentially the question regards the differences among treatment groups regarding the change from time 1 to time 2. t-test on the gain (difference) scores from pre to post
T-test vs. Mixed output

Tests of Within-Subj ects Effects Measure: MEASURE_1 Source prepost prepost * treat Error(prepost) Type III Sum of Squares 1805.000 2205.000 1040.000 df 1 1 8 Mean Square 1805.000 2205.000 130.000 F 13.885 16.962 Sig. .006 .003 Partial Eta Squared .634 .680 Sphericity Assumed Sphericity Assumed Sphericity Assumed
t2 = F
Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Difference Lower Upper -65.51672 -18.48328
F gain Equal variances assumed 2.246
Sig. .172
t -4.118
df 8
Sig. (2-tailed) .003
Mean Difference -42.00000
Std. Error Difference 10.19804
Another approach: ANCOVA

We could analyze this situation in yet another
way Analysis of covariance would provide a description of differences among treatment groups at post while controlling for individual differences at pre* Note how our research question now shifts to one in which our emphasis is in differences at time 2, rather than describing differences in the change from time1 to time 2
Special problems of before-after studies
Instrumentation change Variables are not measured in the same way in the before and after studies. A common way for this to occur is when the observer/raters, through experience, become more adept at measurement. History (intervening events) Events not part of the study intervene between the before and after studies and have an effect Maturation Invalid inferences may be made when the maturation of the subjects between the before and after studies has an effect (ex., the effect of experience), but maturation has not been included as an explicit variable in the study. Regression toward the mean If subjects are chosen because they are above or below the mean, one would expect they will be closer to the mean on re-measurement, regardless of the intervention. For instance, if subjects are sorted by skill and then administered a skill test, the high and low skill groups will probably be closer to the mean than expected. Test experience The before study impacts the after study in its own right, or multiple
Pre-test sensitization
So what if exposure to the pretest automatically
influences posttest results in terms of how well the treatment will have its effect? Example:
Attitudes about human rights violations after exposure to a
documentary on the plight of Tibet
Pretests: questions about attitudes human rights violations
Initial Awareness State
More empathic response to the film
Scores on posttest that might reflect a greater treatment effect
Solomon 4-group design

A different design can allow us to look at the
effects of a pretest

Including a pretest can sensitize participants and
create a threat to construct validity. Combining the two basic designs creates the Solomon 4group design, which can determine if pretest sensitization is a problem:
R R
O O
If these two groups are different, pretest sensitization is an issue. Pre X Treatment interaction If these two groups are different, there is a testing effect in general.
R O
R O
O
O

Why not used so much? Requires more groups However, it has been show that this does not mean more subjects necessarily Even if overall N maintained with switch to S4, may have more power than a posttest only situation Not too many interested in pretest sensitization Regardless one should control for it when possible, just like wed control for other unwanted effects
Complexity of design and interpretation
Although understandable, as usual this is not a good
reason for not doing a particular type of analysis

Lack of understanding of how to analyze
How do we analyze it?

We could analyze the data in different ways
For example: One-way ANOVA on the four post-
test results
Treat all four groups as part of a 4 level factor Contrast treatment groups vs. non
This would not however allow for us to get a
sense of change/gain
Alternative approach (Braver & Braver)

2 x 2 Factorial design
with control/treat, pre/not as two between subjects factors

Test A: Is there an
interaction?
Significant interaction
would suggest pretest effect

Effect of treatment changes
depending on whether there is pretest exposure or not
Simple effects
Test B & C: simple effects
B: Treatment vs Control at
Prepresent C: Treatment vs Control at Preabsent

In other words, do we find that
the treatment works but only if pretest?

O2 > O4, O5 = O6
If so, terminate analysis

The treatment effects are due
to pretest
Simple effects
However, could there be a treatment effect in
spite of the pretest effect?

In other words, could the pretest merely be provide
an enhancement of the treatment Ex. Kaplan/Princeton Review class helps in addition to the effect of having taken the GRE before
If the other simple effect test C is significant
also (still assuming sig interaction) we could conclude that was the case
Non-significant interaction
If there is no interaction to begin with, check the
main effect of treatment (test D)

If sig, then treatment effect w/o pretest effect
However this is not the most powerful course of
action, and if not sig may not be indicative of no treatment effect because we would be disregarding the pre data (less power)
Non-significant interaction: alternatives to testing treatment main effect

Better would be to use analysis of covariance
that takes into account differences among individuals at pretest (Test E) T-test on gain/difference scores (Test F) Or mixed design (Test G)
Between groups factor of Treatment Within groups factor of Pre-Post
As mentioned, F and the interaction in G are
identical to one another However test E will more likely have additional power
Ancova
We can interpret the ANCOVA as allowing for
a test of the treatment after posttest scores have been adjusted for the pretest scores Basically boils down to:
What difference at post would we see if the
participants had scored the same at pre?

We are partialling out the effects of pre to
determine the effect of the treatment on posttest scores
In SPSS
The ancova (or
other tests) will only concern groups one and two as they are the only ones w/ pre-tests to serve as a covariate or produce difference scores for the mixed design/t-test approach
If the Ancova results (or test F or G) show the
treatment to still have an effect, we can conclude that the treatment has some utility beyond whatever effects the pre-test has on the post-test If that test is not significant however, we may perform yet another test
Test H
t-test comparing groups
3 and 4 (O5 vs.O6) Less power compared to others (only half the data and no pre info) but if it is significant despite the lack of power we can assume some treatment effect
Meta-analysis
Even if this test is not significant, Braver & Braver
(1988) suggest a meta-analytic technique that combines the results of the previous two tests (test E, F or G and that of H)
Note how each is done only with a portion of the data
More power from a consideration of all the data
Take the observed p-value from each test, convert to
a one-tailed z-score, add the two z-scores and divide by 2 (i.e. the number of z-scores involved) to give
zmeta
If that shows significance* then we can conclude a
treatment effect
Nowadays might want to use effect size r or d for the meta-
analysis (see Hunter and Schmidt) as there are obvious issues in using p-values One might also just examine the Cohens d for each (without analysis) and draw a conclusion from that
Problems with the meta-analytic technique for Solomon 4 group design

Note that the meta-analytic approach may not
always be the more powerful test depending on the data situation

Sawilosky and Markman (1990) show a case where
the other tests are sig meta not

Also, by only doing the meta in the face of non
significance we are forcing an inclusion criterion for the meta (selection bias)
Problems
Braver and Braver acknowledge that the meta-
analytic technique should be conducted regardless of the outcomes of the previous tests
If test A & D nonsig, do all steps on the right side
However they note that the example Sawilosky used
had a slightly negative correlation b/t pre and post for one setup, and an almost negligible positive corr in the other, and only one mean was significantly different from the others
Probably not a likely scenario
Since their discussion the Braver and Braver
approach has been shown to be useful in the applied setting, but there still may be concerns regarding type I error rate Gist: be cautious in interpretation, but feel free to use if suspect pre-test effects
MCs summary/take
1. Do all the tests on the right side if test A
and D nonsig
If there is a treatment effect but not a pretest effect,
the meta-analysis is more powerful for moderate and large sample sizes
With small sample sizes the classical ANCOVA is slightly
more powerful
As the ANCOVA makes use of pretest scores, it is noticeably more powerful than the meta-analysis, whereas the t test is only slightly more powerful than the meta-analysis.
When a pretest either augments or diminishes the
effectiveness of the treatment, the ANCOVA or t test is typically more powerful than the meta-analysis.
2. Perhaps apply an FDR correction to the
analyses conducted on the right side to control for type I error rate 3. Focus on effect size to aid your
More things to think about in experimental design

The relationship of reliability and power
Treatment effect not the same for everyone

Some benefit more than others
Sounds like no big deal (or even obvious), but all
of these designs discussed assume equal effect of treatment for individuals
Reliability
What is reliability? Often thought of as consistency, but this is more of a
by-product of reliability
Not to mention that you could have perfectly consistent scores
lacking variability (i.e. constants) for which one could not obtain measures of reliability
Reliability really refers to a measures ability to
capture an individuals true score, to distinguish accurately one person from another on some measure
It is the correlation of scores on some measure with
their true scores regarding that construct
Classical True Score Theory

Each subjects score is true score + error of
measurement Obsvar = Truevar + Errorvar

Reliability = Truevar/ Obsvar = 1 Errorvar/
Obsvar
Reliability and power

Reliability = Truevar/ Obsvar = 1 Errorvar/ Obsvar
If observed variance goes up, power will decrease However if observed variance goes up, we dont know automatically
what happens to reliability Obsvar = Truevar + Errorvar

If it is error variance that is causing the increase in observed
variance, reliability will decrease*

Reliability goes down, Power goes down
If it is true variance that is causing the increase in observed
variance, reliability will increase Reliability goes up, Power goes down
The point is that psychometric properties of the variables play an
important, and not altogether obvious role in how we will interpret results, and not having a reliable measure is a recipe for disaster.
Error in Anova
Typical breakdown in a between groups
design SStot = SSb/t + SSe Variation due to treatment and random variation (error) The F statistic is a ratio of these variances F = MSb/MSe
Error in Anova
Classical True Score Theory Each subjects score = true score + error of measurement MSe can thus be further partitioned Variation due to true differences on scores between subjects and error of measurement (unreliability) MSe = MSer + MSes MSer regards measurement error MSes systematic differences between
individuals
MSes comes has two sources Individual differences Treatment differences
Subject by treatment interaction
Error in Anova
The reliability of the measure will determine
the extent to which the two sources of variability (MSer or MSes) contribute to the overall MSe If Reliability = 1.00, MSer = 0
Error term is a reflection only of systematic
individual differences
If Reliability = 0.00, MSes = 0 Error term is a reflection of measurement error only MSer = (1-Rel)MSe MSes = (Rel)MSe
We can test to see if systematic variation is
significantly larger than variation due to error of measurement
MSes (Rel)( MSe ) (Rel) F MSer (1 Rel)( MSe ) (1 Rel)
df (n 1, n 1)
With a reliable measure, the bulk of MSe will be
attributable to systematic individual differences However with strong main effects/interactions, we might see sig F for this test even though the contribution to model is not very much Calculate an effect size (eta-squared)
SSes/SStotal Lyons and Howard suggest (based on Cohens rules of thumb)
that < .33 would suggest that further investigation may not be necessary
How much of the variability seen in our data is due to
systematic variation outside of the main effects?

Subjects responding differently to the treatment
Summary
Gist: discerning the true nature of treatment effects,
e.g. for clinical outcomes, is not easy, and not accomplished just because one has done an experiment and seen a statistically significant effect Small though significant effects with not so reliable measures would not be reason to go with any particular treatment as most of the variance is due poor measures and subjects that do not respond similarly to that treatment
One reason to perhaps suspect individual differences due to
the treatment would be heterogeneity of variance For example, lots of variability in treatment group, not so much in control
Even with larger effects and reliable measures, a
noticeable amount of the unaccounted for variance may be due to subjects responding differently to the treatment Methods for dealing with the problem are outlined in Bryk and Raudenbush (hierarchical linear modeling), but one strategy may be to single out suspected
Resources
Zimmerman & Williams (1986)
Bryk & Raudenbush (1988)

Lyons & Howard (1991)

Experimental Design

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Experimental Design

Enviado por

Direitos autorais:

Formatos disponíveis

Experiments

Pre and Post condition

Classic experimental design

Classic experimental design

process are no longer representative of the larger initial group?

age, gender etc. instead of treatments

against many threats to internal validity

Classic experimental design

With random assignment, groups should be

Classic experimental design

Mixed design: prepost experiments

Between subjects factor of treatment

Estimated Marginal Means

treat control treatment

Another approach: t-test

T-test vs. Mixed output

F gain Equal variances assumed 2.246

Sig. (2-tailed) .003

Mean Difference -42.00000

Std. Error Difference 10.19804

Another approach: ANCOVA

Special problems of before-after studies

documentary on the plight of Tibet

Pretests: questions about attitudes human rights violations

Initial Awareness State

More empathic response to the film

Scores on posttest that might reflect a greater treatment effect

Solomon 4-group design

Solomon 4-group design

Solomon 4-group design

reason for not doing a particular type of analysis

Solomon 4-group design

For example: One-way ANOVA on the four post-

This would not however allow for us to get a

Alternative approach (Braver & Braver)

with control/treat, pre/not as two between subjects factors

would suggest pretest effect

depending on whether there is pretest exposure or not

Prepresent C: Treatment vs Control at Preabsent

the treatment works but only if pretest?

If so, terminate analysis

spite of the pretest effect?

main effect of treatment (test D)

However this is not the most powerful course of

Non-significant interaction: alternatives to testing treatment main effect

As mentioned, F and the interaction in G are

participants had scored the same at pre?

determine the effect of the treatment on posttest scores

If the Ancova results (or test F or G) show the

t-test comparing groups

Take the observed p-value from each test, convert to

If that shows significance* then we can conclude a

Problems with the meta-analytic technique for Solomon 4 group design

always be the more powerful test depending on the data situation

the other tests are sig meta not

However they note that the example Sawilosky used

Since their discussion the Braver and Braver

When a pretest either augments or diminishes the

More things to think about in experimental design

Treatment effect not the same for everyone

Sounds like no big deal (or even obvious), but all

of these designs discussed assume equal effect of treatment for individuals

Reliability really refers to a measures ability to