Você está na página 1de 36

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/258227131

Single group, pre-post test research designs: Some methodological concerns.

Article  in  Oxford Review of Education · October 2012


DOI: 10.2307/41702779

CITATIONS READS

40 12,592

2 authors:

Emma Marsden Carole Joan Torgerson


The University of York Durham University
64 PUBLICATIONS   842 CITATIONS    111 PUBLICATIONS   1,669 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A systematic review of the international literature on effective professional development interventions for groups of welfare professionals View project

How Writing Works View project

All content following this page was uploaded by Emma Marsden on 16 May 2016.

The user has requested enhancement of the downloaded file.


This article was downloaded by: [University of York]
On: 19 December 2012, At: 09:22
Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Oxford Review of Education


Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/core20

Single group, pre- and post-test


research designs: Some methodological
concerns
a b
Emma Marsden & Carole J. Torgerson
a
University of York, UK
b
University of Durham, UK
Version of record first published: 02 Oct 2012.

To cite this article: Emma Marsden & Carole J. Torgerson (2012): Single group, pre- and post-test
research designs: Some methodological concerns, Oxford Review of Education, 38:5, 583-616

To link to this article: http://dx.doi.org/10.1080/03054985.2012.731208

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-


conditions

This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation
that the contents will be complete or accurate or up to date. The accuracy of any
instructions, formulae, and drug doses should be independently verified with primary
sources. The publisher shall not be liable for any loss, actions, claims, proceedings,
demand, or costs or damages whatsoever or howsoever caused arising directly or
indirectly in connection with or arising out of the use of this material.
Oxford Review of Education
Vol. 38, No. 5, October 2012, pp. 583–616

Single group, pre- and post-test


research designs: Some
methodological concerns
Emma Marsdena* and Carole J. Torgersonb
a b
University of York, UK; University of Durham, UK
Downloaded by [University of York] at 09:22 19 December 2012

This article provides two illustrations of some of the factors that can influence findings from
pre- and post-test research designs in evaluation studies, including regression to the mean
(RTM), maturation, history and test effects. The first illustration involves a re-analysis of data
from a study by Marsden (2004), in which pre-test scores are plotted against gain scores to
demonstrate RTM effects. The second illustration is a methodological review of single group,
pre- and post-test research designs (pre-experiments) that evaluate causal relationships between
intervention and outcome. Re-analysis of Marsden’s prior data shows that learners with higher
baseline scores consistently made smaller gains than those with lower baseline scores, demon-
strating that RTM is clearly observable in single group, pre-post test designs. Our review found
that 13% of the sample of 490 articles were evaluation studies. Of these evaluation studies,
about half used an experimental design. However, a quarter used a single group, pre-post test
design, and researchers using these designs did not mention possible RTM effects in their expla-
nations, although other explanatory factors were mentioned. We conclude by describing how
using experimental or quasi-experimental designs would have enabled researchers to explain
their findings more accurately, and to draw more useful implications for pedagogy.

Keywords: education research; pre-experiment; regression to the mean; control group; research design;
research methods

Introduction

Pre-experimental designs in evaluation studies


Evaluations of educational policy and practice interventions that rely on the single
group ‘pre-experimental’ research design (also known as the ‘before and after’ or

*Corresponding author. ‘Department of Education, University of York, Heslington, York YO10


5DD, UK. Email: emma.marsden@york.ac.uk

ISSN 0305-4985 (print)/ISSN 1465-3915 (online)/12/050583–34


Ó 2012 Taylor & Francis
http://dx.doi.org/10.1080/03054985.2012.731208
584 E. Marsden and C.J. Torgerson

‘pre- and post-test’ design) may be threatened by a number of biases. Typically, in


this design, participants are selected (sometimes on the basis of performance below
a pre-specified threshold in a test), pre-tested, exposed to an educational interven-
tion and then post-tested. Observed improvements in the outcome measures may
be ascribed to the intervention in a causal relationship. However, any evaluative
approach that uses this design provides weak information about the counterfactual
inference (Cook & Campbell, 1979; Shadish, et al., 2002; Torgerson & Torgerson,
2008) and may be subject to a number of confounding variables, such as history,
maturation, test effects and the statistical phenomenon known as the regression to
the mean (RTM) effect (Campbell & Stanley, 1963; Cook & Campbell, 1979;
Shadish et al., 2002). Shadish et al.’s nine threats to internal validity include the
four confounding variables with which we are concerned here: history; maturation;
Downloaded by [University of York] at 09:22 19 December 2012

regression and testing (Shadish et al., 2002, p. 55).


In this paper, we argue why it could be inappropriate to ascribe improvements
in outcome measures to an intervention being evaluated due to ignoring other pos-
sible explanations, such as maturation or the RTM phenomenon. We first discuss
how a range of factors can affect the causal validity of pre-experimental designs in
evaluation research. Second, we demonstrate that one of these, RTM, is a clearly
observable phenomenon in pre-experimental designs, by analysing pre-post test
gains as a function of baseline achievement across a battery of outcome measures.
Finally, we present a methodological review of studies published in educational
research journals in 2009. We illustrate the issue of deriving causal inference from
an observed change in outcome without consideration of alternative explanations
for the change observed. The selected papers describe empirical studies that evalu-
ated education interventions, adopting a pre-experimental design with no control
or comparison group. We argue that many of these papers do not directly address
the possibility that regression, maturation, test effects or other possible confounders
could account for their findings, and we discuss the implications for interpreting
their findings.

Threats to the causal validity of the single group pre- and post-test design
A number of threats to the single group design weaken a causal interpretation.
Some of these, such as attrition or un-blinded assessment, are common to experi-
mental or multiple group designs and we will not discuss them further (Cook &
Campbell, 1979). Others, however, such as maturation, history, test effects and
regression effects cannot be controlled for using a single group pre- and post-test
design, and we discuss these below.

History
The pre-experimental design cannot control for the contemporaneous effects of
‘normal’ educational experience or innovations in practice and policy that may
account for some or all of the observed changes. A design using a control or
Pre- and post-test research designs 585

comparison group is usually necessary to account for the possible effects of these
on post-test scores (Cook & Campbell, 1979; Shadish et al., 2002; Torgerson &
Torgerson, 2008).

Maturation
Learners tend to improve in their educational outcomes over time simply due to
increasing maturity. In the absence of a control group we cannot control for matu-
ration effects because these will tend to affect post-test scores regardless of any
new intervention being evaluated. The greater the time difference between pre-
and post-test, the greater the potential effects due to maturation (Cook & Camp-
bell, 1979; Shadish et al., 2002).
Downloaded by [University of York] at 09:22 19 December 2012

Test effects
Evaluations usually involve some form of measurement, before and after the inter-
vention. It is possible that improvements can result from the test itself, attributable
to factors such as participants remembering questions or the questions raising
awareness and triggering learning after the pre-test, independent of the subsequent
intervention. Ideally, two or more equivalent versions of the ‘same’ test should be
used, counter-balanced amongst participants at pre- and post-test. However, to
fully ascertain whether any learning occurred as a result of simply having done the
test, it is necessary to test participants who did not receive the pre-test. Thus,
there can be four groups of participants: pre- and post-test no intervention; pre-
and post-test with intervention; post-test only no intervention; post-test only with
intervention. This is the Solomon four group design (Shadish et al., 2002).

The RTM phenomenon and an illustration of its effects


RTM is a statistical phenomenon that affects all pre-experimental designs that
include, or analyse data from, participants selected on the basis of an extreme,
usually low but sometimes high, pre-test score (Cook & Campbell, 1979).
Galton (1886) ‘discovered’ the statistical phenomenon of RTM in his work on
the heritability of height, describing it as ‘regression to mediocrity’. Thorndike
(1942) reminded us how RTM, or ‘the regression fallacy’, can affect educational
research. The phenomenon can affect measurement known to contain an error
component (as well as a ‘true’ measurement), for example a reading test. Many
test results have a normal distribution with most values clustered around the mean
and a smaller number of markedly lower or higher results. In almost all educa-
tional tests a proportion of the results will have a random error component (the
error term). Scores towards the ends of a distribution will, on average, be more
likely to have a higher error term than those nearer the mean. When students are
re-tested the results towards the ends of the distribution will tend to move closer
to the mean (to their ‘true’ value) than the results in the middle range. A minority
586 E. Marsden and C.J. Torgerson

will not ‘regress to the mean’ but the majority will, moving the mean of the sub-
groups towards the whole sample mean on post-test. The regression effect is,
therefore, most evident for students with the lowest and highest pre-test scores.
In order to illustrate the RTM phenomenon we re-analysed data from a study
undertaken by the first author (Marsden, 2004, 2006). To ascertain whether
RTM affected the data, participants’ change scores (i.e., post-test minus pre-test
scores) were plotted against pre-test scores. If RTM were present a negative corre-
lation would be observed, because participants with high pre-test scores would, on
average, tend to have smaller gains than participants with low pre-test scores. As
expected, Figure 1 shows a strong negative correlation ( 0.65, p < 0.001) between
the pre-test and change scores.1
The lower and upper quartiles of the pre-test scores were extracted for each of
Downloaded by [University of York] at 09:22 19 December 2012

the four measures (listening, speaking, reading and writing). This created two sub-
groups, ‘lower’ and ‘upper’, coming from the same contextual group with the same
intervention. The pre- to post-test gains made by these lower and upper groups
were compared. This simulated a pre-experiment that compared the effectiveness
of the intervention for those with the lowest and highest scores at the outset.
(Once the inter-quartile range of the test scores was eliminated, the remaining
samples were very small, and it is emphasised that our aim was solely to illustrate
the existence of RTM effects.) Of the 16 comparisons (i.e., pre-to-post and
pre-to-delayed post tests, in two groups, in four outcome measures), six showed
statistically significantly larger gains by the lower groups than the upper groups. A

30.00

20.00

10.00
Pre-post

0.00

-10.00

-20.00

20.0030.00 40.00 50.00 60.00


Pretest score

Figure 1. Pre-test scores on a test plotted against gains between pre- and post- tests (data from
Marsden, 2004)
Pre- and post-test research designs 587

further four comparisons suggested a borderline statistically significant difference


in the same direction. For the remaining six gain scores, the lower groups’ gains
were larger than the upper groups’ gains in all but one case, though the differences
were not statistically significant. Critically, the lower group never improved less
than the upper group.2

Design issues
We can address the issues illustrated above by introducing a control or comparison
group formed by random allocation. Then, maturation, history, test effects and
RTM effects will affect both groups similarly and ‘cancel out’ when comparing
groups. Consequently, group differences in changes from pre- to post-tests can be
Downloaded by [University of York] at 09:22 19 December 2012

appropriately ascribed to the intervention.


Although including a randomised control group will deal with temporal and RTM
effects, using a selected control group, not formed by random allocation, may not.
There are several ways a selected control group can introduce bias, the most obvious
of which is selection bias. The members of a control group may be systematically dif-
ferent in some variable, often unobserved, which influences outcome; consequently,
any difference observed between the groups at outcome may be due to selection
rather than to treatment. Secondly, even well matched control groups may introduce
bias through difference in history. For example, groups may have been exposed to
different interventions that may accelerate maturation. Also, RTM can differentially
affect different contextual groups even though they may appear to be matched at pre-
test. This is because it is an individual’s position in relation to their own contextual
group’s mean that determines whether their score is likely to regress up or down to
their group’s mean, and by how much. Most of the extreme values will, regardless of
the presence or effectiveness of an intervention regress towards the mean, though a
minority do not. Identifying those values that will regress on re-testing, given no
intervention, and those that will not, is difficult, if not impossible.
The effects of history, maturation, test effects and RTM often do not operate
alone. Usually, differences in pre- and post-test gains are a combination of all four
factors. A methodological review of studies of psychological, educational and
behavioural treatments (Lipsey & Wilson, 1993) showed that the ‘pre- post-test’
design consistently overestimates effectiveness by an average of 61% compared with
studies with a control group. This greater improvement seen in before and after
studies compared with quasi-experiments (i.e., experiments with a control group) is
entirely predictable given what we know about history, maturation, test effects and
RTM effects. Indeed, gains in control or comparison groups can be observed,
demonstrating that not all of the gains in the experimental group are attributable to
the intervention itself. For example, Norris and Ortega’s (2000) meta-analysis of
78 second language education studies found that the average effect sizes of true
control and comparison treatment groups was d=0.30 (st.dev.=0.39), that 15
groups with no experimental intervention made small but important gains, and that
change over time in control groups was a consistent phenomenon.
588 E. Marsden and C.J. Torgerson

Methodological review of empirical studies


We undertook a small-scale methodological review of a systematically assembled
dataset of published pre-experimental studies.

Review methods
We searched 13 educational research journals: British Educational Research Jour-
nal, Cambridge Journal of Education, Educational Studies, International Journal of
Science Education, Journal of Educational Research, Journal of Research in Reading,
Journal of Teacher Education, Language Learning and Technology, Oxford Review of
Education, Reading and Writing: An Interdisciplinary Journal, Research in the Teach-
ing of English, Reading Research Quarterly and Science Education, for 2009, using
Downloaded by [University of York] at 09:22 19 December 2012

the database Educational Resources Information Centre. The year 2009 provided the
most recent full cycle of journal issues before the start of the review process. This
is not a representative sample of education journals or of educational research
itself, and we note that it is likely that the number of articles that fit our criteria
from any one journal is potentially positively correlated with the number of arti-
cles published by that journal in that year, and/or with the amount of detail pro-
vided in the report.
In order to be selected for the review, papers had to: be unique empirical stud-
ies; compare a construct (e.g., attitude, knowledge, behaviour) before and after an
intervention; have at least one quantified measure; employ a study design that did
not include a control or comparison group or any other mechanism that could
have potentially addressed the known biases of using a single group design. Inde-
pendent data extraction of the studies (with double data extraction of over 80% of
the studies) retrieved information about: the topic; the nature of the intervention;
the outcome measures; and the results. We also recorded whether: the author/s
derived a causal inference between intervention and outcome; the author/s men-
tioned RTM as a possible explanation for the results; the author/s mentioned other
potential explanatory factors for the results.

Results
In total, 490 articles were published in 2009 in the 13 journals. We found 64
(13%) evaluated innovative interventions and used experimental, quasi-experimen-
tal or pre-experimental designs (with quantitative and/or qualitative outcome mea-
surements).3 Of these 64, 19 were included at the first screening stage. At the
second screening stage we excluded three studies (Graebner et al., 2009; MacAr-
thur & Lembo, 2009; Tsaparlis & Papaphotis, 2009) because they did not fit our
criteria. This left 16 (25%) evaluation studies that met our criteria (i.e., pre-post
designs without a control or comparison group). (Note, 48 studies evaluated inter-
ventions using designs with control or comparison groups.)
Detailed information about each pre-experiment is presented in the Appendix.
Pre- and post-test research designs 589

The nature of the arguments about causal relationships in the studies


All 16 studies argued that there was a causal relationship between the intervention
and observed changes on outcome measurements. Fifteen of the 16 studies docu-
mented improvement between pre- and post-tests. No author mentioned the possi-
bility that RTM effects could have partly or wholly explained the results observed.
Six authors did not mention any factors, other than the intervention, that could
potentially partly explain the observed changes over time. Ten of the authors
acknowledged, indirectly or directly, that the experimental intervention may not
have been the (only) cause of the observed gains (see Table 1 and Appendix).
‘Maturation’ or ‘time’ was cited as a possible cause for observed learning gains
in two studies (Sherrod & Wilhem, 2009; Wilhelm, 2009). The ‘test effect’ itself
was not mentioned, though Sherin & van Es (2009) described this issue indirectly
Downloaded by [University of York] at 09:22 19 December 2012

as they speculated that the outcome measurements themselves (instructional


behaviour in the classroom) may have encouraged change over time (p. 33);
O’Byrne (2009) intentionally used part of the pre-test as a learning tool in the
subsequent intervention; and Guisasola et al. (2009) acknowledged that the tests
were part of the intervention and may have caused some of the improvements.
If the studies reviewed here had used a control or comparison group, the results
may have supported the authors’ claims about a causal relationship between the
intervention and the results. The absence of a control/comparison group does not
eliminate the possibility of a causal relationship, but the use of a control/compari-
son group is necessary to warrant claims about the existence and strength of such
a relationship.

Table 1. Consideration of potential factors explaining the observed changes, other than, or in
addition to, the experimental intervention

Characteristic of study Studies

No mention of any potential explanatory Annetta et al.; Ducate & Lomicka; Newton &
factor (except the intervention). Newton; Sherrod & Wilhem; Spalding et al.; Taylor
& Jones.
Acknowledgement that other Grace; Park et al.
(unspecified) factors may be involved.
Specific alternative explanatory factors Brady et al. (characteristics of intervention and other
mentioned. extraneous variables pertaining to quality of
intervention——support, time, resources; the measure
was not standardised). Evagorou et al. (self-selection
bias). Guisasola et al. (tests influence learning). Jones
et al. (self-selection bias). Miedijensky & Tal
(influence of regular school, time, maturation).
O’Byrne (indirectly: tests influence learning). Sherin
& van Es (relationship between the intervention and
the outcome measure may be cyclical). Wilhelm
(differential maturation, differential exposure to the
intervention, differential motivation).
590 E. Marsden and C.J. Torgerson

The role of RTM in cases of ceiling performance at pre-test


One study did not document improvement between pre- and post-test: Ducate &
Lomicka’s (2009) findings could have been attributable to the RTM phenomenon,
as the authors note that the pre-test scores were at ceiling. RTM would predict
that these students would ‘naturally’ regress to the population mean (move from
ceiling downwards) given no intervention. In fact, it is possible that the instruction
did actually enhance the participants’ learning as there was no downward trend in
the post-test results, but stable scores or non-significant gains were observed. Fur-
thermore, other factors such as maturation, history and test effects may also have
contributed to the observed lack of gains, as these factors have less impact when
pre-test scores are already high or at ceiling. Critically, however, none of these
interpretations can be supported by a study design that did not include a compari-
Downloaded by [University of York] at 09:22 19 December 2012

son group.

The use of different pre- and post-test measures


Twelve studies used pre- and post-test measures that were either identical or two
slightly different versions of the same test. One study (Park et al., 2009) used a
pre-intervention measure that was different from the follow-up test: the pre-test
was the grade point average from the previous year’s science tests, and the post-
test was a specially designed astronomy test. Because two different tests were used,
this allowed regression effects to be more pronounced than if the same test had
been used. All tests have an error value, but two different but correlated tests
(indicated by the assumption that both measure a certain body of knowledge) will
have an even greater error component, allowing greater scope for regression
effects. When large and randomised samples are used in experiments with two or
more comparison groups, pre-tests are not, in fact, necessary in order to infer cau-
sality, but in pre-experimental designs the nature of the pre- and post-tests has
important consequences for the claims made.

The notion of differential effectiveness as a function of pre-test score


In three studies (Brady et al., 2009; Park et al., 2009; Wilhelm, 2009) the authors
argued that the intervention had the greatest effect amongst those who achieved
low scores on the pre-test (see Appendix). Brady et al. (2009) stated that ‘… those
teachers who had higher scores at the outset generally had smaller gains; likewise,
teachers with initially low scores had the potential for larger improvement’ (p.
436). Wilhelm (2009) argued that ‘although the females scored lower than the
males on every [...] domain, females made gains that brought each of their post
domain items up to or beyond those of the males’ pre-scores. Also ... the post-[...]
tests displayed no significant difference between genders [despite] a significant dif-
ference between groups favouring males [at pretest]’ (p. 2118). Park et al. (2009)
divided their pre-intervention scores into five bands and found that ‘the improve-
Pre- and post-test research designs 591

ment was different in accordance with the students’ pre-achievement, F(4,229)


=7.853, p=.000 and gender. […] the lowest pre-achievement group’s improvement
was greater than other groups […] low achieving students made the most signifi-
cant gains after Computer Assisted Instruction (CAI) (p < .05) compared with
students who had high pre and post achievement levels’ (pp. 1006–1007). They
found that highest scorers at pre-test deteriorated at post-test (table 2, p. 1004).
They also argued that girls achieved higher at pre-test and did not improve at
post-test, and boys achieved worse at pre-test and made gains at post-test (table 4,
p. 1005). The authors suggested that the intervention had a differential impact
based on both gender and pre-intervention achievement. However, these two fac-
tors appear to be related, as the girls’ baseline scores were generally higher than
the boys’. It is therefore possible that RTM effects were, in part, responsible for
Downloaded by [University of York] at 09:22 19 December 2012

the gains seen. The conflation of pre-achievement level and gender, in the absence
of considering RTM effects, is also observed in the analysis of attitude data: ‘…
attitudes of lower achievers at pre-test were enhanced, while attitudes of higher
achievers at pre-test did not improve or deteriorated’ (p. 1003) and ‘… the boys
significantly enhanced their attitude to science through CAI, but CAI did not
appear to have the same significance for girls’ attitude to science’ (p. 1006).
Another study (Evagorou et al., 2009) may also illustrate that pre-achievement
level can partly explain the gains observed. Although the authors made no claims
about differential effectiveness as a function of pre-test scores, they did note that
‘… four of the 7 skills tested were initially quite undeveloped’ (Evagorou et al.,
2009, p. 670) (i.e., as these initial scores could have been skewed to the lower end
of the distribution, gains due to RTM were more likely).
These examples demonstrate that pre-post gains are sometimes used to argue
that interventions are most effective for those who have low scores at baseline.
Other examples can be found from outside our review, such as an evaluation by
Moore and Wade (1998) which concluded that ‘… five or six years after the inter-
vention the Reading Recovery teaching, the weakest group [had] overtaken initially
more able readers and performed better in both reading accuracy and comprehen-
sion’ (p. 201). Similarly, Benati et al. (2010) argued that ‘… learners who scored
lower on the pre-test improved more than the high scorers such that the two
groups were equal on the post-test’ (p. 127) and Bell et al. (2009) argued that
‘students of lower attainment at Key Stage 3 appear to perform better [in Science
GCSE] than would have been predicted from their Key Stage 3 attainment, but
that higher attaining pupils perform less well’ (p. 119).
The conclusion that an intervention is more beneficial for low achievers at base-
line is only warranted when an equivalent low-scoring sub-group from the control
or comparison group does not make gains equivalent to the low achievers from the
experimental group. This is demonstrated by Ben-David and Zohar (2009). They
randomly assigned equal numbers of low and high achievement participants to a
control and an intervention group, and were therefore able to conclude that their
intervention resulted in more learning gains for low achieving students than for
high achieving students. McCutchen et al. (2009) also reported a differential
592 E. Marsden and C.J. Torgerson

impact of an intervention on low and high achievers (though assignment to condi-


tions used matched randomisation at the school level rather than at the level of
individual participants, and so RTM may have affected the different groups
unevenly, as described above).

Discussion
Control (or comparison) groups are important for avoiding unwarranted interpre-
tations of data from pre-post measurements. It should be noted that 14 of the 64
evaluation studies did use a comparison group, without pre-intervention measures;
and 34 of the 64 studies used both a pre-post design and a control/comparison
group (with or without random allocation to groups).
The use of control and comparison groups principally avoids unwarranted inter-
Downloaded by [University of York] at 09:22 19 December 2012

pretations (internal validity). It can also improve ecological validity. For example,
using ‘test only’ groups can inform decisions when the intervention would be
added to the normal programme, and using comparison groups can help practitio-
ners determine the relative merits of different interventions. As discussed above,
random allocation is the best way of addressing history, maturation, test effects
and RTM effects. If a control group cannot be formed by random assignment then
a contemporaneous control group is preferable to no control group.
Another way of partly controlling for RTM effects is to undertake repeated multi-
ple baseline measurements, in an interrupted time series design, until a stable score
is achieved so as to reduce the margin of error of the test. This improves the validity
of associating any future gains with the intervention rather than RTM. This is often
done in cognitive psychology research in order to find an asymptote that is more
likely to reflect the ‘true’ value of the construct being measured. McArthur and Lem-
bo (2009) evaluated cognitive strategy instruction for writing skills. The three partici-
pants did between three and five pre-test essays to obtain a stable baseline (p. 1029).
The post-test consisted of three more essays. For two students, post-test scores were
all higher than stable baseline scores. For the other student, a slight increase was
observed at post-test over baseline. The authors note that the percentage of non-
overlapping data between stable baseline and post-test was 100% (p. 1029).
Whilst such a research design is statistically more satisfactory, for the partici-
pant, teacher and policy-maker, it is time consuming and difficult to justify peda-
gogically. Randomisation is therefore probably a preferable method of addressing
the RTM problem, particularly as it also eliminates selection bias.
Pre-experimental designs do, however, have a role to play in educational
research. For example, before and after data can determine the promise of an inter-
vention during its development phase. In this case researchers will investigate the
potential for an intervention to improve scores in an iterative cycle of testing and
developing, though the researcher should guard against over-interpretation beyond
the observation that the intervention has promise. Many of the studies we
reviewed also made useful contributions by demonstrating feasibility of implemen-
tation. However, pre-experimental research in which the observed magnitude of
Pre- and post-test research designs 593

gains over time is ascribed uniquely to a causal relationship between the interven-
tion and the outcome measures is a concern. Furthermore, caution must be exer-
cised when using pre-experimental research to inform sample size calculations for
RCTs because such studies over-estimate the intervention effects and lead to an
underestimation of the sample size (Torgerson & Torgerson, 2008).
We do not know the extent to which the effects outlined earlier influenced the
findings reported in the studies we reviewed. Thirteen of the 16 studies included all
the participants in all analyses, and did not split the pre-test data into high and low
scorers. In such studies one might argue that the movement up to the mean from
the lower scorers and the movement down to the mean from the higher scorers
may have ‘cancelled out’ the effects of RTM. However, this is by no means certain,
as the movement of the lower and upper outliers due to RTM may not have been
Downloaded by [University of York] at 09:22 19 December 2012

equivalent. Indeed, equal upwards and downwards movement is unlikely given the
combined effects of history, maturation, test effects and the intervention (experi-
mental or comparison). The combined effects of these factors may reduce any
regression down to the mean of the higher scorers but increase the regression up to
the mean of the lower scorers. Clearly, some of the difference might be due to the
intervention actually being effective at improving the outcomes measured, but how
much, if any, is impossible to know due to the limitations within the design.

Conclusions
In our small-scale methodological review of pre-experimental studies we have illus-
trated that a number of authors of such research designs did not take into account
the potential biasing effects of history, maturation, test effects and RTM in the
discussion of their results. We found several studies that divided the participants
on the basis of their pre-test scores into low and high achievers and argued that an
intervention was more beneficial for those with low scores at baseline, but did not
discuss RTM as a possible factor influencing this finding.
In pre-experiments, history, maturation, test effects or RTM effects may not
explain all of the pre-post differences observed in these studies, and the experi-
mental interventions may be responsible for some of the effects observed. How-
ever, because random allocation to experimental and comparison groups was not
used, we cannot tell the extent to which the differences were due to history, matu-
ration, test or the regression artefact. We know, however, that some of the
observed difference is likely to be artefactual.
Randomised controlled trials are widely used to control for selection bias, that
is, where participants are selected on characteristics that may bias the results. This
paper has highlighted how randomised control groups are also important to
control for history, maturation, test and the RTM phenomenon. Our review found
about one fifth of the evaluation studies did use a comparison group, and about
half used pre-intervention measures in addition to a comparison group, some with
random allocation. This illustrates that such designs are feasible in instructional
settings.
594 E. Marsden and C.J. Torgerson

Acknowledgements
We thank David Torgerson for his useful comments and suggestions on an earlier
draft of the paper.

Notes
1. The data were, in fact, from a trial that used matched randomisation to an experimental
and a comparison group, thereby controlling for RTM effects.
2. For reasons of space statistics are not provided, but the data can be found in Marsden,
2004.
3. The majority of studies that were NOT evaluation studies aimed to explore potential rela-
tionships, define constructs, or document processes.

Notes on contributors
Downloaded by [University of York] at 09:22 19 December 2012

Emma Marsden is Senior Lecturer in Language Education in the Department of


Education at the University of York. Her interests are in research methods
and design, both for general educational and applied linguistics research,
language learning theories, and foreign language education. She has worked
on projects funded by the ESRC, DfES, British Academy, HEA LLAS
Subject Centre and the British Council.

Carole Torgerson is Professor of Education in the School of Education at Durham


University. Her main methodological research interests are in experimental
methods (randomised controlled trials and quasi-experiments) and research
synthesis. She has received awards from the DfES, DfCSF, Home Office,
ESRC, HEA, CfBT, HEFCE, NIHR and a range of other organisations.

References
Annetta, L., Mangrum, J., Holmes, S., Collazo, K. & Cheng, M. (2009) Bridging realty to vir-
tual reality: investigating gender effect and student engagement on learning through video
game play in an elementary school, International Journal of Science Education, 31(8), 1091–
1113.
Bell, J., Donnelly, J., Homer, M. & Pell, G. (2009) A value-added study of the impact of sci-
ence curriculum reform using the national pupil database, British Educational Research Jour-
nal, 35(1), 119–135.
Benati, A., Lee, J. & McNulty, E. (2010) Exploring the effects of Processing Instruction on a
discourse-level guided composition, in: A. Benati & J. Lee (Eds) Processing instruction and
discourse (London, Continuum), 97–147.
Ben-David, A. & Zohar, A. (2009) Contribution of meta-strategic knowledge to scientific
inquiry learning, International Journal of Science Education, 31(12), 1657–1682.
Brady, S., Gillis, M., Smith, T., Lavalette, M., Liss-Bronstein, L., Lowe, E., North, W., Russo,
E. & Wilder, T.D. (2009) First grade teachers’ knowledge of phonological awareness and
code concepts: examining gains from an intensive form of professional development and
corresponding teacher attitudes, Reading and Writing: An Interdisciplinary Journal, 22(4),
425–429.
Pre- and post-test research designs 595

Campbell, D.T. & Stanley, J.C. (1963) Experimental and quasi-experimental designs for research
(Chicago, IL, RandMcNally).
Cook, T.D. & Campbell, D.T. (1979) Quasi-experimentation: design and analysis issues for field set-
tings (Boston, MA, Houghton Mifflin).
Ducate, L. & Lomicka, L. (2009) Podcasting: an effective tool for honing language students’
pronunciation? Language Learning and Technology, 13(3), 66–86.
Evagorou, M., Korfiatis, K., Nicolaou, C. & Constantinou, C. (2009) An investigation of the
potential of interactive simulations for developing thinking skills in elementary school: a case
study with fifth-graders and sixth-graders, International Journal of Science Education, 31(5),
655–674.
Galton, F. (1886) Regression towards mediocrity in hereditary stature, The Journal of the Anthro-
pological Institute of Great Britain and Ireland, 15, 246–263.
Grace, M. (2009) Developing high quality decision-making discussions about biological conser-
vation in a normal classroom setting, International Journal of Science Education, 31(4), 551–
570.
Downloaded by [University of York] at 09:22 19 December 2012

Graebner, I.T., de Souza, E.M.T. & Saito, C.H. (2009) Action-research and food and nutrition
security: a school experience mediated by conceptual graphic representation tool, Interna-
tional Journal of Science Education, 31(6), 809–827.
Guisasola, J., Solbes, J., Barragues, J.-I., Morentin, M. & Moreno, A. (2009) Students’ under-
standing of the special theory of relativity and design for a guided visit to a science museum,
International Journal of Science Education, 31(15), 2085–2104.
Jones, G., Taylor, A. & Broadwell, B. (2009) Estimating linear size and scale: body rulers, Inter-
national Journal of Science Education, 31(11), 1495–1509.
Lipsey, M.W. & Wilson, D.B. (1993) The efficacy of psychological, educational and behavioral
treatment: confirmation from meta-analysis, American Psychologist, 48(12), 1181–1209.
MacArthur, C.A. & Lembo, L. (2009) Strategy instruction in writing for adult literacy learners,
Reading and Writing, 22(9), 1021–1039.
Marsden, E. (2004) Teaching and learning of French verb inflections: a classroom experiment
using processing instruction. Unpublished Ph.D. dissertation, University of Southampton.
Marsden, E. (2006) Exploring input processing in the classroom: an experimental comparison
of processing instruction and enriched input, Language Learning, 56, 507–566.
McCutchen, D., Green, L., Abbott, R. & Sanders, E. (2009) Further evidence for teacher
knowledge: supporting struggling readers in grades three through five, Reading and Writing:
An Interdisciplinary Journal, 22(4), 401–423.
Miedijensky, S. & Tal, T. (2009) Embedded assessment in project-based science courses for the
gifted: insights to inform teaching all students, International Journal of Science Education, 31
(18), 2411–2435.
Moore, M. & Wade, B. (1998) Reading and comprehension: a longitudinal study of ex-Reading
Recovery students, Educational Studies, 24, 195–203.
Newton, D.P. & Newton, L.D. (2009) Knowledge development at the time of use: a problem-
based approach to lesson planning in primary teacher training in a low knowledge, low skill
context, Educational Studies, 35(3), 311–321.
Norris, J. & Ortega, L. (2000) Effectiveness of L2 instruction: a research synthesis and quantita-
tive meta-analysis, Language Learning, 50, 417–528.
O’Byrne, B. (2009) Knowing more than words can say: using multimodal assessment tools to
excavate and construct knowledge about wolves, International Journal of Science Education,
31(4), 523–539.
Park, H., Khan, S. & Petrina, S. (2009) ICT in science education: a quasi-experimental study
of achievement, attitudes toward science, and career aspirations of Korean middle school
students, International Journal of Science Education, 31(8), 993–1012.
596 E. Marsden and C.J. Torgerson

Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002) Experimental and quasi-experimental designs
for generalized causal inference (Boston, Houghton Mifflin).
Sherin, M.G. & van Es, E.A. (2009) Effects of video club participation on teachers’ professional
vision, Journal of Teacher Education, 60(1), 20–37.
Sherrod, S.E. & Wilhelm, J. (2009) A study of how classroom dialogue facilitates the develop-
ment of geometric spatial concepts related to understanding the cause of moon phases,
International Journal of Science Education, 31(7), 873–894.
Spalding, E., Wang, J., Lin, E. & Hu, G. (2009) Analyzing voice in the writing of Chinese
teachers of English, Research in the Teaching of English, 44(1), 23–51.
Taylor, A. & Jones, G. (2009) Proportional reasoning ability and concepts of scale: Surface area
to volume relationships in science, International Journal of Science Education, 31(9), 1231–
1247.
Thorndike, R.L. (1942) Regression fallacies in the matched groups experiment, Psychometrika,
7, 85–102.
Torgerson, C. & Torgerson, D. (2008) Designing and running randomised trials in health, education
Downloaded by [University of York] at 09:22 19 December 2012

and the social sciences: an introduction (Basingstoke, Palgrave Macmillan).


Tsaparlis, G. & Papaphotis, G. (2009) High-school students’ conceptual difficulties and
attempts and conceptual change: the case of basic quantum chemical concepts, International
Journal of Science Education, 31(7), 895–930.
Wilhelm, J. (2009) Gender differences in lunar-related scientific and mathematical understand-
ings, International Journal of Science Education, 31(15), 2105–2122.
Downloaded by [University of York] at 09:22 19 December 2012

Appendix

Data extracted from included studies with a pre-experiment (single group pre-post test) design

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

Annetta, Science 5th grade Examine students’ ‘The gain from pre- Yes. ‘… the MEGA No.
Mangrum, Education; students ‘of learning of simple test to post-test integrated into an
Holmes, varying machines; overall was elementary school
Collazo & academic levels’ significant for the science class did result
Cheng. 10–11 years; sample exposed to in the learning of key
MEGA ...’ (p. science concepts for
1100). ‘The overall fifth-grade boys’ and
gain from pre-test to girls’ learning of
post-test was simple machines’
significant (0.000), f (p. 1104).
= 67.02’ (p. 1100)
Tables 2 & 3.
Elementary Measures: pre and
school; post test measuring
basic knowledge of
the six simple
machines and the
purpose of each.
U.S.A. n=74 students
Brady, Gillis, First grade Examine efficacy of Scores on knowledge Yes. ‘Encouragingly, No.
Pre- and post-test research designs

Smith, teachers; an intensive form of survey: ‘indicated this model of PD

(Continued)
597
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
598

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

Lavalette, Professional professional weak knowledge of generated substantial


Liss- development of development for phonological overall gains on the
Bronstein, teachers; building the awareness and survey of teachers’
Lowe, North, knowledge of first- phonics concepts knowledge’ (p. 443)
Russo & grade teachers in the prior to PD [average ‘Thus, there are effects
Wilder areas of phonological 42.6% correct] and for time with final
awareness and large, significant teacher knowledge
E. Marsden and C.J. Torgerson

phonics; gains in each year by scores ... being


year-end [average significantly higher
74.1% correct] … on than their respective
all [three] subtests beginning scores’ (p.
and on the total 437) ‘assessment of
score’ (pp. 436, teacher attitudes
437). Repeated indicates that positive
Measures ANOVAs feelings about the PD
showed statistically increased, as did
significant differences personal commitment
for phonological to participate’ (p. 439)
awareness, Code, ‘The present study
Fluency and Oral demonstrated the value
language, (p. 437). of an intensive form of
‘With large effect PD provided by skilled
sizes for [PA] and mentors for building
[C] (.73 and .80) (p. teacher knowledge’ (p.
443). 447).
Elementary Attitudes: ‘Repeated But authors state: ‘At (But authors state
school; measure ANOVAs this point one cannot ‘we only have been
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

indicated significant conclude whether the able to identify a


effects of time with gains stem from the modest portion of
higher scores at the extent of classroom the variance
end of the year for support provided for accounting for
… self-efficacy … individual teachers or teachers’ responses
and … positive from other attributes’ … and … scores)’
attitudes toward PD (p. 444). Other factors (p. 446).
… lower scores at mentioned such as
the end of the year variable administrative
on negative attitudes’ support and sufficient
(p. 438). Tables 3 & time and resources for
4. PD meetings.
U.S.A. n=57 (n=65 Measures: Survey to And authors state ‘The
for analysis ofassess teacher instrument … has not
Teacher knowledge needed to been normed or
Knowledge) teach basic reading standardized’ p. 446).
skills and a Teacher
Attitude Survey
(TAS) to measure
attitudes to the
professional
development.
Ducate & Foreign Undergraduates Using podcasts to No statistically Yes, indirectly. No.
Lomicka language (18–22 years improve foreign significant Negative result
education; old); language improvement in reported, causal
Pre- and post-test research designs

pronunciation; most of the measures relationship implied

(Continued)
599
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
600

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

(comprehensibility, (p. 67). Reasons given


accentedness, and for lack of gains: ‘16
attitudes towards weeks is not a
pronunciation) (p. sufficient amount of
73). One sig time to make gains in
difference in French pronunciation …’; ‘…
comprehensibility podcasting and
E. Marsden and C.J. Torgerson

ratings (p. 73). repeated readings


alone are not enough
to improve
pronunciation over an
academic semester’
(pp.76, 77).
University; n=12 German Measures: Pre and (But authors argue
students; n=10 post assessments of gains were not
French speech samples, from observed because
students. identical scripted pre-test scores were
podcast, for high and future
’comprehensibility’ research should be
and ’accentedness’. done with lower
achievers as
significant
improvement may
then be detected (p.
77).)
U.S.A. Attitudes towards
pronunciation using
Pronunciation
Attitude Inventory.
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

Evagorou, Thinking skills; 11–12 year olds; To investigate the Considerable Yes (and indirectly, No.
Korfiatis, impact of a improvements in the no). ‘The proposed
Nicolaou & simulation-based participants’ system learning environment
Constantinou. learning environment thinking skills, on six provoked considerable
on students’ measures, but not on improvements in some
development of ‘feedback thinking’ system thinking skills
system thinking (pp. 664–669 and during a relatively brief
skills; Tables 2, 3 & 4). learning process.’
(p. 656) ‘… after the
instruction the total
number of referred
elements increased’
(p. 664). ‘We have to
admit the failure of
our intervention in
promoting feedback
thinking’ (p. 671).
Elementary n=13 Measures: 2 tests, But authors state that
school; both used as pre and results: ‘… could be
post tests. Each test positively affected by
with tasks the fact that [the
corresponding to students] voluntarily
seven thinking skills. participated in the
project’ (p. 669).
Cyprus.
Pre- and post-test research designs

Grace Decision-making 15–16 year olds; Can peer group About three quarters Yes and No. No.
in science decision-making of the students ‘Discussions … had a
classrooms; discussions help modified their marked impact on
601

develop students’ proposed solutions students’ proposed


(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
602

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

personal reasoning in (p. 557). ‘A solutions’ (p. 567).


relation to comparison of pre- And ‘the time span
conservation issues? and post-test between the pre and
Measures: Pre and comments revealed a post-tests was
post questionnaire general shift to considered short
higher-level enough to minimise
responses following the possible impact of
E. Marsden and C.J. Torgerson

the discussions …’ other external


(p. 559). 54% of influences …’ (p. 557).
student exhibited an And ‘… most students’
increased quality of knowledge and
response; 40% awareness of values …
remained at the same increased after peer
level, and 6% discussion’ (p. 567).
dropped down a
level Almost 20% of
students moved from
level 3 to level 4 (p.
559).
Secondary n=131 (4 intact But ‘it is not possible
school; classes) to establish with
certainty that the
differences between an
individual’s pre-test
and post-test
statements were the
direct result of the

(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

discussions …’ (p.
556).
U.K.
Guisasola, Physics education 1st year How does a museum Increases between Yes. ‘The results show No.
Solbes, in Engineering undergraduates; visit influence pre and post that the teaching
Barragues, course; students measures in sequence and
Morentin & understanding of the understanding. exhibition visit have
Moreno Special theory of Figures 1, 2 and 3 increased the students’
Relativity (STR) and showing differences interest, knowledge,
its applications? Do between pre and post and understanding of
students use more measures for: correct the STR and its
scientific arguments explanations of applications’
when discussing aspects of STR; (p. 2100).
topics related to the scientific arguments
Special theory of applied; proportions
Relativity after of three or more
visiting the mentions of
exhibition? applications.
University n=35 Measures: to Acknowledge test-
measure effect: ’change in the
understanding, a students’
questionnaire as the understanding, can be
pre-test; a written
report structured
around similar
Pre- and post-test research designs

questions to pre-test
for the post test.
(Continued)
603
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
604

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

Spain attributed not only to


the visit to the
museum, but rather to
the overall pre-visit,
visit, and post-visit
teaching process’
(p. 2091).
E. Marsden and C.J. Torgerson

Jones, Taylor, & Maths 6th – 9th grade To examine the ‘The results of this Yes. ‘Results of a No.
Broadwell Education; impact of teaching study revealed that paired-sample t-test
students to use their teaching students to suggested that the
bodies as rough use their body as a significant changes in
measurement tools rough measurement pre-test and post-test
on their ability to tool … increased scores for the LMA
estimate linear their ability to were not due to
measurements. accurately estimate random chance but
linear sizes (see instead are probably
Table 1). The mean due to the intervention
score on the pre-test the students received
for the LMA was as a result of
26.21 (SD = 4.57) completing the metric
whereas the mean measurement tasks’
score for the post- (p. 1504) .
test increased to
30.68 (SD = 3.43)’.
(p. 1504, and see p.
1495); ‘Paired t-tests
… found significant

(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

differences for …
object estimation …,
kinaesthetic estimation
..., and body ruler’
(p. 1504)
Summer Camp; 11–13 years; Measure: 20 item But authors state: ‘All
test to assess of the participants
understanding of were volunteers
metric scale. enrolled in science
summer camp. As
such, they are most
probably not
representative of the
variation that would
exist for all students of
this age’ (p. 1503).
U.S.A. n=19
Miedijensky Impacts of 12–15 year olds; To document ‘Significant Yes. Causal links No.
&Tal assessment for student views on and differences between between experiencing
learning reactions to the pre/post AFL and positive
approach assessment for questionnaires were views about AFL: ‘Our
amongst gifted learning, amongst found with regard to findings indicate
and talented; students taking one- the three main positive impacts of
pull-out year project-based categories and most AFL on the students’
programme for science courses for of the subcategories.’ views of assessment …’
gifted and the gifted. (p. 2430). Also,
Pre- and post-test research designs

talented; relationship between


(Continued)
605
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
606

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

AFL and learning:


Figure 3 and ‘By
following the students
through ... projects,
developing the
assessments with them,
... we showed how
E. Marsden and C.J. Torgerson

assessment supported
learning … the
findings of this study
have strengthened our
belief that the
students’ voice is
important to further
improve the
assessment and its
impact on learning
(p. 2432).
Israel n=86 Measures: pre-post ‘… significant shift But authors state:
questionnaire: toward a more ‘Since the courses took
general view of complex view of the time, and the meetings
assessment, ideas different dimensions occurred only once a
about assessment of assessment …’ p. week, one could claim
modes, and 2421; Table 4 & 5; that other factors such
relationships Figure 2 as the regular school
or even time and
maturation should be
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

considered to
contribute to this shift.
We cannot entirely
denounce this concern;
however, our data
indicate nothing of this
sort of assessment was
employed in the
regular schools; and
indicate the students
strongly associated
their views to the
assessment
components, and
provided relevant
examples that support
our claim’ (p. 2430).
between assessment
and learning;
12 post-treatment
interviews
Newton & Teacher Primary teacher Impact of a problem- Statistically Yes. ‘… there was a No.
Newton education; trainees, and solving approach to significant increase, very large increase in
PGCE tutors; lesson planning in an (very large effect student confidence in
area where trainees size), in students’ planning science
Pre- and post-test research designs

reported confidence lessons which they


(Continued)
607
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
608

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

have little subject in planning to teach, ascribed to the PBL


knowledge. mean score element (effect size,
increasing from ‘3.24 2.17)’ (pp. 318–319).
at the outset to 6.49
… effect size, 2.17’
(pp. 317–319). ‘The
grades awarded for
E. Marsden and C.J. Torgerson

solutions to
Problems 1 and 6
suggested an increase
in the students’
lesson planning skills
(effect size, 0.95) (p.
319); and ‘the mean
score for the
relatively easy
Problem 1 was 4.40.
For the relatively
difficult Problem 6,
it was 6.82, an
increase that was
statistically
significant’ (p. 318).
PGCE teacher n=75 PGCE Measures: Before But authors state that
training course; students; and and after ‘these judgements [by
comparisons of a) the tutor about quality

(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

n=3 PGCE reported self- of lesson plans] were


tutors confidence and b) subjective and not
quality of solutions based on objective
to problems. criteria’ (p. 318).
U.K.
O’Byrne Understanding of Grade 2 pupils Impact of multi- Descriptive statistics Yes. ‘Their mastery of No.
wolves; from one modal activities and of individual test the
school; small group items (pp. 529–535).
discussion on Tables 1 & 2.
developing Measurement-
understanding of dependent gains
wolves; between pre and post
test, e.g. ’Every pair
of pre–post-unit
drawings showed
refinement of
concepts of wolves in
Primary school; n (aggregated Measures: Pre-post nature as distinct concept of form
over 3 True-false test; Pre- from fictional or emerged once they
consecutive post drawings of imaginary wolves, were provided carefully
years)=43; wolves. more commonly constructed assessment
featured in pre-unit and learning tasks that
drawings’ (p. 538). drew out this
knowledge’ (p. 533);
‘Some of these shifts
Pre- and post-test research designs

were no doubt the


result of small-group
investigations’
609

(p. 534). ‘Multimodal


(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
610

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

Assessment Tools
Support Gradual
Concept Change’ (p.
537).
U.S.A. Implicit in design is
that ‘test-effect’ used,
indirectly, as a learning
E. Marsden and C.J. Torgerson

tool.
After one year
aggregated n=28
Park, Khan & ICT in science Grade 8 Impact of Computer ‘... After CAI classes Yes. ‘CAI was No.
Petrina education; students from Assisted Instruction students’ significantly correlated
one school; (CAI) on achievement in with improvement in
achievement and science improved most of the
attitudes; significantly ... , achievement groups’
[and] The mean (p. 1003).
differences of ‘Collectively, student
students’ Attitude to achievement in the
Science before and post-achievement test
after CAI were improved significantly
significant ... , with compared to their
students having more achievement in science
positive attitudes prior to CAI’ (p.
towards science after 1006).
CA.’ (Table 1, p.
1003).
Middle school; Measures: But ‘Although there
Comparisons of are a number of

(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

post-achievement factors that likely


test to students’ contributed to the
Grade Point Average outcomes measured in
in the previous year. this study, the
potential contributions
of CAI on low
achieving students and
girls in science are
intriguing’ (p. 1008).
Korea n=234 Impact on attitudes
to science and future
courses and career
aspirations measured
by pre- and post-
questionnaires.
Sherin & van Es Maths Teacher Teachers Document effect of Quantitative Yes. ‘… there is a No.
education; discussing video improvement on all strong alignment
Elementary and recorded lessons indicators of between the reasoning
middle schools; (‘video clubs’) on teachers’ attention to strategies developed in
U.S.A. teacher learning. students’ the video club and
mathematical those displayed in the
thinking, amongst all later classroom
teachers on all observations.’
measures. Tables 3–
11, pp. 25–32 ‘... not
only did the teachers,
Pre- and post-test research designs

over time, come to


use more
sophisticated
611

(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
612

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

strategies for
reasoning about
student thinking ...
they also came to
notice more complex
issues of student
thinking ’ (p. 27).
E. Marsden and C.J. Torgerson

And ‘in Meeting 1,


only 25% of the
comments about the
student concerned
mathematical
thinking ... in
Meeting 10, 92% ...
were ... to do with
mathematical
thinking’ (p. 28).
n=4 + 7 Measures: quantity But authors state that
and nature of the causal relationship
discussion during may be bi-
video clubs (pre- directional——teacher’s
post); classroom instruction and video
observations (early- clubs mutually
late); ‘noticing influencing (p. 33).
interviews’ (pre +
post)
Sherrod & Understanding 7th grade; Will classroom Learners Yes (implicitly). No.
Wilhem geometric spatial dialogue facilitate demonstrated new ‘Following classroom
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

concepts related students’ understanding of discourse, 8.7%, 6.5%


to the cause of understanding of three concepts and 7.6%
lunar phases; lunar concepts tested: ‘… significant demonstrated new
related to geometric gains in scientific understanding of the
spatial visualisation? and geometric spatial geometric
understanding …, configuration …’
but also … Tables 3, 4 & 5
accelerated (p. 882).
dedication to inquiry
teaching’ (p. 877);
increased proportions
of students
demonstrating
understanding
(p. 881).
Middle school; n= 92 (5 classes Measures: 2 D
taught by same drawings, before and
teacher) after dialogue. And
the Lunar Phases
Concept Inventory
pre- and post-
measure.
U.S.A.
Spalding, Development of Chinese Effect of 3 week ‘Scoring showed that Yes. ‘The writing No.
Wang, Lin & writing in English teachers of summer writing the teachers’ writing course led to
Hu as a foreign English as a workshop on improved significant gains in
Pre- and post-test research designs

language; significantly in the writing scores’ (p. 48).


(Continued)
613
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
614

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

foreign language teachers’ English course of the


(grades 3–12); writing. institute’ (p. 23).
Greatest gains in
‘voice’. Tables 1, 2
& 3.
China. n=57 Measures: Pre- and
post-workshop
E. Marsden and C.J. Torgerson

writing samples,
assessed using the 6
+ 1 Trait[R]
analytical model.
Taylor & Jones Science 11–13 year olds Impact of a series of ‘A significant Yes. ‘Results of a No.
education; enrolled on a science investigations correlation between paired-sample t-test
science summer on improving proportional suggested that the
camp; understanding of reasoning ability and significant changes in
surface area to students’ pre-test and post-test
volume relationships understanding of for the ASAVA were
surface area to not due to random
volume relationships. chance but instead are
Mean score on the probably due to the
pre-test was 54.42 intervention the
(SD 20.41) whereas students received as a
the mean score on result of completing
the post-test the surface area to
increased to 75.89 volume application
(SD 19.71)’ (pp. tasks’ (p. 1236).

(Continued)
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)

Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

1235–1236, 1236–
1237).
Middle school; n=19 Measures:
Understanding tested
by pre-post
achievement tests.
Relationship of
understanding to
proportional
reasoning ability
tested in one off
achievement test.
U.S.A.
Wilhelm Science ‘Middle level’ Examine gender ‘The mean pre-test Yes. ‘The partial g2 No.
education; students; differences in lunar score was 31.2% … value of 0.703
phases and the mean post- indicates that
understanding. test score was 52.9% approximately 70.3%
…. A repeated- of the gain in lunar-
measures ANOVA related understanding
revealed a significant can be directly
increase … from pre- attributed to the
test to post-test on inquiry Moon unit’
overall test scores (p. 2113). ‘The partial
(see Table 3)’ g2 value of 0.151
(p. 2112). indicates that
approximately 15.1%
Pre- and post-test research designs

of the gain in lunar-

(Continued)
615
Downloaded by [University of York] at 09:22 19 December 2012

Appendix (Continued)
616

Results, as
reported by

View publication stats


authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)

related understanding
can be directly
attributed to the
inquiry Moon unit’
(p. 2116). ‘Findings
suggest that both
scientific and
E. Marsden and C.J. Torgerson

mathematical
understandings can be
significantly improved
for both sexes through
the use of spatially
focused, inquiry-
oriented curriculum
such as REAL’
(pp. 2105, 2120).
University; n=123 Measures: a Lunar GSA: The mean pre- But authors state
Phases Concept test score was 49.4% ‘The other 30% [of
Inventory (20 item) …and the mean gain in scores] could
and a Geometric post-test score was be attributed to
Spatial Assessment 56.2% … A differential
(GSA) (16 item). repeated-measures maturation,
ANOVA revealed a differential exposure
significant increase to the intervention,
… (see Table 6)’ differential
(p. 2116). motivation, and so
forth’ (p. 2113).
U.S.A.

Você também pode gostar