Escolar Documentos
Profissional Documentos
Cultura Documentos
net/publication/258227131
CITATIONS READS
40 12,592
2 authors:
Some of the authors of this publication are also working on these related projects:
A systematic review of the international literature on effective professional development interventions for groups of welfare professionals View project
All content following this page was uploaded by Emma Marsden on 16 May 2016.
To cite this article: Emma Marsden & Carole J. Torgerson (2012): Single group, pre- and post-test
research designs: Some methodological concerns, Oxford Review of Education, 38:5, 583-616
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation
that the contents will be complete or accurate or up to date. The accuracy of any
instructions, formulae, and drug doses should be independently verified with primary
sources. The publisher shall not be liable for any loss, actions, claims, proceedings,
demand, or costs or damages whatsoever or howsoever caused arising directly or
indirectly in connection with or arising out of the use of this material.
Oxford Review of Education
Vol. 38, No. 5, October 2012, pp. 583–616
This article provides two illustrations of some of the factors that can influence findings from
pre- and post-test research designs in evaluation studies, including regression to the mean
(RTM), maturation, history and test effects. The first illustration involves a re-analysis of data
from a study by Marsden (2004), in which pre-test scores are plotted against gain scores to
demonstrate RTM effects. The second illustration is a methodological review of single group,
pre- and post-test research designs (pre-experiments) that evaluate causal relationships between
intervention and outcome. Re-analysis of Marsden’s prior data shows that learners with higher
baseline scores consistently made smaller gains than those with lower baseline scores, demon-
strating that RTM is clearly observable in single group, pre-post test designs. Our review found
that 13% of the sample of 490 articles were evaluation studies. Of these evaluation studies,
about half used an experimental design. However, a quarter used a single group, pre-post test
design, and researchers using these designs did not mention possible RTM effects in their expla-
nations, although other explanatory factors were mentioned. We conclude by describing how
using experimental or quasi-experimental designs would have enabled researchers to explain
their findings more accurately, and to draw more useful implications for pedagogy.
Keywords: education research; pre-experiment; regression to the mean; control group; research design;
research methods
Introduction
Threats to the causal validity of the single group pre- and post-test design
A number of threats to the single group design weaken a causal interpretation.
Some of these, such as attrition or un-blinded assessment, are common to experi-
mental or multiple group designs and we will not discuss them further (Cook &
Campbell, 1979). Others, however, such as maturation, history, test effects and
regression effects cannot be controlled for using a single group pre- and post-test
design, and we discuss these below.
History
The pre-experimental design cannot control for the contemporaneous effects of
‘normal’ educational experience or innovations in practice and policy that may
account for some or all of the observed changes. A design using a control or
Pre- and post-test research designs 585
comparison group is usually necessary to account for the possible effects of these
on post-test scores (Cook & Campbell, 1979; Shadish et al., 2002; Torgerson &
Torgerson, 2008).
Maturation
Learners tend to improve in their educational outcomes over time simply due to
increasing maturity. In the absence of a control group we cannot control for matu-
ration effects because these will tend to affect post-test scores regardless of any
new intervention being evaluated. The greater the time difference between pre-
and post-test, the greater the potential effects due to maturation (Cook & Camp-
bell, 1979; Shadish et al., 2002).
Downloaded by [University of York] at 09:22 19 December 2012
Test effects
Evaluations usually involve some form of measurement, before and after the inter-
vention. It is possible that improvements can result from the test itself, attributable
to factors such as participants remembering questions or the questions raising
awareness and triggering learning after the pre-test, independent of the subsequent
intervention. Ideally, two or more equivalent versions of the ‘same’ test should be
used, counter-balanced amongst participants at pre- and post-test. However, to
fully ascertain whether any learning occurred as a result of simply having done the
test, it is necessary to test participants who did not receive the pre-test. Thus,
there can be four groups of participants: pre- and post-test no intervention; pre-
and post-test with intervention; post-test only no intervention; post-test only with
intervention. This is the Solomon four group design (Shadish et al., 2002).
will not ‘regress to the mean’ but the majority will, moving the mean of the sub-
groups towards the whole sample mean on post-test. The regression effect is,
therefore, most evident for students with the lowest and highest pre-test scores.
In order to illustrate the RTM phenomenon we re-analysed data from a study
undertaken by the first author (Marsden, 2004, 2006). To ascertain whether
RTM affected the data, participants’ change scores (i.e., post-test minus pre-test
scores) were plotted against pre-test scores. If RTM were present a negative corre-
lation would be observed, because participants with high pre-test scores would, on
average, tend to have smaller gains than participants with low pre-test scores. As
expected, Figure 1 shows a strong negative correlation ( 0.65, p < 0.001) between
the pre-test and change scores.1
The lower and upper quartiles of the pre-test scores were extracted for each of
Downloaded by [University of York] at 09:22 19 December 2012
the four measures (listening, speaking, reading and writing). This created two sub-
groups, ‘lower’ and ‘upper’, coming from the same contextual group with the same
intervention. The pre- to post-test gains made by these lower and upper groups
were compared. This simulated a pre-experiment that compared the effectiveness
of the intervention for those with the lowest and highest scores at the outset.
(Once the inter-quartile range of the test scores was eliminated, the remaining
samples were very small, and it is emphasised that our aim was solely to illustrate
the existence of RTM effects.) Of the 16 comparisons (i.e., pre-to-post and
pre-to-delayed post tests, in two groups, in four outcome measures), six showed
statistically significantly larger gains by the lower groups than the upper groups. A
30.00
20.00
10.00
Pre-post
0.00
-10.00
-20.00
Figure 1. Pre-test scores on a test plotted against gains between pre- and post- tests (data from
Marsden, 2004)
Pre- and post-test research designs 587
Design issues
We can address the issues illustrated above by introducing a control or comparison
group formed by random allocation. Then, maturation, history, test effects and
RTM effects will affect both groups similarly and ‘cancel out’ when comparing
groups. Consequently, group differences in changes from pre- to post-tests can be
Downloaded by [University of York] at 09:22 19 December 2012
Review methods
We searched 13 educational research journals: British Educational Research Jour-
nal, Cambridge Journal of Education, Educational Studies, International Journal of
Science Education, Journal of Educational Research, Journal of Research in Reading,
Journal of Teacher Education, Language Learning and Technology, Oxford Review of
Education, Reading and Writing: An Interdisciplinary Journal, Research in the Teach-
ing of English, Reading Research Quarterly and Science Education, for 2009, using
Downloaded by [University of York] at 09:22 19 December 2012
the database Educational Resources Information Centre. The year 2009 provided the
most recent full cycle of journal issues before the start of the review process. This
is not a representative sample of education journals or of educational research
itself, and we note that it is likely that the number of articles that fit our criteria
from any one journal is potentially positively correlated with the number of arti-
cles published by that journal in that year, and/or with the amount of detail pro-
vided in the report.
In order to be selected for the review, papers had to: be unique empirical stud-
ies; compare a construct (e.g., attitude, knowledge, behaviour) before and after an
intervention; have at least one quantified measure; employ a study design that did
not include a control or comparison group or any other mechanism that could
have potentially addressed the known biases of using a single group design. Inde-
pendent data extraction of the studies (with double data extraction of over 80% of
the studies) retrieved information about: the topic; the nature of the intervention;
the outcome measures; and the results. We also recorded whether: the author/s
derived a causal inference between intervention and outcome; the author/s men-
tioned RTM as a possible explanation for the results; the author/s mentioned other
potential explanatory factors for the results.
Results
In total, 490 articles were published in 2009 in the 13 journals. We found 64
(13%) evaluated innovative interventions and used experimental, quasi-experimen-
tal or pre-experimental designs (with quantitative and/or qualitative outcome mea-
surements).3 Of these 64, 19 were included at the first screening stage. At the
second screening stage we excluded three studies (Graebner et al., 2009; MacAr-
thur & Lembo, 2009; Tsaparlis & Papaphotis, 2009) because they did not fit our
criteria. This left 16 (25%) evaluation studies that met our criteria (i.e., pre-post
designs without a control or comparison group). (Note, 48 studies evaluated inter-
ventions using designs with control or comparison groups.)
Detailed information about each pre-experiment is presented in the Appendix.
Pre- and post-test research designs 589
Table 1. Consideration of potential factors explaining the observed changes, other than, or in
addition to, the experimental intervention
No mention of any potential explanatory Annetta et al.; Ducate & Lomicka; Newton &
factor (except the intervention). Newton; Sherrod & Wilhem; Spalding et al.; Taylor
& Jones.
Acknowledgement that other Grace; Park et al.
(unspecified) factors may be involved.
Specific alternative explanatory factors Brady et al. (characteristics of intervention and other
mentioned. extraneous variables pertaining to quality of
intervention——support, time, resources; the measure
was not standardised). Evagorou et al. (self-selection
bias). Guisasola et al. (tests influence learning). Jones
et al. (self-selection bias). Miedijensky & Tal
(influence of regular school, time, maturation).
O’Byrne (indirectly: tests influence learning). Sherin
& van Es (relationship between the intervention and
the outcome measure may be cyclical). Wilhelm
(differential maturation, differential exposure to the
intervention, differential motivation).
590 E. Marsden and C.J. Torgerson
son group.
the gains seen. The conflation of pre-achievement level and gender, in the absence
of considering RTM effects, is also observed in the analysis of attitude data: ‘…
attitudes of lower achievers at pre-test were enhanced, while attitudes of higher
achievers at pre-test did not improve or deteriorated’ (p. 1003) and ‘… the boys
significantly enhanced their attitude to science through CAI, but CAI did not
appear to have the same significance for girls’ attitude to science’ (p. 1006).
Another study (Evagorou et al., 2009) may also illustrate that pre-achievement
level can partly explain the gains observed. Although the authors made no claims
about differential effectiveness as a function of pre-test scores, they did note that
‘… four of the 7 skills tested were initially quite undeveloped’ (Evagorou et al.,
2009, p. 670) (i.e., as these initial scores could have been skewed to the lower end
of the distribution, gains due to RTM were more likely).
These examples demonstrate that pre-post gains are sometimes used to argue
that interventions are most effective for those who have low scores at baseline.
Other examples can be found from outside our review, such as an evaluation by
Moore and Wade (1998) which concluded that ‘… five or six years after the inter-
vention the Reading Recovery teaching, the weakest group [had] overtaken initially
more able readers and performed better in both reading accuracy and comprehen-
sion’ (p. 201). Similarly, Benati et al. (2010) argued that ‘… learners who scored
lower on the pre-test improved more than the high scorers such that the two
groups were equal on the post-test’ (p. 127) and Bell et al. (2009) argued that
‘students of lower attainment at Key Stage 3 appear to perform better [in Science
GCSE] than would have been predicted from their Key Stage 3 attainment, but
that higher attaining pupils perform less well’ (p. 119).
The conclusion that an intervention is more beneficial for low achievers at base-
line is only warranted when an equivalent low-scoring sub-group from the control
or comparison group does not make gains equivalent to the low achievers from the
experimental group. This is demonstrated by Ben-David and Zohar (2009). They
randomly assigned equal numbers of low and high achievement participants to a
control and an intervention group, and were therefore able to conclude that their
intervention resulted in more learning gains for low achieving students than for
high achieving students. McCutchen et al. (2009) also reported a differential
592 E. Marsden and C.J. Torgerson
Discussion
Control (or comparison) groups are important for avoiding unwarranted interpre-
tations of data from pre-post measurements. It should be noted that 14 of the 64
evaluation studies did use a comparison group, without pre-intervention measures;
and 34 of the 64 studies used both a pre-post design and a control/comparison
group (with or without random allocation to groups).
The use of control and comparison groups principally avoids unwarranted inter-
Downloaded by [University of York] at 09:22 19 December 2012
pretations (internal validity). It can also improve ecological validity. For example,
using ‘test only’ groups can inform decisions when the intervention would be
added to the normal programme, and using comparison groups can help practitio-
ners determine the relative merits of different interventions. As discussed above,
random allocation is the best way of addressing history, maturation, test effects
and RTM effects. If a control group cannot be formed by random assignment then
a contemporaneous control group is preferable to no control group.
Another way of partly controlling for RTM effects is to undertake repeated multi-
ple baseline measurements, in an interrupted time series design, until a stable score
is achieved so as to reduce the margin of error of the test. This improves the validity
of associating any future gains with the intervention rather than RTM. This is often
done in cognitive psychology research in order to find an asymptote that is more
likely to reflect the ‘true’ value of the construct being measured. McArthur and Lem-
bo (2009) evaluated cognitive strategy instruction for writing skills. The three partici-
pants did between three and five pre-test essays to obtain a stable baseline (p. 1029).
The post-test consisted of three more essays. For two students, post-test scores were
all higher than stable baseline scores. For the other student, a slight increase was
observed at post-test over baseline. The authors note that the percentage of non-
overlapping data between stable baseline and post-test was 100% (p. 1029).
Whilst such a research design is statistically more satisfactory, for the partici-
pant, teacher and policy-maker, it is time consuming and difficult to justify peda-
gogically. Randomisation is therefore probably a preferable method of addressing
the RTM problem, particularly as it also eliminates selection bias.
Pre-experimental designs do, however, have a role to play in educational
research. For example, before and after data can determine the promise of an inter-
vention during its development phase. In this case researchers will investigate the
potential for an intervention to improve scores in an iterative cycle of testing and
developing, though the researcher should guard against over-interpretation beyond
the observation that the intervention has promise. Many of the studies we
reviewed also made useful contributions by demonstrating feasibility of implemen-
tation. However, pre-experimental research in which the observed magnitude of
Pre- and post-test research designs 593
gains over time is ascribed uniquely to a causal relationship between the interven-
tion and the outcome measures is a concern. Furthermore, caution must be exer-
cised when using pre-experimental research to inform sample size calculations for
RCTs because such studies over-estimate the intervention effects and lead to an
underestimation of the sample size (Torgerson & Torgerson, 2008).
We do not know the extent to which the effects outlined earlier influenced the
findings reported in the studies we reviewed. Thirteen of the 16 studies included all
the participants in all analyses, and did not split the pre-test data into high and low
scorers. In such studies one might argue that the movement up to the mean from
the lower scorers and the movement down to the mean from the higher scorers
may have ‘cancelled out’ the effects of RTM. However, this is by no means certain,
as the movement of the lower and upper outliers due to RTM may not have been
Downloaded by [University of York] at 09:22 19 December 2012
equivalent. Indeed, equal upwards and downwards movement is unlikely given the
combined effects of history, maturation, test effects and the intervention (experi-
mental or comparison). The combined effects of these factors may reduce any
regression down to the mean of the higher scorers but increase the regression up to
the mean of the lower scorers. Clearly, some of the difference might be due to the
intervention actually being effective at improving the outcomes measured, but how
much, if any, is impossible to know due to the limitations within the design.
Conclusions
In our small-scale methodological review of pre-experimental studies we have illus-
trated that a number of authors of such research designs did not take into account
the potential biasing effects of history, maturation, test effects and RTM in the
discussion of their results. We found several studies that divided the participants
on the basis of their pre-test scores into low and high achievers and argued that an
intervention was more beneficial for those with low scores at baseline, but did not
discuss RTM as a possible factor influencing this finding.
In pre-experiments, history, maturation, test effects or RTM effects may not
explain all of the pre-post differences observed in these studies, and the experi-
mental interventions may be responsible for some of the effects observed. How-
ever, because random allocation to experimental and comparison groups was not
used, we cannot tell the extent to which the differences were due to history, matu-
ration, test or the regression artefact. We know, however, that some of the
observed difference is likely to be artefactual.
Randomised controlled trials are widely used to control for selection bias, that
is, where participants are selected on characteristics that may bias the results. This
paper has highlighted how randomised control groups are also important to
control for history, maturation, test and the RTM phenomenon. Our review found
about one fifth of the evaluation studies did use a comparison group, and about
half used pre-intervention measures in addition to a comparison group, some with
random allocation. This illustrates that such designs are feasible in instructional
settings.
594 E. Marsden and C.J. Torgerson
Acknowledgements
We thank David Torgerson for his useful comments and suggestions on an earlier
draft of the paper.
Notes
1. The data were, in fact, from a trial that used matched randomisation to an experimental
and a comparison group, thereby controlling for RTM effects.
2. For reasons of space statistics are not provided, but the data can be found in Marsden,
2004.
3. The majority of studies that were NOT evaluation studies aimed to explore potential rela-
tionships, define constructs, or document processes.
Notes on contributors
Downloaded by [University of York] at 09:22 19 December 2012
References
Annetta, L., Mangrum, J., Holmes, S., Collazo, K. & Cheng, M. (2009) Bridging realty to vir-
tual reality: investigating gender effect and student engagement on learning through video
game play in an elementary school, International Journal of Science Education, 31(8), 1091–
1113.
Bell, J., Donnelly, J., Homer, M. & Pell, G. (2009) A value-added study of the impact of sci-
ence curriculum reform using the national pupil database, British Educational Research Jour-
nal, 35(1), 119–135.
Benati, A., Lee, J. & McNulty, E. (2010) Exploring the effects of Processing Instruction on a
discourse-level guided composition, in: A. Benati & J. Lee (Eds) Processing instruction and
discourse (London, Continuum), 97–147.
Ben-David, A. & Zohar, A. (2009) Contribution of meta-strategic knowledge to scientific
inquiry learning, International Journal of Science Education, 31(12), 1657–1682.
Brady, S., Gillis, M., Smith, T., Lavalette, M., Liss-Bronstein, L., Lowe, E., North, W., Russo,
E. & Wilder, T.D. (2009) First grade teachers’ knowledge of phonological awareness and
code concepts: examining gains from an intensive form of professional development and
corresponding teacher attitudes, Reading and Writing: An Interdisciplinary Journal, 22(4),
425–429.
Pre- and post-test research designs 595
Campbell, D.T. & Stanley, J.C. (1963) Experimental and quasi-experimental designs for research
(Chicago, IL, RandMcNally).
Cook, T.D. & Campbell, D.T. (1979) Quasi-experimentation: design and analysis issues for field set-
tings (Boston, MA, Houghton Mifflin).
Ducate, L. & Lomicka, L. (2009) Podcasting: an effective tool for honing language students’
pronunciation? Language Learning and Technology, 13(3), 66–86.
Evagorou, M., Korfiatis, K., Nicolaou, C. & Constantinou, C. (2009) An investigation of the
potential of interactive simulations for developing thinking skills in elementary school: a case
study with fifth-graders and sixth-graders, International Journal of Science Education, 31(5),
655–674.
Galton, F. (1886) Regression towards mediocrity in hereditary stature, The Journal of the Anthro-
pological Institute of Great Britain and Ireland, 15, 246–263.
Grace, M. (2009) Developing high quality decision-making discussions about biological conser-
vation in a normal classroom setting, International Journal of Science Education, 31(4), 551–
570.
Downloaded by [University of York] at 09:22 19 December 2012
Graebner, I.T., de Souza, E.M.T. & Saito, C.H. (2009) Action-research and food and nutrition
security: a school experience mediated by conceptual graphic representation tool, Interna-
tional Journal of Science Education, 31(6), 809–827.
Guisasola, J., Solbes, J., Barragues, J.-I., Morentin, M. & Moreno, A. (2009) Students’ under-
standing of the special theory of relativity and design for a guided visit to a science museum,
International Journal of Science Education, 31(15), 2085–2104.
Jones, G., Taylor, A. & Broadwell, B. (2009) Estimating linear size and scale: body rulers, Inter-
national Journal of Science Education, 31(11), 1495–1509.
Lipsey, M.W. & Wilson, D.B. (1993) The efficacy of psychological, educational and behavioral
treatment: confirmation from meta-analysis, American Psychologist, 48(12), 1181–1209.
MacArthur, C.A. & Lembo, L. (2009) Strategy instruction in writing for adult literacy learners,
Reading and Writing, 22(9), 1021–1039.
Marsden, E. (2004) Teaching and learning of French verb inflections: a classroom experiment
using processing instruction. Unpublished Ph.D. dissertation, University of Southampton.
Marsden, E. (2006) Exploring input processing in the classroom: an experimental comparison
of processing instruction and enriched input, Language Learning, 56, 507–566.
McCutchen, D., Green, L., Abbott, R. & Sanders, E. (2009) Further evidence for teacher
knowledge: supporting struggling readers in grades three through five, Reading and Writing:
An Interdisciplinary Journal, 22(4), 401–423.
Miedijensky, S. & Tal, T. (2009) Embedded assessment in project-based science courses for the
gifted: insights to inform teaching all students, International Journal of Science Education, 31
(18), 2411–2435.
Moore, M. & Wade, B. (1998) Reading and comprehension: a longitudinal study of ex-Reading
Recovery students, Educational Studies, 24, 195–203.
Newton, D.P. & Newton, L.D. (2009) Knowledge development at the time of use: a problem-
based approach to lesson planning in primary teacher training in a low knowledge, low skill
context, Educational Studies, 35(3), 311–321.
Norris, J. & Ortega, L. (2000) Effectiveness of L2 instruction: a research synthesis and quantita-
tive meta-analysis, Language Learning, 50, 417–528.
O’Byrne, B. (2009) Knowing more than words can say: using multimodal assessment tools to
excavate and construct knowledge about wolves, International Journal of Science Education,
31(4), 523–539.
Park, H., Khan, S. & Petrina, S. (2009) ICT in science education: a quasi-experimental study
of achievement, attitudes toward science, and career aspirations of Korean middle school
students, International Journal of Science Education, 31(8), 993–1012.
596 E. Marsden and C.J. Torgerson
Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002) Experimental and quasi-experimental designs
for generalized causal inference (Boston, Houghton Mifflin).
Sherin, M.G. & van Es, E.A. (2009) Effects of video club participation on teachers’ professional
vision, Journal of Teacher Education, 60(1), 20–37.
Sherrod, S.E. & Wilhelm, J. (2009) A study of how classroom dialogue facilitates the develop-
ment of geometric spatial concepts related to understanding the cause of moon phases,
International Journal of Science Education, 31(7), 873–894.
Spalding, E., Wang, J., Lin, E. & Hu, G. (2009) Analyzing voice in the writing of Chinese
teachers of English, Research in the Teaching of English, 44(1), 23–51.
Taylor, A. & Jones, G. (2009) Proportional reasoning ability and concepts of scale: Surface area
to volume relationships in science, International Journal of Science Education, 31(9), 1231–
1247.
Thorndike, R.L. (1942) Regression fallacies in the matched groups experiment, Psychometrika,
7, 85–102.
Torgerson, C. & Torgerson, D. (2008) Designing and running randomised trials in health, education
Downloaded by [University of York] at 09:22 19 December 2012
Appendix
Data extracted from included studies with a pre-experiment (single group pre-post test) design
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
Annetta, Science 5th grade Examine students’ ‘The gain from pre- Yes. ‘… the MEGA No.
Mangrum, Education; students ‘of learning of simple test to post-test integrated into an
Holmes, varying machines; overall was elementary school
Collazo & academic levels’ significant for the science class did result
Cheng. 10–11 years; sample exposed to in the learning of key
MEGA ...’ (p. science concepts for
1100). ‘The overall fifth-grade boys’ and
gain from pre-test to girls’ learning of
post-test was simple machines’
significant (0.000), f (p. 1104).
= 67.02’ (p. 1100)
Tables 2 & 3.
Elementary Measures: pre and
school; post test measuring
basic knowledge of
the six simple
machines and the
purpose of each.
U.S.A. n=74 students
Brady, Gillis, First grade Examine efficacy of Scores on knowledge Yes. ‘Encouragingly, No.
Pre- and post-test research designs
(Continued)
597
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
598
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
(Continued)
599
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
600
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
Evagorou, Thinking skills; 11–12 year olds; To investigate the Considerable Yes (and indirectly, No.
Korfiatis, impact of a improvements in the no). ‘The proposed
Nicolaou & simulation-based participants’ system learning environment
Constantinou. learning environment thinking skills, on six provoked considerable
on students’ measures, but not on improvements in some
development of ‘feedback thinking’ system thinking skills
system thinking (pp. 664–669 and during a relatively brief
skills; Tables 2, 3 & 4). learning process.’
(p. 656) ‘… after the
instruction the total
number of referred
elements increased’
(p. 664). ‘We have to
admit the failure of
our intervention in
promoting feedback
thinking’ (p. 671).
Elementary n=13 Measures: 2 tests, But authors state that
school; both used as pre and results: ‘… could be
post tests. Each test positively affected by
with tasks the fact that [the
corresponding to students] voluntarily
seven thinking skills. participated in the
project’ (p. 669).
Cyprus.
Pre- and post-test research designs
Grace Decision-making 15–16 year olds; Can peer group About three quarters Yes and No. No.
in science decision-making of the students ‘Discussions … had a
classrooms; discussions help modified their marked impact on
601
Appendix (Continued)
602
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
discussions …’ (p.
556).
U.K.
Guisasola, Physics education 1st year How does a museum Increases between Yes. ‘The results show No.
Solbes, in Engineering undergraduates; visit influence pre and post that the teaching
Barragues, course; students measures in sequence and
Morentin & understanding of the understanding. exhibition visit have
Moreno Special theory of Figures 1, 2 and 3 increased the students’
Relativity (STR) and showing differences interest, knowledge,
its applications? Do between pre and post and understanding of
students use more measures for: correct the STR and its
scientific arguments explanations of applications’
when discussing aspects of STR; (p. 2100).
topics related to the scientific arguments
Special theory of applied; proportions
Relativity after of three or more
visiting the mentions of
exhibition? applications.
University n=35 Measures: to Acknowledge test-
measure effect: ’change in the
understanding, a students’
questionnaire as the understanding, can be
pre-test; a written
report structured
around similar
Pre- and post-test research designs
questions to pre-test
for the post test.
(Continued)
603
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
604
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
Jones, Taylor, & Maths 6th – 9th grade To examine the ‘The results of this Yes. ‘Results of a No.
Broadwell Education; impact of teaching study revealed that paired-sample t-test
students to use their teaching students to suggested that the
bodies as rough use their body as a significant changes in
measurement tools rough measurement pre-test and post-test
on their ability to tool … increased scores for the LMA
estimate linear their ability to were not due to
measurements. accurately estimate random chance but
linear sizes (see instead are probably
Table 1). The mean due to the intervention
score on the pre-test the students received
for the LMA was as a result of
26.21 (SD = 4.57) completing the metric
whereas the mean measurement tasks’
score for the post- (p. 1504) .
test increased to
30.68 (SD = 3.43)’.
(p. 1504, and see p.
1495); ‘Paired t-tests
… found significant
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
differences for …
object estimation …,
kinaesthetic estimation
..., and body ruler’
(p. 1504)
Summer Camp; 11–13 years; Measure: 20 item But authors state: ‘All
test to assess of the participants
understanding of were volunteers
metric scale. enrolled in science
summer camp. As
such, they are most
probably not
representative of the
variation that would
exist for all students of
this age’ (p. 1503).
U.S.A. n=19
Miedijensky Impacts of 12–15 year olds; To document ‘Significant Yes. Causal links No.
&Tal assessment for student views on and differences between between experiencing
learning reactions to the pre/post AFL and positive
approach assessment for questionnaires were views about AFL: ‘Our
amongst gifted learning, amongst found with regard to findings indicate
and talented; students taking one- the three main positive impacts of
pull-out year project-based categories and most AFL on the students’
programme for science courses for of the subcategories.’ views of assessment …’
gifted and the gifted. (p. 2430). Also,
Pre- and post-test research designs
Appendix (Continued)
606
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
assessment supported
learning … the
findings of this study
have strengthened our
belief that the
students’ voice is
important to further
improve the
assessment and its
impact on learning
(p. 2432).
Israel n=86 Measures: pre-post ‘… significant shift But authors state:
questionnaire: toward a more ‘Since the courses took
general view of complex view of the time, and the meetings
assessment, ideas different dimensions occurred only once a
about assessment of assessment …’ p. week, one could claim
modes, and 2421; Table 4 & 5; that other factors such
relationships Figure 2 as the regular school
or even time and
maturation should be
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
considered to
contribute to this shift.
We cannot entirely
denounce this concern;
however, our data
indicate nothing of this
sort of assessment was
employed in the
regular schools; and
indicate the students
strongly associated
their views to the
assessment
components, and
provided relevant
examples that support
our claim’ (p. 2430).
between assessment
and learning;
12 post-treatment
interviews
Newton & Teacher Primary teacher Impact of a problem- Statistically Yes. ‘… there was a No.
Newton education; trainees, and solving approach to significant increase, very large increase in
PGCE tutors; lesson planning in an (very large effect student confidence in
area where trainees size), in students’ planning science
Pre- and post-test research designs
Appendix (Continued)
608
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
solutions to
Problems 1 and 6
suggested an increase
in the students’
lesson planning skills
(effect size, 0.95) (p.
319); and ‘the mean
score for the
relatively easy
Problem 1 was 4.40.
For the relatively
difficult Problem 6,
it was 6.82, an
increase that was
statistically
significant’ (p. 318).
PGCE teacher n=75 PGCE Measures: Before But authors state that
training course; students; and and after ‘these judgements [by
comparisons of a) the tutor about quality
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
Appendix (Continued)
610
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
Assessment Tools
Support Gradual
Concept Change’ (p.
537).
U.S.A. Implicit in design is
that ‘test-effect’ used,
indirectly, as a learning
E. Marsden and C.J. Torgerson
tool.
After one year
aggregated n=28
Park, Khan & ICT in science Grade 8 Impact of Computer ‘... After CAI classes Yes. ‘CAI was No.
Petrina education; students from Assisted Instruction students’ significantly correlated
one school; (CAI) on achievement in with improvement in
achievement and science improved most of the
attitudes; significantly ... , achievement groups’
[and] The mean (p. 1003).
differences of ‘Collectively, student
students’ Attitude to achievement in the
Science before and post-achievement test
after CAI were improved significantly
significant ... , with compared to their
students having more achievement in science
positive attitudes prior to CAI’ (p.
towards science after 1006).
CA.’ (Table 1, p.
1003).
Middle school; Measures: But ‘Although there
Comparisons of are a number of
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
612
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
strategies for
reasoning about
student thinking ...
they also came to
notice more complex
issues of student
thinking ’ (p. 27).
E. Marsden and C.J. Torgerson
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
Appendix (Continued)
614
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
writing samples,
assessed using the 6
+ 1 Trait[R]
analytical model.
Taylor & Jones Science 11–13 year olds Impact of a series of ‘A significant Yes. ‘Results of a No.
education; enrolled on a science investigations correlation between paired-sample t-test
science summer on improving proportional suggested that the
camp; understanding of reasoning ability and significant changes in
surface area to students’ pre-test and post-test
volume relationships understanding of for the ASAVA were
surface area to not due to random
volume relationships. chance but instead are
Mean score on the probably due to the
pre-test was 54.42 intervention the
(SD 20.41) whereas students received as a
the mean score on result of completing
the post-test the surface area to
increased to 75.89 volume application
(SD 19.71)’ (pp. tasks’ (p. 1236).
(Continued)
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
Results, as
reported by
authors, including RTM mentioned?
Objective statistics and (Points that could
Topic; Participants; of study; references to Did authors ascribe relate to RTM, as
Setting; relevant tables a causal interpreted by
Study Country Sample size Measures where appropriate relationship? reviewers)
1235–1236, 1236–
1237).
Middle school; n=19 Measures:
Understanding tested
by pre-post
achievement tests.
Relationship of
understanding to
proportional
reasoning ability
tested in one off
achievement test.
U.S.A.
Wilhelm Science ‘Middle level’ Examine gender ‘The mean pre-test Yes. ‘The partial g2 No.
education; students; differences in lunar score was 31.2% … value of 0.703
phases and the mean post- indicates that
understanding. test score was 52.9% approximately 70.3%
…. A repeated- of the gain in lunar-
measures ANOVA related understanding
revealed a significant can be directly
increase … from pre- attributed to the
test to post-test on inquiry Moon unit’
overall test scores (p. 2113). ‘The partial
(see Table 3)’ g2 value of 0.151
(p. 2112). indicates that
approximately 15.1%
Pre- and post-test research designs
(Continued)
615
Downloaded by [University of York] at 09:22 19 December 2012
Appendix (Continued)
616
Results, as
reported by
related understanding
can be directly
attributed to the
inquiry Moon unit’
(p. 2116). ‘Findings
suggest that both
scientific and
E. Marsden and C.J. Torgerson
mathematical
understandings can be
significantly improved
for both sexes through
the use of spatially
focused, inquiry-
oriented curriculum
such as REAL’
(pp. 2105, 2120).
University; n=123 Measures: a Lunar GSA: The mean pre- But authors state
Phases Concept test score was 49.4% ‘The other 30% [of
Inventory (20 item) …and the mean gain in scores] could
and a Geometric post-test score was be attributed to
Spatial Assessment 56.2% … A differential
(GSA) (16 item). repeated-measures maturation,
ANOVA revealed a differential exposure
significant increase to the intervention,
… (see Table 6)’ differential
(p. 2116). motivation, and so
forth’ (p. 2113).
U.S.A.