The Journal of Educational Research

This article was downloaded by: [Moskow State Univ Bibliote]
On: 03 February 2014, At: 13:21

Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41
Mortimer Street, London W1T 3JH, UK
The Journal of Educational Research

Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/vjer20
Role of Rubric-Referenced Self-Assessment in Learning to

Write
a b
Heidi Goodrich Andrade & Beth A. Boulay
a
University at Albany
b
Abt Associates, Inc. , Cambridge, Massachusetts
Published online: 02 Apr 2010.
To cite this article: Heidi Goodrich Andrade & Beth A. Boulay (2003) Role of Rubric-Referenced Self-Assessment in Learning to Write,
The Journal of Educational Research, 97:1, 21-30, DOI: 10.1080/00220670309596625
To link to this article: http://dx.doi.org/10.1080/00220670309596625
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the Content) contained in the
publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or
warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed
by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with
primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings,
demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly
in connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is
expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
Role of Rubric-Referenced
Self-Assessment in Learning to Write
HEXDX GOODRICH ANDRADE BETH A. BOULAY
University at Albany Abt Associates, Inc., Cambridge, Massachusetts
Downloaded by [Moskow State Univ Bibliote] at 13:21 03 February 2014
assessment on seventh- and eighth-grade students writing.

ABSTRACT The authors examined the impact of self-
assessment on 7th- and 8th-grade students written essays.
Students wrote 2 essays: historical fiction essay and response Theoretical Framework
to literature essay. All students received instructional rubrics
that articulated the criteria and gradations of quality for the This study draws on several areas of educational and cog-
given essay. Students in the treatment group participated in 2 nitive research, including authentic assessment, self-regu-
formal self-assessment lessons, during which they used the lated learning, and the assessment of writing. Current per-
rubric to assess the quality of their drafts. Authors used mul- spectives on authentic or alternative assessment, as well as
tiple linear regression to examine the relationship between on the assessment of writing, distinguish between evalua-
essay scores, treatment, and a set of control predictors. The
results from the historical fiction essay suggested a positive tion and assessment and promote a conception of assess-
relationship between the treatment and girls scores, but no ment as ongoing feedback that supports learning (Gardner,
statistically significant relationship between the treatment and 1991; Goodrich, 1997; White, 1994; Wiggins, 1989a.
boys scores. The results from the response to literature essay 1989b; Wolf & Pistone, 1991). Cooper and Odell (1999),
showed no effect of treatment for either boys or girls. The for example, stressed the need to provide students with time
results are explained in terms of the insufficiency of the inter-
vention, as well as the possible effects of rubrics, school condi- to reflect on their writing so that some sort of assessment
tions, and gender differences in response to self-generated of strengths and weaknesses in a piece of writing occurs
feedback. before a final draft is written (p. x). The literature on
Key words: assessment of writing, rubric-referenced, self- authentic assessment also provides guidance on the charac-
regulated learning teristics of effective assessment (see Goodrich, 1996, for a
review). Those characteristics influenced the design of this
study, which,
Most students do not revise because they have not learned articulated clear criteria for assessing writing,
how to evaluate what they write; they have not internalized supported students in assessing their own works in
any consistent set of criteria or standards to which they can
hold themselves. (White, 1994, p. 10) progress,
identified techniques and methods for improving writing,
S tudent self-assessment is the Cinderella of the authen-

tic or alternative assessment movement. Whereas
teachers and researchers alike celebrate the marriage of
provided opportunities for improvement through revi-
sion, and
honored students developmental stages by referring to
instruction and evaluation and the family of assessments appropriate grade-level standards.
that result, student self-assessment stands apart as the poor
stepchild-poorly understood and assumed to be difficult, The design of this study also reflects the literature on
impractical, or unnecessary. Yet, an examination of assump- self-monitoring and writing, which notes that novice writers
tions yields rich potentials. Rather than thinking of self- have difficulty identifying the problems with their writing
assessment as simply a matter of asking students to give as well as techniques for improving it (Scardamalia & Bere-
themselves a grade, we hypothesized that self-assessment iter, 1985) and that the use of on-line or hard-copy revision
could support learning and skill development via a process
of careful reflection on the quality of studentsown work. In Address correspondence to Heidi Andrade, University at Al-
the Student Self-Assessment Study, we tested that hypothe- bany, 1400 WashingtonAvenue, ED233A, Albany, NY 12222. (E-
sis by examining the effects of rubric-referenced self- mail: handrade @ uamail.albany.edu)
21
22 The Journal of Educational Research
checklists can improve students writing (Daiute, 1986; dently of each other. School B, in contrast, was located in
Scardamalia & Bereiter, 1983). Similarly, research on self- an ethnically and linguistically diverse working-class urban
regulated learning and feedback suggests that learning community. The teachers with whom we worked at School B
improves when feedback (a) directs students to monitor collaborated on an integrated curriculum that combined his-
their learning and (b) shows them how to achieve learning tory and language arts. Their shared humanities curriculum
objectives (Bangert-Drowns, Kulik, Kulik, & Morgan, drew explicitly on the districts standards and an experi-
1991; Butler & Winne, 1995).The Student Self-Assessment mental new portfolio process. The study took place during
Study was based on the hypothesis that students can be their the 1997-98 school year and involved 13 seventh- and
own source of feedback, given the appropriate conditions eighth-grade classes in the two schools.
and supports. For the purposes of this study, the uppropri- The sample included 397 students: 25 1 students (63.2%)
are supporrs were instructional rubrics that described good attended School A; 146 students (36.8%) attended School
and poor writing, and the appropriate conditions were two B. There were 183 (46.1%) seventh graders and 214
self-assessment lessons that we designed to help students (53.9%) eighth graders. One hundred ninety-one of the stu-
use a rubric to assess their draft essays. dents were boys (48.1%); 201 of the students were girls
Instructional rubrics refer to those rubrics designed to (50.6%). Five students (.01%) from the sample were not
support student learning and development and to act as identified by gender. None of those students were included
standards-referenced assessment tools (Goodrich Andrade, in the subsample that we analyzed. We also were unable to
2000). Instructional rubrics included the following features obtain ethnicity data for 103 of the students (26% of the
that supported learning: They (a) were written in language sample). However, of the remaining 294 students, 126
that students could understand, (b) defined and described (42.9%) were White, 68 (23.2%) were Black, 5 (1.7%) were
quality work in as concrete terms as possible, (c) referred to American Indian, 33 (11.2%) were Filipino, 46 (15.6%)
common weaknesses in students work and indicated how were Hispanic, and 16 (5.4%) were students of other ethnic-
such weaknesses could be avoided, and (d) could be used by ities. The average Abbreviated Stanford Achievement Test
students to evaluate their works in progress and thereby (ASAT) score for the sample was 2.19 (SD= 1.80), and the
guide revision and improvement. average EnglisManguage arts grade for the term prior to
The historical fiction rubric in Appendix A is an example this study was 80.53 (SD= 11.43).
of an instructional rubric designed for use in this research.
Like both of the rubrics that we used, the historical fiction Procedure
rubric drew on district, state, and national standards, as well
as on feedback from colleagues and teachers. The historical Students were assigned by classroom to either the self-
fiction rubric articulated the criteria for the assignment, de- assessment condition or the control condition. Equal num-
scribed gradations of quality from excellent to poor, and pro- bers of seventh- and eighth-grade classrooms were in each
vided suggestions for avoiding common pitfalls in writing. group, and the groups were counterbalanced according to
Taken together, the research on authentic assessment, scores on the ASAT. In those cases in which we could not
assessment of writing, and self-regulated learning indicated match the classrooms exactly according to ASAT scores,
the potential for rubric-referenced self-assessment to sup- the control group was favored such that they would have
port learning and skill development. Accordingly, the higher ASAT scores. That procedure allowed us to obtain a
research question that drove this study was: Can a formal more conservative estimate of the effect of the self-assess-
process of rubric-referenced self-assessment have measur- ment intervention.
able effect on student writing? Students in each class were asked to write two different
essays approximately 1 month apart: a historical fiction
essay, and a response to literature essay (see Appendix A for
Method
the historical fiction essay assignment and rubric, and
Sample Appendix B for the response to literature assignment and
rubric). Students in the treatment and control classes were
This project was supported by the Edna McConnell Clark given identical instructional rubrics with the assignment of
Foundation, which asked that we implement the work in each essay, and their teachers briefly reviewed the assign-
schools with which the foundation collaborates. As a result, ment and the rubric.
we conducted the research in two middle schools in a city After students wrote a first draft of their essays (at least
in southern California. One of the schools (School A) was in theory-not all students wrote first drafts on schedule),
located in an upper-middle-class,largely professional, sub- the first author conducted two 40-min self-assessment
urban neighborhood with little ethnic diversity. Many of the lessons with the students in the treatment condition. The
non-White students who attended School A were bused in lessons involved students in a formal process of guided self-
from adjacent communities and tended to be placed in assessment designed in collaboration with a participating
lower level classes. The language arts teachers with whom teacher at School B. Students used markers to color code
we worked in School A designed their cumcula indepen- the criteria on the rubric and the evidence in their essays
SeptemberlOctober2003 [Vol. W(No. l)] 23
that demonstrated that their writing met each criterion. The coefficients for the 10 criteria ranged from .60-.95, and all
first lesson guided students use of the rubric to evaluate but one of the t tests showed no significant differences
their drafts in terms of the three most global criteria-ideas between the ratings. For the one criterion for which the score
and content, organization, and paragraphs. For example, the differed (Al, historical content), each time the scorers dis-
historical fiction rubric included a criterion that required agreed (4 times out of 15), Scorer 2 rated the essay 1 point
students to bring the time and place in which the character higher than did Scorer 1. Because the disagreement was
lived alive. During class, students were asked to underline always in the same direction, the somewhat small discrepan-
time and place in blue on their rubrics, then underline the cy reached statistical significance.
information that they provided about the time and place of The scorers rated 12 response to literature essays inde-
their story in blue on their essay. If they could not find the pendently; they also rated 4 additional response to literature
information in their essay-and they were often surprised to essays during scoring to guard against drift. The two sets of
discover that they could not-they were instructed to write scores were highly correlated ( r = .79), and a t test showed
(at the tops of their papers) a reminder to add the missing no significant differences between scores (p = .83). We
information when they wrote a second draft. The treatment again assessed interrater reliability on the 10 individual cri-
students were then ask.ed to write a new draft and bring it to teria. Nine of the 10 correlations were high (S9-.91), but
class for the second self-assessment lesson. one was only .3. Again, all but one of the t tests (word
The second lesson followed the same process of guiding choice) were nonsignificant, indicating that there was high
students use of the second half of the rubric to look at the agreement between the scorers. However, the dependent
four finer grained criteria-voice and tone, word choice, measure used in the analyses was the average score on each
sentence fluency, and conventions. The students were then essay. We assumed that that average was resilient to the
asked to revise their essays without our further intervention. minor discrepancies between the raters on the two criteria
Control classes received copies of the rubrics and were noted in the previous paragraph.
asked to write at least two drafts, but they did not formally
assess their own work in class. independent Measures
Dependen t Measures Treatment group. We used an indicator variable to denote

whether students participated in the self-assessment lesson
Students revised drafts were collected and scored. Each (treatment). That variable served as our primary question
essay was scored by at least one research assistant, accord- variable, and allowed us to test whether membership in the
ing to an adapted version of the rubric used in the class- group that engaged in self-assessment was related to higher
room. The scoring rubrics consisted of a 6-point ordinal essay scores.
scale in which 0 was the lowest score, and 5 the highest Previous performance in English. We expected that stu-
score. A research assistant averaged the scores on each cri- dents who performed well in English in the past would
terion to create an overall score for each essay. write better essays than would those who had not done as
The second author and a research assistant were trained well in English, regardless of whether they participated in
by the first author to score the essays using the scoring the self-assessment lessons. We included two measures of
rubric. For each essay, the scorers underwent a training past performance in English to be used as control variables.
period during which they scored several essays in conjunc- Each students English grade from the previous semester
tion with discussions with the first author. The scorers then was obtained from the schools records (grade). Those
rated several more essays independently. The scores were grades ranged from 95 (A, highest) to 55 (F, lowest). We
compared to assure that scoring commenced with a high also obtained students scores on the ASAT from the school
level of interrater reliability. records. The scores represented an average of the language
We used Pearson correlations and t tests in combination mechanics and language expression components of the test.
to determine the extent to which the raters were in agree- School- and grade-level descriptors. To test whether the
ment. That method provided a more accurate assessment of effect of the intervention was different across the two
interrater reliability than did correlation alone because it schools, we included an indicator variable in our analysis to
tested the relationshrp as well as the mean difference index the school attended (school). We also indexed the
between the scores of the two raters. grade level of each student (Grade 7 or 8) to test for grade-
The scorers rated 10 historical fiction essays independent- level effects (level).
ly at the beginning of scoring and 5 historical fiction essays Student descriptors. We obtained four descriptive vari-
during scoring to guard against drift. The overall (N= 15) ables for each student from school records to use as control
Pearson correlation coefficient was extremely high ( r = .92), variables: gender, ethnicity, participation in special educa-
and a t test showed no significant differences between scores tion, and participation in ESL (English as a second lan-
(p = .71). In addition to those measures of overall reliability, guage) programs. The ethnicity measure contained eight
we judged the extent LO which the raters agreed on each of categories (White, Black, Asian or Pacific Islander, Native
the criteria used to assess the essays. The Pearson correlation American or Alaskan, Portuguese, Filipino, Hispanic, and
Other). Because the numbers of children were small in sev- the overall sample of students. Fifty-one students in the sub-
eral of the categories, we collapsed the variable into one sample (42.9%) attended School A; 68 students (57.1%)
categorical variable, ethnicity, to represent membership in attended School B. Across both schools, there were 59 stu-
any minority group. dents (49.6%) in the treatment condition, and 60 students
(50.4%) in the control condition. Sixty-two students
Results (52.1%) were in eighth grade; 57 students (47.9%) were in
seventh grade.
Descriptive Statistics Scores for the response to literature essay were complet-
ed for a subsample of 98 students (24.7% of the complete
We completed scoring the historical fiction essay for a sample). Because of implementation difficulties at School
subsample of 119 students (30% of the complete sample). B, all of the students in the subsample of those writing the
Chi-square tests and t tests revealed no significant differ- response to literature essay attended School A. There were
ences between the students included in the analysis and 48 students (49.0%) in the treatment condition, and 50 stu-
those not included in the analysis on any of the descriptive dents (5 1.O%) in the control condition. Half of the students
measures, indicating that the subsample did not differ from were in eighth grade; the other half were in seventh grade.
TABLE 1. Frequencies and Means for Descriptive Variables, by Treatment Condition:

Historical Fiction Essay (N= 119)
Treatment Control
(n = 59) (n = 60)
Variable n % n %
School A 23 39.0 28 46.7

Gender (female) 33 55.9 32 53.3
Level (Grade 8) 36 61.O 26 43.3
Ethnicity 25 (of 39) 64.1 22 (of 34) 64.7
English as a second language 6 10.2 9 15.0
Special education 3 5.1 5 8.3
M SD M SD
Abbreviated Stanford
Achievement Test score 4.73 1.79 5.47 1.61
Grade 80.45 11.99 80.0 11.12
TABLE 2. Frequencies and Means for Descriptive Variables by Treatment Condition,

Response to Literature Essay, and School A (N= 98)
I
Treatment Control
( n = 48) ( n = 50)
Variable n % n %
Gender (female) 20 41.7 25 50

Level (Grade 8) 23 47.9 26 52
Ethnicity 19 (of 43) 44.2 10 (of 41) 24.4
English as a second language 0 0 2 4
Special education 1 2.1 5 10
I
I M SD M SD
Abbreviated Stanford
Achievement Test Score 5.95 1.94 5.81 1.65
Grade 85.0 9.56 84.0 11.04
SeptemberlOctober 2003 [Vol. 97(No. l)] 25
We summarized the descriptive statistics for each sub- tionships between the predictors and the scores on the
sample by treatment condition (see Tables 1 and 2). We response to literature essay.
assessed the equivalence of the treatment and control groups The first column of Table 3 shows the bivariate correla-
on each of the descriptive variables by using chi-square tions between student scores on the historical fiction essay
tests for categorical variables and t tests for continuous (HF score) and the predictors. Our question predictor (treat-
variables. ment) had essentially no correlation with HF scores, which
In the subsample of students who completed the histori- suggested that there was no effect of treatment on essay
cal fiction essay, significant differences existed between the scores without controlling for the effects of other variables.
groups with respect to level (more eighth graders than sev- ASAT scores, grade, and school attended (school) were all
enth graders in the treatment group) and ASAT scores (con- moderately correlated with essay scores. ASAT also was
trol group earned higher ASAT scores than did treatment correlated with treatment, indicating that the relationship
group). The treatment and control groups of students who between treatment condition and average essay score may
completed the response to literature essay were significant- have changed after we controlled for ASAT. In addition,
ly different with respect to the number of minority students ethnicity and gender were somewhat correlated with essay
(more minority students than nonminorities in the treatment scores and correlated with each other, suggesting that
group). Therefore, we included those variables as control although both variables could be important control vari-
variables in all subsequent multivariate analyses. ables, after controlling for one, the other might not explain
much additional variation.
Correlations The first column in Table 4 shows the bivariate correlations
between student scores on the response to literature essay
Table 3 shows the correlation matrices that we construct- (RtoL score) and the predictors. Treatment had little or no
ed to explore the relationships between the predictors and correlation with RtoL score, whereas ASAT scores and
scores on the historical fiction essay, as well as among the grades were moderately correlated with RtoL scores. As with
predictors. In Table 4, we summarized the analogous rela- the historical fiction essay, those correlations suggest that
TARLE 3. Correlations Between Historical Fiction Essay Scores, Treatment, Previous Performance in
English, and Demographic Predictors
Variable HF score Treatment ASAT Grade School Level Ethnicity
HF score 1.o
Treatment -.02 1.o
ASAT .52*** .22* 1.o
Grade .54*** ,004 .45*** 1.0
Schtml .38*** -.08 .40*** .25** 1.o
Level .11 .I8 -.17 .26** -.I2 1 .o
Ethnicity -.40*** -.01 -.34** -.41*** -.70*** -.07 1 .o
Gender .18* .03 -. 10 ,003 -.20* .ll .30**
Nore. N = 119. ASAT =Abbreviated Stanford Achievement Test.

* p < .05. **p < .01. ***p < ,001.
TABLE 4. Correlations Between Scores on the Response to Literature Essay (RtoL), 'Ikatment,
Previous Performance in English, and Demographic Predictors
-
Variable RtoL score Treatment ASAT Grade Level Ethnicity
-
Rtol. score I.o
Treatment .02 1 .o
ASAT .31** .04 1 .o
Grade .42*** .05 .38*** 1 .o
Level -.01 -.04 -.I4 .40*** 1.o
Ethnicity -.I9 .21 -.30** -.19 -.09 1.o
Gentler .01 -.09 -.I0 .I3 -.03 .28**
Nore. N = 98. ASAT =Abbreviated Stanford Achievement Test.

* p < 05. **p<.Ol. ***p<.OOl.
there was no effect of treatment group on essay scores with- other variables in the model. Model 1 in Table 6 reveals the
out our having controlled for the effects of other variables. same finding for the response to literature essay scores. We
expected that finding because of the low bivariate correla-
Multiple Linear Regression Models tion between treatment and RtoL score.
Previous pe$onnance in English. Model 2 in Table 5
Tables 5 and 6 show separate taxonomies of multiple lin- shows that there was a significant main effect of ASAT
ear regression models fit to assess the relationship between scores and grades on student scores on the historical fiction
the scores on each of the essays and the question predictors, essay. That is, students who had higher grades and those
and the high-priority control variables and their interactions. who had higher scores on the ASAT than did lower achiev-
Treatment. As expected, Model 1 in Table 5 shows that ing students tended to have higher essay scores, controlling
participation in the self-assessment lessons was not related for whether they were in the treatment group. A similar
to scores on the historical fiction essay. That finding indi- effect was seen on the response to literature essay (Model 2,
cated that no treatment effect existed without controlling for Table 6). Again, students with higher grades tended to score
TABLE 5. Multiple Linear Regression Models of the Relationship Between Historical Fiction Essay
Scores and k t m e n t
Model
Variable I 2 3 4 5 6
Intercept 3.16*** .61 .62 .70 .76 .83*

Treatment -.03 .I0 -.I6 .27 .04 -.I4
ASAT .15*** .11* .14*** .15*** .15***
Grade .02*** .02** .02*** .02*** .02***
TX*ASAT .06
TX*Grades -002
School .39* .33* .28
TX*school -.43 -.37 -.28
LRvel
TX*level
Gender -.22* -.a** -.47**
Tx* .36 .44*
Gender
Model R Z ,0596 39.37% 43.50% 45.57% 47.14% 45.29%
Note. N = 119. ASAT = Abbreviated Stanford Achievement Test. In the six models, previous performance was controlled for in
English, school. grade level, gender, and interactions with the question predictor.
*p < .05. **p < .01. ***p < .001.
TABLE 6. Multiple Linear Regression Models of the Relationship Between Response to Literature
Essay Scores and k t m e n t
Model
Variable 1 2 3 4 5'
Intercept 3.145"' 1.10' 1.35 1.41 1.98'

Treatment .027 .06 .06 .05 .004
ASAT .048 .044 .041 .03
Grade .02*** .02" .02" .02"'
Level -.04 -.05 -.11
Gender -.04 -.06
Ethnicity -.001
Model R2 .06% 22.73% 22.84% 22.98% 26.21%
Nore. N = 98. ASAT = Abbreviated Stanford Achievement Test. In the six models. previous performance was controlled for in
English, grade level, gender, ethnicity. and interactions.
T h i s model conlains 70 participants.
*p < .05. * p < .01. ***, < .001.
I
SeptemberlOctober2003 [Vol. W(No. l)] 27
higher on their essays across treatment and control groups we refit Model 5 to remove any predictors that did not meet
than did students with lower grades. However, on the RtoL statistical significance (p < .05)-this appears as Model 6.
essay, ASAT was not ii significant predictor of essay scores. Analogous models were fit to investigate the effects of
(Although not shown in Table 6, we ran models with each the variables on response to literature essay scores (Models
of the measures of English performance separately. Without 3,4, and 5 in Table 6). The models indicated that after con-
grades in the model, there also was a significant main effect trolling for treatment, ASAT scores, and grades, there were
of ASAT. As seen in Table 5 , ASAT and grades were corre- no main effects of grade level, gender, or ethnicity (all stu-
lated and collinear; therefore, we kept them both in the dents who completed this essay attended School A). We
model throughout the model building process.) There were also tested for interaction effects between each of the pre-
no significant interactions between previous performance in dictors and treatment. None of the interaction terms were
English and treatment in either analysis. That finding indi- significant, which indicated that the treatment effect did not
cates that the effect of the treatment does not differ as a differ across any of the subgroups.
function of previous performance in English for either type
of essay. Overall Findings
School, level, gendel; and ethnicity. We investigated
whether there were any main effects of school or of grade The results from the historical fiction analyses revealed
level on essay scores, or if the effect of the treatment dif- some interesting relationshps. Figure 1 shows a series of
fered by either of these variables (statistical interaction). prototypical plots of the effects revealed by Model 6. Essay
When we examined student scores on the historical fiction scores (Y-axis) are plotted as a function of ASAT scores (X-
essay, the treatment-by-school interaction approached sta- axis). The black lines refer to the treatment condition, and the
tistical significance after we controlled for previous perfor- gray lines refer to the control condition. Solid lines refer to the
mance in English (Model 3, Table 5). That is, when we con- effects for boys, and dotted lines represent the effects for girls.
trolled for past performance in English, the effect of All four lines have an upward slope, which indicates that,
participation in the self-assessment lesson on historical fic- as expected, students with higher ASAT scores tended to
tion essay scores differed across the two schools. However, earn higher essay scores than did students with lower ASAT
as Model 5 reveals, that effect was no longer significant scores. Also, there was a positive main effect of gender,
when the interaction of treatment and gender was added to such that boys consistently scored higher than did girls on
the model. The interaction shown in Model 5 indicates that the historical fiction essay-the solid lines are always
the effect of the treatment condition differed for boys and shown above the dotted lines. In addition, Model 6 reveals
girls. No additional variation in essay scores was explained that although there was an effect of treatment condition, it
by any of the remaining variables or interactions.Therefore, differed for boys and girls (p < .0389). That is, we found a
4.51
-
--- treatmen-
treatmentJglrla
-
---
controlbys
COnWglrls
21
2 3 4 5 6 7 8 9
ASAT Scores
FIGURE 1. Plot of multiple linear regression model using treatment, Abbrwiated

Stanford Achievement Test scores, English grades, and gender to predict scores on
historical fiction essay. English grade is set to the sample mean (N= 119).
-
positive effect of treatment condition for girls; those in the group and the effects of school conditions, essay topics, and
treatment group scored .31 points higher, on average, on the gender differences in response to feedback.
historical fiction essay than did girls in the control group
(black dotted line is above the gray dotted line). That dif- Egects of Rubrics on Students Writing
ference was statistically significant (p < .0395). However,
the effect was reversed for boys-the gray solid line is Andrade (200 1) has suggested that instructional rubrics
above the black solid line. The boys in the control group alone are sometimes-but not always-associated with an
scored .14 points higher, on average, than did boys in the increase in the scores assigned to middle school students
treatment group. That effect, however, did not attain statis- writing. In the current study, we investigated the possibility
tical significance (p < .3739). that focused self-assessment could enhance the effect of
Boys and girls in the treatment groups did not differ from rubrics by helping students use them to critically evaluate
each other-note the distance between the two black lines their own work. It is possible that the process of self-assess-
(p < 3364). It appears that without treatment, on average, ment (treatment condition) contributed too little to the final
girls tended to receive lower scores than did boys (on aver- essay scores, above and beyond the contribution of the
age, .47 points lower; p < .0018). With treatment, however, rubrics (control condition), to reliably measure. That is, the
girls tended to receive scores similar to those received by rubric may have helped students in each group to write
the boys in the treatment group. more effectively by advising them of what to expect in the
In contrast, the results from the response to literature essays, and the self-assessment lessons may have had little
essay analyses suggest that there was no effect of treatment. additional effect. Further research is needed to determine
The only significant main effect was the predictable effect whether boosting the self-assessment by extending treat-
of grades. We ran a power analysis to determine whether we ment time, co-creating rubrics with students, and providing
had a large enough sample to measure even small between- feedback to students about their self-assessments and their
group differences. The results showed that when the sample revisions makes a difference in students writing.
size was 98, the multiple linear regression test of R2 for six
normally distributed covariates would have a 95% power to School Conditions
detect an R2 of .1857. That effect size was comparable to
those that we observed in previous essays. Therefore, we We discounted the results from the response to literature
were confident that we had the statistical power to detect an essay treatment because of the difficulties that we experi-
effect of the self-assessment treatment if one existed. enced when we implemented that portion of the study. The
In summary, for the first essay (historical fiction), partici- response to literature essay assignment was given at the end
pation in the self-assessment lessons tended to have a posi- of April, presumably well before the usual end-of-the-year
tive relationship with girls scores, but no statistically signif- chaos that is experienced by most schools. However, the
icant relation to boys scores. For the second essay (response 1997-98 school year in which this study took place was dif-
to literature), the self-assessment lessons had no relationship ferent. Administrators in the district in which we were work-
to either girls or boys scores. ing had issued new testing requirements in the schools that
made doing much else nearly impossible. We did not use the
Discussion data from School B because the study could not be carried
out appropriately in that school, despite the valiant efforts of
The results of this study indicate that the self-assessment several teachers. The data from School A was scored, but it
intervention was insufficient to obtain a consistent, measur- was apparent that the essays had not received the same
able effect. Teachers may have to do more than provide a amount of attention from students that they had in the past.
rubric and facilitate two self-assessment sessions to deter- One teacher from School A summarized the situation best:
mine meaningful improvements in students essays. Guid- The last few months of the year brought the normal stan-
ance from a teacher in solving the problems identified by dardized testing, but there was a twist this past year. Litera-
students during self-assessment lessons could help the stu- cy portfolios turned out to be a HUGE deal with administra-
dents, given the well-established difficulties that they have tion. Basically, we were pressured to make [School A] look
good as compared to the other middle schools in the city. I
with revision (e.g., Cooper & Odell, 1999). Other promis- calculated that we put 11 instructional days exclusively into
ing additions to the intervention include teachers (a) these folders over the course of the year. In addition, eighth
extending the treatment time; (b) clarifying students under- graders had to do an exit exhibition and we had to rehearse
standing of the rubric, perhaps by co-creating it with them them as well as sit on some panels. Teaching in some
and by making it more central to each writing assignment; respects took a back seat to satisfying the needs of the edu-
cational machine.
and (c) providing feedback about the accuracy of students
self-assessments and the effectiveness of their revisions. The irony that two assessment initiatives prevented us from
We also have considered other possible influences on our properly carrying out our research on self-assessment has
results, including the Iikelihood of a positive effect of the not escaped us. Regrettably, the data from the response to
rubrics on the writing done by the students in the control literature essay was suspect. We included it here only to
SeptembedOctober 2003 [Vol. W(No. l)] 29
make the obvious point that self-assessment, although self-assessments were often, if not always, at least partial-
potentially beneficial. may not be sufficiently powerful to ly negative and completely solution specific, which would
counteract other negative influences on students writing. predict a negative impact on boys and girls performance.
Such a prediction would have been incorrect. We further
Essay Topic speculated that students might tend to respond differently
to self-generated negative, solution-specific feedback than
We speculated that some portion of the inconsistencies in to feedback from adults.
the findings for the two essays might be explained by the The findings from the Student Self-AssessmentStudy are
simple fact that the students were asked to write two differ- consistent with findings from an earlier study conducted by
ent kinds of essays. Although the various demands of the the first author (Goodrich, 1996), which showed that rubric-
two essays almost certainly had an effect on students writ- referenced self-assessment has a positive relationship with
ing, one should remember that six of the seven criteria con- girls metacognitive processing, but a negative relationship
tained in the rubrics for the historical fiction and response to with boys metacognitive processing. In combination, the
literature essays were either identical or nearly identical. We earlier study and the Student Self-Assessment Study sug-
believe that it is unlikely that the remaining criterion (ideas gest that self-generatedfeedback may have a different effect
and content) would have been enough to skew our results. on girls performance than does negative adult feedback.
Some interesting contradictions in the research literature
Gender Differences indicate that that finding may not be peculiar to our
research. Roberts and Nolen-Hoeksema (1989) found no
The analysis of essay scores for the historical fiction evidence that evaluative feedback leads to performance
essay indicates that self-assessment has a positive relation- decrements in women, suggesting that womens maladap-
ship with girls writing, but little or no relationship with tive response to feedback is not absolute. Also, Bronfen-
boys writing. We turned to the research on student respons- brenner (1967, 1970) reported that when peers, instead of
es to feedback to understand the gender difference. In a adults, delivered deficient feedback, the pattern of attribu-
broad stroke, our finding is consistent with research on sex tion and response reversed: Boys attributed the failure to a
differences in response to feedback and in achievement lack of ability and showed impaired problem solving,
motivation and learned helplessness. That body of work has whereas girls more often viewed the peer feedback as
shown generally that girls and boys differ in their attribu- indicative of effort and showed improved performance.
tions of success and failure and in their response to evalua- Noting that the more traditional finding of greater help-
tive feedback (Dweck & Bush, 1976; Dweck, Davidson, lessness among girls was evident only when the evaluators
Nelson, & Enna, 1978). Briefly, research by Dweck and were adults, Dweck and colleagues (1978) reported that
others (Deci & Ryan, 1980; Hollander & Marcia, 1970) has boys and girls have not learned one meaning for failure and
shown that girls are more likely than are boys to be extrin- one response to it. Rather, they have learned to interpret and
sically motivated and to attribute failure to ability rather respond differently to feedback from different agents (p.
than to motivation, effort, or the agent of evaluation. As a 269). That conclusion seems to be reasonable and relevant
result of those attributions, girls performance following to the gender differences found in this study. Although our
negative adult feedback tends to deteriorate more than does research was not designed to permit us to examine students
boys performance. attributions of success or failure, we believe that the results,
Furthermore, research by Dweck and her colleagues has together with earlier research, raise the possibility that girls
demonstrated that boys and girls tend to attribute failure to responses to self-generated feedback tend to be construc-
ability when negative adult feedback is solution specific. A tive, whereas boys responses tend to be either neutral or
hypothesis based on those findings would predict detri- possibly even negative. Clearly, the cognitive and emotion-
ments in boys and girls performance in the present study al mechanisms associated with self-assessment deserve a
because rubric-referenced self-assessment is solution spe- closer look if they are to serve all students equally well.
cific and likely to uncover problems with students work.
That is, because the rubrics set high standards of quality Conclusion
and addressed multiple components of the writing task,
self-assessment was likely to emphasize at least one weak- Perhaps not surprisingly, we believe that there is reason
ness in a students first draft of an essay. In addition, the to hold out hope for self-assessment in spite of the less-
rubrics never referred to academically irrelevant criteria than-promising results of the Student Self-Assessment
such as neatness, timeliness, or effort. Although we did not Study. One reason is that the differences in scores between
collect data on students actual assessments of their essays, the treatment group and the control group tended to favor
the first authors observations in classrooms during the the treatment group, even when those differences were too
self-assessment lessons revealed that students typically small to attain statistical significance. A more extensive
found one or more problems with their essays. On the basis intervention is needed. We intend to create and test more
of those observations, we conjectured that the students supportive conditions for self-assessment by extending the
treatment time, co-creating rubrics with students, and facil- vational processes. In L. Berkowitz (Ed.), Advances in experimental
itating revision. By so doing, we believe that we can elevate social psychology. New York: Academic Press.
Dweck, C., & Bush, E. (1976).Sex differences in learned helplessness: 1.
student self-assessment above its poor stepchild status and, Differential debilitation with peer and adult evaluators. Developmental
more important, help create effective writing programs and P~ChOlOgy,12, 147-156.
reflective writers. Dweck C., Davidson, W., Nelson, S.,& Enna, B. (1978).Sex differences
in learned helplessness: 11. Contingencies of evaluative feedback in the
classroom and 111. An experimental analysis. Developmental Psycholo-
gy, f4(3), 268-276.
NOTES Gardner, H. (1991).Assessment in context: The alternative to standardized
testing. In B. R. Gifford & M. C. OConnor (Eds.), Changing assess-
We thank Stacey Koprince for her help in scoring hundreds of essays. ments: Alternative views of aptitude, achievement and instruction.
We also are grateful to the Edna McConnell Clark Foundation for the Boston: Kluwer.
financial support for this study. The opinions expressed in this article do Goodrich, H. (1996). Student self-assessment: At the intersection of
not necessarily reflect those of the foundation. metacognition and authentic assessment. Doctoral dissertation. Cam-
This study was conducted while the first author was a principle investi- bridge, MA: Harvard University.
gator at Project Zero, Harvard Graduate School of Education. Goodrich, H. (1997). Understanding rubrics. Educarional Leadership,
54(4), 14-17.
Goodrich Andrade, H. (2000).Using rubrics to promote thinking and
REFERENCES learning. Educarionul Leadership, 57(5), 13-18.
Hollander, E., & Marcia, J. (1970).Parental determinants of peer orienta-
Andrade, H. G. (2001,April 18). The effects of instructional rubrics on tion and self-orientation among preadolescents. Developmentul Psy-
learning to write. Current Issues in Education [On-line}, 4(4). Available: chology, 2,292-302.
http://cie.ed.asu.edu/volume4/number4 Roberts, T., & Nolen-Hoeksema, S . (1989).Sex differences in reactions to
Bangert-Drowns, R., Kulik, C., Kulik, J., & Morgan, M. (1991).The evaluative feedback. Sex Roles, 21(11/12),725-746.
instructional effect of feedback in test-like events. Review of Education- Scardamalia, M., & Bereiter, C. (1983).The development of evaluative,
al Research, 61,213-238. diagnostic and remedial capabilities in childrens composing. In M,
Bronfenbrenner, U. (1967).Response to pressure from peers versus adults Martlew (Ed.), The psychology of written language: A developmental
among Soviet and American school children. International Journal of approach. London: Wiley.
PSychology, 2, 199-207. Scardamalia, M., & Bereiter, C. (1985).Fostering the development of self-
Bronfenbrenner, U. (1970).Reactions to social pressure from adults versus regulation in childrens knowledge processing. In S. F. Chipman, J. W.
peers among Soviet day school and boarding school pupils in the per- Segal, & R. Glaser (Eds.), Thinking and learning skills, Vol. 2: Research
spective of an American sample. Journal of Personality and Social Psy- and open questions (pp. 563-577). Hillsdale, NJ: Erlbaum.
chology, 15, 179-189. White, E. (1994). Teaching and assessing writing: Recent advances in
Butler, D.. & Winne, P. (1995). Feedback and self-regulated learning: A understanding, evaluating, and improving student performance (2nd
theoretical synthesis. Review of Educational Research. 65(3),245-281. ed.).Portland, M E Calendar Islands.
Cooper, C.. & Odell. L. (1999). Evaluating writing: The role of teachers Wiggins, G. (1989a).A true test: Toward more authentic and equitable
knowledge about text, learning and culture. National Council of Teach- assessment. Phi Delta Kappan, 70(9), 703-713.
ers of English: Urbana, IL. Wiggins, G. (1989b).Teaching to the (authentic) test. Educarional h a d -
Daiute, C. (1986).Physical and cognitive factors in revising: Insights from ership, 46(7), 4147.
studies with computers.Research in the Teaching of English, 20. 141-159. Wolf, D., & Pistone, N. (1991). Takingfull measure: Rethinking assess-
h i , E.,& Ryan, R. (1980).The empirical exploration of intrinsic moti- ment rhmugh the arts. New York: College Board Publications.

The Journal of Educational Research

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

The Journal of Educational Research

Enviado por

Direitos autorais:

Formatos disponíveis

This article was downloaded by: [Moskow State Univ Bibliote]

On: 03 February 2014, At: 13:21

The Journal of Educational Research

Role of Rubric-Referenced Self-Assessment in Learning to

To link to this article: http://dx.doi.org/10.1080/00220670309596625

PLEASE SCROLL DOWN FOR ARTICLE

assessment on seventh- and eighth-grade students writing.

S tudent self-assessment is the Cinderella of the authen-

Dependen t Measures Treatment group. We used an indicator variable to denote

TABLE 1. Frequencies and Means for Descriptive Variables, by Treatment Condition:

School A 23 39.0 28 46.7

TABLE 2. Frequencies and Means for Descriptive Variables by Treatment Condition,

Gender (female) 20 41.7 25 50

Variable HF score Treatment ASAT Grade School Level Ethnicity

Nore. N = 119. ASAT =Abbreviated Stanford Achievement Test.

Nore. N = 98. ASAT =Abbreviated Stanford Achievement Test.

Intercept 3.16*** .61 .62 .70 .76 .83*

Model R Z ,0596 39.37% 43.50% 45.57% 47.14% 45.29%

Intercept 3.145"' 1.10' 1.35 1.41 1.98'

FIGURE 1. Plot of multiple linear regression model using treatment, Abbrwiated

Você também pode gostar