Você está na página 1de 30

jAL (print) issn 14797887

jAL (online) issn 17431743

Journal of
Applied
Linguistics

Article

Creating and validating assessment


instruments for a discipline-specific
writing course:
an interdisciplinary approach
Fredricka L. Stoller, Bradley Horn, William Grabe and
Marin S. Robinson1
Abstract
This paper reports on a sustained interdisciplinary effort between applied linguists
and chemistry faculty to create and validate writing assessment instruments for
an advanced-level Write Like a Chemist course, one component of a much larger
interdisciplinary project. The article describes a multiple-year effort to form valid
analytic and holistic assessment instruments to be used by chemistry faculty to assess the writing performance of chemistry majors. Emphasis is placed on the joint
contributions of applied linguists and chemists in (a) the identification of meaningful
writing criteria, (b) the development of assessment scales with distinct score points
and descriptors, (c) socialization sessions that prepared chemists to help build the
assessment instruments, and (d) the validation of assessment instruments with other
chemists. Outcomes exemplify the mediating role that applied linguistics can play
in the design of a discipline-specific course, instructional materials, and assessment
instruments that support the development of disciplinary expertise. The results also
demonstrate the positive consequences of crossing disciplinary boundaries for both
subject-area faculty and applied linguists.
Keywords: writing assessment, disciplinary writing, holistic assessment,
analytic scales, interdisciplinary research, mediating role of
applied linguistics
Affiliations
All authors: Northern Arizona University, USA.
Corresponding author: Fredricka L. Stoller, Northern Arizona University, Department of English, PO Box 6032,
Flagstaff, AZ 860116032 USA.
email: fredricka.stoller@nau.edu

JAL vol 2.1 2005: 75104


2005, equinox publishing

doi : 10.1558/japl.2005.2.1.75
LONDON

76

Creating and Validating Assessment Instruments

1 Introduction
Applied linguists involved in Language for Academic Purposes, Language for
Specific Purposes, and Language for Research Purposes frequently consult
content-area specialists as part of preliminary needs analyses for advanced-level
discipline-specific courses with a language focus (e.g. Dudley-Evans & St. John,
1998; Flowerdew & Peacock, 2001a; Swales, 2004). Content-area specialists
often assist applied linguists in:
(a) defining language-learning, content-learning, and strategy-learning
objectives;
(b) identifying core disciplinary texts;
(c) specifying discipline-specific tasks;
(d) creating instructional materials;
(e) designing assessment instruments for such courses.
Although some applied linguists develop long-term collaborative working
relationships with content-area specialists, it is more often the case that applied
linguists gather information from specialists (along with other stakeholders)
early in course-development stages and then work on their own to develop,
implement, and assess their courses (cf. Flowerdew, 1993; Johns, 1997). Even
when repeated needs analyses are conducted as part of an ongoing process of
course development (as recommended by Tudor, 1996, cited in Flowerdew &
Peacock, 2001b), the actual exchange of information in these consultations is
often unidirectional; that is, applied linguists solicit information from contentarea specialists but the reverse rarely occurs.
In this article, we report on a sustained interdisciplinary effort between applied
linguists and chemistry faculty in which there has been a commitment to an
interchange of ideas, joint problematization, and collaboration in action (see
Candlin & Sarangi, 2004), with both parties providing ongoing contributions
to the development and assessment of an advanced-level Write Like a Chemist
course2 offered by university chemistry faculty for chemistry majors. Key to the
project has been the cooperation and collaboration between applied linguists
and chemistry faculty (see Dudley-Evans & St. John, 1998). Our approach to
disciplinary writing, in general (e.g. genre analyses, corpus analyses, course
design, textbook development, instruction), and writing assessment, more
specifically, could not have been accomplished without the equally important
but distinct contributions of chemistry faculty and applied linguists.
We focus here on one component of a much larger project, specifically our
interdisciplinary efforts to form valid writing assessment instruments. The
assessment component of the project represents a particularly public and procedural set of collaborative activities and provides a strong example of patterns of

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

77

language interactions that had to be negotiated by the applied linguists. Guiding


these efforts were the following questions:
(1) What assessment criteria can be adapted from (a) the general
writing-assessment practices used by applied linguists and (b) the discipline-specific writing concerns expressed by chemistry faculty?
(2) How can assessment scales be developed that have meaningful and distinct score points with descriptors that chemistry faculty find acceptable,
understandable, and easy to use?
(3) How can chemistry team members be socialized to provide useful information for the purposes of building valid assessment instruments?
(4) How can the assessment instruments developed by the applied linguist/
chemist team be validated with additional groups of chemists?
Key to these questions has been an effort to build a bridge across disciplinary
discourses surrounding the development of writing abilities. Because the students involved were chemistry majors needing to write chemistry assignments,
the burden to build interdisciplinary discourse practices rested with the applied
linguists in the project. As language specialists, it was their task not only to
achieve specific outcomes tied to a chemistry writing course, but also to negotiate language about teaching, about writing instruction more specifically, and
about how the chemists referred to language itself to ensure the success of the
project. In effect, the applied linguists had to find ways to help chemistry faculty
gain control and ownership of the discourse of writing instruction without
telling them what to do or how to do it.
To frame the discussion, we provide background on the Write Like a Chemist
project. We then discuss the processes and outcomes of our collaborative efforts
to create and validate a rating system to be used by chemistry faculty to evaluate
student writing in a discipline-specific course. By means of this case study, it
is our intention to showcase the mediating role (see Candlin & Sarangi, 2004)
that applied linguistics can play in the design of instructional materials and
related assessment instruments that support the development of disciplinary
expertise. Implications for applied linguists working in less conventional crossdisciplinary areas in areas outside language teaching and language acquisition
(e.g. Candlin & Candlin, 2003; Sarangi & Candlin, 2003; Sarangi & Roberts,
1999) are presented in our concluding remarks.

78

Creating and Validating Assessment Instruments

2 Write Like a Chemist: project background and description


The Write Like a Chemist project was conceived as a response to a Northern
Arizona University mandate to address the writing needs of junior-level students across campus. At the time, departments were given the option of either
developing junior-level writing courses of their own or requiring students to
take English department courses. The chemistry department chose to develop
its own course. The chemistry faculty member who spearheaded the coursedevelopment process initiated a cross-disciplinary alliance (Wardle, 2004)
by contacting a faculty member associated with the applied linguistics area
within the English department. Initial success at our home institution led to
the expansion of the project in subsequent years, both in terms of overall goals
and composition of project team. Over time, the emphasis of the project broadened to include the development of instructional materials and a pedagogical
approach for use beyond the confines of our own university. The cross-disciplinary team of two expanded to include six members: two faculty members
and two graduate students in applied linguistics, and a faculty member and a
post-doctoral associate in chemistry. As part of their work, the applied linguists
coordinated the development and validation of writing assessment instruments
the focus of this article.
2.1 Materials-development emphases
Core to the Write Like a Chemist project have been the analysis of the language
of chemistry in three professional genres (i.e. journal articles, scientific posters,
and research proposals) and the development of instructional materials3 that
could be piloted4 (and later used) by chemistry faculty with little, if any, experience in teaching writing (see Stoller et al., 2005). It was assumed from the outset
of the project that most upper-division chemistry majors, at least in the United
States, would have encountered few, if any, of the three targeted genres during
their first two years of university studies. To introduce students to the reading
and writing of these genres, a read-analyze-write approach was formalized to
guide students in reading (and rereading) authentic texts from the targeted
genres, engaging in genre-analysis activities, and then writing (and rewriting)
a piece of their own following the writing conventions of the discipline.
The Write Like a Chemist read-analyze-write approach is dependent upon
authentic excerpts from the targeted genres. The excerpts, chosen to meet select
criteria including topic, length, and challenge, serve as models of preferential
patterns, expectations of the discipline, and the interrelationships between
language and content. Topics were selected to represent a range of areas within
chemistry including agricultural and food chemistry, analytic chemistry, bio-

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

79

chemistry, organic chemistry, and toxicology. The topics were deemed to be


both of interest to students and within their intellectual grasp, considering only
two prior years of university-level chemistry course work. Short excerpts were
chosen to highlight features of individual sections of the target genres; longer
excerpts were selected to illustrate the interface between and among different
sections of the genres. For example, for writing a methods section suitable
for a journal article, excerpts were chosen with content both familiar (e.g. 1H
NMR and gas chromatography/mass spectrometry) and unfamiliar (e.g. solidphase microextraction and on-fiber derivatization) to most students. Whenever
possible, the less familiar content is introduced in excerpts we believe will be
of interest to students (e.g. on-fiber derivatization is introduced in an excerpt
about preserving beer flavor). For writing a results section, discussion section,
and introduction for a journal article, excerpts from six different articles are
showcased; the students are asked to read these excerpts section by section.
The first four articles address topics in environmental chemistry and toxicology
with content related to:
(a) the hydrophilic (i.e. water-loving) behavior of cyclodextrins and
their potential role in soil remediation;
(b) the microextraction of polycholorinated biphyenyls (PCBs) from
full-fat milk;
(c) the genotoxicity of trivalent chromium in E. coli cells;
(d) the role of free radicals in the toxicity of airborne fine particulate
matter.
The last two articles involve organic synthesis with content related to the preparation of substituted tetrazoles in water (instead of an organic solvent) and an
asymmetric Strecker reaction, a potentially useful reaction in the synthesis of
pharmaceuticals. (See Appendix A for a representative sampling of articles
from which authentic excerpts have been drawn.) The pedagogic emphasis of
the Write Like a Chemist approach, and its reliance on authentic excerpts from
the chemistry literature, has been on enablement, facilitating access to valued
genres through tasks designed to raise students awareness of text features
(Hyland, 2002: 20).
2.2 Assessment emphases
While language-analysis and materials-development activities have been central to the project, ensuring systematic and consistent scoring of students
written work has also been a key concern. Scoring procedures were needed
for the purposes of providing feedback to students and evaluating the overall

80

Creating and Validating Assessment Instruments

efficacy of Write Like a Chemist materials. As such, there was a pressing need
for a rating system that could be:
(a) introduced to Write Like a Chemist pilot faculty during summer
training sessions for later use;
(b) used by project team members to evaluate pre- and post-assessment
written tasks submitted by students at pilot institutions;
(c) consulted by external evaluators (all chemistry faculty) at the end
of the first and second year of piloting to assess pilot students postcourse writing abilities.
The assessment-design goals guiding collaboration between the chemists and
applied linguists on the project team were the following:
(1) Development of a valid set of analytic rating scales to assist instructors
in providing feedback on major course writing assignments, including a
paper modeled after a journal article.
(2) Development of a valid set of holistic rating scales for evaluating student
writing outcomes across many instructional sites, socializing chemists to
read student writing in similar ways, and assessing the overall project.
(3) Establishment of an agreed-upon set of student writing samples to serve
as scale anchors.
While working toward these outcomes, the importance of unambiguous rating
descriptors, which could be easily understood and used by chemistry faculty
with little experience assessing student writing for pedagogical purposes,
became clear. With carefully worded descriptors, we could ensure efficient
discussions of student writing between applied linguists and chemistry faculty
at different stages of the project. Furthermore, with the guidance of carefully
articulated rating scales, we could assist chemistry faculty in reading student
papers in a more time-efficient manner, an issue which emerged fairly early
in collaborative efforts. Our target outcomes and associated aims facilitated
dialogue among project participants and guided the chemists, in particular,
in discussing, describing, and evaluating student writing with a consistent
terminology and shared purpose.
2.3 Discipline-specific writing assessment issues
Concerns about discipline-specific writing assessment have emerged in both
the applied linguistics and chemical-education literatures. Maintaining our
commitment to joint problematization signified that neither of these research
traditions should be valued over the other. As such, although the applied

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

81

linguistics literature on discourse analysis and language testing is quite rich,


the applied linguists were aware that they could not cast this literature (their
literature) as the ultimate voice of authority on these matters. Conversely,
while chemists are skilled in communicating within their own discourse community, they have only recently begun wrestling with the question of how to
initiate new members into this community. Thus, though the chemists were
recognized as the content-area experts on our project team, their expertise
(including the text traditions that they draw upon as chemistry educators)
had to be redirected to facilitate valid language description, pedagogy, and
assessment. As such, our task in reviewing the extant literatures from both
fields became one of synthesizing relevant aspects of both traditions for the
purposes of assessment design.
It is beyond the scope of this article to review the extensive literature on
writing assessment in the applied linguistics domain (for an overview, see Ferris
& Hedgcock, 2005; Hamp-Lyons, 2003; Hyland, 2002, 2003; Weigle, 2002).
Important for the present discussion, however, is the question of how linguistic
and non-linguistic elements of communication are seen to be related within
different types of assessment tasks.
Language for Specific Purpose (LSP) test designers often meet with content-area specialists to conduct a linguistic analysis of the target language use
domain.5 Although applied linguists are well equipped to describe the linguistic
characteristics of the target language setting, they are perhaps less prepared
to address non-linguistic elements that characterize successful performance
in different disciplines. The question of how to address non-linguistic factors
therefore becomes fundamental to LSP performance assessment. McNamara
(1996) has suggested the terms weak and strong performance assessment to
distinguish between assessments that focus solely on the linguistic elements of
communicative performance and those in which language is a necessary but
not sufficient condition on the test tasks (1996: 1967). Although McNamara
presents these terms as a dichotomy, he also notes that the underlying concepts
represent a continuum. In certain contexts, language performance is contingent
upon some degree of content knowledge. For example, within the field of
chemistry, writing might be used as a means to measure how well students
grasp not only the language of the discipline, but also underlying content
knowledge and requisite skills (e.g. basic scientific concepts, laboratory skills,
and data analysis techniques). A weak chemistry performance assessment
would focus solely on the students ability to comprehend and produce the
language of chemistry, whereas a strong chemistry performance assessment
would require the student to demonstrate mastery of science content and skills
through relevant disciplinary genres (e.g. lab reports, poster presentations,
empirical research articles).

82

Creating and Validating Assessment Instruments

Numerous chemical-education publications confirm that chemistry faculty


are grappling with ways to teach writing in university chemistry courses (e.g.
Beall & Trimbur, 2001; Bressette & Breton, 2001; Coppola & Daniels, 1996;
Ebel et al., 2001; Gordon et al., 2001; Kovac & Sherwood, 2001; Kuldell, 2003;
Oliver-Hoyo, 2003; Paulson, 2001; Shibley et al., 2001; Whelan & Zare, 2003).
Nonetheless, most science students, including chemistry students (at least
in the United States), typically learn to write for the discipline in an ad hoc
manner (Learning to speak and write, 2001). To reverse this trend, growing
numbers of chemists now view writing as integral to the process of doing and
learning chemistry, rather than as a tangential activity (Klein & Aller, 1998:
31). Numerous efforts, including the Write Like a Chemist project, aim to integrate more formalized writing instruction into chemistry curricula. Although
discussions of writing assessment in chemistry are limited in number, the
publications cited above reveal an interest in both students written expression
and the science content conveyed in their written work. This dual orientation
suggests that chemistry educators views of writing assessment are similar to
McNamaras notion of a strong performance assessment.
Despite the fact that chemists and (some) applied linguists may have similar
views about the utility of writing for assessment purposes, serious validity
concerns can arise in specific purposes language testing when the assessment
criteria (both linguistic and non-linguistic) developed by language-trained test
designers differ from those used by content-area specialists evaluating the same
types of performance (e.g. Douglas, 2000, 2001; Jacoby & McNamara, 1999;
Lumley, 1998). Although many of the assessment criteria mentioned in the
chemistry education literature reveal roots in traditional English composition
concerns (e.g. Gordon et al., 2001; Paulson, 2001; Shibley et al., 2001), in cases
like the Write Like a Chemist project, where content-area specialists (chemistry
faculty) serve as the raters of test-taker performance, there is a potential for a
mismatch between the ways in which content specialists interpret linguistically
focused assessment criteria and the meanings those same criteria hold for
applied linguists (see Brown, 1995; Smith, 2003a, 2003b). To minimize chances
for a mismatch, the project team understood the need for developing a shared
understanding of how both linguistic and non-linguistic features were to be
defined (both individually and relative to one another) for the purposes of
assessing student writing produced for the Write Like a Chemist project. This
need served to further underscore the necessity of close collaboration between
applied linguists and chemists at each stage of the assessment-design process.

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

83

3 Development of assessment criteria and scales


To explain the process of developing assessment criteria and scales, we discuss
contributions from interdisciplinary team members, the development of early
and more refined criteria and scales, and the challenges faced. We provide
a detailed discussion so that others can learn from our experiences and use
them as springboards for other interdisciplinary efforts. In identifying assessment criteria that could be adapted from general writing-assessment practices
and discipline-specific writing concerns, it was important from the outset to
consider the views of both chemists and applied linguists.
3.1 Chemistry faculty contributions
Chemistry team members, although not formally trained in writing assessment, had clear ideas about what features of chemistry writing were critical
for them. The chemists had a strong intuitive sense of what counted as good
writing but often experienced difficulty articulating the discoursal aspects of
effective chemistry writing. The chemists brought with them experience in
reading and writing in different chemistry genres and first-hand knowledge of
the writing conventions endorsed by the American Chemical Society (e.g. word
and phrase preferences; the formalization of extreme conciseness in writing;
formatting conventions for tables, figures, and schemes). Furthermore, the
chemistry faculty were familiar with the discipline-specific backgrounds of
junior-level students and the common challenges faced by chemistry students
when attempting to write in discipline-specific genres. They also understood
the frustrations frequently experienced by chemistry faculty when encountering students writing.
3.2 Applied linguist contributions
The applied linguists, as a group, had decades of experience teaching writing
to native and non-native students, assessing writing, designing LSP courses,
conducting language analyses (though not with chemistry texts), and training
teachers to teach writing. They had an awareness of the general writing skills
needed for successful writing at advanced levels and knowledge of a range of
rating systems used for writing assessment.
The applied linguists drew upon the language testing literature, focusing
mainly on discussions of the creation and validation of rating scales for the
assessment of written performance tasks, as they worked with chemistry
colleagues to address a number of fundamental questions underlying the scaledesign process:

84

Creating and Validating Assessment Instruments

(1) What assessment purposes are the rating scales meant to serve?
(2) How many levels of performance could reliably be distinguished?
(3) Should scores be weighted or not?
(4) How should the content and wording of scale descriptors be determined?
(For an overview of these and other key factors in scale design, see Bachman
& Palmer, 1996; Hudson, 2005; Lumley, 2002; McNamara, 1996; North &
Schneider, 1998; Weigle, 2002.) Despite the fact that these basic scale-design
issues are well documented, a specific challenge in this project was to identify
assessment criteria that were meaningful not only to the applied linguists, but
also (and more importantly) to the chemistry faculty who would ultimately be
responsible for assessing student performance.
3.3 Development of early criteria and scales
In the first two years of the project (before the nationwide pilot), the leading
chemist in the project developed a set of analytic rating scales for each major
writing assignment with the help of the applied linguistics graduate student
who co-taught the course. The scales consisted of a series of simple statements
reflecting analytic criteria (e.g. Uses correct grammar, tense, voice) and a maximum possible score for each criterion (see Appendix B).
These early rating scales indicated the perceived importance of general academic and chemistry-specific writing conventions, as well as accurate science.
For example, in the rating scale for the journal article assignment, roughly half
of the grading criteria could be classified as general academic writing conventions (i.e. criteria that are common in university composition instruction):
(1) Shows coherent organization within and between sections, using appropriate transitional devices.
(2) Is properly formatted throughout, including in-text citations and references.
Other grading criteria reflected chemistry-specific writing conventions
(including conciseness, a feature that has become a formal convention of
chemistry writing) and chemists understanding of standard academic writing
conventions:
(3) Includes properly formatted tables, schemes, figures, and NMR data, and
refers to them appropriately.

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

85

(4) Uses language and level of detail appropriate for an expert audience, with
special attention to wordiness.
Finally, the fact that chemists also view accurate subject-area knowledge as
essential in effective writing is reflected in an additional criterion as well as the
not trivial number of points assigned to it:
(5) Provides clear and correct scientific information throughout the paper.
Although somewhat rudimentary, these early rating scales proved invaluable as
a starting point for more clearly articulated rating scales that would also hold
meaning for project-external audiences.
3.4 Development of more refined scales
The demands of a grant-supported project, as well as the fact that chemists
would be the evaluators of students writing abilities, required the development
of more refined assessment scales, with distinct and meaningful score points
that chemistry faculty would find acceptable, understandable, and easy to
use. Building upon earlier efforts (described above), the applied linguists first
developed a general analytic rating scale that combined performance criteria
typical of general writing assessment and criteria thought to be specifically
applicable to chemistry writing (see Figure 1):
Criteria typical of general writing
assessment

Criteria specifically applicable to


chemistry writing

Audience and level of detail


Organization and purpose
Fluency and mechanics

Scientific writing conventions


Accuracy of science content

Figure 1: Criteria specified in an early general analytic rating scale

The immediate goal was to produce a working model for assessing the writing of chemistry students more generally, and then to systematize the rating
procedures for each major writing assignment in the course more specifically
(e.g. writing the methods section of a data-driven paper modeled after a
journal article). This early scale underwent numerous revisions, during which
chemistry colleagues played a central role in three areas of the ongoing scaledesign process:
(1) Specifying the relative importance of discipline-specific subject matter
and writing conventions.

86

Creating and Validating Assessment Instruments

(2) Providing a chemists interpretation of the meaning and importance of


general writing assessment criteria (e.g. audience, level of detail, grammar, mechanics).
(3) Clarifying usage expectations for chemistry writing.
The third point above provides a ready illustration of the importance of chemist
input in an interdisciplinary effort like ours. In Write like a Chemist materials,
discipline-specific content and writing usage conventions, as they manifest
themselves in the targeted chemistry genres (see Dodd, 1997), are featured
prominently. The chemists had to decide if scientific content and writing conventions should be assessed separately or incorporated with other characteristics
of good writing. For example, in the earlier rating scales (Appendix B), the use
of scientific abbreviations, superscripting, and subscripting were incorporated
into the same criterion as the use of correct grammar, tense, voice, and punctuation. As a group, we had to decide if that combination should remain or if
chemistry-specific usage should be separate from grammar and language issues.
As another example, we had to decide if misused scientific terminology (e.g.
oxidation, free radical) was to be assessed as part of correct scientific information (a separate criterion) or as part of a more general word-use criterion. As a
third example, chemistry team members had to decide which usage rules were
critical for the target genres and which could realistically be expected from
junior-level students upon their first introduction to professional genres. The
chemists, in accord with the standards codified by their profession in The ACS
Style Guide (Dodd, 1997), proved to be quite demanding in their expectations
of conventional usage rules emphasized in the Write Like a Chemist curriculum,
including the proper use of scientific plurals (e.g. apparatus, data, syntheses);
abbreviations (e.g. s, h, mL, mol, FTIR, UV-vis); capitalization (e.g. molecules,
compounds, and solvents are written in lowercase; positional words, even at
the beginning of sentences are written in lowercase and in italics: o [ortho],
m [meta], p [para]); and numbers and units (e.g. a space is expected between
numbers and units except with %: 300 K, 2.5 L, 35%).
As the applied linguists attempted to separate general and chemistry-specific
criteria in their preliminary working drafts, the chemists played a leading role
in determining the wording of scale descriptors. Similarly, the incorporation
of criteria from general academic writing assessment had to be evaluated by
the chemists to provide insights into how those criteria might be interpreted
by a broader population of academic chemists. The rating scale in Appendix
C provides an illustration of how early assessment criteria (Figure 1) were
reconfigured (and reworded) by the applied linguists as a result of ongoing
discussions with the chemists. Some notable modifications included the
following:

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

87

(1) The use of the criterion Purpose (i.e. the authors goals for writing and
success in achieving those goals) was dropped from the Organization &
Purpose criterion because chemistry colleagues believed that this notion
could not be rated independently, but rather was reflected in a students
performance across all other criteria.
(2) Fluency was removed from the Fluency & Mechanics criterion and
instead incorporated into a new and entirely separate Conciseness and
Fluency criterion, further exemplifying the value placed on the economic
use of words by chemists.
(3) The criterion Grammar was paired with Mechanics to capture surfacelevel syntactic and punctuation errors.
It is instructive to note here that grammar became a site of extended discussions. Although the Grammar & Mechanics criterion might be presumed to
include all linguistic structures, this criterion was influenced by how one might
view certain syntactic errors of a fairly general nature. For example, after some
months debate, it was decided that students use of passive voice would be
assessed as part of Scientific Conventions rather than Grammar & Mechanics
because what raters would assess would not be the correct formation of the
passive (not normally a problem for students at this level of instruction) but
rather its conventional use in the appropriate sections of the targeted genres
(e.g. describing laboratory procedures in the Methods section of a research
article; see also Swales, 1990, 2004).
Concurrent with discussions of overarching scoring criteria, the project
team worked together to draft and revise scale descriptors for each score
point in the various rating scales being developed.6 These discussions and the
continual fine-tuning of the rating scales (by chemistry faculty and applied
linguists, in turn) provided the project team with the insights needed to
finalize a more detailed set of analytic rating scales. A key element of the
applied linguists work here was to ensure that the chemists developed a
sense of ownership of the descriptors and the resulting set of analytic rating
scales. Over time, the various rating scales made use of a fairly common set
of phrases (including the resurrection of Purpose) to describe scoring details:
audience and purpose, organization, writing conventions, grammar and
mechanics, and scientific content.

88

Creating and Validating Assessment Instruments

3.5 Challenges in the development process


For the purposes of project evaluation, the project team also needed to develop
a valid holistic rating system. One difficulty in developing the holistic scale
involved decisions about the number of scale points to include. As indicated
in Appendix C, some analytic scales that we had developed earlier had split
scoring that weighted criteria differently (with either 6 or 4 as top scores).
As a second issue, we had to decide on the number of criteria that would be
incorporated into each holistic score point. Earlier scales had used 5 different
scoring criteria (see Figure 1) while the scale in Appendix C used 6 different
scoring criteria. The key issue here was not to create a rigid semi-analytic set
of descriptors but to provide sufficient description of each holistic score point
so that a flexible interpretation could be made while scoring. A third issue that
we had to address was how to make descriptions for each holistic score point
distinct and easily interpretable for chemistry faculty. Using overly generic
statements for the relevant criteria at each score point would not provide
effective guidance for chemistry faculty who were generally not practiced in
holistic scoring of students writing and who would inevitably vary in their
interpretations of overly general evaluation statements.
The applied linguists made a set of initial decisions for assembling a first
holistic rating scale based on their prior work with the analytic scales and
continuing conversations with the chemists. For example, they decided to
build a holistic scale on 6 score points, knowing that a 4-point scale would not
provide sufficient variability in scoring and a 5-point scale would too easily
draw scorers to middle of the road scoring (as an average 3 rating). At the same
time, the chemistry team members felt that 6 levels of writing performance
could be distinguished. Subsequent decisions about the holistic scale were
made during the more formal socialization stages of our collaborative work
(described in the next section).

4 Socialization process
The need for a valid holistic assessment scale, along with a set of anchor
benchmark papers at each scale point, led to an extended socialization
process with the chemists. This section outlines steps taken by the applied
linguists to socialize the chemists in their rating decisions and to refine the
holistic rating scale (while incorporating the descriptive language used by the
chemists). We provide a detailed accounting of the socialization process in
part as a response to the paucity of such discussions in the applied linguistics
literature. It is our assumption that applied linguists who engage in other
interdisciplinary efforts will be able to adapt many of the ideas and procedures presented here. Our detailed discussion may also serve as valuable

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

89

supporting evidence for those trying to persuade university administrators


and/or supporting institutions of the viability of cross-disciplinary efforts
and the role of socialization within them.
The chemists on the project had little experience rating students writing
holistically. As a group, the applied linguists thought that the chemists would
be able to contribute more to conversations about assessment and agree upon
descriptors for a common scale if they went through a series of student-paper
rating exercises. These exercises became an essential component of our more
formalized three-step socialization process:
Step I: An extended set of tasks requiring chemistry colleagues to read and
assess students journal article papers holistically.
Step II: Discussions of ways to transform comments made about student writing into more accurate rating scale descriptors.
Step III: Agreement on benchmark student journal article papers for each
proficiency band on the 6-point scale.
It was hoped that one outcome of the process would be the development of a
set of consistent terminology that could be used by the chemistry faculty and
applied linguists to describe features of different papers and come to agreement
on holistic scoring guide descriptors.
This socialization process entailed a series of meetings of the entire project
team. As a preliminary step to the process, one chemist and one applied linguist
(team teachers earlier in the project) separated 50 student papers (collected
over a period of five semesters) into three distinct groups (representing high,
intermediate, and low writing proficiency).7 The major task for the subsequent
five meetings, then, was to engage the chemists in the lengthy process of reading
the majority of these papers, determining which were stronger or weaker within
and across the initial groupings, and finally discussing what aspects of the
students writing led to decisions about the strengths and weaknesses of each
text.8 The chemists comments on students writing were recorded, eventually
providing the information needed to fine-tune a strong set of descriptors for
each point on the scale.
The applied linguists developed a general plan for the five meetings:
(1) Distinguish higher- from lower-level writing samples within an initially
labeled middle band of student papers, thereby creating distinctions
between 4- and 3-rated papers on the proposed 6-point scale. This was
considered an important starting point because many holistic rating
scales do not distinguish the middle bands well, even though they are the
most common scores typically assigned (Henning & Davidson, 1987).

90

Creating and Validating Assessment Instruments

(2) Sort the high band of papers to see if distinctions could be made, and
described, between papers rated at 5 and 6.
(3) Identify distinctions within the lowest band of papers, basically separating papers rated as 1s and 2s.
Prior to the first meeting, 11 papers were selected from the mid-range group of
previously sorted journal article papers to establish an initial sample of potential
3s and 4s. The chemists were asked to read 6 of the 11 papers before coming to
the first meeting. They were instructed to read each paper, decide if it should
be rated as a 3 or a 4 (out of a possible, but as yet undefined 6 points), and jot
down comments on aspects of the paper that helped with scoring decisions, all
within a 10-minute per paper limit (if possible). They were asked to refrain from
consulting rating scales that had been created earlier in the project and not to
confer with one another to ensure that diverse views would be heard.
The first meeting began by recording each raters scores for three of the six
papers read before the session. Each paper was discussed and raters were asked
why they had assigned the scores that they had. Comments were noted on the
blackboard for everyone to see and recorded for future reference. This sequence
was repeated for the remaining three papers read before the meeting. The five
remaining (unread) papers were then distributed and the read-score-recorddiscuss cycle was repeated; the raters were given 30 minutes to read and rate all
five papers. The imposition of such a tight timeline (essentially six minutes per
paper) stemmed from the fact that the chemists had related that they had spent
around 20 minutes to rate each 56 page paper assigned for the first meeting,
even though they had been instructed to read the papers in the most holistic
fashion (to decide whether the papers were strong or weak samples within the
group).9 The imposed time limit helped the chemistry raters learn, over time,
to read student papers more holistically. As a result of the first meeting, we
began to identify not only papers that could be used as benchmarks, but also
the characteristics of papers in this middle score range that could be used to
refine scale descriptors.
In the second training session, we repeated the read-score-record-discuss
sequence to identify benchmarks and characteristic features for the upper half
of the score bands (4, 5, and 6 scores). The chemistry raters spent a great deal
of time reaching agreement on the qualities of papers falling at the highest end
of the scale. Although the raters identified several papers that were potential
benchmark 4s during the second rating session, just one newly introduced
paper was deemed worthy of a 5. (Another paper that had been assigned a 5
in the first session was reintroduced in the second session to see if its earlier
score would stand. It did.) None of the papers introduced in this session were
thought to be representative of the top band of the 6-point scale. By the end

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

91

of the first and second training sessions, we had succeeded in identifying a


number of mid-range papers (3s and 4s) and corresponding descriptors for the
holistic scale but only two benchmark 5s and no benchmark 6s.
Because of the difficulties experienced trying to define the highest end of
the 6-point scale, the third meeting was devoted to identifying papers characteristic of strong performance. Session three resulted in a number of potential
benchmark 5 papers, but once again no 6s were identified. Our failure to find
a benchmark 6 raised two concerns: that our 6-point scale might be unworkable (i.e. identifying six distinctive levels of performance was not possible) or
alternately, that the level of performance imagined by the chemists to be worthy
of a 6 was unattainable by the target student population. There were indications,
during group discussions, that the chemists had begun to interpret the highest
scale point as a publishable paper rather than the work of junior-level chemistry
students learning to write professionally for the first time.
In our fourth session, we attempted to address these concerns by focusing
solely on those papers that had received the highest ratings in previous sessions
(high 4s and 5s). In this session, the process was altered slightly by asking the
chemists to decide only if a given paper could be classified as high or low within
the set of eight papers that they were asked to reread prior to the meeting.
Following these guidelines, the raters reached consensus on one paper that
they believed to be representative of the very highest performance that could
be expected from junior-level chemistry students.
In the fifth and final session, we focused primarily on the lower half of the
score bands (1s, 2s, and 3s). Although there were still some on-going difficulties
in distinguishing some papers rated 3 or 4, and other papers rated 3 or 2, the
raters felt that these lower bands were generally more distinct than the higher
levels. Perhaps not surprisingly, papers at the lowest end of the scale (1s and 2s)
proved to be much easier to identify, largely because this task basically entailed
separating student writers who were merely weak from those who were unable
(or unwilling) to do the assignment. From the perspective of the chemists, the
scale descriptors for the lower points in the scale were also easier to describe
and reach reasonable consensus on.
By the conclusion of the fifth session, the entire range of band distinctions
had become more clearly defined, and the holistic rating scale took on a strong
multi-dimensional quality. The holistic descriptors at each band level became
regularized in line with the more important characteristics of the analytic
scales, while still retaining the flexibility and consistent descriptive scheme of
a holistic scale (see Appendix D). By the end of the process, the chemists had
learned how to apply these newly articulated criteria in a more efficient manner,
resulting in a much less labor-intensive process of reading a student paper and
arriving at a consistent score.

92

Creating and Validating Assessment Instruments

In fact, it was sustained input from the chemists that allowed us to disambiguate the different levels of student performance. The applied linguists role
during the socialization process intentionally remained that of a guide rather
than an authority on language testing. The applied linguists interceded only
when necessary to keep discussion moving and to facilitate more time-efficient
scoring lest the chemists lose motivation to continue participating in this
aspect of the project. For similar reasons, the applied linguists refrained from
offering any theoretical insights into the nature or quality of student writing
during these sessions. Instead, the focus was on the chemists ways of making
sense of and talking about students writing performance.

5 Validation of assessment instrument with other chemists


A second step for validating the holistic scale involved its use with additional
groups of chemists. The holistic scale created by means of the extended
socialization process (described above) was used as the primary evaluation
mechanism by chemistry faculty piloting Write Like a Chemist materials and
by those serving as external evaluators of pilot student writing.
In preparation for the first academic year pilot at other universities, seven
chemistry instructors attended a two-day training session organized to familiarize them with the Write Like a Chemist pedagogical approach and socialize
them for the purposes of assessing student writing in much the same way that
chemistry team members were socialized (albeit within a much more condensed timeframe). At the end of the first day, after they had been introduced
to Write Like a Chemist materials, pilot instructors were given a copy of the
holistic rating scale (Appendix D) and six student papers selected from the set
of previously identified benchmarks. For homework, pilot faculty were asked to
review the rating scale and then read over each of the six papers, one at a time,
in any order. They were advised to take about ten minutes to read each paper
and to feel free to identify notable features (strong or weak) by underlining,
circling, or jotting down margin notes. At the end of ten minutes, they were to
assign the paper a 16 score. They returned the next day with papers rated and
rationales for assigned scores noted.
The rater training session followed a pattern similar to that established earlier for project team members (see above), with individual instructors taking
turns to report the score they had assigned to a given paper, followed by a
whole group discussion of the strengths and weaknesses of that paper. After
addressing the six papers read before the session, four additional papers were
distributed, followed by the same read-score-record-discuss cycle. The rater
training session went well, in terms of having raters largely agreeing on the
scoring of sample student papers by the end of the workshop. Using Fishers

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

93

Z-transformation to determine the average agreement among the seven raters,


a Pearson correlation (r) of .71 was obtained,10 which was considered acceptable
for a group so varied in terms of training and experience in writing pedagogy
and assessment.
At the end of the first pilot year, the holistic rating scale was used by another
group of eight chemistry faculty who had been recruited to evaluate student
writing produced in the pilot classes. The procedures followed to train these
eight raters were largely the same as those outlined above. During the formal
scoring session, each student paper (N = 88) was independently scored by two
raters. Scores were deemed to be acceptable as long as the two raters marks
were within one point of each other (on the six-point scale); in instances where
the raters marks differed by more than one point (n =24), the lead chemist
from the project team stepped in as the master rater. If the score assigned by the
master rater matched one of the other raters scores (15 cases), that number was
taken as the papers true score. In cases where the master raters score differed
from the other two ratings assigned, the master raters score was (a) taken as
the true score (when the master raters score was the average of the other two
scores, which happened in 3 cases) or (b) averaged with the next closest score
(6 cases). Using Spearmans correlation coefficient for ranked data, interrater
reliability for the scores assigned was rS = .85, indicating that the raters were
consistent during the external evaluation session.
Both the pilot faculty and external evaluators, like the chemists on the project
team, had had little if any experience with holistic scoring prior to the rater
training sessions. The training sessions, although short in duration compared
to earlier socialization sessions, succeeded in assisting pilot faculty and external
evaluators to articulate what is important in student writing, what to prioritize
in student writing assessment, and what characteristics are representative of
effective writing in chemistry. By the end of the sessions, they found the holistic
scoring guide descriptors to be useful and were able to apply them consistently
while discussing student papers.
A chain of additional outcomes resulted from the two rater training sessions
described above. Although the pilot instructors and external evaluators had
grown accustomed to using the holistic rating scale, feedback from some of
them spurred a renewed interest in analytic rating scales amongst project team
members as a means to provide more detailed feedback to students. As a result,
we undertook to (re)design the analytic rating scales developed earlier. This
time around, the process of revising the scale descriptors was left largely in
the hands of the chemists on the project team. Although drafts of the various
scales were circulated among all team members, by this point in the process, the
chemists seemed to have the best idea of what characteristics they were looking
for at each level of student performance and they often passed drafts between

94

Creating and Validating Assessment Instruments

themselves before sending them on to the rest of the group for a final review. A
key modification to note in the revised analytic rating scale (see Appendix E)
involves the chemists decision to quantify the number of errors permissible at
each score point, a decision that applied linguists may at first glance find too
restrictive, but which may be a logical outcome of the chemists disciplinary
culture, with its exacting standards of precision and accuracy.11 That the chemists felt comfortable crafting an analytic scale in this way highlights a key goal
of the assessment process undertaken. The chemists on the team had become
skilled not only in evaluating the writing of the chemistry students, but also
at explaining and refining specific criteria to evaluate student writing in their
content-area domain.

6 Conclusion
Through our involvement in designing assessment instruments for the Write
Like a Chemist project, we have learned a number of valuable lessons about
interdisciplinary collaboration that have implications for applied linguists
working with content experts from other disciplines. As mentioned in our
introduction, our experience has confirmed the importance of a long-term
association, rather than a short-term consultation, between content-area specialists and applied linguists. Sufficient time must be allotted for planning,
discussions, negotiations of meaning, and the rethinking of original plans. Time
should also be set aside to allow participants to formulate and voice opinions,
consider the views of others, and then reformulate opinions in light of others
views. The process of developing resources and instruments for other-group
needs is by necessity a dynamic one, and therefore one deserving of a careful,
measured approach. Time and (seemingly infinite) patience are needed for
the drafting and redrafting of relevant documents, with additional time set
aside to debate single words and phrases, if necessary. Equally important is
the establishment of agreed upon procedures and timelines, accompanied by
the flexibility and willingness of all participants to make changes in response
to unanticipated occurrences.
Our project had one central lesson for the applied linguists in particular. What
applied linguists assume is characteristic of effective language use might not be
viewed as equally relevant by content-area specialists. In our case, the applied
linguists had to remain open to negotiating a shared understanding of writing
assessment criteria, rather than presuming that, as applied linguists, they had a
monopoly on expertise in this area. Our collaboration led to an explicit shared
vocabulary that all participants felt comfortable using in project discussions.
The need for shared terminology pertained not only to the technical aspects of
assessment, but also to descriptions of student performance. By approaching

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

95

interdisciplinary projects with an intentionally open mindset, those interested


in such collaborative efforts can learn to appreciate the different orientations
and expectations about language use that exist across the disciplines. Because
content-area specialists may not know the terminology of language description,
pedagogy, or assessment, they may benefit from a formal and systematically
planned socialization process before they are able to give detailed insights
into discipline-specific language use. At the same time, applied linguists need
to be conscious of remaining grounded in their approach to collaboration;
steeping interdisciplinary discussions in theory-laden terms relevant only to
applied linguists may ultimately serve to alienate erstwhile collaborators and
undermine the overall success of an interdisciplinary project.
From the outset of our project, we have been conscious of the fact that successful interdisciplinary collaboration requires that applied linguists demonstrate
the value of their contributions and also make contributions that directly
support the goals and needs of the other group. Essentially, applied linguists,
as mediators between language-based issues and the needs of a specific group,
must become subservient to the goals of the other group if they are to be successful in their cross-disciplinary efforts. In this vein, the importance of building
upon the past efforts of the other group, rather than starting from scratch, can
be critical as well; making use of early documents not only assigns some value
to those efforts, but also provides a useful springboard for subsequent discussion, negotiation, and work. Early documents reveal important information,
including disciplinary realities, knowledge, priorities, and terminology, as well
as what might be perceived as omissions.
That our interdisciplinary team functioned so smoothly, with few if any
conflicts, in achieving the aims set forth in this article merits formal analysis.
Informal reflection reveals numerous factors apart from personality attributes,
good luck, and complementary work habits, which cannot be discounted with
implications for other interdisciplinary projects. Important from the start was
the fact that the project, by definition, required the expertise of both chemists
and applied linguists. Both groups of participants developed a sense of ownership in the project, comprehending that success could only be achieved by
means of equally valuable, but quite diverse contributions from both groups.
Equally important was the clearly articulated understanding, from the start,
that the end users of our pedagogical approach, textbook, and assessment
instruments would be chemists and chemistry students. Recognizing our target
audience from the outset greatly influenced our short-term and long-term
goals, including our efforts to demystify chemical discourse (see Stoller et al.,
2005). Perhaps most significant was the fact that the chemists approached the
applied linguists, essentially initiating the interdisciplinary alliance, unlike
many interdisciplinary projects that are initiated by applied linguists. The

96

Creating and Validating Assessment Instruments

end result of our successful collaboration demonstrates the value of mixing


disciplinary discourses (Hyland, 2004) and interdiscursivity (Candlin & Maley,
1997).
From an even broader perspective, the processes and outcomes reported
here emphasize the value of crossing disciplinary boundaries and all that
accompanies the crossing, including the consideration of new perspectives,
the melding of diverse realities, engagement in joint problematization, and the
transformation and recontextualization of professional practices (see Candlin
& Sarangi, 2004; Sarangi & Candlin, 2003; Sarangi & Roberts, 1999). Although
partnerships between applied linguists and chemists are relatively rare, the
partnership showcased here demonstrates the power of reciprocity and the
willingness to forge new paths for the benefit of (at least) two disciplines.

Appendix A
Boesten, W. H. J., Seerden, J.-P. G., de Lange, B., Dielemans, H. J. A., Elsenberg, H. L. M.,
Kaptein, B., Moody, H. M., Kellogg, R. M. and Broxterman, Q. B. (2001) Asymmetric
Strecker synthesis of -amino acids via a crystallization-induced asymmetric transformation using(R)-phenylglycine amide as chiral auxiliary. Organic Letters 3: 11214.
Dellinger, B., Pryor, W. A., Cueto, R., Squadrito, G. L., Hegde, V. and Deutsch, W. A.
(2001) Role of free radicals in the toxicity of airborne fine particulate matter. Chemical
Research in Toxicology 14: 13717.
Demko, Z. P. and Sharpless, K. B. (2001) Preparation of 5-substituted 1H-tetrazoles from
nitriles in water. The Journal of Organic Chemistry 66: 794550.
Jozefaciuk, G., Muranyi, A. and Fenyvesi, E. (2003) Effect of randomly methylated cyclodextrin on physical properties of soils. Environmental Science & Technology 37:
30127.
Llompart, M., Pazos, M., Landn, P. and Cela, R. (2001) Determination of polychlorinated
biphenyls in milk samples by saponification-solid-phase microextraction. Analytical
Chemistry 73: 585865.
Plaper, A., Jenko-Brinovec, S., Premzl, A., Kos, J. and Raspor, P. (2002) Genotoxicity
of trivalent chromium in bacterial cells. Possible effects on DNA topology. Chemical
Research in Toxicology 15: 9439.
Vesely, P., Lusk, L., Basarova, G., Seabrooks, J. and Ryder, D. (2003) Analysis of aldehydes
in beer using solid-phase microextraction with on-fiber derivatization and gas chromatography/mass spectrometry. Journal of Agricultural and Food Chemistry 51: 69414.

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

97

Appendix B
Grading Criteria for Journal Article Paper

Possible
Points

Uses language and level of detail appropriate for an expert


audience, with special attention to wordiness

15

Includes properly formatted tables, schemes, figures, and


NMR data, and refers to them appropriately

10

Shows coherent organization within and between sections,


using appropriate transitional devices

10

Provides clear and correct scientific information throughout


the paper

15

Each area of the paper includes appropriate information


(e.g. the Introduction gives background information and
addresses the importance of the work, while the Discussion
interprets the results and suggests future research)

15

Uses correct grammar, tense, voice, punctuation, scientific


abbreviation, superscripting and subscripting, etc.

10

Is free of surface errors (typos, misspelled or misused words)

10

Includes an informative title

Is properly formatted throughout, including in-text citations


and references
Total

10

100

Your Score

Organization
of Text

Conciseness &
Fluency

Science Content

Audience

Grammar &
Mechanics

Scientific
Conventions

yNo notable errors.

yNo grammatical errors.


yNo mechanical errors.

yLevel of detail always


appropriate.
yLevel of formality
always appropriate.

yA few relatively minor


errors.
yOverall impression not
affected.

yToo much/too little


detail in 1-2 key
sections.
yLevel of formality
occasionally
inappropriate.
yFew grammatical errors,
and/or
yFew mechanical errors.
yReader not distracted
from content.

yWording is inconcise &


inappropriate in some
instances.
yConceptual parallelism
violated several times.

ySeveral errors.
yOverall impression
adversely affected.

yFrequent grammatical
errors, and/or
yFrequent mechanical
errors.
yReader distracted from
content.

yToo much/too little detail


in several key sections.
yLevel of formality often
inappropriate.

yWording is inconcise &


inappropriate in many
instances.
yConceptual parallelism
violated frequently.
yReader distracted from
content.
yIncorrect, unclear, or
illogical presentation of
science in many sections.
yUse of terminology often
questionable.

yConstant grammatical
errors, and/or
yConstant mechanical
errors.
yContent difficult to
understand as a result of
these errors.
yMany errors.
yScientific value of paper
would be disregarded.

yIncorrect, unclear, &


illogical presentation of
science throughout.
yInappropriate use of
terminology throughout.
yWould be dismissed by
readers.
yToo much/too little detail
throughout.
yLevel of formality
inappropriate throughout.

yWording is inconcise &


inappropriate throughout.
yConcepts are not parallel.
yContent difficult to
understand as a result of
these errors.

yText & graphics present yPresentation of science y Presentation of science


science in correct, clear, mostly correct, clear, &
mostly correct, clear, &
& logical manner.
logical.
logical.
yAppropriate use of
ySome inconsistencies in ySeveral inconsistencies
terminology in majority use of terminology.
in use of terminology.
of text.

yText & graphics work


together to present science
in correct, clear, & logical
manner.
yAppropriate use of
terminology throughout.

yWording is concise &


yWording is concise &
appropriate in almost all appropriate in most
instances.
instances.
yConceptual parallelism yConceptual parallelism
maintained in majority
maintained in majority
of text.
of text.

yWording is concise &


appropriate throughout.
yConceptual parallelism
maintained throughout.

yMove structure followed yMove structure followed yMove structure followed yMove structure is not
yMove structure is not
in all key sections.
in most key sections.
in most key sections.
followed in several key
followed.
yLogical progression of yProgression of ideas
yProgression of ideas
sections.
yProgression of ideas
ideas through majority
unclear in a few
unclear in several
unclear throughout.
yProgression of ideas
of text.
instances.
instances.
unclear in many instances.

yMove structure followed


in all key sections.
yLogical progression of
ideas throughout.

98
Creating and Validating Assessment Instruments

Appendix C

Analytic rating scale designed by applied linguists in response to earlier scale


(Appendix B).

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson

99

Appendix D
Holistic rating scale developed
socialization
sessions.
Appendixafter
D: Holistic
Rating Scale
Developed after Socialization Sessions

Presentation of science is correct, clear, logical, and sophisticated for course level.
Move structures are followed correctly in all sections.
Writing flows; wording is concise and appropriate.
Graphics are correctly formatted and seamlessly integrated with the text.
Few, if any, grammatical or mechanical errors are present.
Few, if any, errors are made in the use of scientific conventions or terminology.
Presentation of science is mostly correct, clear, and logical, but lacks some sophistication.
Move structures are followed in all sections, though some moves are underdeveloped or
problematic.
Writing is generally concise and appropriate, though some areas are wordy or awkward.
Graphics are formatted correctly, but may not be well integrated with the text.
A handful of grammatical or mechanical errors are present.
A handful of errors are made in the use of scientific conventions and/or terminology.
Presentation of science is generally correct, but often lacks clarity and/or sophistication.
Move structures are generally followed, though some moves are missing or misplaced.
Writing is often wordy and awkward.
Graphics contain some formatting errors and are not well integrated with the text.
Grammatical and mechanical errors are noticeable and, at times, distracting.
Errors in scientific conventions and/or terminology are noticeable and, at times, distracting.
Presentation of science evinces some lack of understanding on the part of the author.
Move structures are generally followed, but several are missing, misplaced, and/or
underdeveloped.
Writing is consistently wordy and awkward.
Graphics contain some formatting errors and are poorly integrated with the text.
Grammatical and mechanical errors regularly cause the reader distraction.
Errors in scientific conventions and/or terminology are frequent and regularly cause the
reader distraction.
Presentation of science evinces general lack of understanding on the part of the author.
An attempt has been made to follow move structures, but most are missing, misplaced,
and/or underdeveloped.
Writing is uniformly wordy and awkward.
Graphics are poorly designed and formatted, and disconnected from the text.
Grammatical and mechanical errors cause the reader serious distraction.
Errors in scientific conventions and/or terminology are very frequent and cause the reader
serious distraction.
The author does not understand the science and therefore cannot present it in any
meaningful way.
The author has not attempted to follow move structures.
Problems with wording and terminology are the rule, rather than the exception.
Graphics, if used, are poorly designed and formatted, and disconnected from the text.
Grammatical and mechanical errors make reading the text difficult.
Errors in scientific conventions and/or terminology predominate and make understanding
the paper difficult.

Audience &
Purpose

Organization

Writing
Conventions

Grammar &
Mechanics

Scientific
Content

Wordiness and/or
errors in level of
detail, style, or
formality occur in a
handful (2-3) of
instances.

All (sub)moves are


present, but one is out
of sequence or has
minor problems. No
extra moves are
present.

A handful (3-5) of
errors are made in the
use of writing
conventions.

A handful (3-5) of
grammatical or
mechanical errors are
present.

Presentation of
science is generally
complete and correct,
but one element is
missing, problematic,
or weakly developed.

All (sub)moves are


present, fully developed,
and in the correct
sequence. No extra
moves are present.

Few (1-2), if any, errors


are made in the use of
writing conventions.

Few (1-2), if any,


grammatical or
mechanical errors are
present.

Presentation of science
is complete, correct,
clear, and logical. Level
of science conveys an
understanding that is
sophisticated for course
level.

Wording is clear and


concise. Level of detail,
writing style, &
formality are appropriate
for an expert and/or
scientific audience.

(STRONGEST)
6

Presentation of
science is generally
correct, but two
elements are missing,
problematic, or
weakly developed.

Grammatical and
mechanical errors are
noticeable (6-8) and,
at times, distracting.

Errors in writing
conventions are
noticeable (6-8) and,
at times, distracting.

All (sub)moves are


present, but a few
have minor problems
or are out of sequence.
Extra moves may be
present.

Wordiness and/or
errors in level of
detail, style, or
formality are
noticeable (4-5), and,
at times, distracting.

Presentation of
science contains
several errors. Three
elements are missing,
problematic, or
weakly developed.

Grammatical and
mechanical errors are
frequent (9-10) and
regularly distracting.

Presentation of
science is generally
incorrect. Four
elements are missing,
problematic, or
weakly developed.

Grammatical and
mechanical errors are
consistent (11-12) and
seriously distracting.

Errors in writing
conventions are
consistent (11-12) and
make the writing
appear unprofessional.

Two (sub)moves are


missing or
underdeveloped.
(Sub)moves may be
out of sequence; extra
moves may be present.

One (sub)move is
missing or
underdeveloped.
(Sub)moves may be
out of sequence; extra
moves may be present.

Errors in writing
conventions are
frequent (9-10) and
regularly distracting.

Wordiness and/or
inappropriate level of
detail, style, or
formality are
consistent (8-9) and
seriously distracting.

3
Wordiness and/or
errors in level of
detail, style, or
formality are frequent
(6-7) and regularly
distracting.

Appendix E: Analytic Rating Scale


Fine-tuned Primarily by Chemistry Team Members

Presentation of
science conveys little
scientific
understanding. Five
elements are missing,
problematic, or
weakly developed.

Grammatical and
mechanical errors are
common (>12) and
limit the readers
ability to understand
the material.

Errors in writing
conventions are
common (>12). The
writing is
unprofessional.

Three (sub)moves are


missing or
underdeveloped.
(Sub)moves may be
out of sequence; extra
moves may be present.

(WEAKEST)
1
Wordiness and/or
inappropriate level of
detail, style, or
formality are common
(10) and cause the
reader to dismiss the
work.
Score

44

100
Creating and Validating Assessment Instruments

Appendix E

Analytic rating scale fine-tuned primarily by chemistry team members.

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson 101

Notes
1 The authors gratefully acknowledge James K. Jones (applied linguist) and Molly
Costanza-Robinson (chemist) for their participation in the process of creating and
validating instruments discussed in this article. We thank four anonymous reviewers and
the editors of JAL for their thoughtful comments on an earlier version of this paper.
2 Our work has been supported by the US National Science Foundation, with grants
(DUE 0087570 and DUE 0230913) received by authors Marin S. Robinson, Chemistry,
and Fredricka L. Stoller, Applied Linguistics, Northern Arizona University. Note that
any opinions, findings, conclusions, and recommendations expressed in this article are
those of the authors and do not necessarily reflect the views of the National Science
Foundation.
3 For an overview of Write Like a Chemist materials, see www4.nau.edu/chemwrite
4 Write Like a Chemist materials were piloted in eight US colleges and universities in
20042005 and are being piloted in another eight US institutions in 20052006.
5 The target language use domain concept is attributable to assessment research by
Bachman and Palmer (1996).
6 The wordings of scale descriptors were a source of continual discussion and revision.
In one 15-week period alone, approximately 45 versions of the different analytic scales
were generated and discussed, with changes to the wording of multiple descriptors being
made in each revision.
7 Initial classifications were based partially on the chemist and applied linguists recollections of each students overall performance in the course, in large part because they
remembered students paper topics.
8 The applied linguistics graduate student who team-taught the course in the first two years
of the project also participated in these sessions. Because we were primarily concerned
with the chemists interpretations of student written performance, we report only their
views here.
9 This revelation translated into a concern about the practicality of the analytic rating
scales for different writing tasks that had been created in the months preceding the
socialization sessions.
10 Although interrater reliability for rank-order data is typically estimated using Spearmans
rho, use of the Fishers Z-transformation of Pearson product-moment correlations is
necessary when averaging agreement scores among three or more raters.
11 At the time of writing this article, the chemists are reconsidering whether this approach
is practical or valid for writing assignments of varying lengths and complexity.

102

Creating and Validating Assessment Instruments

References
Bachman, L. and Palmer, A. (1996) Language Testing in Practice. New York: Oxford
University Press.
Beall, H. and Trimbur, J. (2001) A Short Guide to Writing about Chemistry. (Second
edition) New York: Longman.
Bressette, A. R. and Breton, G. W. (2001) Using writing to enhance the undergraduate
research experience. Journal of Chemical Education 78: 16267.
Brown, A. (1995) The effect of rater variables in the development of an occupationspecific language performance test. Language Testing 12: 115.
Candlin, C. N. and Candlin, S. (2003) Health care communication: a problematic site for
applied linguistics research. In M. McGroarty (ed.) Annual Review of Applied Linguistics
13454. New York: Cambridge University Press.
Candlin, C. N. and Maley, Y. (1997) Intertextuality and interdiscursivity in the discourse
of alternative dispute resolution. In B.-L. Gunnarson, P. Linell and B. Nordberg (eds)
The Construction of Professional Discourse 20122. London: Longman.
Candlin, C. N. and Sarangi, S. (2004) Making applied linguistics matter. Journal of Applied
Linguistics 1(1): 18.
Coppola, B. P. and Daniels, D. S. (1996) The role of written and verbal expression in
improving communication skills for students in an undergraduate chemistry program.
Language and Learning Across the Disciplines 1: 6786.
Dodd, J. S. (ed.) (1997) The ACS Style Guide. (Second edition) Washington, DC: American
Chemical Society.
Douglas, D. (2000) Assessing Language for Specific Purposes. New York: Cambridge
University Press.
Douglas, D. (2001) Language for specific purposes assessment criteria: where do they
come from? Language Testing 18: 17185.
Dudley-Evans, T. and St. John, M. J. (1998) Developments in English for Specific Purposes:
a multi-disciplinary approach. New York: Cambridge University Press.
Ebel, H. F., Bliefert, C. and Russey, W. E. (2001) The Art of Scientific Writing: from student
reports to professional publications in chemistry and related fields. (Second edition) New
York: John Wiley.
Ferris, D. R. and Hedgcock, J. S. (2005) Teaching ESL Composition: purpose, process, and
practice. (Second edition) Mahwah, NJ: Lawrence Erlbaum.
Flowerdew, J. (1993) Content-based language instruction in a tertiary setting. English for
Specific Purposes 12: 12138.
Flowerdew, J. and Peacock, M. (eds) (2001a) Research Perspectives on English for Academic
Purposes. New York: Cambridge University Press.
Flowerdew, J. and Peacock, M. (2001b) The EAP curriculum: issues, methods, and
challenges. In J. Flowerdew and M. Peacock (eds) Research Perspectives on English for
Academic Purposes 17794. New York: Cambridge University Press.

F. L. Stoller, B. Horn, W. Grabe & M. S. Robinson 103

Gordon, N. R., Newton, T. A., Rhodes, G., Ricci, J. S., Stebbins, R. G. and Tracy, H. J.
(2001) Writing and computing across the USM chemistry curriculum. Journal of
Chemical Education 78: 535.
Hamp-Lyons, L. (2003) Writing teachers as assessors of writing. In B. Kroll (ed.) Second
Language Writing: research insights for the classroom. New York: Cambridge University
Press.
Henning, G. and Davidson, F. (1987) Scalar analysis of composition ratings. In K. M.
Bailey, T. L. Dale, and R. T. Clifford (eds) Language Testing Research: selected papers
from the 1986 colloquium. Monterey, CA: Defense Language Institute.
Hudson, T. (2005) Trends in assessment scales and criterion-referenced language assessment. In M. McGroarty (ed.) Annual Review of Applied Linguistics 20527. New York:
Cambridge University Press.
Hyland, K. (2002) Teaching and Researching Writing. London: Longman.
Hyland, K. (2003) Second Language Writing. New York: Cambridge University Press.
Hyland, K. (2004) Disciplinary Discourses: social interactions in academic writing. Ann
Arbor: University of Michigan Press.
Jacoby, S. and McNamara, T. (1999) Locating competence. English for Specific Purposes 18:
21341.
Johns, A. M. (1997) Text, Role, and Context: developing academic literacies. New York:
Cambridge University Press.
Klein, B. and Aller, B. M. (1998) Writing across the curriculum in college chemistry: a
practical bibliography. Language and Learning Across the Disciplines 2: 2535.
Kovac, J. and Sherwood, D. W. (2001) Writing Across the Chemistry Curriculum: an
instructors guide. New York: Prentice Hall College Division.
Kuldell, N. (2003) Read like a scientist to write like a scientist: using authentic literature in
the classroom. Journal of College Science Teaching XXXIII(2): 325.
Lumley, T. (1998) Perceptions of language-trained raters and occupational experts in
a test of occupational English language proficiency. English for Specific Purposes 17:
34767.
Lumley, T. (2002) Assessment criteria in a large-scale writing test: what do they really
mean to the raters? Language Testing 19: 24676.
McNamara, T. (1996) Measuring Second Language Performance. New York: Longman.
Nature (2001) Learning to speak and write. Nature 411: 1.
North, B. and Schneider, G. (1998) Scaling descriptors for language proficiency scales.
Language Testing 15: 21763.
Oliver-Hoyo, M. T. (2003) Designing a written assignment to promote the use of critical
thinking skills in an introductory chemistry course. Journal of Chemical Education 80:
899903.
Paulson, D. R. (2001) Writing for chemists: satisfying the CSU upper-division writing
requirement. Journal of Chemical Education 78: 10479.

104

Creating and Validating Assessment Instruments

Sarangi S. and Candlin, C. N. (2003) Trading between reflexivity and relevance: new challenges for applied linguistics. Applied Linguistics 24: 27185.
Sarangi, S. and Roberts, C. (eds) (1999) Talk, Work, and Institutional Order: discourse in
medical, mediation, and management settings. Berlin: Mouton de Gruyter.
Shibley, I. A., Milakofsky, L. M. and Nicotera, C. L. (2001) Incorporating a substantial
writing assignment into organic chemistry: library research, peer review, and assessment. Journal of Chemical Education 78: 503.
Smith, S. (2003a) The role of technical expertise in engineering and writing teachers
evaluations of students writing. Written Communication 20: 3780.
Smith, S. (2003b) What is good technical communication? A comparison of the standards
of writing and engineering instructors. Technical Communication Quarterly 12: 724.
Stoller, F. L., Jones, J. K., Costanza-Robinson, M. S. and Robinson, M. S. (2005)
Demystifying disciplinary writing: a case study in the writing of chemistry. Across the
Disciplines: interdisciplinary perspectives on language, learning, and academic writing.
Retrieved 30 May 2005. http://wac.colostate.edu/atd/lds/stoller.cfm
Swales, J. M. (1990) Genre Analysis: English in academic and research settings. Cambridge:
Cambridge University Press.
Swales, J. M. (2004) Research Genres: exploration and applications. Cambridge: Cambridge
University Press.
Wardle, E. A. (2004) Can cross-disciplinary links help us teach academic discourse in
FYC? Across the Disciplines: interdisciplinary perspectives on language, learning, and
academic writing, 1. Retrieved 10 January 2004. http://wac.colostate.edu/atd/articles/
wardle2004/
Weigle, S. C. (2002) Assessing Writing. New York: Cambridge University Press.
Whelan, R. J. and Zare, R. N. (2003) Teaching effective communication in a writing-intensive analytical chemistry course. Journal of Chemical Education 80: 90406.

Você também pode gostar