Escolar Documentos
Profissional Documentos
Cultura Documentos
a r t i c l e i n f o a b s t r a c t
Article history: We investigate classroom assignments and resulting student work to identify important characteristics
Received 17 September 2016 of assignments in terms of instructional quality and their validity as measures of teaching quality. We
Received in revised form examine assignment quality within a large-scale project exploring multiple measures including class-
14 July 2017
room observations, teacher knowledge measures, and value-added estimates based on student
Accepted 2 August 2017
achievement scores. Analyses included descriptive statistics, multivariate analyses to understand factors
Available online 13 September 2017
contributing to score variance, and correlational analyses exploring the relationship of assignment scores
to other measures. Results indicate relatively low demand levels in all teacher assignments, a marked
Keywords:
Assignments
difference in score distributions for mathematics (math) and English language arts (ELA), and a sub-
Teaching quality stantial relationship between what was asked of and produced by students. Relationships between as-
Intellectual demand signments scores, classroom characteristics, and other measures of teaching quality are examined for
both domains. These findings help us understand the nature of and factors associated with assignment
quality in terms of intellectual demand.
© 2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
http://dx.doi.org/10.1016/j.learninstruc.2017.08.001
0959-4752/© 2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
J. Joyce et al. / Learning and Instruction 54 (2018) 48e61 49
tests, and in-class work. These artifacts are, in certain ways, more substantially stronger in their academic attainment.
straightforward to interpret than are classroom observations. The framework of authentic intellectual engagement is the
Specifically, assignment artifacts can make clear what is expected foundation of the artifact protocol used in this study, the Intellectual
of students and how students respond to those expectations in Demand Assignment Protocol (IDAP) (Wenzel, Nagaoka, Morris,
ways that are not always observable within a set of classroom in- Billings, & Fendt, 2002). While other assignment protocols build
teractions (see Gitomer & Bell, 2013). This study also contributes on different frameworks, the assignment protocols cited in the
further validity evidence (see Kane, 2013) for the interpretation of literature all focus on some variation of intellectual demand.
scores from an artifact protocol. Prior classroom assignment research has provided under-
As part of a larger study of a broad set of measures of teaching standing of the intellectual demands that are placed on students,
quality, this study investigates the relationship of artifact scores to how students respond, and how these assignments affect student
measures of classroom observations, teacher knowledge, and outcomes. Students respond to authentic work that is challenging,
value-added measures based on standardized student achievement constructive, and relevant (American Institutes for Research, 2003;
tests. Acknowledging that “no single measurement can capture the Beane, 2004; Daggett, 2005; Dowden, 2007; Milanowski,
full range of teacher performance in different contexts or condi- Heneman, & Kimbal, 2009; Ng, 2007; Paik, 2015; Prosser, 2006;
tions” (Looney, 2011, p. 443), this research enables us to consider Woolley, Rose, Orthner, Akos, & Jones-Sanpei, 2013). Intellectual
what artifacts can contribute to an understanding of teaching demand has also been measured reliably (Borko, Stecher, Alonzo,
quality and how scores on artifacts are related to scores on other Moncure, & McClam, 2005; Clare & Aschbacher, 2001;
teaching quality measures. Matsumura, Garnier, Slater, & Boston, 2008) and is connected to
As with classroom observations, quality of classroom assign- student outcomes (Matsumura & Pascal, 2003; Mitchell et al., 2005;
ments is a construct that has no absolute definition. Therefore, a Newmann et al., 2001).
given protocol provides a conceptual lens through which quality is This study is situated within a larger validation effort of mea-
defined (e.g., see Dwyer, 1998). For this work, we adopt the sures of teaching quality. Following Messick (1989) and Kane
framework of authentic intellectual work (Newmann, Bryk, & (2013), we investigate evidence of the extent to which scores
Nagaoka, 2001). Originally proposed by Archibald and Newmann from an artifact protocol support the appropriateness of inferences
(1988), the framework characterizes the work students are typi- about teaching quality in middle school math and ELA classrooms.
cally asked to do as contrived and superficial and contrasts that Adopting a theoretical framework of intellectual demand, this
with the kinds of work skilled adults often do. Authentic intellec- research seeks empirical support for the use of artifacts to make
tual work is viewed as relatively complex and socially or personally judgments of teaching quality by investigating the following
meaningful. questions:
Newmann et al. (2001) describe authentic intellectual work as
having three distinctive characteristics. First, it involves the con- 1. How are scores representing assignment intellectual demand
struction of knowledge, arguing that authentic work requires one to distributed for math and ELA?
go beyond routine use of information and skills previously learned. 2. What is the relationship between the intellectual demand of a
Problem solvers must construct knowledge that involves “orga- given assignment and the student work produced in response?
nizing, interpreting, evaluating, or synthesizing prior knowledge to 3. What are the relationships between assignment intellectual
solve new problems (p. 14).” The second characteristic of authentic demand and other measures of teaching?
intellectual work is disciplined inquiry, which involves the use of 4. How are assignment scores related to contextual variables
prior knowledge in a field, in-depth understanding, and elaborated including teacher characteristics, class demographics, schools,
communication. The final characterizing feature is value beyond or districts?
school, the idea that work that people do authentically is intended
to impact or influence others.
The principles of authentic work derive from philosophies and 1.2. Review of research on assignment quality as measures of
studies from constructivist traditions including Bruer (1993), classroom practice
Dewey (1916), Resnick (1987), and Wolf, Bixby, Glenn, & Gardner
(1991). Support for these constructivist pedagogies, on which the Initial validation work of IDAP (Wenzel et al., 2002) provided
authentic work framework is based, include Carpenter, Fennema, evidence that IDAP scores could support inferences about the
Peterson, Chiang, and Loef (1989), Cobb et al. (1991), and Silver quality of classroom assignments in the Chicago Public Schools.
and Lane (1995). In this constructivist tradition, students engage Artifacts could be rated reliably, though there was some year-to-
with real-world problems that have legitimacy within their own year drift. In addition, more challenging artifacts were associated
experiences and that involve the structuring and restructuring of with higher test scores (Newmann et al., 2001). Note, however,
knowledge rather than simply reporting back information that they these initial validation studies used status scores rather than a
have reviewed. value-added measure. They also did not include observation mea-
A number of studies have provided empirical support. sures as alternative measures. Additional work supporting the
Newmann, Marks, and Gamoran (1996) studied a set of restruc- validity of using IDAP was done by Mitchell et al. (2005), who also
tured schools that were designed around authentic intellectual demonstrated that artifacts could be scored reliably and that scores
engagement and related constructivist practices. They found that were related to status achievement scores.
authentic pedagogical practice explained approximately 35% of the Studies using other assignment protocols have also examined
variance in student performance. D'Agostino (1996) studied the validity and reliability of classroom assignments. Matsumura
compensatory (low-achieving) education third-grade classrooms and colleagues found that a reliable estimate of ELA classroom
and found a strong relationship between authentic instruction and assignment quality could be attained with three assignments, that
math problem solving. Findings for reading comprehension were there was overlap among the scales, and that there was a rela-
more ambiguous. Knapp, Shields, and Turnbull (1992) studied high- tionship between assignment quality and other measures of
poverty schools and found that those classrooms that engaged in teaching quality (Clare, 2000; Clare & Aschbacher, 2001; Clare,
authentic practices of meaning making, disciplinary thinking, and Valdes, Pascal, & Steinberg, 2001; Matsumura & Pascal, 2003;
connections with the real world produced students who were Matsumura et al., 2008). Similar work looking at middle school
50 J. Joyce et al. / Learning and Instruction 54 (2018) 48e61
2.2. Instruments lesson was video-recorded. Three different protocols were used to
score each of the four videos. All teachers were scored using the
The study used the IDAP (Wenzel et al., 2002) to examine Framework for Teaching (Danielson, 2013) and CLASS™ (La Paro &
classroom assignments and associated student work within a Pianta, 2003). These are protocols designed to capture qualities of
framework that valued authentic intellectual engagement. Two instruction, classroom environment, and interpersonal relation-
separate protocols, one for math and one for ELA, were used in the ships across all subjects and grades. In addition, math teachers
current study. For math, intellectual demand was considered in were scored using the subject-specific Mathematical Quality of
terms of three dimensions: communication, conceptual under- Instruction (MQI) protocol (Learning Mathematics for Teaching
standing, and real-world connections. For ELA, intellectual demand Project, 2011), and ELA teachers were scored using the subject-
dimensions included: communication, construction of knowledge, specific Protocol for Language Arts Teaching Observation (PLATO)
and real-world connections. Teacher assignments and student (Grossman et al., 2009). Additional details on the design, rater
work were each evaluated separately, and the scales for each were training, and collection of the observation protocol data are pro-
similar but not identical. The structure of the scales for assignments vided in Bell et al. (2012). The observation lessons were collected
and student work is presented in Table 1. The rubrics for each of the independently from the assignments, meaning that the foci of the
respective scales are available in Appendix A. two sources of evidence were not linked in terms of specific content
being studied. The estimated reliability of the observation scores at
2.2.1. Raters, training, and scoring the teacher level using four observations was approximately 0.65
Artifacts were scored within domains by raters who had (Bill & Melinda Gates Foundation, 2012). Reliability was estimated
expertise both in teaching math (n ¼ 17) or ELA (n ¼ 22) and in using a variance decomposition analysis that included teacher,
using scoring rubrics to score constructed responses for other section, lesson, rater, and residual factors.
standardized assessment programs. Five highly experienced in-
dividuals in constructed-response scoring served as scoring leaders 2.3.2. Value-added modeling
for each domain, providing guidance and oversight throughout the VAM scores were computed at the teacher by class section level,
scoring process. Artifacts were distributed and displayed to raters using state standardized achievement tests. The dataset also
using a proprietary scoring tool. Exemplar selection and rater included scores from the state accountability tests in math, ELA,
training were led by experts (one each for math and ELA), who were reading, and science, longitudinally linked to individual students
involved in the original development of the IDAP. Papers were across grades. Finally, the data included student demographic and
chosen by the experts from the pool of artifacts submitted as part of program participation information including race, gender, free or
the study to serve as exemplars for each rubric and each score reduced price lunch status, English language learner (ELL) status,
point. and indicators of participation in gifted or special education pro-
Training and certification followed the same process for each of grams. VAM scores were computed using the latent regression
the six rubrics in each domain. First, benchmark papers were model (Lockwood & McCaffrey, 2014), which regresses outcome
introduced to illustrate each score point. Then, using a set of 6 test scores on teacher indicator variables, student background
training papers, raters first scored the artifact, followed by a dis- characteristics, and student prior test scores while accounting for
cussion with the scoring trainer. Next, each rater independently the measurement error in the prior test scores. Reliabilities for VAM
completed a set of 6 certification papers and was required to assign scores using a variance decomposition analysis were 0.89 for math
the pre-determined score in at least four cases (with no deviations and 0.80 for ELA (Lockwood, Savitsky, & McCaffrey, 2015).
of more than 1 score point) in order to begin the actual scoring.
Training took between 2.5 and 5 h per scale. During scoring of the
2.3.3. Teacher knowledge
assignments, typical and challenging assignments were intermixed
All math teachers took a version of the Mathematics Knowledge
and rated in the same session. For student work, only challenging
for Teaching (MKT) test (Hill, Schilling, & Ball, 2004). The 38-item
assignments were involved. Scoring leaders regularly read scored
instrument was designed to assess “the mathematics teachers use
samples to ensure reliability. Scoring of the entire sample on each
in classrooms” (Delaney, Ball, Hill, Schilling, & Zopf, 2008, p. 172)
scale took 1e3 days, depending on the complexity of the scale and
through items that ask questions covering instructional choices and
the number of artifacts.
analysis of student answers in the areas of number, operations, and
content knowledge as well as patterns, functions, and algebra
2.3. Other measures of teaching quality content knowledge. For ELA, teachers were administered an
abridged form (30 items) of the Praxis® test used for ELA teacher
In addition to the IDAP, the study included other measures certification. This instrument is intended to assess “ … knowledge,
associated with teaching quality. skills, and abilities believed necessary for competent professional
practice” (Educational Testing Service, 2015, p. 5). Reliabilities for
2.3.1. Observation protocols teacher knowledge scores were 0.85 for math and 0.78 for ELA
Each teacher in this study was observed four times, and each (Lockwood et al., 2015).
Table 1
Scale dimensions (scale range) for teacher assignments and student work in math and ELA.
Table 2 Table 3
Variance components analysis for total assignment quality scores. Teacher assignment raw and adjusted scores (SD).
Component Math Variance (%) ELA Variance (%) Math Math ELA ELA
Raw Adjusted Raw Adjusted
School 0.07 (6) 0.02 (1)
Teacher 0.04 (4) 0.03 (1) Typical 1.73 (0.41) 1.75 (0.42) 1.61 (0.61) 1.58 (0.59)
Class Section 0.17 (15) 0.00 (0) Challenging 1.89 (0.49) 1.90 (0.50) 2.03 (0.71) 2.05 (0.74)
Assignment 0.81 (72) 2.39 (95)
Note. Range ¼ 1e3.67. df - Typical ¼ 904; df - Challenging ¼ 451.
Rater 0.03 (3) 0.08 (3)
3.1.1. Math
Average intellectual demand scores are relatively low with
respect to the overall scale (see Table 3). Fig. 2 shows a large pro-
portion of scores below 2 on a scale that ranges from 1 to 3.67 for
both typical and challenging assignments. Though the challenging
assignments show higher intellectual demand, they are still rela-
tively lacking in demand for substantial demonstration of
!
2 Pnmijk
The total scores were modeled as follows: log Pnmijðk1Þ ¼ Bn Di Cj Fk ,
where Bn is the overall quality of the assignment, Di is the difficulty of the particular
scale i, Cj is the severity of the rater j, Fk is the threshold difficulty of scale rating k
vs k-1, and Pnmijk is the probability of receiving rating k, and Pnmijðk1Þ is the
probability of receiving rating k-1. Fig. 2. Math assignment score distributions.
J. Joyce et al. / Learning and Instruction 54 (2018) 48e61 53
3.1.3. Performance on individual scales 3.2. What is the relationship between assignment quality and
Table 4 presents information on each of the scales contributing resulting student work in math and ELA?
to the overall intellectual demand score for both typical and chal-
lenging assignments in math and ELA. Differences between the This analysis examines the relationship between the demands
typical and challenging assignments are also statistically significant contained in classroom assignments and the corresponding work
at the scale level in both domains. that students do to fulfill the assignment. Student work scores were
For math, all scales scores are slightly higher for challenging regressed onto the teacher assignment ratings using hierarchical
than for typical assignments. More challenging assignments are linear modeling (HLM; Raudenbush & Bryk, 2002) to describe the
somewhat more likely to ask students to show their work or pro- overall relationship of teacher assignment quality and quality of
vide some types of explanation. The other scales have smaller dif- student work.
ferences for assignment type, although all score differences To better understand this relationship, we divided each scale
between challenging and typical assignments are statistically into thirds and categorized each assignment as low, medium, or
significant. high and then reported the mean student work score for each
For ELA, score differences between typical and challenging as- category (see Table 6).
signments are larger. Challenging assignments are more likely to
ask for some extended writing as a characteristic associated with 3.2.1. Math
rigor and construction of knowledge and more frequently ask stu- For math, more than 2000 student artifacts were produced in
dents to take on a role in their writing. However, very rarely are response to low-demand assignments and were uniformly rated
these roles judged to be real world for the student (as described by low in demand. As intellectual demand of the assignment
the lack of scores of 3 and 4 on the relevance scale). increased, there is a slight increase in quality of associated student
Table 5a and Table 5b present the correlations among scales for work, but student scores remain relatively low. The weak rela-
math and ELA, respectively. After aggregating scale scores at the tionship was confirmed by regressing student work quality onto
teacher level and pooling across all assignments, the correlations teacher assignment quality ratings with a HLM, which returned a
were computed using Spearman's rank correlation Rho. The cor- standardized coefficient of only 0.23 (t ¼ 15.5), with an R2 value of
relation reported in each series pools scores across all assignments. only 0.14. That is, math assignment ratings accounted for only 14%
Analyses that included only typical or only challenging assignments of the observed variance in student work scores. The most salient
yielded similar results. All correlations are statistically significant, point, however, is the relative absence (~6%) of high-demand
yet moderate, supporting the idea that the IDAP dimensions are teacher assignments: assignments with scores that fell into the
each measuring related yet unique constructs. top third of the score scale. In fact, only 21 (<10%) of the 225 math
teachers had even one assignment score in the high-demand
category, and only two (<1%) teachers in the study had more
than one assignment rated at that level.
3.2.2. ELA
For ELA, there is a noticeable shift in the quality of student work
in response to increased demand in the assignment, with a marked
change in distribution as assignments move from low demand to
medium and high demand. HLM analysis returned a standardized
coefficient of 0.72 (t ¼ 23.8) and an R2 value of 0.52, suggesting a
strong relationship between ELA assignment demand and student
work.
Table 4
Mean assignment demand by scale (SD)(n) including differences between typical and challenging scales.
Scale Math Mean Differencea Math Scale ELA Mean Differencea ELA
Typical Challenging Typical Challenging
Written Communication 1.52 (0.60) 0.21*** 1.73 (0.68) Elaborated Communication 2.01 (0.96) 0.66*** 2.66 (1.08)
(3-point scale) (900) ES ¼ 0.35 (448) (4-point scale) (895) ES ¼ 0.69 (443)
Conceptual Understanding 2.19 (0.44) 0.10*** 2.29 (0.54) Construction of Knowledge 1.57 (0.69) 0.43 1.99 (0.77)
(4-point scale) (901) ES ¼ 0.23 (447) (3-point scale) (899) ES ¼ 0.62 (444)
Real-World Connection 1.50 (0.63) 0.13*** 1.61 (0.70) Real-World Connection 1.27 (0.53) 0.22 1.49 (0.64)
(4-point scale) (901) ES ¼ 0.21 (446) (4-point scale) (899) ES ¼ 0.42 (448)
Note. For this analysis the n reflects total number of ratings contributing to the scale, including cases with two ratings.
*p < 0.05, **p < 0.01, ***p < 0.001.
a
ES stands for effect size.
Table 5a
Assignment inter-scale correlations for mathematics.
Written Communication 1
Conceptual Understanding 0.52 1
Real-World Connection 0.38 0.33 1
Table 5b
Assignment inter-scale correlations for English language arts (ELA).
Elaborated Communication 1
Construction of Knowledge 0.75 1
Real-World Connection 0.46 0.52 1
Table 6
Mean student work score by level of assignment demand (SD) (n)d.
Table 7
Teacher-level correlations between challenging assignment quality, student work, and other measures of teaching quality.
VAM NS 0.15* NS NS
MKT/Praxis® NS NS 0.22** 0.28**
CLASS™ NS 0.15* 0.17** 0.36***
FFT (total) NS 0.15* 0.23*** 0.33***
MQI (total) NS NS e e
PLATO (total) e e 0.13* 0.24**
relationships with student work were observed for the domain- 3.3.3. VAM
general protocols but not for the subject-specific MQI. For ELA, all No relationship was found between VAM and teacher assign-
three protocol total scores were statistically significantly and ment quality in math or ELA, nor is a relationship observed be-
positively related to both teacher assignments and student work. tween ELA student work and VAM. For math, there is a statistically
significant but small positive association between the quality of
student work and VAM.
J. Joyce et al. / Learning and Instruction 54 (2018) 48e61 55
3.4. How are assignment scores related to contextual variables There was multicollinearity among the individual variables
including teacher characteristics, class demographics, schools, or within each predictor group, making the estimates of single vari-
districts? ables within each group problematic to interpret. Therefore, a se-
ries of likelihood ratio tests (LRTs) was conducted that evaluated
In this analysis, we attempt to understand contextual, assign- the effect of including the group of variables in the model compared
ment, and teacher factors associated with assignment quality. with a model that contained all variables but those belonging to the
These analyses were conducted at the class section level to capture specific predictor group. The p-values of the LRT are presented in
differences between two class sections that a teacher taught. Class Table 8.
section, teacher, and school variables were all incorporated as
random effects to account for the nested structure of the data. In 3.4.1. Math
addition, a number of contextual variables were included in the Statistically significant effects were observed for school district
model and treated as fixed effects. These variables were time of the such that assignments from one district were more intellectually
school year (season) during which the assignment was given, demanding than from the other two districts. Sixth-grade assign-
assignment type, different scales, student grade (grades 6e8), and ments were more intellectually demanding than assignments in
school district (3 districts). older grades. Both of these findings suggest the possibility of
The model also included predictor groups of variables intended curricular differences across grades and/or districts that might ac-
to capture important contextual dimensions. A first group of pre- count for intellectual demand of assignments. As observed in other
dictors was considered class section demographic variables and analysis, challenging assignments differed from typical assign-
included, for each section, percentage of students in each section ments and scale distributions differed as well.
identified as free/reduced-priced lunch eligible, minority, and En- Class sections with larger proportions of minority, English lan-
glish language learners. A second group of predictors was labeled as guage learner, and free or reduced price lunch students were
prior achievement variables and included mean class section prior associated with assignments that had lower intellectual demand.
achievement scores as well as the proportion of students in each Class sections that higher prior achievement scores, more students
class section classified as either gifted or special education. A third classified as gifted, and fewer classified as special education had
group of predictors was treated as teacher quality variables and assignments with greater intellectual demand. However, there was
consisted of teacher knowledge and VAM estimates. not a significant difference in intellectual demand among classes
To model the likelihood of different factors being associated that differed with respect to the teacher's knowledge and VAM
with assignment ratings of different quality, scale scores were scores.
analyzed via a multi-level cumulative logit mixed model described
in Appendix B. A full model incorporating all of the aforementioned 3.4.2. ELA
contextual variables and predictors was fitted to the data to ELA relationships of contextual variables differed somewhat
investigate their relationship with assignment ratings. Estimates from math. No district effects were observed, and eighth grade had
from the full model are presented in Table 8 and elaborated on in more challenging assignments than the lower grades. Additionally,
the following subsections. there was an effect of season, such that assignments given in the
second half of the school year were less challenging than those
given in the first half. There were substantial differences across
Table 8 assignment type and scale.
Model estimates of contextual variables and assignment demand scores.
Patterns among ELA predictor group variables paralleled math.
Contextual Variablesa Math ELA Class sections with larger proportions of minority, English language
District Ab 0.06 0.12 learner, and free or reduced price lunch students were associated
District Bb 0.89*** 0.06 with assignments that had lower intellectual demand. Class sec-
Seasonc 0.14 1.10*** tions that had students with higher prior achievement scores, more
Assignment Typed 0.53** 1.96***
students classified as gifted, and fewer students classified as special
Grade 7e 0.41** 0.07
Grade 8e 0.42** 0.41* education were given assignments with greater intellectual de-
Scale 2 f 2.45*** 1.51*** mand. However, there was not a significant difference in intellec-
Scale 3f 2.72*** 3.01*** tual demand among classes that differed with respect to the
Teacher Quality Variablesg p > 0.05h p > 0.05h teacher's knowledge and VAM scores.
VAM 0.02 0.15*
MKT/Praxis 0.01 0.02 4. Discussion
Class Section Demographic Variablesg p < 0.01h p < 0.01h
Minority 0.28** 0.16** This study presents findings that describe how classroom arti-
FRLP 0.19** 0.13** facts can provide unique insights into the quality of instruction
ELL 0.02** 0.09**
experienced by students in mathematics and ELA. These insights
Prior Achievement Variablesg p < 0.05h p < 0.001h are complementary to those provided by existing measures,
Gifted 0.03* 0.10*** capturing both the instructional expectations set in a particular
Special Education 0.02* 0.02***
Prior Test Scores 0.05* 0.16***
classroom and the student responses to these expectations. This
research demonstrates that artifacts can provide explicit and reli-
Note. *p < 0.05. **p < 0.01. ***p < 0.001. Coefficients are in logits.
a ably judged evidence about the intellectual demand of the work
Describe the characteristics of the district, school, and classroom.
b
Referenced against District C. students are asked to do and the intellectual demand in the stu-
c
Negative means Spring scores are lower. dents’ response, shedding light on curricular and contextual factors
d
Positive means Challenging scores are higher. as well as teaching quality. The study also provides evidence about
e
Referenced against Grade 6. the relationship between intellectual demand of assignments and
f
Referenced against Scale 1.
g
Represent blocks of variables that are evaluated via the likelihood ratio test.
other measures of teaching quality and contextual correlates.
h
The p values in bold indicate whether each block represents a statistically sig- A critical distinction in examining teaching has been the
nificant improvement over the model including contextual variables only. contrast of teacher quality with teaching quality (Kennedy, 2010).
56 J. Joyce et al. / Learning and Instruction 54 (2018) 48e61
Teacher quality focuses on the individual and assumes a causal From a teaching quality perspective, however, this lack of vari-
connection between what the teacher does and what the student ation is, in fact, quite important. The consistent lack of intellectual
learns. Teaching quality focuses on the teaching and the teacher as a demand in math assignments suggests that issues of teaching
central actor in a complex system that includes a host of contextual quality must be addressed systemically. Even the teachers with the
factors including curriculum, policy, and community (Bell et al., highest artifact scores had artifacts that scored low relative to the
2012). Studies of teaching quality more often include a focus on IDAP scales. While there are significant contextual relationships
normative practice as a way of understanding consistencies in with demographic and prior achievement variables, the over-
practice across a system. This research is an examination of arching story is that even in classrooms characterized by favorable
teaching quality and avoids any claims of causal attribution to in- predictors, the intellectual demand of assignments is quite low. In
dividual teachers. Instead, this study focuses on both differences terms of the protocol, across the vast majority of classrooms in the
and commonalities across classrooms as a way of understanding study, students are asked to solve routine math problems that
teaching quality situated in an understanding of the contextual require little more than procedural execution and lack extended
factors. That is, artifacts provide evidence of instructional quality communication or real-world connections. These insights are
that is likely attributable to factors that include, but are not limited valuable to stakeholders in a multitude of ways. The most obvious
to, the teacher. This is a significant departure from teacher obser- would be to exert pressure on publishers to deliver materials that
vations that tend to be centered on a teacher's strengths or weak- are relevant, rigorous, and constructive so that these tasks can then
nesses and, instead, focus on the actual intellectual demand asked be implemented as such in the classroom. Additionally, the small
of a student. distinction between typical and challenging can be interpreted as a
Evidence from assignments reinforces findings from observa- potential area for professional development for teachers in terms of
tions that the intellectual demand of classroom interactions is very how to effectively provide challenge in the classroom. Finally, in
limited and relatively consistent across classrooms. As in observa- these findings there is a clear call to researchers and educators to
tion studies, including the study from which these data are drawn further explore the role of relevance in middle years education.
(see Bell et al., 2014), dimensions associated with intellectual de- Findings in ELA classrooms are somewhat different. For ELA,
mand consistently score relatively low on different protocols many of the challenging assignments are qualitatively different
(Gitomer et al., 2014) and are clustered around a narrow range at from typical assignments. Challenging assignments, while still
the lower part of respective scales. Score variance attributable to modest in terms of the overall scale, are more likely to ask students
raters is very low, and scale inter-correlations are moderate. While to provide elaborated communication, to engage in greater con-
the construct of intellectual demand is the basis of IDAP for both struction of knowledge, and to complete more authentic writing
math and ELA, the specific features of design and performance that tasks in which they attempt to take on a role, even if at a relatively
define the domain-specific IDAP are unique and consistent with the superficial level. The more intellectually demanding ELA assign-
structure of particular disciplines. For both math and ELA, raters are ments are strongly associated with student work that evidences
able to reliably score artifacts and to differentiate scores on three greater intellectual demand.
scales designed to capture intellectual demand. In addition to this ELA scores are positively correlated with observation scores.
general pattern, there are specific results that are appropriate re- Classrooms with higher intellectual demand in assignments are
ported separately by domain. also characterized by observation scores that describe stronger
The construct of intellectual demand is the basis of IDAP for both classroom interactions. These classrooms are also taught by
math and ELA. However, the specific features of design and per- teachers with higher teacher knowledge scores. There are also
formance that define the domain-specific protocols are unique and significant contextual relationships with demographic and prior
consistent with the structure of particular disciplines. For both achievement variables.
math and ELA, raters are able to reliably score artifacts and to For ELA only, there is an effect of the season during which the
differentiate scores on three scales designed to capture intellectual assignment is given. Assignments given in the spring are less
demand. Therefore, the IDAP can be used effectively to rate artifacts demanding than those given in the fall. One conjecture is that the
as evidence of teaching quality. decrease may be reflective of an increased focus on standardized
In math, the intellectual demand of artifacts is quite low and test preparation in the spring. Another is that teachers may expe-
narrowly distributed for all scales, regardless of whether the rience planning fatigue as the year progresses. Also, assignments in
teacher categorized the artifact as typical or challenging. Though eighth grade were more challenging than those in sixth grade,
there is a statistically significant difference in scores for typical and although no difference was noted for seventh grade. Further study
challenging assignments, those that teachers select as more chal- is needed to understand this trend and whether it may be related to
lenging show only very small increases in each of the scales. To the a perceived need that deeper tasks are required in preparation for
limited extent that math artifacts show more demand, students are high school.
more likely to produce work that reflects this demand. Of course, Artifact research does allow for some important insights that are
the demands expressed on an assignment do not always reflect the not always picked up by other measures. For example, the differ-
full expectations of the teacherdsome may be communicated ences between the intellectual demand asked of students from
orally or through other habits developed in the classroom. different subgroups would be critical to key stakeholders in
The lack of substantial variation in math assignment scores addressing equity of access. Additionally, if it is found that the
leads to two sets of findings addressing issues of teaching quality. spring drop in demand is related to the current timeframe of high-
First, correlations with other variables of interest are highly stakes testing, a response may be needed in either revising the
attenuated. Therefore, these correlations with other measures of testing in terms of intent and composition of the tests themselves.
teaching quality are either not significant or very small. However, Across both domains, correlations of assignment scores with
the finding of district and grade effects suggests the potential of this VAM are not statistically significant, and this finding raises issues
kind of artifact analysis to identify systematic differences that are deserving of further study and adds to the argument for the drop in
worthy of further exploration. For example, can curricular differ- demand in preparation for the standardized testing, from which
ences at either the district or grade level account for these differ- VAM is calculated, in the spring. It may be that the items on
ences? Also, score decreases were found to be attributable to achievement tests that are used to determine VAM estimates are
reduced attention to real-world relevance in the higher grades. not asking students to engage in intellectually demanding tasks as
J. Joyce et al. / Learning and Instruction 54 (2018) 48e61 57
valued by the IDAP. Similarly, it may be that improvement in Foundation, Grant# OPP52048, as part of the Understanding
achievement test scores can be made by simply focusing on the Teaching Quality (UTQ) project.
improvement of skills associated with low-demand tasks. A much
better understanding of the alignment of test demands with
Acknowledgements
assignment demands is needed to better interpret this lack of
consistent relationship between measures.
The authors are grateful to Steve Leinwand and Carmen
Grade, district, and seasonal effects suggest the possibility that
Manning for leading the rater training effort and to Lesli Scott and
assignment differences reflect differences in mandated curriculum
Terri Egan for coordinating data collection and scoring, respectively.
in addition to any differences that can be attributed to individual
We are grateful to our collaborators, Courtney Bell, JR Lockwood,
teachers. Current teacher evaluation efforts have placed a very
and Dan McCaffrey, for all of their contributions to study design as
substantial focus on individual teacher differences, but there are
well as comments on earlier versions of the paper. The authors
likely broader contextual factors such as curriculum that also need
alone are responsible for the content of this paper.
to be given attention (Gitomer, 2017). The link between mandated
curriculum reform and teacher interpretation, as explored by
Cohen and Hill (2000), merits further investigation. Appendix A
It is important to recognize features of the research that have Math - Teacher assignment scoring
implications for interpreting the study's findings. As noted, the
Newmann et al. (2001) framework represents a particular Scale 1 - (Written mathematical communication).
perspective on instruction. Different frameworks might support
different inferences about the quality of assignments. Nor does the
framework used in this study consider important teaching and
3 ¼ Analysis/Persuasion/Theory. The task requires the stu-
learning issues such as learning progressions, adaptive materials
dent to show his/her solution path and to explain the solu-
for students with special needs, culturally responsive pedagogy,
tion path with evidence.
and the relationship of artifacts to curriculum design. The IDAP may
have more or less utility in studying these kinds of problems. 2 ¼ Report/Summary. The task requires the student to show
The issue of teachers selecting challenging and typical assign- her/his solution path or requires explanation of why.
ments raises potentially interesting questions as well. In this study, 1 ¼ Just answers. The task requires little more than an-
teachers were given limited guidance in deciding what constituted swers. Students may be asked to show some work.
a challenging assignment. Thus, inferences in this study are not
simply about the quality of an assignment but also include teacher
choice of the assignment. While the student data suggests that the
teachers were differentiating assignments in ways that were sen- Scale 2 - (Conceptual Understanding).
sitive to the protocol, there is a question of whether different, and
perhaps more directive, sampling procedures would lead to
different interpretations.
Finally, as with most artifact studies, judgments are made on the
basis of evidence contained in the artifact documents. Evidence of 4 ¼ The dominant expectation of the assignment requires
what transpired in the classroom around the enactment of the ar- substantial demonstration of conceptual understanding
tifacts is not available. Certainly interactions between teachers and that is related to one or more mathematical ideas.
students might strengthen or weaken the expectations that can be 3 ¼ The dominant expectation of the assignment requires
inferred by only considering the physical documentation. some demonstration of conceptual understanding that is
related to one or more mathematical ideas.
4.2. Conclusions
2 ¼ The dominant expectation of the assignment relates to a
As a measure of teaching quality, assignments have the potential mathematical idea but primarily requires demonstration of
to contribute to an understanding of the extent to which intellec- procedural knowledge.
tual demand is present in classrooms and across classrooms within 1 ¼ The mathematical expectation of the assignment relates
schools and districts. These insights can then be used to address only tangentially, if at all, to a mathematical idea.
development needs in each of these areas leading to improvement
in teaching quality and ultimately to improvement of student
outcomes (Yoon, Duncan, Lee, Scarloss, & Shapley, 2007). By
examining score patterns as we have done, schools and districts Scale 3 - (Real-world connection).
may be able to better identify both the extent to which students are
asked to engage in intellectually meaningful work as well as the
teacher, classroom, and institutional factors that contribute to
students’ learning experience. Further, with this insight, schools
4 ¼ The assignment clearly addresses an authentic situation
and districts could determine priorities for improvement at a local
that entails relevant mathematical questions, issues, or
or system level.
problems that are encountered in the real world and spec-
ifies that the work will be meeting the needs of a real and
Funding
appropriate audience, one other than a teacher or student
peers.
This work was supported by the Bill & Melinda Gates
58 J. Joyce et al. / Learning and Instruction 54 (2018) 48e61
Scale 3 - (Reasoning).
3 ¼ The assignment clearly addresses an authentic situation
that entails relevant mathematical questions, issues, or
problems that are encountered in the real world but does
not specify that the work will be meeting the needs of a real
and appropriate audience. 4 ¼ Student work demonstrates appropriate and successful
2 ¼ The assignment makes an attempt to address mathe- use of reasoning.
matical questions, issues, or problems that are encountered 3 ¼ Student work demonstrates reasoning, but it is not
in the real world, but the connection is superficial, tangen- entirely appropriate or successful.
tial, contrived, or weak.
2 ¼ Student work demonstrates some use of reasoning
1 ¼ The assignment makes a minimal attempt to address strategies.
the mathematical questions, issues, or problems that are
encountered in the real world. 1 ¼ Student work represents primarily the reproduction of
fragments of knowledge OR the application of previously
learned procedures OR completely inappropriate
reasoning.
and techniques for part of the task, BUT these skills are not characteristics of the genre well, although it offers more
the dominant expectation of the task. sustained writing than isolated sentences or short answers.
1 ¼ There is very little or no expectation for students to 1 ¼ The writing is a series of isolated sentences or short
construct knowledge or an idea. Students can satisfy all or answers that do not support a central idea, or the writing
almost all of the requirements of the task by reproducing responds to fill-in-the-blank or multiple-choice tasks.
information they have read, heard, or viewed. For imagi-
native tasks, students are given very little or no direction in
the use of literary strategies and techniques in order to
construct an idea.
Scale 2 - (Construction of knowledge).
2 ¼ The work includes extended writing that EITHER makes The model was coded using the clmm function (with default
an assertion OR provides evidence for a given assertion but settings) in the R package ordinal (Christensen, 2015). The
does not do both.//The writing does not demonstrate the
60 J. Joyce et al. / Learning and Instruction 54 (2018) 48e61
dependent variable was the raw scale rating (ordinal) and the knowledge for teaching”: Adapting U. S. measures for use in Ireland. Journal of
Mathematics Teacher Education, 11(3), 171e197. http://dx.doi.org/10.1007/
predictors in the different models as described in the manuscript.
s10857-008-9072-1.
This is essentially an extension of the well-known multilevel/hi- Denner, P. R., Salzman, S. A., & Bangert, A. W. (2001). Linking teacher assessment to
erarchical linear models (Raudenbush & Bryk, 2002) for ordinal student performance: A benchmarking, generalizability, and validity study of
dependent variables. Fitting the model with ordinal variables re- the use of teacher work samples. Journal of Personnel Evaluation in Education,
15(4), 287e307. https://doi.org/10.1023/A:1015405715614.
quires using a logit link, and thus the estimated coefficients are in Dewey, J. (1916). Democracy and education. New York, NY: Free Press.
the logit-link as well. The hierarchical nature of the data can be Dowden, T. (2007). Relevant, challenging, integrative and exploratory curriculum
accommodated with the software estimating both the fixed and design: Perspectives from theory and practice for middle level schooling in
Australia. The Australian Educational Researcher, 34(2), 51e71. https://doi.org/10.
random effects. 1007/BF03216857.
Dwyer, C. A. (1998). Psychometrics of Praxis III: Classroom performance assess-
ments. Journal of Personnel Evaluation in Education, 12(2), 163e187. https://doi.
References org/10.1023/A:1008033111392.
Educational Testing Service. (2015). The Praxis® Study companion: English language
American Institutes for Research. (2003). Exploring assignments, student work, and arts: Content knowledge. Retrieved from: http://www.ets.org/s/praxis/pdf/5038.
teacher feedback in reforming high schools: 2002e2003 data from Washington pdf.
state. Menlo Park, CA: SRI International. Gitomer, D. H. (2017). Promises and pitfalls for teacher evaluation. In R. P. Ferretti, &
Archibald, D. A., & Newmann, F. M. (1988). Beyond standardized testing: Assessing J. Hiebert (Eds.), Teachers, teaching, and reform perspectives on efforts to improve
authentic academic achievement in secondary schools. Washington, DC: National educational outcomes. London, UK: Routledge.
Association of Secondary School Principals. Gitomer, D. H., & Bell, C. A. (2013). Evaluating teaching and teachers. In
Auerbach, S. (2002). Why do they give the good classes to some and not to others?: K. F. Geisinger (Ed.), APA handbook of testing and assessment in psychology (Vol.
Latino parent narratives of struggle in a college access program. Teachers College 3, pp. 415e444). Washington, DC: American Psychological Association.
Record, 104(7), 1369e1392. https://doi.org/10.1111/1467-9620.00207. Gitomer, D. H., Bell, C. A., Qi, Y., McCaffrey, D. F., Hamre, B. K., & Pianta, R. C. (2014).
Beane, J. (2004). Creating quality in the middle school curriculum. In The instructional challenge in improving teaching quality: Lessons from a
S. C. Thompson (Ed.), Reforming middle level education: Considerations for poli- classroom observation protocol. Teachers College Record, 116(6), 1e32. Retrieved
cymakers (pp. 49e63). Westerville, OH: National Middle School Association. from: http://www.tcrecord.org/Content.asp?ContentId¼17460.
Bell, C., Gitomer, D., McCaffrey, D., Hamre, B., Pianta, R., & Qi, Y. (2012). An argument Grossman, P., Greenberg, S., Hammerness, K., Cohen, J., Alston, C., & Brown, M.
approach to observation protocol validity. Educational Assessment, 17(2e3), (2009, April). Development of the protocol for language arts teaching obser-
62e87. https://doi.org/10.1080/10627197.2012.715014. vation (PLATO). In Presented at the annual meeting of the American Educational
Bell, C. A., Qi, Y., Croft, A. J., Leusner, D., McCaffrey, D. F., Gitomer, D. H., et al. (2014). Research Association, San Diego, CA.
Improving observational score quality: Challenges in observer thinking. In Hill, H. C., Schilling, S. G., & Ball, D. L. (2004). Developing measures of teachers'
T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.), Designing teacher evaluation systems: mathematics knowledge for teaching. The Elementary School Journal, 105(1),
New guidance from the Measures of Effective Teaching project (pp. 50e97). San 11e30. https://doi.org/10.1086/428763.
Francisco, CA: Jossey-Bass. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of
Bill & Melinda Gates Foundation. (2012). Gathering feedback for effective teaching: Educational Measurement, 50(1), 1e73. https://doi.org/10.1111/jedm.12000.
Combining high-quality observations with student surveys and achievement gains. Kane, T. J., Kerr, K. A., & Pianta, R. C. (Eds.). (2014). Designing teacher evaluation
Author, Seattle, WA. Retrieved from: http://files.eric.ed.gov/fulltext/ED540960. systems: New guidance from the measures of effective teaching project. San
pdf. Francisco, CA: Jossey-Bass.
Borko, H., Stecher, B., Alonzo, A., Moncure, S., & McClam, S. (2005). Artifact packages Kennedy, M. M. (2010). Attribution error and the quest for teacher quality. Educa-
for characterizing classroom practice: A pilot study. Educational Assessment, tional Researcher, 39(8), 591e598. https://doi.org/10.3102/0013189X10390804.
10(2), 73e104. https://doi.org/10.1207/s15326977ea1002_1. Knapp, M. S., Shields, P. M., & Turnbull, B. J. (1992). Academic challenge for the
Borko, H., Stecher, B., & Kuffner, K. (2007). Using artifacts to characterize reform- children of poverty: Summary report. Washington, DC: Office of Policy and
oriented instruction: The Scoop Notebook and rating guide. CSE Technical Planning, U. S. Department of Education.
Report 707. Los Angeles, CA: Center for the Study of Evaluation, National Center Koh, K., & Luke, A. (2009). Authentic and conventional assessment in Singapore
for Research on Evaluation, Standards, and Student Testing (CRESST)/UCLA. schools: An empirical study of teacher assignments and student work. Assess-
Retrieved from: https://cset.stanford.edu/publications/reports/using-artifacts- ment in Education: Principles, Policy, & Practice, 16(3), 291e318. https://doi.org/
characterize-reform-oriented-instruction-scoop-notebook-and-rat. 10.1080/09695940903319703.
Bruer, J. T. (1993). Schools for thought: A science of learning in the classroom. Cam- La Paro, K. M., & Pianta, R. C. (2003). CLASS™: Classroom assessment scoring sys-
bridge, MA: MIT Press. tem™. Charlottesville, VA: Curry School Center for Advanced Study of Teaching
Carpenter, T. P., Fennema, E., Peterson, P. L., Chiang, C., & Loef, M. (1989). Using and Learning, University of Virginia.
knowledge of children's mathematics thinking in classroom teaching: An Learning Mathematics for Teaching Project. (2011). Measuring the mathematical
experimental study. American Educational Research Journal, 26(4), 499e531. quality of instruction. Journal of Mathematics Teacher Education, 14(1), 25e47.
https://doi.org/10.3102/00028312026004499. https://doi.org/10.1007/s10857-010-9140-1.
Christensen, R. H. B. (2015). Package ‘ordinal’: Regression models for ordinal data Linacre, J. M. (2015). Facets computer program for many-facet Rasch measurement.
(version 2015.6e28). Beaverton, Oregon version 3.71.4. Winsteps.com.
Clare, L. (2000). Using teachers' assignments as an indicator of classroom practice. CSE Lockwood, J. R., & McCaffrey, D. F. (2014). Correcting for test score measurement
Technical Report 532. Los Angeles, CA: Center for Research on Evaluation, error in ANCOVA models for estimating treatment effects. Journal of Educational
Standards, and Student Testing (CRESST)/UCLA. Retrieved from: http://eric.ed. and Behavioral Statistics, 39(1), 22e52. https://doi.org/10.3102/
gov/?id¼ED454294. 1076998613509405.
Clare, L., & Aschbacher, P. (2001). Exploring the technical quality of using assign- Lockwood, J. R., Savitsky, T. D., & McCaffrey, D. F. (2015). Inferring constructs of
ments and student work as indicators of classroom practice. Educational effective teaching from classroom observations: An application of Bayesian
Assessment, 7(1), 39e59. https://doi.org/10.1207/S15326977EA0701_5. exploratory factor analysis without restrictions. Annals of Applied Statistics, 9(3),
Clare, L., Valde s, R., Pascal, J., & Steinberg, J. R. (2001). Teachers' assignments as in- 1484e1509. https://doi.org/10.1214/15-AOAS833.
dicators of instructional quality in elementary schools. CSE Technical Report 545. Looney, J. (2011). Developing high-quality teachers: Teacher evaluation for
Los Angeles, CA: Center for Research on Evaluation, Standards, and Student improvement. European Journal of Education, 46(4), 440e455. https://doi.org/
Testing (CRESST)/UCLA. Retrieved from: http://cresst.org/publications/cresst- 10.1111/j.1465-3435.2011.01492.x.
publication-2917/. Matsumura, L. C., Garnier, H. E., Slater, S. C., & Boston, M. D. (2008). Toward
Cobb, P., Wood, T., Yackel, E., Nicholls, J., Wheatley, G., Trigatti, B., et al. (1991). measuring instructional interactions “at-scale.” Educational Assessment, 13(4),
Assessment of a problem-centered second-grade mathematics project. Journal 267e300. https://doi.org/10.1080/10627190802602541.
for Research in Mathematics Education, 22(1), 3e29. https://doi.org/10.2307/ Matsumura, L. C., & Pascal, J. (2003). Teachers' assignments and student work:
749551. Opening a window on classroom practice. CSE Report 602. Los Angeles, CA:
Cohen, D., & Hill, H. (2000). Instructional policy and classroom performance: The Center for the Study of Evaluation, National Center for Research on Evaluation,
mathematics reform in California. Teachers College Record, 102(2), 294e343. Standards, and Student Testing (CRESST)/UCLA.
D'Agostino, J. V. (1996). Authentic instruction and academic achievement in Messick, S. (1989). Meaning and values in test validation: The science and ethics of
compensatory education classrooms. Studies in Educational Evaluation, 22(2), assessment. Educational Researcher, 18(2), 5e11. https://doi.org/10.1002/j.2330-
139e155. https://doi.org/10.1016/0191-491X(96)00008-9. 8516.1988.tb00303.x.
Daggett, W. R. (2005). Achieving academic excellence through rigor and relevance. Milanowski, A., Heneman, H. G., III, & Kimbal, S. M. (2009). Review of teaching
September, Retrieved from: http://www.readbag.com/daggett-pdf-academic- performance assessments for use in human capital management. Working Paper.
excellence. Madison, WI: Consortium for Policy Research in Education (CPRE).
Danielson, C. (2013). Framework for teaching evaluation instrument. Princeton, NJ: Mitchell, K., Shkolnik, J., Song, M., Uekawa, K., Murphy, R., Garet, M., et al. (2005).
The Danielson Group. Retrieved from: http://www.danielsongroup.org/ Rigor, relevance, and results: The quality of teacher assignments and student work
framework/. in new and conventional high schools: Evaluation of the Bill & Melinda Gates
Delaney, S., Ball, D. L., Hill, H. C., Schilling, S. G., & Zopf, D. (2008). “Mathematical Foundation's high school grants. Washington, DC: American Institutes for
J. Joyce et al. / Learning and Instruction 54 (2018) 48e61 61
Research. Scheerens, J., Ehren, M., Sleegers, P., & de Leeuw, R. (2012). OECD review on evalu-
Newmann, F. M., Marks, H. M., & Gamoran, A. (1996). Authentic pedagogy and ation and assessment frameworks for improving school outcomes: Country back-
student performance. American Journal of Education, 104(4), 280e312. https:// ground report for The Netherlands. The Netherlands: Ministry of Education.
doi.org/10.1086/444136. Retrieved from: http://www.oecd.org/education/school/oecdsubmitted for
Newmann, F. M., Bryk, A. S., & Nagaoka, J. K. (2001). Authentic intellectual work and publicationonevaluationandassessmentframe
standardized tests: Conflict or coexistence? Chicago, IL: Consortium on Chicago worksforimprovingschooloutcomescountrybackgroundreports.htm.
School Research. Retrieved from: https://consortium.uchicago.edu/ Silver, E., & Lane, S. (1995). Can instructional reform in urban middle schools help
publications/authentic-intellectual-work-and-standardized-tests-conflict-or- students narrow the mathematics performance gap? Research in Middle Level
coexistence. Education, 18(2), 49e70. https://doi.org/10.1080/10825541.1995.11670046.
Newmann, F., King, M., & Carmichael, D. (2007). Authentic instruction and assess- Stake, R. E., Kushner, S., Ingvarson, L., & Hattie, J. (2004). Assessing teachers for
ment: Common standards for rigor and relevance in teaching academic subjects. professional certification: The first decade of the National Board for Professional
Des Moines, IA: Iowa Department of Education. Retrieved from: http://www. Teaching Standards (Advances in Program Evaluation) (Vol. 11). Bingley, UK:
centerforaiw.com/wp-content/uploads/2016/05/authentic-instruction- Emerald Group Publishing Limited.
assessment-bluebook.pdf. Stodolsky, S. S. (1990). Classroom observation. In J. Millman, & L. Darling-Hammond
Ng, P. T. (2007). Educational reform in Singapore: From quantity to quality. (Eds.), The new handbook of teacher evaluation: Assessing elementary and sec-
Educational Research for Policy and Practice, 7(1), 5e15. https://doi.org/10.1007/ ondary school teachers (pp. 175e190). Newbury Park, CA: Sage.
s10671-007-9042-x. Taylor, E. S., & Tyler, J. H. (2012). Can teacher evaluation improve teaching? Edu-
Organisation for Economic Co-operation and Development (OECD). (2013a). Syn- cation Next, 12(4), 78e84. Retrieved from: http://educationnext.org/can-
ergies for better learning: An international perspective on evaluation and assess- teacher-evaluation-improve-teaching/.
ment. Paris, France: OECD Publishing. Wenzel, S., Nagaoka, J. K., Morris, L., Billings, S., & Fendt, C. (2002). Documentation of
OECD. (2013b). Teachers for the 21st century: Using evaluation to improve teaching. the 1996e2002 Chicago annenberg research project strand on authentic intellec-
Paris, France: OECD Publishing. Retrieved from: http://www.oecd.org/site/ tual demand exhibited in assignments and student work: A technical process
eduistp13/TS2013%20Background%20Report.pdf. manual. Chicago, IL: Consortium on Chicago School Research. Retrieved from:
Paik, S. (2015). Teachers' attention to curriculum materials and student contexts: http://ccsr.uchicago.edu/sites/default/files/publications/p67.pdf.
The case of Korean middle school teachers. The Asia-Pacific Education Researcher, Wolf, D., Bixby, J., Glenn, J., III, & Gardner, H. (1991). To use their minds well:
24(1), 235e246. https://doi.org/10.1007/s40299-014-0175-4. Investigating new forms of student assessment. Review of Research in Education,
Prosser, B. (2006, November). Reinvigorating the middle years: A review of middle 17, 31e74. Retrieved from: http://www.jstor.org/stable/1167329.
schooling. Paper presented at the Australian Association for Research in Edu- Woolley, M. E., Rose, R. A., Orthner, D. K., Akos, P. T., & Jones-Sanpei, H. (2013).
cation Conference. Adelaide. Retrieved from: http://www.aare.edu.au/ Advancing academic achievement through career relevance in the middle
publications-database.php/5229/reinvigorating-the-middle-years-a-submitted grades: A longitudinal evaluation of CareerStart. American Educational Research
for publication-of-middle-schooling Journal, 50(6), 1309e1335. https://doi.org/10.3102/0002831213488818.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and Yoon, K. S., Duncan, T., Lee, S. W.-Y., Scarloss, B., & Shapley, K. L. (2007). Reviewing
data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications, Inc. the evidence on how teacher professional development affects student achievement
Resnick, L. B. (1987). Learning in school and out. Educational Researcher, 16(9), (REL 2007eNo. 033). Washington, DC: National Center for Educational Evalu-
13e20. ation and Regional Assistance, Institute of Education Sciences, U. S. Department
Rubin, B. C. (2003). Unpacking detracking: When progressive pedagogy meets of Education.
students' social worlds. American Educational Research Journal, 40(2), 539e573.