Stake

Evaluation of Testing and Criterial Thinking in Education
1
Robert E. Stake, University of Illinois
I have a month old grandson, Constantine Knutson Stake. At birth he weighed seven
pounds, six ounces. Constantine's worth is not indicated by his weight. We can put scales under
anything, and the arrow points to something, and it may be important to the doctor, but weight is
seldom what we need to know about our children. We do not need to say that the score is
meaningless in order to say that the score is invalid.
Criterial Thinking. Forty-eight years ago I was in graduate school. One day, sitting at
my desk, I suddenly realized there is a Social Science of Education because educational
psychologists were able to restructure phenomena as variables. They had invented the constructs
of education, building blocks for disciplined thinking about education, and they were called
variables. They were attributes, properties, traits, characteristics, facets, dimensions, criteria. By
reducing the complex phenomena of classroom, boardroom, history, and community aspiration,
to variables, one could get a handle on things.
A variable is an attribute that varies. It can vary in various ways, but we seized upon the
idea that it varies in quantity. Amounts vary up and down a scale. So once we identified the
construct, the scale, the important thing was to measure the quantity. We can use these quantities
to describe, to distribute, to compare, even to make like we are finding causes, and to interpret
the identified causes as bases of control, of improvement, of reform. It looked to me like
harnessing the atom. With criteria, with criterial thinking, we could measure, and with
measurement scales we could move mountains.
With criterial thinking and sampling as our entree, the study of education could be
precise, generalization-producing and useful. Any doubts I had were blown away. I enlisted in
the science of testing. I devoted myself to becoming a measurements man, andin spite of the
heresies you will hear todayI am one still. I am a measurement man. My work is program
evaluation. I try to measure the quantity and quality of Education, the merit and shortcoming, the
elusive criteria of teaching and learning.
Monitoring. Not all educational psychologists, educational researchers, scholars and
practitioners invested heavily in measurement, but many authorities from preschool to graduate
school, adopted criterial thinking for monitoring educational responsibilities. They accepted the
idea that to think about educational problems and to devise improvements in policy and practice,
you needed to speak about the activities of Education in terms of variables. One of my graduate
students, Penha Tres,2 studying a State Department of Education, found that the Departments
curriculum specialists could not be heard unless they converted their ideas into the language of
criterial assessment.
We are compelled to think of teaching and learning, leadership, curricula, finance, and
community servicein terms of criteria. Student achievement, goal fulfillment, drop-outs, staff
development, support of business and industryall the facets and complex relationships, in
terms of criteria.

1
Paper delivered at the annual meeting of the American Psychological Assn, San Francisco, August 24, 2001.
Thanks to Bob Linn, Dan Heck, Gene Glass, and Rita Davis for comments on an early draft.
2
Penha Tres Brevig, 1988. Doctoral dissertation. Urbana: University of Illinois.
Criterial Thinking page 2
People recognize that our experiences do not come in the form of criteria. When we ask
our grandchildren what happened in school today, we don't expect criteria. When the newspapers
report on prayer in school, school violence, or replacement of the Superintendent, it is a story
more than a functional relationship. The who, what, where, and why, even the when, are nominal
descriptors more than scalar.
An important feature of criterial thinking is the correlation of variables. Many scales are
similar in the sense that they consistently hold some cases high and other cases low. For
example, the height of a column of mercury is correlated with the heat in the air, and both are
correlated with infant mortality. Measurement depends heavily on correlates, but we sometimes
get in trouble when using correlates as indicators. Infant mortality is not accurately monitored by
measuring air temperature. This is a key concern in my evaluation of testing.
Episodic Thinking. My psychology tells me that the alternative to criterial thinking is
episodic thinking. Educational phenomena come to be known through episodes, happenings,
activities, events. The phenomena have a time and context base. They are populated with people
having personalities, histories, aspiration, frailty. We sometimes talk about personality and
frailty, contexts and episodes, in terms of variables. Almost anything can be converted into
variables. But the conversion often simplifies, under-represents. We gain a handle and lose a
situation.
In 1978, with a large team, Jack Easley and I studied the status of science education in
US public schools.
3
We did case studies in eleven communities across the country. Each school,
each teacher had a story, a situation, an enterprise. We investigated a number of issues and drew
a number of conclusions. Then, because the sample was so small and the nation so large, we sent
4000 questionnaires to randomly sampled respondents. Their perceptions of what was happening
in their schools was consistent with the findings from our case study schoolsbut we felt we
could not say the surveys confirmed the case studies. The case study knowledge was episodic.
Inventive though we were, we could not convey the same sense of situation in our survey items.
The surveys gave us scalar findings that enriched, but did not confirm, the case experience.
Criterial thinking and episodic thinking exist, side by side in our culture and in our
brains, with a kind of binocular disparity, a disparity we resolve, unconsciously, into a unity not
attainable from either criterial or episodic thinking alone.
Contemporary Achievement Testing: Measuring Aptitude, not Achievement
By mandate, standardized testing is widely used in Education in the United States and
various Western countries today to assess student achievement and teacher and school
effectiveness. Drawing on his studies and those of many others, Bob Linn evaluated this
mandated testing, masterfully, last year in his paper, "Assessments and Accountability."
4
Linn
takes the test scoreswhen used rightto be appropriate representations of scholastic
achievement. I do not. I want better representations of Education. In 48 states and beyond,
simplistic misrepresentations are lowering the quality of teaching and learning.
5

3
Stake and Easley, 1979. Case Studies of Science Education. Urbana, IL: CIRCE, University of Illinois.
4
Robert L. Linn, 2000. Assessments and accountability. Educational Researcher, March, 4-16.
5
Linn expressed concern about these negative side effects and set seven worthy constraints as ways of enhancing
the validity, credibility, and positive impact of assessment and accountability systems while minimizing their
negative effects: (1) Provide safeguards against selective exclusion of students from assessments. This would
To evaluate the testing, we need to examine what is being measured. What is the
construct? I claim that it is not achievement but aptitude. The tests are not telling what the
students are learning about educational subject matter. They are telling what aptitude the students
have for further scholastic learning. Achievement and aptitude are interrelated, correlated criteria
but different.
Aptitude is the ability to reason but reasoning requires understanding of content. Aptitude
includes much long-term acquisition of knowledge. Scholastic achievement is what has been
learned lately during courses, mostly in school, partly from formal teaching, partly from peers
and parents, but only because the student is participating in school work. Achievement includes
advances in reasoning. After a while, some of what was achievement becomes aptitude. It is a
porous membrane between them, yet the distinction is essential when I'll ask the question,
"Constantine, what did you learn in school today?"
Measuring Aptitudes. The tests mandated by State Departments of Education and many
School Districts are relatively cheap, simple and reliable devices for ranking students as to a kind
of general intellectual functioning. They use subject matter vocabulary as faade. They measure
very little of what children are learning this year in school, rather they measure the child's
aptitude for learning.
The test developers look for items that, using some apparently-relevant content
vocabulary, discriminate between high and low test scorers. Items that could evoke responses
indicating knowledge of an important and relevant topic do not survive preliminary trials if they
do not correlate with the other items of the test. A test is seen as measuring a unidimensional
criterion. It is not a check-off list of what a student has learned. Validity studies to show that the
scores identify the children who have learned this year's subject matter best are rarely done.
In 1999, the NAACP obtained from the New York State Department of Education, by
court order, all papers pertinent to the development of the current State Regents Examinations.
These are highly respected, long standing tests of secondary school achievement. I found no
validity studies (other than internal consistency studies) included in the NAACP package.
The most persuasive demonstration that scholastic achievement is not targeted is the lack
of research evidence that when teaching greatly improves, scores rise accordingly. Such evidence
is sometimes claimed
6
but it is almost completely absent from the tests and measurements
literature. There is evidence that teachers can coach students on how to get better scores, but few

reduce distortions such as those found for Title 1 in the fall-spring testing cycle. One way of doing this is to include
all students in accountability calculations. (2) Make the case that high-stakes accountability requires new high-
quality assessments each year that are equated to those of previous years. Getting by on the cheap will likely lead to
both distorted results (e.g., inflated, non-generalizable gains) and distortions in education (e.g., the narrow teaching
to the test). (3) Don't put all of the weight on a single test. Instead, seek multiple indicators. The choice of
construct matters and the use of multiple indicators increases the validity of inferences based upon observed gains in
achievement. (4) Place more emphasis on comparisons of performance from year to year than from school to school.
This allows for differences in starting points while maintaining an expectation of improvement for all. (5) Consider
both value added and status in the system. Value added provides schools that start out far from the mark a
reasonable chance to show improvement while status guards against "institutionalizing low expectations" for those
same student and schools. (6) Recognize, evaluate, and report the degree of uncertainty in the reported results. (7)
Put in place a system for evaluating both the intended positive effects and the more likely unintended negative
effects of the system.
6
And many researchers would be pleased to find it and publish it.
studies have validated those gains against independent measures of achievement.
That is the way it is. A major change in teaching can change student learning a
lotbut a major change in teaching can change aptitude for learning only slightly. Thus we
are trying to measure improvement in schooling with tests that do not measure improvements
even when, by other accounts, improvements occurred.
Scholastic Achievement. Most good teachers do not stick to the point. Good teachers
roam the content terrain, point out and extend connections, introduce applications. For them,
knowledge and skill are not collections of discrete elements. Each fact, topic and problem is
linked to various networks of knowledge and systems of thinking. In arithmetic
7
one digit
addition and two digit addition are closely linked, one digit addition and two digit multiplication
are less close, yet linked in several ways. The several ways become many in their applications.
The applications of mathematics and the other subjects quickly become too numerous to itemize
in tables of content, lists of objectives, lesson plansyet the practicing teacher, partly
unconsciously, expands into further dimensions of meaning for each operation and concept.
Subject matter teaching and learning of the complex curricular domains are represented
deceptively simply in goal statements, topical frameworks, and lesson plans.
The chapter titles of a mathematics textbook seem simple enough. For example, the
chapters of the book used by the upper sixth grade in the Duxbury (Massachusetts) Intermediate
School in 1989 are listed below.
Chapter Titles of a Middle School Math Textbook
1. Addition and Subtraction of Whole Numbers
2. Multiplication and Division of Whole Numbers
3. Decimals
4. Multiplication and Division of Decimals
5. Geometry
6. Factors and Multiples
7. Addition and Subtraction of Fractions
8. Multiplication and Division of Fractions
9. Probability
10. Statistics and Graphing
11. Ratio, Proportion, & Percents
12. Measurement
13. Perimeter, Area and Volume
14. Integers
15. Using Triangles
Please notice Chapter 1, Addition and Subtraction of Whole Numbers. A quick browsing
of the chapter's pages finds subdivision into the topics of: place value, reading and writing
groups of three numerals, one and two digit addition and subtraction, properties of addition, three
and four digit addition, money units, missing numbers, five digit addition, three to six digit
subtraction; subtraction with zeros, comparing numbers, greatest and least numbers, rounding
numbers, estimation of sums and differences, Roman numeralswith some special attention to

7
Thomas A. Romberg and Thomas P. Carpenter, 1986. Research on teaching and learning mathematics: Two
disciplines of scientific inquiry. In Merlin C. Wittrock, editor, Handbook of research on teaching. NY: Macmillan.
consumer skills, career interests and problem solving. And each of these Chapter 1 subtopics
could be further subdivided. A full inventory of matters brought into view from all fifteen
chapterscould it be longer than the textbook?
In the backs of our heads, we have epistemological inventories of subject matter running
deep into and beyond discipline-given subclassifications. These inventories are particularly
broad when we include the many applications and relationships a teacher offers. The inventories
can be organized around a conceptual super structure, such as that proposed by Ed Haertel and
Dave Wiley.
8
These classification schemes gravitate toward a powerful simple structure. Few of
them reflect the complex detail of the actual teaching. Field studies of teaching and learning
show that complexity. They reveal each teacher's massively compounded conception of the
subject being taught. You can classify what teachers do into achievement frameworks but you
cannot derive the teachers' achievement inventory from the frameworks.
I am not saying teachers are wonderful. They are just human beings. Psychologists know
their complexity, and the ambiance any human creates.
Pedagogic and Psychometric Views. Let me overgeneralize. Teachers and test
developers see student achievement differently. Teachers see the students learning separate
pieces of knowledge, having had many discrete experiences. To teachers, education is a
succession of episodes, thousands and thousands, many repetitive. To measure learning, you
need to review student responses to the pieces. Any one piece may be learned better by the
students who learn most pieces better, thus a correlation existsbut how crude it is to say how
well one idea is understood by testing another. An achievement test should check out
understanding on as many important pieces of knowledge as is practical. An item should not be
omitted because all students know it. Low correlation of performance across test items should
not mean that the test is weak. Discriminating between good students and poor should not be the
purpose of this testing. Assessing achievement is the purpose.
Most test developers, I think, see the students developing competencies. A competency
may be narrow or broad, such as to apply Boyles Law or to perform chemistry tasks well. These
test developers see criterial thinking as the preferred way of measuring student achievement.
Students who score well on items apparently have a high aptitude and can be expected to do well
in the next classes. Overattention to the uniqueness of a student's understanding of chemistry,
episode by episode, renders the assessment of achievement unmanageable. For most purposes,
say most test developers, achievement in any subject domain can be summarized by a single
criterion.
Teachers and test developers usually agree on the importance of tests being aligned with
the goals of instruction and with the teaching. They sometimes disagree on what to do about poor
alignment. The teacher sees the State goals and District curricula as identifying a number of
topics for which learning activities are needed. Each topic has a bunch of subtopics, some more
essential than others, most of which will not be mentioned in their lesson plans but, during
teaching, some of the subtopics arise to elaborate the presentation, the project, or the recitation.
Most of the many episodes of teaching and learning easily fall into the matrices of the curriculum
guide.

8
Edwin H. Haertel and David E. Wiley, 1990. Poset and lattice representations of ability structures: Implications
for test theory. Paper delivered at the annual meeting of the American Educational Research Association, Boston.
Most psychometricians do not perceive teachers as the experts as to what should be
taught. One of the main claims of the accountability movement has been that poor test scores are
attributable to teachers who have not taught the right things. A psychometric view of
achievement as competancy, rather than a pedagogical view of achievement as topical
comprehension, lowers the need to create test items which mirror the complexity of what is
taught.
A Family of Seven Mathematics Test Items
1. 20 x 1.8 + 32= ?
2. ( 1/5)(20)(9) + 32 = ?
3. 2 x 20 x 9 + 32 = ?
10
4. y = 1.8x + 32. Solve for y if x = 20.
5. Convert 20C to Fahrenheit. F = (9/5)C + 32.
6. Convert 20C to Fahrenheit. C = (5/9)(F - 32).
7. Ann wants to know today's temperature on the
Fahrenheit scale. Her thermometer reads 20 degrees
Celsius. What is the Fahrenheit temperature?
Measuring Achievement. Test specialists as well as educators and the general public
expect a good test to indicate volume of knowledge attained as well as student ranking. Even in
primary school, knowledge is complex. A test samples the domain. Just above is a topical family
of math items. Two similar looking items from the same family of items require different
knowledge and indicate different competency. An item belonging to a topical family is not
necessarily a good indicator of how well students will perform on other items within the
family. And so, of course, a group of such items is not a good indicator of how well students
have learned the subject matter.
This is shown for Musical Knowledge in Figure 1. The course content is represented
there by Circle A. The much smaller content of the test is Circle B. In the correlation matrix we
see that the same students tend to do well on both. The test tells us which students are better at
learning the musical knowledge of the course.
If you knew how well the student knows everything in the course, then of course you can
accurately predict how much of the test coverage he or she will know. But we find out only how
much of the test the student knows, and unless the course domain is very homogeneous and
randomly sampled, we cannot predict accurately how much of the course content the student
knows.
If we want to know what students have learned, the test provides too small a sample.
From the achievement items at the right, you can quickly can tell that music educators will not
find out how well the students perform on the last three items, given knowledge of their
performances on the first two.
Evidence of accountability has been taken to be an acceptable percentage of students
meeting a standard score. Here, a full music test should provide that information. If the items are
aligned with the curriculum and cannot be answered by aptitude alone, the tests will do what the
accountability monitors have asked for, but questions about volume of achievement remain
unanswered. Setting aptitude standards as evidence of accountability has been a mistake.
A B
In a correlation matrix we see
that the student performances A .70
are highly correlated so we might
conclude that the test represents B .70
the content pretty well.
Correlation matrix
Predicting achievement
But when you want to know from A from B
what knowledge and skills
the students have learned, to A 15%
and each knowledge and
skill is rather independent to B 95%
of others, then predictions
from the test are not weak. Prediction matrix
Notice below how unlikely you will learn how
students will perform on the last three items
when given knowledge of their performance on
the two test items.
Items of a Musical Knowledge Test
6. If you wanted music to sound like falling rain
what instrument would you play?
7. Who composed the Nutcracker Suite (Waltz of
the Flowers)?
Items of Course Knowledge not in the Test
1. In 4/4 time, how many beats does a half
notereceive?
2. The song Soldier, Soldier (music shown) is
written in the key of ____.
3. How important would it be (or is it) to have a
piano in your home? _______________
B
Figure 1. Music Achievement
This large circle A represents all
that is taught in a course or
curriculum.
This small circle B represents all
that is included in a test of A.
A
The Question of Aptitude as an Indicator of Achievement. Many criterial thinkers
promote the idea that a difficult-to-measure criterion can be adequately approximated by an
easier-to-measure variable when there is high correlation between them. Thus, aptitude testing is
accepted as if it were achievement testing. Except for homogeneous groups, the correlation will
be reasonably high. The substitution works pretty well for ranking students. The same students
usually win or lose on both scales. But for measurement itself, and especially for student
achievement as a measure of teacher performance, working from correlates is risky business. In
Figure 2, I have plotted performances of a hypothetical group of students regarding Item A, a
social studies test item.
Figure 2. Social Studies Achievement
Item A. Who said, I am fighting this war so that my son may plant grain and his son may paint
pictures.
1. George Washington
2. John Adams
3. Dwight Eisenhower
4. Ronald Reagan
Shown below is a scatter plot of knowledge of the answer to the item above, plotted on the Y
axis, and aptitude for learning history, plotted as X.
Student plotted above this line
had the knowledge to get the
item right.
Students plotted to the right
of this line had the aptitude to
figure the right answer. Southeast Quadrant
Students with higher aptitudes recognize that all four were Presidents and that the
quotation is not modern (in this age, few people aspire to have their sons plant grain) and that in
their long acquaintance with Washington, this quotation hadnt been brought up.
Students plotting above the horizontal line had enough knowledge to get the item right.
Students plotted to the right of the vertical line had enough general aptitude and test experience
to get the item right. From their general knowledge, some recognize that Item A options all are
Presidents, two from Colonial times and two more recent. They reason that the quotation would
be made in agrarian societies, as in Colonial times, and is not a declaration we remember from
the Washington legends. By elimination, many would choose the correct answer, John Adams.
Is this item measuring aptitude or achievement? Both. But accountability testing was set
up to show whether or not right answers are known, not whether or not wrong answers can be
eliminated. Reasoning and problem solving are legitimate academic goals, but this item was
supposed to test knowledge of specific course level, age level, social studies content. By the rules
of the game, all the students plotted in the Southeast quadrant, about 1/6 of the total, will get
credit for the item only because of aptitude. For an achievement test, Item A is not sufficiently
discriminating, but many test developers who do not distinguish between achievement and
aptitude would find it acceptable.
This situation illustrates a common problem in estimating alignment. Here the History
content and a Revolutionary War spirit are apparent. The item has face validity. In test
development today, the needed scrutiny and performance analysis are seldom applied to tease out
the factors responsible for correct answering. Some items appear to be at the center of desired
achievement, yet are answered right by students skilled in decoding tests.
Here is a principle to note, central to personnel policy. We do injury to some individuals
when, as long as correlation is not perfect, we substitute a correlate for a true criterion. In an
extreme condition, we know of correlation between race and achievement, but we deplore any
drawing of conclusions about the quality of a persons achievement based on race. The principle
is the same for concluding an individuals achievement knowing only his or her aptitude. It is a
lesser injury than that of race discrimination, but an injury still. Beyond individuals, is there also
injury to the school system, when we accept aptitude tests as measures of achievement? I think
so.
Psychometricians are responsible for solving such problems, basing test development on
validity research. Are the tests measuring student achievement or something else? That research
is not being conducted on the standardized tests mandated by the states for the measurement of
student achievement. The validity studies being done are internal consistency studies based on
item response theory. Such studies of correlation within the item pool maximizes the likelihood
that whatever is actually being measured by these items is measured well. If the items are not
measuring achievement, the present validity studies will not tell us. External construct validity
studies are badly needed.
And beyond internal consistency and predictive validity, the research on validity, as Bob
Linn indicates, should assure that the effects of decisions made on the basis of the testing do not
run counter to the best interests of the student, the school and the society. Consequential validity
studies are not a part of the development of contemporary testing. Many testing managers are
sensitive to the effects of their testing, and they tell us their concerns about "high stakes" testing,
with consequences such as: increasing drop-out rates, increasing teaching for the tests,
diminishing teacher ranks in grades being tested, narrowing the curriculum, and lowering support
for public education. The consequential validity studies are not being done and psychometricians
have not been vigorous in demanding they be done.
Testing, a Key Link in the Quality Control of Education
During the last half of the Twentieth Century, the traditional procedures for quality
control of schooling, i.e., informal management oversight (by teachers as well as administrators),
Board review, State guidelines and regional accreditation, have continued to be prominent in
actual school operations. But because the effectiveness of public education has fallen below
expectation and need, other means have been sought to evaluate and to improve teaching and
learning. For thirty years, student assessment has been the main instrument of state-mandated
educational reform.
School Reform. State-mandated standardized testing was set up as an accountability
strategy to show how well the public schools are performing and to give them targets for
improvement. In order to do this, they needed to make the claim that the best education for the
state was, at least at a minimum, common to all children, and that State academic goals and
standards should be the relevant criteria. Even though schools serve many functions, the stated
goals were limited to student performance in traditional subject matters. The tests needed to be
aligned with the goals and needed to be valid measures of student learning.
Earlier, in the Centurys third quarter, the impetus for changing schooling was the
appearance of Sputnik. According to Gallup polls year after year, citizens expressed confidence
in the local school but increasingly worried about the national system. It was reasoned that
American schools were unsuccessful if the Soviets could be first to launch space vehicles.
College professors and the National Science Foundation stepped forward to redefine the
curriculum. They created the new math and new science, inquiry teaching, the project
methodcourses strange to the taste of most teachers and parents. For a decade, curriculum
redevelopment was the main instrument of reform but politicians, reading the public as unhappy
both with the Old Guard and the New, created a reform of their own. This reform was based on
criterial thinking in the form of test scores.
School assessment became taken to mean the testing of student achievement with
standardized instruments. Student performance goals were made more explicit so that testing
could be more precisely focused, and efforts were made to align curricula with the testing.
Schooling includes many performances, provisions and relationships that could be assessed but
attention was fixed almost entirely on the tests. If they havent learned, they havent been
taught.
Now for at least two decades, in almost every school, at every grade level and in each of
the subject matters, student achievement has been assessed and found inching up and inching
down, but largely unchanged. Teaching, on the whole, appears to have been slightly amended,
toward taking tests, certainly not reinvented. Goal setting appears to have been more an exercise
in explication, not in adaptation or in raising aspirations. The last decade has seen efforts to set
standards, to be explicit about what levels of student performance are needed to restore
American Education to a world-leading position. Instead of reading the ineffectiveness of the
reform based on assessment as pointing to need for a different strategy, the summons has been
for even more assessment.
Assessment Systems. When student assessment became a school reform strategy, what
had been aptitude tests were recommissioned as achievement tests. States and some Districts
created or purchased their own tests, sorting the items by content and labeling them as
achievement tests, but they contained essentially the same items that had been used to measure
aptitude. Some of the better achievement items washed out because they correlated poorly with
other items or because the content was not being taught in enough schools.
State leaders were in a difficult situation. Most parents were satisfied with their schools
but the state appeared unable to move directly at school problems. Schools were a
disproportionate burden on the taxpayer, yet seemingly unresponsive to complaints. American
Education traditionally had been under local control of the schools, but that meant Districts and
teachers could set their own standards. While appearing to be supportive of local control, the
States argued that teachers should be re-focused and minimum standards set for all schools and
that tests would be used to guarantee compliance. Pressures increased but schools changed little,
so during the 80s and 90s, more emphasis was given to standardization of goals, setting
standards, and to testing.
The same reform effort has occurred in most States, in many Districts, and in some other
countries. And we do not have one credible success story, where according to multiple criteria of
merit, the schools are better because of the assessment strategy. Claims of success have been
made, for example, in Texas, Chicago, and England, but the evidence falls short. In Chicago, the
press failed to challenge gains reported by the Mayor's office, and President Clinton, clueless,
declared Chicago a city for all to learn from. But when the tests were changed, Chicagos gains
disappeared. If the strategy had merit, in some jurisdiction it would have resulted in well-
documented improvements. It could then be called a best practice, but it cannot.
The Complexity of Teaching: Even if student achievement were measured well, it
would be wrong to use those tests to measure the quality of teaching and the trustworthiness of
the educational system. There will be some correlation between achievement scales and school
quality, but good schooling is much more than raising the achievement that any tests measure.
To return to my point of the complexity of helping children become educated, I will tell you
about Kimberly Grogan, a young sixth grade teacher in a Hispanic neighborhood in Chicago.
After lunch the last Friday in May, Grogan announces that for the math lesson, they
will do research on bubble gum. Clear your desks, except for your notebook. It becomes
quiet, each at their table. Grogan points to five stations around the room, each with a
poster identifying a bubble gum brand name and a small supply of gum: Bazooka,
BubbleYum, Carefree Bubble Gum, Bubblicious, and Extra Bubble Gum, Sugar Free.
There are different kinds of bubble gum and we dont know if they are equally good at
making bubbles. Each of you will make a bubble with each of five brands and, using plastic
dividers, a teammate will measure the diameter. (She has placed meter sticks and crude calipers
at each station.) Record the measurements in your journals and on the posted sheet, later we
will draw graphs of bubble size."
As soon as you have made your measurements, wrap that piece of gum in its
wrapper and put it in the paper cup. Do not put gum anywhere except in the cup at its station. If
you get it on the floor, you must clean it up. We dont want bubble gum on the floor or under the
chairs." They organize into groups. It takes a while. They check the stick weight of each brand.
Now, do your bubbles and record your results. Record your measurements of diameter at each
station.
Half an hour later, all are busy chewing gum and measuring bubbles. Miss Grogan
reminds them that they are doing an experiment in mathematics. Omar gently affixes the plastic
tips to measure Davids bubble. Touching Annas pink sphere, Angela reads the angle as 32
degrees. Amelia objects, saying that the others are measuring in centimeters. It is not possible to
keep all the gum off the floor. Someone tells Grogan. She says to stop tattling. She has a smile
on her face and a camera round her neck.
Then, Stop! Everyone, stop! The Bubblicious is missing! It needs to be returned. Only
four people had taken a piece. The box was full. Sit down. (Grogan is shouting in anger.) So the
next 15 minutes is moral and interpersonal trauma. A lesson on the common good replaces the
graphing instruction.
When the experiment resumes, the sixth-graders are kept at their seats and bubble sizes
are recorded andas on Handout Ameans and medians calculated. Double check your
answers. The decimals seem to invite mistakes. Be sure you have lined your numbers up right.
What happened during this math class illustrated Kimberly Grogans attachment to the
children, her concern about their social development, and her desire to use the kinds of
mathematics activities promoted by the National Council of Teachers of Mathematics. She was
nearing the end of the year and it was going to be a difficult separation for her and the children.
She had formed personal attachment to each of them.
On Tuesday of the following week, I return to the school. In the hallway, I notice a large,
new, colorful poster with the question, Which bubble gum makes the largest bubbles? The
answer displayed is BUBBLICIOUS. About a dozen 8 1/2 x 11 graphs are displayed, each
showing five monotonic traces, color-coded as to brand of gum.
The experiment had begun in enthusiasm, had moved through an ethical crisis, had then
become pretty much a follow-the-rules routine. It ended with a set of graphs that, unfortunately,
needed further thinking. It was apparent to me that Grogan and the children did not understand
all the mathematics involved in the exercise.
Several days later I asked Grogan the main purpose of the exercise. She said, Two
purposes, to learn more about graphing and finding averages. I asked her how she decided
whether or not the students had learned about these things. Her answer was, I looked at their
journals for completeness of information, and for spelling, neatness, penmanship. Did she ask
them to interpret the graph? Yes, of course. They had a good meaning.
What I saw Grogans students getting from the Bubblegum experiment across several
math periods was the sense of a science or engineering study. It required thinking about
comparison, about causation, and about measurement, representing bubble size with numbers.
They had an experience of trying to solve a problem, to do an experiment, to arrive at an answer
to an interesting question using their mathematical skills. It was not a demonstration. They did
the thinking. The teaching was not the success it might have been, but it provided a sustained
opportunity to carry out an experiment. I thought the teaching was very good.
Of course I can use criterial thinking to express my evaluation of Grogan's teaching. It
was energized, dramatic, personal, methodologically innovative but substantively conforming,
self-confident, honest, dedicated, nave, evocative. But we know lots more about the quality of
her teaching because we know the bubble gum episode.
The Denial of Complexity: Teaching and learning are massively complex. Luckily we
often teach better than we know how. Luckily children's learning is not limited to what we are
conscious of teaching. Education is enormously an intuitive business. Efforts to rationalize it
sometimes improve things in some ways, but can lower the complexity, the spirit, and the quality
of accomplishment.
Politically, the move has been to remove Education from the control of professionals,
those who know it first hand, those who have worked to understand it, and yes, those who have
not kept all their pledges, and to place it in the hands of economists, politicians, and specialists in
leadership. Those who oppose this testing appear to be people trying to protect their turf.
Reforming education is easier if teaching and learning are seen as simple.
Among those vigorously denying the complexity of Education and supporting its
improvement through measurement are the Business Forum and the Educational Trust. From the
cover of the Trust's latest publication, New Frontiers for a New Century, here are a couple of
sentences.
To move boldly ahead with needed changes, though, educators, policymakers and
advocates need rock-solid information to help them shape policies and practices that will ensure
the academic success of every student, especially those who have not been served well by our
schools in the past. The Education Trust's Education Watch series is our best effort to provide a
solid foundation for action.
That solid information is no more than test scores, not achievement information, not
teaching information, but aptitude test scores. I believe that information for understanding
and improving our schools is not contained in those test scores, no matter how
sophisticated the analysis of these criteria.
The Evaluation of Testing. For the turning of two centuries, psychologists have helped
to nourish Education, participating vigorously in the training of teachers and providing
disciplined concepts such as motivation and social norms. They created the idea of a people
perpetually tested and pledged to validate measures of disability and attainment. Leading the way
have been measurement specialists from Francis Galton to L.L.Thurstone to Lee Cronbach to
Bob Linn.
Linn's seven urgings such as "Don't put all the weight on a single test," are sound advice,
often unheeded in State Departments and District Offices. Graduation testing ignores the fact that
no research studies assure that only high scoring students have earned their diplomas. Some low-
scoring schools are put on probation, ignoring the fact that no evidence shows that schools with
higher means cope better with their constraints. Certainly psychologists are not alone in their
inability to authenticate the representation of education.
The achievement information the tests are providing is aptitude information, a correlate,
not a warranted substitute for the criterion. Clever students and knowledgeable students score
well, and most of them are both. What is wrong with that testing? It is a deceit. Students
increasingly think of education as getting good test scores. But the calamity is the effect on
teachers and schools. Teachers and administrators also increasingly think of education as getting
good scores. And they run their classrooms and Districts accordingly.
The answer is not to substitute good achievement items for good aptitude items and
then continue to over-emphasize the tests. A full program of validity studies would attend to the
consequences: the deprofessionalization of teachers; denial of opportunities to people whose
achievement is not mainstream in fashion; the evolving substitution of test-preparation as the
primary goal of the classroom; the loss of anima.
The trouble is, I am sorry to say, that the students are about as weak in educational
achievement as the aptitude tests say they are. Much of the time we fail to give students credit
for what they do accomplish, but they should accomplish so much more. By invalid measures
and faulty logic, we have arrived at the correct assessment: that our children should be learning
more and our schools should be teaching better. If education is what most leading teachers,
philosophers and researchers think it is, then, by concentrating on aptitude, the tests help our
students learn less and help our schools teach poorly.
A longing for criterial thinking helped get us into this mess, but scales and scores are
rooted deep in our culture. We still have mountains to move. One of the best things to do is to
advocate the joint use of criterial and episodic thinking, not as a weighted scale, but as a
dialectic, to evoke alternative, even contradictory, images of good teaching and learning.
Assessment deserves human judgment. Lets put that in the mandate. For Constantine.

Stake

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Stake

Enviado por

Direitos autorais:

Formatos disponíveis

Evaluation of Testing and Criterial Thinking in Education

Você também pode gostar