Você está na página 1de 18

11 Improving Teacher-Developed Assessments

Chapter Summary:
A focus on two strategies for improving assessment procedures was seen in this chapter. It was suggest
motivation, teachers could use judgmental and/or empirical methods of improving their assessments.

Judgmentally based improvement procedures were described for use by teachers, teachers’ colleagues, and
evaluating assessment procedures were presented for use by teachers and their colleagues: (1) adherenc
general item-writing commandments, (2) contribution to score-based inferences, (3) accuracy of content, (4) ab
fairness. A set of possible questions to ask students about test items was also provided.

Two empirical item-improvement indices described in the chapter were p values and item-discrimination indic
determining item-discrimination values was described. Designed chiefly for use with norm-referenced measure
do not function well when large numbers of students respond correctly (or incorrectly) to the items involved.
highly useful in the improvement of multiple-choice items because the effectiveness of each item’s alternat
indices of item quality for criterion-referenced measurement were described. Although roughly comparable t
indices widely used with norm-referenced measurement, the two indices for criterion-referenced assessmen
many effectively taught students perform well on examinations.

Chapter Outcome:
Sufficient comprehension of both judgmental and empirical test-improvement procedures so that accurate decisions can
employing these two test-improvement strategies
If you’ve ever visited the manuscript room of the British Museum, you’ll recall seeing handwritten manus
superstars of English literature. It’s a moving experience. Delightfully, the museum presents not only the final
authors as Milton and Keats but also the early drafts of those works. It is somewhat surprising and genuinely e
of literature didn’t get it right the first time. They had to cross out words, delete sentences, and substitute phra
drafts are genuinely messy, reflecting all sorts of rethinking on the part of the author. Well, if the titans of En
early drafts, is it at all surprising that teachers usually need to spruce up their classroom assessments?
Chapter 11 PowerPoint

Chapter Activity:
Create a teacher-developed assessment.
1. Choose a subject, topic, and grade level to create a teacher-developed assessment.
2. Reread Chapter 11 and review the assessment that you created.
3. Make any revisions necessary to the teacher-developed assessment.

Chapter 11 – Improving Teacher-Developed Assessments


Educational Assessment - Review By Brenda Roof
Classroom Assessment – What Teachers Need to Know - By W. James Popham

Chapter 11 provides procedures to improve teacher developed assessments. There are two general
improvement strategies discussed. The first is judgmental item improvement and the second is empirical
time improvement. Teachers need to apply adequate time and be motivated to apply either of these
strategies.
The first improvement strategy is judgmentally based improvement procedures. These procedures are
provided by yourself, your colleagues and your students. Each of the three judges will follow different
procedures. The first judge would be you, judging your own assessments. The five criteria procedures for
judging your own assessments are; first, adhere to item-specific guidelines and general item-writing
commandments, second, contributions to score-based inference, third, accuracy of content, fourth,
absence of content lacunae or gaps, and fifth, fairness. The second judge would be a collegial judge. The
collegial judge would be provided the same five criteria used by yourself as well as a brief description of
each of the five criteria. The third judge would be the students. The students are often overlooked as a
source of test analysis, but whom better and most experienced than the test takers. There are five
improvement questions students can be asked after an assessment is administered. The first is, if any of
the items seemed confusing, and which ones were they? Second, did any items have more than one
correct answer? If so which ones? Third, did any item have no correct answers? If so which ones? Fourth,
were there words in any items that confused you? If so, which ones? Fifth, were the directions for the
test or for particular subsections unclear? If so which ones? These questions can be altered slightly for
constructed response assessments, performance assessments, or portfolio assessments. Ultimately the
teacher needs to be the final decision maker for any needed changes.
The second improvement strategy is known as empirically based improvement procedures. This
improvement strategy is based on the empirical data supplied when students respond to teacher
developed assessments. There are a variety of well tuned techniques used over the years that involve
simple number or formula calculations. The first technique is difficulty indices. A useful index of item
quality is its difficulty. This is also referred to as p-value. The formula looks as follows:
Difficulty P = R - # of students responding correctly to an item
T - # of total students responding to an item

A p-value can range from 0-1.00. Higher p-values indicate items that more students answered correctly.
An item with a lower p-value would be an item most students missed. The p-value should be viewed as
the relationship to the student’s chance, probability of getting a correct response. For example on a four
item multiple choice assessment a .25 p-value, by chance should be expected. The actual difficulty of an
item should be tied to the instructional program. So, if the p-value is high that does not necessarily mean
the item was too easy rather the content has been taught well.
The second empirically based technique is item-discrimination indices. An item-discrimination index
typically tells how frequently an item is answered correctly by those who perform well on the total test.
An item-discrimination index reflects the relationship between student’s responses for the total test and
their responses to a particular test item. The author uses a correlation coefficient between student’s total
scores and their performance on a particular item. This is also referred to as a point-biserial. A positive
discriminating item indicates an item is answered correctly more often by those who score well on the
total test than by those who score poorly on the total test. A negatively discriminating item is answered
correctly more often by those who score poorly on the total test than by those who score well on the
total test. A nondiscriminating item, is one that there is no appreciable difference in the correct response
for those who score well or poorly on the total test. Four steps can be applied to compute an items
discrimination index. First order the test papers from high to low by total score. Then divide the papers
evenly into a high group and a low group. Third calculate a p-value for each group. Lastly subtract pl
from ph to obtain each items discrimination index (D). Example of this is:
D= Ph – Pl. The following chart can also be created:
Positive Discriminator High scores > Low scores
Negative Discriminator High scores < Low scores
Nondiscriminator High scores = Low scores

Negative discriminating items are a sign that something about the test is not good. This is so because the
item is being missed more often by students who are performing well on the test in general and not
being missed by students who are not doing well on the test. Ebel and Frisbie (1991) apply the following
table.
.40 and above = Very good items
.30 - .39 = Reasonably good possible improvement needed
.20 - .29 = Marginal items should improve these items
.19 and below = Poor items – remove or revise items

The third technique offered for empirically based improvement procedures is distracter analysis. A
distracter analysis is when we see how high and low groups are responding to the items distracters. This
analysis is used to dig deeper on the basis of p-value or discrimination index’s needing revision. The
following table should be set up to look at this technique:

Test Item # Alternatives


p = .50
D = .33 A B* C D Omit
Upper 15 2 5 0 8 0
Lower 15 4 10 0 0 1

In this example the correct response is B*. Notice students doing well are choosing D and students not
doing well are choosing B. Also of interest is no one is choosing C. As a result of this analysis something
should be changed in C to make it more appealing as well as looking a what about B is causing students
to choose it who are not doing as well on the test and students who are doing well to not choose it.
Criterion referenced measurements can also use a form of item analysis to avoid low discriniation indicies
from traditional item analysis. The first approach requires administering the assessment to the same
group of students prior to and following instruction. Some disadvantages of this are that item analysis
can’t occur till after both assessments are given and instruction has occurred. Pre-tests may also be
reactive by sensitizing students to certain items, so post-test performance is a result of pre-test and
instruction, not just instruction. The formula for this procedure would look like the following:
Dppd = Ppost – Ppre
Ppost – proportion of students answered correct on post-test
Ppre – proportion of students answered correct on pre-test

The value of Dppd (Discrimination based on pretest-posttest differences) should range from -1.00 to
+1.00 high positives indicating good instruction. The second approach for item analysis of criterion-
referenced measurement is to locate two different groups of students. One group having received
instruction, and the other group not having received any instruction. By assessing and applying the item
analysis described earlier you can gain information quicker on item quality, avoiding reactive pretests.
The disadvantage is you are relying on human judgment.
Two strategies have been described judgment or empirical methods. Both strategies can be applied to
improve assessment procedures. When using judgment procedures five criteria should be used for the
assessment creator, as well as the collegial judge with descriptions and a series of question for students
about the assessment. When using the more empirical methods several different types of item analysis
and discrimination practices can be used by applying various formulas. Ultimately the assessment creator
needs to apply good techniques and have a willingness to spend some time to improve their
assessments.

Introduction Significant research supports the use of frequent formative assessment to aid teaching and
learning, particularly when the assessment is well aligned with curriculum and instruction. Yet teachers
may not have had appropriate training to adequately develop and evaluate the quality of their
assessments and test items. We assess our students in order to gain visibility into what they know and
what they do not know. We seek to gain insight into what they can and cannot do. Proper assessment
results inform teacher instruction and student learning. To get proper assessment results, the
assessments and items that we build and create need to be valid and reliable. Consistent and reliable
assessments and items allow teachers and students to make correct inferences about teaching and
learning. This paper gives teachers guidelines on how to create reliable and valid assessments and
procedures to evaluate and improve the quality of their assessments and items. The following topics are
covered: § Guidelines for Creating a Test Blueprint § Guidelines for Writing Assessment Items §
Judgmental Item Improvement Procedures § Empirical Item Improvement Procedures Guidelines for
Creating a Test Blueprint Before writing new assessment items, it is important to state the obvious
upfront. To write quality items, it is imperative to have clear targets. What learning targets are to be
assessed? Are the learning targets clear to both the teacher and the students? Once the learning targets
are understood, the design of the assessment and the choice of items types follow. Generally, we
develop and use assessments because we want teachers to receive feedback about the effectiveness of
their teaching. We also want students to receive feedback about the effectiveness of their learning.
Assessments, when done properly, close the loop between curriculum and instruction. So we need to
define a clear purpose for the assessment. Do we want to know how our students are learning math in
general? Or do we want to know how they are learning numbers and operations? Do we want to know if
they understand science? Or do we want to know if they understand the water cycle? When we design
assessments, we need to be clear and specific on the purpose of the assessment. With a clear purpose,
we can define the expected outcomes of the assessment. Will the assessment identify which students
are proficient or not proficient on specific standards or learning targets? Will the assessment identify
student strengths and weaknesses by skill or learning target? Or will the assessment identify or reliably
sort our student by ability? Once we have clearly defined the purpose and expected outcomes from the
assessment, we need to identify the standards or learning targets that will be assessed. So how do we
know what standards or learning targets to assess or include? If you’ve looked at the Common Core or
your state standards, you know there are many more standards that you can possibly have time to
assess. Since we cannot assess every standard, we must identify the priority standards (Ainsworth,
2015). That is, find and prioritize those standards that students must know and be able to do. These are
the key standards. They are clear and can be understood by the teacher and the student. And, most
importantly, these are standards that are measurable. Priority standards receive greater instruction and
assessment emphasis. However, we are not eliminating any standards. All standards must be taught.
The supporting standards or non-priority standards play a role to help students understand the priority
standards. So even though these standards may not be assessed or measured, they must still be taught.
Once you have identified the learning targets to assess, you need to decide on the levels of cognitive
complexity or rigor to assess. Most teachers are familiar with Bloom’s taxonomy or the Revised Bloom’s
taxonomy. In this taxonomy, there are 6 levels of cognitive rigor. Another common taxonomy that is
used, particular in assessment, is Norm Webb’s Depth of Knowledge. In this taxonomy, there are 4
levels. Both are good 3 Improving Teacher-Developed Assessments and Items taxonomies. My advice is
to use the one that you are most familiar with. Revised Bloom’s Taxonomy (2001): 1. Remember 2.
Understand 3. Apply 4. Analyze 5. Evaluate 6. Create Webb’s Depth of Knowledge (2002): 1.
Recall/Reproduction 2. Skills and Concepts 3. Strategic Thinking and Reasoning 4. Extended Thinking The
last thing to consider in the design of the assessment is the type of items to include on the assessment.
Generally, there are two types of items. Items that require students to select an answer are called
selected-response items. Items that require students to construct an answer are called
constructedresponse items. Common selected-response items are multiple-choice, trule/false, yes/no,
and matching items. These items allow you to sample or measure a broader content area or more
learning targets. They can be quickly and objectively scored. Because you can use more of these items,
they lead to higher reliability and greater efficiency in the assessment. However, these items often tend
to over emphasize lower level recall and reproduction of knowledge. They can be used to assess higher
cognitive levels, but it may be harder to write those Items. Also, since students are selecting responses,
they are not constructing or writing, which is the main criticism of selected response items. Constructed-
response items, on the other hand, tend to be and can be more cognitively challenging for the students.
They are asked to generate a response. From these responses, particularly the wrong responses,
teachers get a good source of data and see the common errors or misconceptions that the students
have. However, constructed-response items take more time to score and to write. Additionally, because
they are subjectively scored, they tend to lead to lower assessment reliability. The purpose, the
outcomes, the learning targets to be assessed, the level of cognitive complexity to use, and the types of
items to use make up the test blueprint. The test blueprint is your plan for assessment construction.
Below is an example of a test blueprint. This is a basic blueprint that uses Webb’s depth of knowledge
and generic “learning targets” (you should replace these with specific Common Core State Standards or
your local standards or learning targets). More complex blueprints may include the item type. This
blueprint allows teachers to see how many items and points there will be on the assessment and what
type of coverage there is across the learning targets. The key point is to make sure the assessment
covers all the learning targets and is spread across the cognitive levels. We would not want all the
questions and points to come from one target or one cognitive level. The more items or points you have
per learning target, the more reliable of an estimate of the student’s ability you will receive. To have a
sufficiently reliable estimate of the score (i.e., the student’s ability) by learning target, it is best to
include at least 5 points per target. Guidelines for Writing Assessment Items In this paper, we consider
the four most common item types developed by teachers for their classroom assessments. These item
types can be developed by any teacher and used effectively in the classroom. Multiple-Choice: Consists
of an introductory part called a stem (either a statement or question) and a set of answer choices, one
of which is the answer. This item type is used when there is only one correct answer and several
plausible incorrect choices that can help teachers diagnose student misunderstandings. 4 Improving
Teacher-Developed Assessments and Items True/False: Consists of a statement or proposition for
students to verify as true or false. These are best used when there is a large body of content to be
tested. Matching: Consists of two lists or phrases where the entries on each list are to be matched. The
list of the left contains the premises. The list on the right contains the responses. Matching items are
best used when there are many related thoughts or facts for students to associate to each other.
Constructed-Response: Items that require students to generate their own response, in contrast to
selecting a response, are called constructed-response items. These can be simple items such as fill-in-
the-blank items or more complex such as extended-response items. These types of items often require
students to demonstrate more in-depth understanding than selected-response items. General Item
Writing Guidelines Developing assessment items is a relatively easy task. However, developing quality
assessment items can be more challenging. The following general guidelines can help improve the
quality of all items, regardless of item type. For more detail guidelines with example items, see
Chappuis, Stiggins, Chappuis, & Arter (2012) and Popham (2011). 1. Keep wording simple and focused.
This adage in effective written communication is just as applicable when writing items as when writing
expository essays. 2. Eliminate clues to the correct answer. Be careful that grammatical clues do not
signal the correct answer within an item or across items on a test. 3. Highlight critical or keywords.
Critical words such as “MOST”, “LEAST”, “EXCEPT” can be easily overlooked so highlight them for the
student. 4. Review and double-check the scoring key. As in any written piece of work, it is always good
practice to review and double check, especially the scoring key. Multiple-Choice Guidelines 1. Make the
stem a self-contained problem. Ask a complete question with all the necessary information to answer it
contained in the stem. This aids in clarity and makes it easier for students to read through tersely stated
answer choices to select the correct answer. 2. Make all answer choices plausible. All answer choices
must be plausible with only one correct answer. Incorrect choices must be reasonable so that they can’t
be ruled out without having knowledge or proficiency in the content being assessed. 3. Keep length of
answer choices similar. Answer choices that are parallel and of similar length do not cue the correct
answer. 4. “of-the-above”. Never use “all of the above” as an option. If students know A and B are
correct, but not C or D, then they can simply choose E (“all of the above”). Including “none of the above”
adds difficulty to the item because students can’t simply assume and guess that one of the options is
correct. True/False Guidelines 1. Include only one concept or idea in each statement. Make the item
completely true or false as stated. Don’t make it complex, which will confuse the issue and students. 2.
Avoid using negative statements. Negatives are often harder to understand and comprehend, especially
double negatives. Use negative statements sparingly and avoid using double negatives at all cost.
Matching Guidelines 1. Provide clear directions on how to make the match. It’s important to let students
know what the basis of the matching is supposed to be. If responses can be matched to multiple
premises, make that clear in the directions. 2. Use homogeneous lists of premises and responses. Keep
the set of premises and response homogeneous. For example, don’t mix events with dates or names. 3.
Keep responses short and brief. Premises should be longer. Thus, the responses should be short and
brief and parallel in their construction. 5 Improving Teacher-Developed Assessments and Items 4. Use
more responses than premises. When there are more responses than premises, this prevents students
from arriving at an answer through a process of elimination. Constructed-Response Guidelines 1. Use
Direct Questions. For most items, it is better to use direct questions rather than incomplete statements.
This is especially true for constructed-response items with young students. Students are less likely to be
confused by direct questions and they lead the teacher to avoid ambiguity in the item. 2. Encourage
Concise Response. Responses to short-answer items should be short. Although constructed-response
items are open-ended, they should be written so that they encourage short concise answer responses.
3. Keep at the End. When constructing these items as fill-in-the-blank items, place the blank or blanks at
the end of the incomplete statement. Blanks at the beginning are more likely to confuse the students. 4.
Limit the Blanks. Use only one or two blanks at most. The more blanks there are, the more likely it is to
confuse the students. 5. Use Rubrics. For extended-response items that require more complex answers
and higher-order level of thinking from students than a single word (or number) answer for fillin-the-
blank item, the desired student response can be as long as a paragraph or as extended as an essay. This
item type requires more teacher time to score as the teacher must score it manually. To make scoring
easier and fairer to students, use a rubric or scoring guide when judging student work. This makes the
evaluation criteria clearer to both the teacher and more importantly, to the student. Judgmental Item
Improvement Procedures As with any form of written communication, it is wise to create drafts and
then review and edit them as necessary before making them final. The same adage holds true for writing
assessment items. Reviewing and editing items is a judgment-based procedure, whether they be your
own or that of others. There are three sources of test-improvement judgments that you should
consider: (1) yourself, (2) your colleagues, and (3) your students. Judging Your Own Items It is often best
(and sometimes the only option) to review and try to improve your own items. Popham (2011) provides
the following guidelines to make the review of assessment items systematic and strategic. 1. Adherence
to guidelines. When reviewing your own items, be sure to be familiar with general item-writing
guidelines. Use them to find and fix any violations of item writing principles. 2. Contribution to score-
based inference. The reason we assess is to make valid score-based inferences of student knowledge
and ability. As you review each item, consider whether the item does in fact contribute to the kind of
inference you desire to make about your students. 3. Accuracy of content. Sometimes, previously
accurate content is contradicted by more recent content or findings (e.g., Pluto). Be sure to see if the
content is still accurate. And above all else, always make sure the key is correct. 4. Absence of content
gap. Review your test as a whole to ensure that important content is not overlooked. This is where your
test blueprint will come in handy. Identify the priority standards and make sure your assessment
content adequately covers those standards. 5. Fairness. Assessment items should be free of bias. They
should not favor one group nor discriminate against another. Be attentive to any potential bias so that
you can eliminate them as much as you possibly can. Collegial Judgments Often teachers work in teams
or in professional learning communities (PLCs). This school environment provides a great opportunity for
teachers to enlist the help and judgment from those colleagues whom they trust. When you ask another
teacher to review your items, be sure to provide them with guidelines and 6 Improving Teacher-
Developed Assessments and Items criteria for review. You can provide them with a summary of the five
criteria above. If you have itemwriting guidelines that you followed when you developed the items,
provide those resources to them too. Teachers are busy professionals. It is often hard to find time to
review your own items, let alone find and ask a fellow teacher to review them for you. If your school is
part of a large district, there may be a district staff member who specializes in assessment. This person
would be a great resource to have review your assessment develop procedures and the quality of your
items. Student Judgments Another source of good judgment data can come from the students
themselves. Students have a lot of experience taking tests and reading through a lot of items. They are
often happy to provide you with information to help you improve your assessments. When asking for
student feedback, be sure to ask them after they have completed the assessment. Students should not
be asked to take the test and review it at the same time. Doing both simultaneously may lead to poor
test performance and poor feedback. After students have completed the test, give them an opportunity
to provide you with feedback on the assessment as a whole and on each specific item. Gaining
information as to whether they thought the directions were clear and if any of the items were confusing
is valuable information to inform your assessment creation. Giving students the opportunity to provide
feedback to you can not only help you improve your assessments and items, it can also help you further
engage your students and build a critical and trusting teacherstudent relationship. Empirical Item
Improvement Procedures You’ve written your assessment items adhering to item writing guidelines and
best practices. You’ve reviewed them and had a colleague review them before you gave the assessment
to your students. And perhaps, students also provided feedback after they completed the test. Now that
your students have taken the test, this provides another rich source of valuable information that you
can use to improve your assessments and items. These empirical methods are based on numbers and
statistics. But they need not be daunting. If you deliver your tests with online assessment software, all
the numbers will most likely be computed for you. All you need to do is know a little about these
statistics and what they can tell you. Reliability Indices Reliability refers to the expected consistency of
test scores. As shown in the formula below, the reliability coefficient expresses the consistency of test
scores as the ratio of true score variance to total score variance (true score variance plus error variance).
If all test score variance were true, the index would equal 1.0. Conversely, the index will be 0.0 if none of
the test score variance was true. Clearly, a larger coefficient is better as it indicates the test scores are
influenced less by random sources of error. Generally speaking, reliabilities go up with an increase in
test length and population heterogeneity and go down with shorter tests and more homogeneous
populations. Although a number of reliability indices exist, a frequently reported index for achievement
tests is Coefficient Alpha, which indicates the internal consistency over the responses to a set of items
measuring an underlying trait. From this perspective, Alpha can be thought of as the correlation
between scores if the students could be tested twice with the same instrument without the second
testing being affected by the first. It can also be conceptualized as the extent to which an exchangeable
set of items from the same domain would result in similar ordering of students. For large-scale
educational assessments, reliability estimates above .80 are common. A reliability coefficient of 0.50
would suggest that there is as much error variance as true-score variance in the test scores. For
classroom assessments, where the number items on a test may be shorter and the number of students 2
2 2 2 2 2 T E T X T X σ σ σ σ σ ρ + = = 7 Improving Teacher-Developed Assessments and Items taking a
test may be fewer than on large-scale assessments, it is still important to consider and look at the
reliability estimate of the test when appropriate, such as when giving midterms or final assessments.
When important judgments or inferences are to be made from the test scores, it is important to make
sure that those test scores are sufficiently reliable. For classroom assessments, it is desirable to have a
test reliability greater than 0.70. Difficulty Indices At the most general level, an item’s difficulty is
indicated by its mean score in some specified group. ∑ = = ⋅ n i i x n x 1 1 In the mean score formula
above, the individual item scores (xi) are summed and then divided by the total number of students (n).
The mean score is the proportion correct for the item. This is also known as the p-value. In theory, p-
values can range from 0.0 to 1.0 on the proportion-correct scale. For example, if an item has a p-value of
0.92, it means 92 percent of the students answered the item correctly. Additionally, this value might
also suggest that: 1) the item was relatively easy, and/or 2) the students who attempted the item were
relatively high achievers. For selected-response items such as multiple-choice or true/false items, it is
important to view the p-value in relationship to the chance probability of getting the answer correct. For
example, if there are four response options on a multiple-choice item, by chance alone, a student would
be expected to answer the item correctly a quarter of the time (p-value = 0.25). For true/false items, the
chance of answering that item correctly is 0.50. Discrimination Indices Discrimination is an important
consideration in assessment theory. The use of more discriminating items on a test is associated with
more reliable test scores. At the most general level, item discrimination indicates an item’s ability to
differentiate between high and low achievers. It is expected that students with high ability (i.e., those
who perform well on the assessment overall) would be more likely to answer any given item correctly,
while students with low ability (i.e., those who perform poorly on the assessment overall) would be
more likely to answer the same item incorrectly. Most often, Pearson's product-moment correlation
coefficient between item scores and test scores is used to indicate discrimination. This is also known as
the item-total correlation. The correlation coefficient can range from -1.0 to +1.0. A high item-total
correlation indicates that high-scoring students tend to get the item right while low-score students tend
to answer the item incorrectly; this indicates a good discriminating item. Good test items are expected
to yield positive values (e.g., greater than .30). Very good items typically have discrimination values
greater than 0.40. Items with low discrimination values (< 0.20) and particularly those with negative
values should be reviewed and revised if possible. It not, they should be discarded and not used on
future assessments. Distractor Analysis In addition to using the statistics provided above, it is often
necessary and informative to look deeper at how the items perform. A distractor analysis gives us this
look into how the students actually performed on the item. This analysis identifies the number (or
percent) of students who picked each option on a selectedresponse item. It also identifies the number
of students who provided an incorrect response to constructedresponse items. For example, high
quality multiple-choice items should have at least some students picking each of the response options.
Remember, all distractors should be plausible, thus there should be students who pick that distractor.
These distractors can tell us what misinformation or miscalculations the students make. Items with
distractors where no student selected that option should be reviewed and revised if possible. Putting It
Into Practice Writing quality assessment items and reviewing and evaluating assessment and item
performance need not be daunting and tedious tasks. They can be made much easier with the proper
tools. Good Next Generation Assessment platforms such as Naiku provide teachers with these tools.
Below, the item writing and item evaluation processes are illustrated with Naiku. 8 Improving Teacher-
Developed Assessments and Items Item Writing Item creation forms should be simple and intuitive. It
should provide all the necessary elements at the forefront. Less often used features should be hidden
and made easily available when needed. Below is the form for writing multiple-choice items in Naiku.
The stem is provided at the top. This is where the question or statement goes. The standard fouroption
choices are provided next. More options can be added or options can be reduced with either a click of
the + or – buttons. Math formulas, formatted text, rationale statements, and standards alignment can
be added by clicking on the appropriate buttons. Figure 1. Multiple-Choice Item Creation in Naiku.
Forms for other item types should be equally easy and intuitive to use. Item Evaluation Equally
important to providing a good tool for creating items, an assessment platform should provide tools to
review and evaluate items. Below, I highlight some of the features of Naiku that allow teachers to
review and evaluate item. Teachers can review and provide comments to their own items. In addition,
other teachers can review and provide comments to the creator of the item. There are multiple ways to
provide feedback for your colleagues. The first and simplest method is to vote the item up/down to
quickly indicate your level of satisfaction with the item. The second and more useful method is to leave
and provide constructive feedback in the comment box. It is important to tell the writer how the item
can be improved, and if so, what changes need to be made. Figure 2. Peer Item Review and Comment in
Naiku. In addition to review by peer teachers, Naiku encourages students to provide feedback to their
teachers. This is accomplished during the reflection stage, after students have completed the test.
During this process, students can inform their teacher whether they thought the item was confusing,
had more than one correct answer, or had no correct answer. Figure 3. Student Feedback to the Teacher
in Naiku. After all the students complete the test, you can close and score the assessment. This will
generate both a test result and item analysis report to help you review and evaluate the quality of the
assessment and the items. The test analysis report is shown below. Note the reliability estimate in the
Assessment Statistics box. This statistic is the Coefficient Alpha, which provides a 9 Improving Teacher-
Developed Assessments and Items good estimate of the reliability of the assessment. Also note the
other descriptive statistics such as the minimum score, maximum score, mean score and standard
deviation, which give a good picture of how your students performed on the assessment. At the bottom
of the report, note the scores are presented by standard (or learning target). This helps teachers identify
their students’ strengths and areas in need of improvement and perhaps extra instruction. Figure 4. Test
Analysis Report Showing Reliability Index in Naiku. The item statistics and distractor analysis are
provided in the Item Analysis report. For each item, the estimate of the difficulty of the item (p-value)
and the item’s discrimination power (pbis) are provided. These two statistics provide teachers with a lot
of information about the quality of their items. In addition, the frequency and percent of students
selecting each response option are provided. This information helps teachers revise and improve the
items. In addition, this item is useful for special education teachers should they need to make
modifications to the item (e.g., removal of the distractors). Figure 5. Item Analysis Report Showing Item
Difficulty, Discrimination, and Distractor Analysis in Naiku. At Naiku, we believe that teachers can create
good and effective assessments and items. Teachers should familiarize themselves with the best
practices and guidelines in assessment and item development. Judgmental approaches can be used to
help teachers review and improve their items. Empirical approaches, utilizing item difficulty, item
discrimination, and distractor analysis can also be used to evaluate the quality of the items. Although
these statistics are best used with norm-referenced assessments, they can still be informative for a
teacher’s classroom criterionreferenced assessment. However, it is advised that you do not rely too
heavily on the statistics, particularly when the number of students who took the assessment 10
Improving Teacher-Developed Assessments and Items is small and the number of items on the test is
also small. Teachers can use a tool like Naiku to help them develop, review, evaluate, and revise their
assessments and items. References Ainworth, L. (2015). Common Formative Assesssments 2.0. How
teacher teams intentionally align standards, instruction, and assessment. Thousand Oaks, CA: Corwin:.
Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian, P.W., Cruikshank, K.A., Mayer, R.E., Pintrich, P.R.,
Raths, J., & Wittrock, M.C. (2001). A taxonomy for learning, teaching, and assessing: A revision of
Bloom’s Taxonomy of Educational Objectives (Complete edition). New York: Longman. Chappuis, J.,
Stiggins, R., Chappuis, S. & Arter, J (2012). Classroom Assessment for Student Learning. 2nd ed. Upper
Saddle River, NJ: Pearson Education. Popham, W. J. (2011). Classroom Assessment. What Teachers Need
to Know. 6th ed. Boston, MA: Pearson Education. Webb, N. (March 28, 2002) “Depth-of-Knowledge
Levels for four content areas, ” unpublished paper. About the Author Dr. Adisack Nhouyvanisvong is an
expert in educational assessment, including computer-based and adaptive testing. He has created and
ensured the psychometric integrity of large-scale educational assessments for states and national
organizations, taught at the University of Minnesota, and is an adjunct professor at St. Mary's University
where he has taught educators and graduate students in educational assessment practice and
instructional strategies. He has been published in peer-reviewed journals, regularly speaks at education
conferences and is currently President of Naiku.
Improving Teacher-Developed Assessments

The last chapter in the Popham book that addresses some of the important outcomes to be assessed on
the upcoming EDUC 603 exam is Chapter 11: Improving Teacher-Developed Assessments. This chapter
is important because it directly addresses skills that on one hand reflect an application of many of the
skills facilitated throughout the course, and on the other hand it provides instruction over a couple
important skills that can help make you more capable of improving your own test development skills (very
important in your growth as a professional educator).

As indicated, three important skills that will be measured on the upcoming exam are facilitated within this
chapter:

11.1 Differentiate between examples of judgmental and empirical assessment item


improvement strategies.

11.2 Given a set of scores (pretest and/or posttest scores) for a particular test, calculate
each item’s discrimination index (D) for both norm-referenced as well as criterion-
referenced assessment items.

11.3 Given the results of an item (distracter) analysis for a multiple choice assessment
(including p and D values) along with a general description of the purpose of the test,
determine whether or not specific test items should be included in subsequent versions of
the test.
The first part of the chapter provides you with information and examples highlighting differences between
judgmentally-based and empirically-based procedures for improving the quality of your tests. Popham
points out that perhaps YOU are the most important person capable of judging the quality of your work.
One way you can judge the quality of your work is to apply the rules/guidelines for good test item
construction to your own work (similar to what you did for project #1). You are also in a position to judge
the accuracy and completeness of content addressed within an assessment, as well as the degree to
which bias is mitigated. You just need to take a closer look at your own work! On a side note, articles
were recently published online about a list of words and topics that the New York Department of
Education hopes to test-makers avoid using in order to prepare tests that minimize bias. Popham also
suggests that you enlist the help of colleagues as well as students in helping you determine the overall
quality and effectiveness of assessment instruments you create. Getting feedback from outside sources is
something you are asked to do for project #2.

In addition to judgmentally-based improvement strategies, Chapter 11 also presents some important


ways that you can use the data you collect from students as a means of identifying strengths and areas
for improvement in the assessment instruments you create. Two simple but effective methods for
tabulating and analyzing assessment data are presented in the text. One simple test helps you determine
how well a test items seems to discriminate between students who seem to be able to successfully
perform the skills measured on an assessment as a whole and those students who don't appear to have
learned the skills well. This is called a discrimination index, and it's a fun thing to calculate. You can use
calculations like the discrimination index to help yo decide if specific assessment items are doing their
job.

In addition to reading Chapters 11-12, you should also carefully review the information presented in the
course notes that addresses identifying differences between formative and summative evaluation, as well
as distinguishing between examples of norm-referenced and criterion-referenced assessments. These
concepts are a bit of a review from earlier material, but they fit well within the overall category of
"Improving Teacher-Developed Assessments." The course notes address these topics as well.

After you complete the practice items included in Chapters 11 & 12 and review all the course notes for
part two of this course, you should be ready to take the practice exam #2. This document is accessible
from the course notes page. This practice tests asks you to perform the following important skills defined
within the second part of the course:

6. Strategies for Effective Selected-Response Tests & 7. Strategies for Effective Constructed-Response
Tests

6.1 Edit poorly-worded assessment directions to make them more well-


written.

6.2 Edit poorly-constructed selected response and constructed response


assessment items to make them more well-written.

6.3 Given an instructional objective, write selected response and/or


constructed response assessment items that appropriately measure the SKA
indicated within the objectives under the conditions stated (if applicable).

8. Performance Assessments and Effective Rubrics

8.1 Construct a well-written rubric scoring guide for a given instructional goal
or task.

10. Affective Assessment

10.3 Develop an effective assessment instrument designed to measure attitudes for


a specific instructional experience.

11. Improving Teacher-Developed Assessments

11.1 Differentiate between examples of judgmental and empirical assessment item


improvement strategies.

11.2 Given a set of scores (pretest and/or posttest scores) for a particular test,
calculate each item’s discrimination index (D) for both norm-referenced as well as
criterion-referenced assessment items.

11.3 Given the results of an item (distracter) analysis for a multiple choice
assessment (including p and D values) along with a general description of the
purpose of the test, determine whether or not specific test items should be included
in subsequent versions of the test.
12. Formative Assessment

12.1 Distinguish between examples of formative and summative assessments.

13. Making Sense Out of Standardized Test Scores

13.1 Distinguish between examples of norm-referenced and criterion-referenced


assessments.

Terms in this set (11)

Judgmental Item Improvement Procedures


*use human judgment to improve test items
*3 sources of judgments: self, colleagues, students
Judgment by Self
*can be bias
*allow time between test construction & review
*use the 5 review criteria
Judgment by Colleagues
*provides good, non-partisan review
*helpful for performance or portfolio assessments
*time consuming
Judgment by Students
*overlooked group of reviewers
*spot problems with items and directions due to experience taking tests
*test is administered before the student review
*use a questionaire to collect data
Empirically Based Item Improvement
*review based on responses to test items
*works well with selected-response items
*general categories
Item Difficulty Indices
*useful index of item quality
*common index is p-value
Item Discrimination Indices
*powerful index
*desire positively discriminating items
*reports the frequency of those who score well on the assessment and answer the item
correctly
*procedure
Distractor Analysis
*determine how high and low groups are performing on distractors
*calculate difficulty (p) and discrimination (D) for each response alternative
*review values for distractors
Criterion-Referenced Measurements
*desire high p-values for items
*2 different schemes: Dppd, Duipd
Discrimination based on Pre-test & Post-tests
*test reactivity: Dppd=ppost-ppre

Pahina 1 ng 37
Chapter 11
Improving Teacher-Developed Assessments

Pahina 2 ng 37
Sufficient comprehension of both judgmental and empirical test-improvement
procedures so that accurate decisions can be made about the way teachers are
employing these two test-improvement strategies
Chief Chapter Outcome
Pahina 3 ng 37
Outline
Strategies for Item Improvement
Judgmentally-Based Procedures
Empirically-Based Procedures
Difficulty Indices
Discrimination Indices
Distractor Analysis
Item Analysis for Criterion-Referenced Measurement

Pahina 4 ng 37
Strategies for Item Improvement
Two general improvement strategies
Judgmental item-improvement procedures
Human judgment is chiefly used
Empirical item-improvement procedures
Based on students’ responses
Both can be applied to your own classroom assessment devices

Pahina 5 ng 37
Strategies for Judgmentally- Based Item Improvement
Use human judgment to improve test items
Three sources of judgments
Self
Colleagues
Students

Pahina 6 ng 37
Strategies for Judgmentally- Based Item Improvement
Judgment by self
Can be biased
Allow time between test construction & review
Consider these five review criteria
Do the items adhere to the guidelines and rules specified by the text?
Do the items contribute to score-based inferences?
Is the content still accurate?
Are there gaps in the content (lacunae)?
Is the test fair to all?

Pahina 7 ng 37
Strategies for Judgmentally- Based Item Improvement
Judgment by a trusted colleague
Provide colleague with brief description of the previously-mentioned criteria
Describe the key-inference intended by the test
Particularly useful with performance and portfolio assessments
Offer to return the favor

Pahina 8 ng 37
Strategies for Judgmentally- Based Item Improvement
Judgment by students
Typically an overlooked group of reviewers
Students review after completing the test
Particularly useful with
Problems with items and directions
Time allowed for completion of the test
Use a questionnaire to collect data
Expect carping from low-scoring students

Pahina 9 ng 37
Strategies for Judgmentally- Based Item Improvement
Item-Improvement Questionnaire for Students
If any of the items seemed confusing, which ones were they?
Did any items have more than one correct answer? If so, which ones?
Did any items have no correct answers? If so, which ones?
Were there words in any items that confused you? If so, which ones?
Were there directions for the test, or for particular subsections, unclear? If so, which
ones?
An Illustrative Item-Improvement Questionnaire for Students

Pahina 10 ng 37
Strategies for Empirically- Based Item Improvement
Student responses to test items are used
Used with selected-response type
Three useful techniques for evaluating items
Difficulty indices
Discrimination indices
Distractor analysis
Each item is assigned a difficulty index and a discrimination index

Pahina 11 ng 37
Strategies for Empirically- Based Item Improvement

Pahina 12 ng 37
Strategies for Empirically- Based Item Improvement
Item difficulty indices
p can range from 0 to 1.00
Values closer to 1 indicate most students answered the item correctly
Values nearer to 0 indicate most students answered the item incorrectly, e.g., A p
value of .15 means 85% of the students answered the item incorrectly

Pahina 13 ng 37
Strategies for Empirically- Based Item Improvement
Item difficulty indices
The p value of an item should be viewed in relation to the probability of getting the
item correct, e.g., For a T/F item, the student should be able to answer the item
correctly ½ of the time by chance alone (p = .50)
Not necessarily an index of “difficulty”
A high p value may just indicate the concept was taught effectively
Pahina 14 ng 37
Strategies for Empirically- Based Item Improvement
Item-discrimination indices
Indicates how frequently an item is answered correctly by those who perform well
on the entire test
Reflects the relationship between responses to the whole test and to a particular
test item
Type of Item
Proportion of Correct Responses on Total Test
Positive discriminator
High Scorers > Low Scorers
Negative discriminator
High Scorers < Low Scorers
Nondiscriminator
High Scorers = Low Scorers

Pahina 15 ng 37
Strategies for Empirically- Based Item Improvement
Item-discrimination indices
Positively-discriminating items are preferred over negatively-discriminating items
A negative discrimination index for an item indicates the item is answered correctly
more often by the least knowledgeable students

Pahina 16 ng 37
Decision Time–To Catch a Culprit: Teacher or Test?
Susan Stevens, a 6th-grade social studies teacher, has developed her own classroom
assessments. Recently she learned to conduct discrimination analyses and has
determined that one of her tests had negative discrimination on 4 of the 30 items
(students who performed well on the total test answered the 4 items incorrectly
more often than those who didn’t perform well). Those 4 items dealt with the same
topic—the relationships among the branches of the US government. Susan is
considering removing the items, but realized that the negative discriminations for
these items may be the result of her instruction. What do you recommend she do?

Pahina 17 ng 37
Strategies for Empirically- Based Item Improvement
Item-discrimination indices procedure
Order papers from high to low by total score
Divide papers into two groups—a high group and a low group
Calculate the p value for each item for the high group and low group
Divide # of students in the high group who answered the item correctly by the # of
students in the high group
Repeat for the low group

Pahina 18 ng 37
Strategies for Empirically- Based Item Improvement

Pahina 19 ng 37
Strategies for Empirically- Based Item Improvement
Item discrimination indices procedure
Discrimination Index
Item Evaluation
.40 and above
Very good items
.30–.39
Reasonably good items, but possibly subject to improvement
.20–.29
Marginal items, usually needing improvement
.19 and below
Poor items, to be rejected or improved by revision
Guidelines for Evaluating the Discriminating Efficiency of Items
Source: Ebel, R.L. and Frisbie, D.A. (1991) Essentials of Educational Measurement
(5th ed.). Englewood Cliffs, NJ: Prentice Hall.

Pahina 20 ng 37
Strategies for Empirically- Based Item Improvement

Pahina 21 ng 37
Strategies for Empirically- Based Item Improvement
Item discrimination indices
Not necessarily appropriate to have highly discriminating items
In an instructionally-oriented setting, we would want low discrimination for all items
on a posttest
Items on a posttest with low discrimination indices are indicators of instructional
excellence
Most suitable for norm-referenced measurement missions

Pahina 22 ng 37
Strategies for Empirically- Based Item Improvement
Distractor Analysis
Examination of the incorrect options (distractors) for multiple-choice or matching
items
Determines how high- and low-group students are responding to an item’s
distractors

Pahina 23 ng 37
Strategies for Empirically- Based Item Improvement
Distractor Analysis
E.g., An inspection of distractors can reveal if there appears something in a distractor
that is enticing students in the high group to select it

Pahina 24 ng 37
Strategies for Empirically- Based Item Improvement
Distractor Analysis
Calculate difficulty (p) and discrimination (D) for each response alternative
Review values for distractors
Are any p values too high or too low?
Are any D values negative?
Are there any patterns in responses that indicate modifications should be made?

Pahina 25 ng 37
Strategies for Empirically- Based Item Improvement
Distractor Analysis
Item No. 28
Alternatives
(p = .50, D = .33)
A
B*
C
D
Omit
Upper 15 students
2
5
0
8
0
Lower 15 students
4
10
0
0
1
A Typical Distractor-Analysis Table

Pahina 26 ng 37
Criterion-Referenced Measurements
Desire high values of p for all items
Post-instruction difficulties (p) approach 1
Desire low discriminations (D)
Two different general approaches to item analysis are used
Similar to those used for norm-referenced tests

Pahina 27 ng 37
Item Analysis for Criterion- Referenced Measurement

Pahina 28 ng 37
Item Analysis for Criterion-Referenced Measurement

Pahina 29 ng 37
Item Analysis for Criterion-Referenced Measurement
Same group—Posttest/Pretest analysis disadvantages
Instruction must be completed before securing item analysis
Pretest may be reactive
Students can be sensitized to items on the posttest from their experience on the
pretest
Posttest becomes function of the instruction and the pretest

Pahina 30 ng 37
Item Analysis for Criterion-Referenced Measurement
Pahina 31 ng 37
Item Analysis for Criterion-Referenced Measurement

Pahina 32 ng 37
Item Analysis for Criterion-Referenced Measurement
Instructed/Uninstructed analysis disadvantages
Must rely on human judgment to determine instructed and uninstructed groups
If the two groups are very different, e.g., intellectual ability, they may differ for other
reasons than instruction
Can be difficult to isolate the two groups

Pahina 33 ng 37
Parent Talk
Suppose a parent of one of your students called you this morning, before school, to
complain about his son’s poor performance on your classroom tests. He concludes
his grousing by asking you, “Just how sure are you the fault is in Tony and not your
tests?”
How would you respond to the parent?

Pahina 34 ng 37
Summary
Judgmental and empirical methods are available to improve classroom assessments
Judgmental procedures can be done by the teacher, colleagues, and/or students

Pahina 35 ng 37
Summary
Five review criteria were recommended to systematically improve classroom
assessment procedures:
Adherence to guidelines
Contribution to score-based inferences
Accuracy of content
Absence of content lacunae
Fairness

Pahina 36 ng 37
Summary
Empirical methods are determined using student responses to items
Item difficulties–calculated for each item using proportion correct
Item discriminations–calculated for each item using proportion of an ability group
correct
Distractor analyses–calculated by examining proportions to incorrect options
(distractors)

Pahina 37 ng 37
Summary
Not appropriate to use traditional difficulty and discrimination indices for criterion-
referenced test
Items are analyzed by calculating the differences between:
Pretest vs. posttest
Instructed group vs. uninstructed group

Você também pode gostar