Technical Report: Algebra I, Biology, and Literature

TECHNICAL REPORT
ALGEBRA I, BIOLOGY, AND LITERATURE

2013
Provided by Data Recognition Corporation

Table of Contents
TABLE OF CONTENTS
Glossary of Common Terms ....................................................................................................................... i
Preface: An Overview of the Assessments ............................................................................................... vii
The Keystone Exams from 2008 to Present ............................................................................................... vii
Assessment Activities Occurring from 2010 to Present............................................................................ viii
Chapter One: Background of the Keystone Exams ..................................................................................... 1
Assessment History in Pennsylvania ............................................................................................................1
The Keystone Exams ....................................................................................................................................1
Chapter Two: Test Development Overview of the Keystone Exams ........................................................... 5
Keystone Blueprint/Assessment Anchors and Eligible Content ..................................................................5
High-Level Test Design Considerations ........................................................................................................7
Online Testing Design Considerations .........................................................................................................8
Algebra I .......................................................................................................................................................9
Biology....................................................................................................................................................... 11
Literature .................................................................................................................................................. 13
Literature Passages ................................................................................................................................... 14
Chapter Three: Item and Test Development Processes ............................................................................ 17
General Keystone Test Development Processes ...................................................................................... 17
General Test Definition ............................................................................................................................. 18
Algebra I Test Definitions .......................................................................................................................... 18
Biology Test Definitions ............................................................................................................................ 20
Literature Test Definitions ........................................................................................................................ 23
Item Development Considerations ........................................................................................................... 25
Item and Test Development Cycle ............................................................................................................ 27
General Item and Test Development Process .......................................................................................... 30
Chapter Four: Universal Design Procedures Applied to the Keystone Exams Test Development Process . 35
Universal Design........................................................................................................................................ 35
Elements of Universally Designed Assessments ....................................................................................... 35
Guidelines for Universally Designed Items ............................................................................................... 37
Item Development .................................................................................................................................... 38
Item Format .............................................................................................................................................. 39
Assessment Accommodations .................................................................................................................. 40
2013 Keystone Exams Technical Report

Table of Contents
Chapter Five: Field Test Leading to the Spring 2013 Core ......................................................................... 41
Field Test Overview ................................................................................................................................... 41
Spring 2011 Keystone Exams Embedded Field Test ................................................................................. 41
Statistical Analyses and Results ................................................................................................................ 45
Review of Items with Data ........................................................................................................................ 49
Chapter Six: Operational Forms Construction for 2013 Administrations .................................................. 51
Final Selection of Items and Keystone Forms Construction ..................................................................... 51
Special Forms Used with the Operational 2013 Keystone Exams ............................................................ 52
Chapter Seven: Test Administration Procedures ...................................................................................... 57
Sections, Sessions, Timing, and Layout of the Keystone Exams ............................................................... 57
Sections and Sessions ............................................................................................................................... 57
Timing ....................................................................................................................................................... 58
Layout........................................................................................................................................................ 60
Shipping, Packaging, and Delivery of Materials ........................................................................................ 61
Chapter Eight: Processing and Scoring ..................................................................................................... 63
Receipt of Materials .................................................................................................................................. 63
Scanning of Materials ............................................................................................................................... 64
Materials Storage ...................................................................................................................................... 66
Scoring Multiple-Choice Items .................................................................................................................. 67
Rangefinding ............................................................................................................................................. 67
Scorer Recruitment and Qualifications ..................................................................................................... 68
Leadership Recruitment and Qualifications.............................................................................................. 68
Training ..................................................................................................................................................... 69
Handscoring Process ................................................................................................................................. 70
Handscoring Validity Process .................................................................................................................... 70
Quality Control .......................................................................................................................................... 72
Chapter Nine: Description of Data Sources .............................................................................................. 79
Student Filtering Criteria ........................................................................................................................... 79
Key Verification Data ................................................................................................................................ 80
Calibration of Operational Test Data ........................................................................................................ 80
Final Data .................................................................................................................................................. 80
Spiraling of Forms ..................................................................................................................................... 81

Table of Contents
Chapter Ten: Summary Demographic and Accommodation Data for Spring 2013 Keystone Exams .......... 83
Assessed Students..................................................................................................................................... 83
Reasons for Student Non-Assessment ...................................................................................................... 85
Demographic Characteristics of Students Receiving Test Scores ............................................................. 87
Test Accommodations Provided ............................................................................................................... 94
Glossary of Accommodation Terms ........................................................................................................ 112
Chapter Eleven: Classical Item Statistics ................................................................................................ 117
Item-Level Statistics ................................................................................................................................ 117
Item Difficulty ......................................................................................................................................... 117
Item Discrimination................................................................................................................................. 118
Scatter Plots of Item Discrimination and Difficulty ................................................................................. 119
Observations and Interpretations ........................................................................................................... 124
Chapter Twelve: Rasch Item Calibration ................................................................................................ 127
Description of the Rasch Model .............................................................................................................. 127
Checking Rasch Assumptions .................................................................................................................. 128
Rasch Item Statistics ............................................................................................................................... 132
Chapter Thirteen: Standard Setting ....................................................................................................... 145
Standard Setting and Performance Level Descriptors ............................................................................ 145
Development Overview for the Performance Level Descriptors ............................................................ 145
Performance Level Descriptors Meeting 1 ............................................................................................. 146
Performance Level Descriptors Meeting 2 ............................................................................................. 149
Standard Setting ..................................................................................................................................... 152
Chapter Fourteen: Scaling ..................................................................................................................... 167
Raw Scores to Rasch Ability Estimates.................................................................................................... 167
Rasch Ability Estimates to Scaled Scores ................................................................................................ 168
Raw-to-Scaled Score Tables .................................................................................................................... 170
Chapter Fifteen: Equating ..................................................................................................................... 171
Pre- vs. Post-Equating ............................................................................................................................. 171
Equating Design for Keystone Exams ...................................................................................................... 172
Post-Equating Check Analyses ................................................................................................................ 172
Equating for the Embedded Field Test Items.......................................................................................... 178

Table of Contents
Chapter Sixteen: Scores and Score Reports ........................................................................................... 179

Scoring..................................................................................................................................................... 179
Description of Total-Test Scores ............................................................................................................. 179
Description of Module Scores ................................................................................................................. 181
Appropriate Score Use ............................................................................................................................ 182
Cautions for Score Use ............................................................................................................................ 183
Report Development............................................................................................................................... 184
Reports .................................................................................................................................................... 184
Chapter Seventeen: Operational Test Statistics ..................................................................................... 189
Performance Level Statistics ................................................................................................................... 189
Scaled Scores........................................................................................................................................... 189
Raw Scores .............................................................................................................................................. 190
Chapter Eighteen: Reliability ................................................................................................................. 199
Reliability Indices .................................................................................................................................... 200
Coefficient Alpha ..................................................................................................................................... 200
Further Interpretations ........................................................................................................................... 202
Standard Error of Measurement............................................................................................................. 204
Results and Observations........................................................................................................................ 206
Rasch Conditional Standard Errors of Measurement ............................................................................. 207
Results and Observations........................................................................................................................ 208
Reliability of Performance Level Classification Decisions ....................................................................... 210
Rater Agreement..................................................................................................................................... 212
Chapter Nineteen: Validity .................................................................................................................... 217
Purposes and Intended Uses of the Keystone Exams ............................................................................. 217
Evidence Based on Test Content............................................................................................................. 217
Evidence Based on Response Process..................................................................................................... 219
Evidence Based on Internal Structure..................................................................................................... 220
Evidence Based on Relationships with Other Variables ......................................................................... 227
Evidence Based on Consequences of Tests ............................................................................................ 228
Evidence Related to the Use of Rasch Model ......................................................................................... 229
Validity Evidence Summary..................................................................................................................... 230

Table of Contents
Chapter Twenty: Special Study on Item Scrambling ............................................................................... 231

Item Scrambling ...................................................................................................................................... 231
Analyses of Item-Scrambling Effects ....................................................................................................... 232
References ............................................................................................................................................ 233
Appendices
Appendix A: Understanding Depth of Knowledge and Cognitive Complexity
Appendix B: General Scoring Guidelines
Appendix C: Item and Test Development Process for the Keystone Exams
Appendix D: Item and Data Review Card Examples
Appendix E: Item Rating Sheet and Criteria Guidelines
Appendix F: Keystone Exams Spring 2013 Tally Sheets
Appendix G: Keystone Exams Spring 2013 Module Layout Plans
Appendix H: Mean Raw Scores by Form
Appendix I: Demographic and Accommodation Tables (Winter and Summer)
Appendix J: Item Statistics
Appendix K: Raw-to-Scaled Score Conversion Tables
Appendix L: Post-Equating Check Analyses Results
Appendix M: Reliabilities
Appendix N: Item Scrambling

Table of Contents

Glossary of Common Terms
GLOSSARY OF COMMON TERMS

The following table contains some terms used in this technical report and their meanings. Some of these terms
are used universally in the assessment community, and some of these terms are used commonly by
psychometric professionals.
Term Common Definition

In Rasch measurement, ability is a generic term indicating the level of an individual on the
construct measured by an exam. For the Keystone Exams, as an example, a student’s
Ability literature ability is measured by how the student performed on the Keystone Literature
exam. A student who answered more items correctly has a higher ability estimate than a
student who answered fewer items correctly.
Adjacent agreement is a score/rating difference of one (1) point in value usually assigned
Adjacent
by two different raters under the same conditions (e.g., two independent raters give the
Agreement
same paper scores that differ by one point).
Alternate forms are two or more versions of a test that are considered exchangeable, for
example, they measure the same constructs in the same ways, are intended for the same
purposes, and are administered using the same directions. More specific terminology
Alternate Forms
applies depending on the degree of statistical similarity between the test forms (e.g.,
parallel forms, equivalent forms, and comparable forms) where parallel forms refers to the
situation in which the test forms have the highest degree of similarity to each other.
Average is a measure of central tendency in a score distribution that usually refers to the
arithmetic mean of a set of scores. In this case, it is determined by adding all the scores in a
distribution and then dividing the obtained value by the total number of scores. Sometimes
Average
people use the word average to refer to other measures of central tendency such as the
median (the score in the middle of a distribution) or mode (the score value with the
greatest frequency).
In a statistical context, bias refers to any source of systematic error in the measurement of
a test score. In discussing test fairness, bias may refer to construct-irrelevant components
of test scores that differentially affect the performance of different groups of test takers
Bias (e.g., gender, ethnicity). Attempts are made to reduce bias by conducting item fairness
reviews and various differential item functioning (DIF) analyses, detecting potential areas
of concern, and either removing or revising the flagged test items prior to the development
of the final operational form of the test (see also differential item functioning).
A constructed-response (CR) item—referred to by some as an open-ended (OE) response
item—is an item format that requires examinees to create their own responses, which can
be expressed in various forms (e.g., written paragraph, created table/graph, formulated
calculation). Such items are frequently scored using more than two score categories, that
Constructed-
is, polytomously (e.g., 0, 1, 2, 3). This format is in contrast to when students make a choice
Response Item
from a supplied set of answer options, for example, multiple-choice (MC) items which are
typically dichotomously scored as right = 1 or wrong = 0. When interpreting item difficulty
and discrimination indices, it is important to consider whether an item is polytomously or
dichotomously scored.
Content validity evidence shows the extent to which an exam provides an appropriate
Content Validity
sampling of a content domain of interest (e.g., assessable portions of Algebra I curriculum
Evidence
in terms of the knowledge, skills, objectives, and processes sampled).
2013 Keystone Exams Technical Report Page i

Glossary of Common Terms
Term Common Definition

The criterion-referenced score is interpreted as a measure of a student’s performance
Criterion-
against an expected level of mastery, educational objective, or standard. The types of
Referenced
resulting score interpretations provide information about what a student knows or can do
Interpretation
in a given content area.
A cut score marks a specified point on a score scale where scores at or above that point are
interpreted or acted upon differently from scores below that point (e.g., a score designated
as the minimum level of performance needed to pass a competency test). A test can be
Cut Score
divided into multiple proficiency levels by setting one or more cut scores. Methods for
establishing cut scores vary. For the Keystone Exams, three cut scores are used to place
students into one of four performance levels (see also standard setting).
Decision consistency is the extent to which classifications based on test scores would match
Decision the decisions on students’ proficiency levels based on scores from a second parallel form of
Consistency the same test. It is often expressed as the proportion of examinees who are classified the
same way from the two test administrations.
Differential item functioning is a statistical property of a test item in which different groups
Differential Item of test takers (who have the same total test score) have different average item scores. In
Functioning other words, students with the same ability level but different group memberships do not
have the same probability of answering the item correctly (see also bias).
Distractor An incorrect option in a multiple-choice item (also called a foil).
The strongest of several linking methods used to establish comparability between scores
from multiple tests. Equated test scores should be considered exchangeable.
Consequently, the criteria needed to refer to a linkage as equating are strong and
Equating
somewhat complex (equal construct and precision, equity, and invariance). In practical
terms, it is often stated that it should be a ‘matter of indifference’ to a student if he/she
takes any of the equated tests (see also linking).
Exact agreement indicates identical scores/ratings are assigned by two different raters
Exact Agreement
under the same conditions (e.g., two independent raters give a paper the same score).
The Keystone Exams use multiple test forms for each content area test. Each form is
composed of operational (OP) items and field test (FT) items. An FT item is a newly
Field Test (FT)
developed item that is ready to be tried out to determine its statistical properties (e.g., see
Items
p-value and Point-Biserial Correlation). Each test form includes a set of FT items. FT items
are not part of any student scores.
Frequency is the number of times that a certain value or range of values (score interval)
Frequency
occurs in a distribution of scores.
Frequency distribution is a tabulation of scores from low to high or high to low with the
Frequency
number and/or percent of individuals who obtain each score or who fall within each score
Distribution
interval.
Infit and outfit are statistical indicators of the agreement of the data and the measurement
model. Infit and outfit are highly correlated, and they both are highly correlated with the
point-biserial correlation. Underfit can be caused when low-ability students correctly
Infit/Outfit answer difficult items (perhaps by guessing or atypical experience) or high-ability students
incorrectly answer easy items (perhaps because of carelessness or gaps in instruction). Any
model expects some level of variability, so overfit can occur when nearly all low-ability
students miss an item while nearly all high-ability students get the item correct.
2013 Keystone Exams Technical Report Page ii

Technical Report: Algebra I, Biology, and Literature

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Technical Report: Algebra I, Biology, and Literature

Enviado por

Direitos autorais:

Formatos disponíveis

TECHNICAL REPORT

ALGEBRA I, BIOLOGY, AND LITERATURE

Provided by Data Recognition Corporation

2013 Keystone Exams Technical Report

2013 Keystone Exams Technical Report

2013 Keystone Exams Technical Report

Chapter Sixteen: Scores and Score Reports ........................................................................................... 179

2013 Keystone Exams Technical Report

Chapter Twenty: Special Study on Item Scrambling ............................................................................... 231

2013 Keystone Exams Technical Report

2013 Keystone Exams Technical Report

GLOSSARY OF COMMON TERMS

Term Common Definition

2013 Keystone Exams Technical Report Page i

Term Common Definition

2013 Keystone Exams Technical Report Page ii

Você também pode gostar