Karpinski Aryn C

The Relationship between Online Formative Assessment Scores and State Test Scores: Measure Development and Multilevel
Growth Modeling
Dissertation
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Aryn C. Karpinski, M.S., M.A. Graduate Program in Educational Policy and Leadership
The Ohio State University 2010
Dissertation Committee: Jerome V. DAgostino, Ph.D., Advisor Richard G. Lomax, Ph.D. Dorinda J. Gallant, Ph.D.
Copyright by Aryn C. Karpinski 2010
ABSTRACT
The formative assessment literature has unanimously heralded the benefits of the diagnostic use of assessment to inform curriculum and instruction and improve student performance and achievement. Research in this area has primarily focused on traditional formative assessment practices. More recently, research is beginning to examine the effectiveness of technology-based formative assessment, with the latest studies of this mode formative assessment beginning to replicate the traditional findings. The current study examined one computerized/online formative assessment program, the Diagnostic Online Reading Assessment (DORA), and its relationship to a summative state proficiency test, in addition to examining the multilevel influence of teacher use of this technology-based mode of formative assessment on student DORA growth. Existing state test data from one school district in Colorado and existing DORA scores were obtained. In addition, teacher survey data from the same Colorado school district and across the United States were collected online. Student data from grades 3 through 11 and teacher survey data were analyzed via Hierarchical Linear Growth Modeling and Rasch Analysis to investigate the following three main objectives: (1) Examining if DORA growth is related to state test score growth, (2) Developing a behavioral frequency measure of teacher use of computerized/online formative ii
assessment, the Online Formative Assessment Survey (OFAS), and (3) Investigating the relationship between the OFAS and student DORA growth. Specific to the first objective, it was found that DORA subtest growth was significantly and positively related to state reading test score growth. For the second objective, it was found that a psychometrically sound measure of teacher computerized/online formative assessment practices can be developed. The results rendered a 50-question measure focusing on all elements of teacher use of computerized/online formative assessment, and 10-question measure concentrating solely on how teachers use the results from the computerized/online formative assessment program. In the third objective, it was found that both the 50-question and 10-question OFAS were not significant, positive predictors of student DORA score growth. Although the validation of OFAS scores was not supported, future research should continue to define this theoretical network of relationships, and maintain the measure revision and validation process. The Rasch results provide psychometric support for this newly developed measure, specifically the focus on using the online formative assessment results. Internet-mediated assessment is becoming commonplace in the classroom, and is more frequently being used to replace traditional modes of assessment. The need to examine the extent to which these methods are educationally sound is in high demand. Results from this study can support administrative demands for more efficient, technology-based ways to encourage teachers to use this mode of formative assessment, and in turn, meet state standards and increase student achievement.
iii
DEDICATION
This document is dedicated to my family, friends, and all my valued previous and current educators, teachers, and advisors.
Ora na azu nwa. African Proverb (It takes a village to raise a child.)
iv
ACKNOWLEDGEMENTS
I would like to thank everyone who has helped and supported me throughout my academic career. First, I would like to thank Dr. Jerome V. DAgostino for his ideas, immense wealth of knowledge, and guidance that I needed to finish my dissertation and the program with great success. Without his support, none of this would be possible. I would like to thank my dissertation committee members, Dr. Richard G. Lomax and Dr. Dorinda J. Gallant, for their advice and encouragement. Thank you also to Dr. Ann A. OConnell and Dr. William E. Loadman. I wish to acknowledge and thank the administrators and staff in the School of Educational Policy and Leadership in the College of Education and Human Ecology for their support, specifically Barbara Heinlein and Deborah Zabloudil. Many thanks to Sue Ann Highland from the Colorado Department of Education and Anne-Evan K. Williams from Lets Go Learn, Inc. Finally, I would like to thank all of my colleagues and friends here at The Ohio State University for their support. I would also like to thank all my valued previous and current educators, advisors, and teachers. Most of all, thanks to my family and friends for being my support system and cheering me every step of the way. v
VITA
March 1982 ....................................................Born: Lorain, Ohio 2000................................................................B.A. Psychology, Honors, Miami University 2004 2006....................................................Graduate Instructor, The Department of Psychology, West Virginia University 2006................................................................Graduate Research Assistant, The Department of Psychology, West Virginia University 2006................................................................M.S. Life-Span Developmental Psychology, West Virginia University 2007................................................................Psychometrist/Researcher, Allegheny General Hospital, Pittsburgh, Pennsylvania 2007 2008....................................................Graduate Research Associate, The School of Educational Policy and Leadership, The Ohio State University 2009................................................................M.A. Educational Policy and Leadership, The Ohio State University
vi
2008 2010....................................................Graduate Teaching Associate, The School of Educational Policy and Leadership, The Ohio State University
Publications
Rana, S.S., Schramke, C.J., Sangha, A., & Karpinski, A.C. (2009). Comparison of psychosocial factors between patients with benign fasciculations and those with amyotrophic lateral sclerosis. Annals of Indian Academy of Neurology, 12(2), 108 110. Karpinski, A.C., & Scullin, M.H. (2009). Suggestibility under pressure: Theory of mind and executive function, and suggestibility in preschoolers. Journal of Applied Developmental Psychology, 30, 749 763. Karpinski, A.C. (2009). Media sensationalization of social science research: Social networking insites. Teachers College Record. Retrieved from http://www.tcrecord.org/Content.asp?ContentID=15642 Karpinski, A.C. (2009). A response to reconciling a media sensation with data. First Monday, 14(5). Retrieved from http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2503/2183 Karpinski, A.C., Scullin, M.H., & Montgomery-Downs, H.E. (2008). Risk for sleepdisordered breathing and executive function in preschoolers. Sleep Medicine, 9, 418 424.
Fields of Study
Major Field: Educational Policy and Leadership Quantitative Research, Evaluation, and Measurement
vii
TABLE OF CONTENTS
ABSTRACT........................................................................................................................ ii DEDICATION ................................................................................................................... iv ACKNOWLEDGEMENTS ................................................................................................ v VITA .................................................................................................................................. vi TABLE OF CONTENTS................................................................................................. viii LIST OF TABLES ........................................................................................................... xiii LIST OF FIGURES .......................................................................................................... xx CHAPTER 1: INTRODUCTION ....................................................................................... 1 Formative Assessment .................................................................................................. 2 Technology and Formative Assessment ....................................................................... 5 Validity ......................................................................................................................... 6 Computerized/Online Formative Assessment DORA ............................................... 8 Research Questions and Hypotheses ............................................................................ 8 Contributions .............................................................................................................. 11 viii
Chapter Summary ....................................................................................................... 13 CHAPTER 2: LITERATURE REVIEW .......................................................................... 14 Formative Assessment History ................................................................................... 15 Theoretical Frameworks ............................................................................................. 19 The Assessment Cycle ................................................................................................ 28 Feedback ..................................................................................................................... 29 Formative Assessment Some Evidence ................................................................... 30 Computerized/Online Formative Assessment ............................................................ 34 Computerized/Online Formative Assessment in College Courses ............................. 37 CHAPTER 3: METHODOLOGY .................................................................................... 46 Objectives and Rationale ............................................................................................ 46 Context ........................................................................................................................ 48 Participants.................................................................................................................. 53 Measures ..................................................................................................................... 56 Procedure .................................................................................................................... 61 Data ............................................................................................................................. 70 Analyses ...................................................................................................................... 75 CHAPTER 4: RESULTS .................................................................................................. 87 Research Question 1 ................................................................................................... 87 ix
Original District Sample ........................................................................................... 106 Cases Selected for Potential Removal ...................................................................... 127 Low Frequency CSAP or DORA Scores .................................................................. 127 ESL/ELL Students .................................................................................................... 130 IEP Students .............................................................................................................. 140 Total Cases Removed ............................................................................................... 151 Final Analysis Sample .............................................................................................. 155 Final Analysis Sample CSAP Scores..................................................................... 157 Final Analysis Sample DORA Scores ................................................................... 164 DORA Subtests Used ............................................................................................... 179 Sample Size............................................................................................................... 180 Hierarchical Linear Growth Modeling Assumptions................................................ 182 Two-Level Time-Varying Covariate Hierarchical Linear Growth Model Results... 191 Research Question 2 ................................................................................................. 221 Rasch Analysis .......................................................................................................... 227 Research Question 3 ................................................................................................. 277 Teacher Sample......................................................................................................... 277 Student Sample ......................................................................................................... 284 Original District Sample ........................................................................................... 286 x
Cases Selected for Potential Removal ...................................................................... 296 Low Frequency DORA Scores ................................................................................. 296 504 Plan .................................................................................................................... 300 ESL/ELL Students .................................................................................................... 301 IEP Students .............................................................................................................. 305 Total Cases Removed ............................................................................................... 311 Final Analysis Sample .............................................................................................. 313 DORA Subtests Used................................................................................................ 320 Teacher OFAS Scores ............................................................................................... 321 Hierarchical Linear Growth Modeling Assumptions................................................ 323 Three-Level Hierarchical Linear Growth Model Results ......................................... 337 Word Recognition ..................................................................................................... 341 Oral Vocabulary ........................................................................................................ 357 Spelling ..................................................................................................................... 371 Reading Comprehension ........................................................................................... 383 CHAPTER 5: DISCUSSION.......................................................................................... 396 Purpose...................................................................................................................... 396 Objectives ................................................................................................................. 397 Discussion of Research Question 1 Results .............................................................. 399 xi
Discussion of Research Question 2 Results .............................................................. 418 Discussion of Research Question 3 Results .............................................................. 443 Implications of All Research Questions The Validation Argument ...................... 465 Conclusion ................................................................................................................ 473 REFERENCES ............................................................................................................... 477 STATUTES .................................................................................................................... 491 APPENDICES ................................................................................................................ 492 Appendix A: Online Formative Assessment Survey Final Version ...................... 492 Appendix B: Highland School District Permission Letter ........................................ 495 Appendix C: Lets Go Learn, Inc. Permission Letter ........................................... 496 Appendix D: The Ohio State University IRB Exempt Status ................................... 497 Appendix E: Informal Interview Questions to Develop the OFAS .......................... 498 Appendix F: Reviewer Feedback on the OFAS........................................................ 499 Appendix G: Invitation to Participate in the Study ................................................... 505
xii
LIST OF TABLES
Table 1. Self-Regulated Learning Theory and Model ...................................................... 26 Table 2. Student District Demographic Information for Highland Elementary School from the National Council for Education Statistics (NCES) for 2008/2009 .....................91 Table 3. Student District Demographic Information for Highland Middle School from the National Council for Education Statistics (NCES) for 2008/2009 ....................................92 Table 4. Student District Demographic Information for Highland High School from the National Council for Education Statistics (NCES) for 2008/2009 ....................................93 Table 5. Student Demographic Information for the Highland School District from the Colorado Department of Education (CDE) from 2004/2005 to 2007/2008 ......................95 Table 6. Student Demographic Information in the Original Sample from the Highland School District for Grades 3 through 11 by Cohort .........................................................109 Table 7. Descriptive Statistics for the Original District Sample CSAP Reading Scores for the Highland School District by Cohort...........................................................................112 Table 8. DORA Administration Schedule for the 2006/2007 through 2009/2010 Academic Years by Cohort ..............................................................................................116 Table 9. Descriptive Statistics for the Original District Sample DORA Subtest Scores for the Highland School District for the 2006/2007 to 2009/2010 Academic Years ............119 xiii
Table 10. Demographic Information of the Cases Removed Due to Missing CSAP or DORA Scores from the Highland School District for Grades 3 through 11 ...................129 Table 11. Demographic Information of the ESL/ELL Students from the Highland School District for Grades 3 through 11 ......................................................................................132 Table 12. Independent Samples t Tests Comparing ESL/ELL and Non-ESL/ELL Students from the Original District Sample on the CSAP Reading State Test ..............................135 Table 13. Independent Samples t Tests Comparing ESL/ELL and Non-ESL/ELL Students from the Original District Sample on DORA Scores.......................................................138 Table 14. Demographic Information of the IEP Students from the Highland School District for Grades 3 through 11 ......................................................................................144 Table 15. Independent Samples t Tests Comparing IEP and Non-IEP Students from the Original District Sample on the CSAP Reading State Test .............................................146 Table 16. Independent Samples t Tests Comparing IEP and Non-IEP Students from the Original District Sample on DORA Scores .....................................................................149 Table 17. Demographic Information of the Total Cases Removed and the Final Analysis Sample from the Highland School District ......................................................................153 Table 18. Descriptive Statistics for the Final Analysis Sample for the CSAP Reading State Test Scores for the Highland School District by Cohort ........................................159 Table 19. Independent Samples t Tests Comparing the Final Analysis Sample and Total Cases Removed on the CSAP Reading State Test ...........................................................163 Table 20. Descriptive Statistics for the Final Analysis Sample DORA Subtest Scores for the Highland School District for the 2006/2007 to 2009/2010 Academic Years ............167 xiv
Table 21. Independent Samples t Tests Comparing the Final Analysis Sample and the Total Cases Removed on DORA Scores .........................................................................177 Table 22. One-Way Random Effects ANOVA Model with the CSAP Reading Test .....195 Table 23. Unconditional Growth Model with the CSAP Reading State Test..................197 Table 24. Conditional Growth Model with the CSAP Reading State Test as the Outcome and the DORA Word Recognition (WR) Subtest as the Time-Varying Covariate .........200 Table 25. Full Model with the CSAP Reading State Test as the Outcome and the DORA Word Recognition (WR) Subtest as the Time-Varying Covariate ..................................203 Table 26. Conditional Growth Model with the CSAP Reading State Test as the Outcome and the DORA Oral Vocabulary (OV) Subtest as the Time-Varying Covariate .............206 Table 27. Full Model with the CSAP Reading State Test as the Outcome and the DORA Oral Vocabulary (OV) Subtest as the Time-Varying Covariate ......................................209 Table 28. Conditional Growth Model with the CSAP Reading Test as the Outcome and the DORA Spelling (SP) Subtest as the Time-Varying Covariate ..................................212 Table 29. Full Model with the CSAP Reading State Test as the Outcome and the DORA Spelling (SP) Subtest as the Time-Varying Covariate .....................................................215 Table 30. Conditional Growth Model with the CSAP Reading Test as the Outcome and the DORA Reading Comprehension (RC) Subtest as the Time-Varying Covariate .......218 Table 31. Full Model with the CSAP Reading Test as the Outcome and the DORA Reading Comprehension (RC) Subtest as the Time-Varying Covariate .........................221 Table 32. Descriptive Information from Online Formative Assessment Survey (OFAS) Teacher Participants by Gender .......................................................................................225 xv
Table 33. Summary of 47 Measured Persons (56 Measured Items) ................................230 Table 34. Summary of 56 Measured Items ......................................................................230 Table 35. Summary of Category Structure (56 Measured Items) ....................................232 Table 36. Item Statistics: Misfit Order (56 Measured Items) ..........................................234 Table 37. Item Category/Option Frequencies: Misfit Order (56 Measured Items) .........237 Table 38. Summary of 47 Measured Persons (53 Measured Items) ................................241 Table 39. Summary of 53 Measured Items ......................................................................241 Table 40. Item Statistics: Misfit Order (53 Measured Items) ..........................................242 Table 41. Item Category/Option Frequencies: Misfit Order (53 Measured Items) .........244 Table 42. Summary of 47 Measured Persons (50 Measured Items) ................................245 Table 43. Summary of 50 Measured Items ......................................................................245 Table 44. Summary of Category Structure (50 Measured Items) ....................................247 Table 45. Item Statistics: Misfit Order (50 Measured Items) ..........................................249 Table 46. Item Category/Option Frequencies: Misfit Order (50 Measured Items) .........250 Table 47. Summary of 46 (Non-Extreme) Measured Persons (47 Measured Items).......254 Table 48. Summary of 47 (All) Measured Persons (47 Measured Items) .......................254 Table 49. Summary of 47 (Non-Extreme) Measured Items ............................................255 Table 50. Item Statistics: Misfit Order (47 Measured Items) ..........................................256 Table 51. Summary of 46 (Non-Extreme) Measured Persons (11 Measured Items).......259 Table 52. Summary of 47 (All) Measured Persons (11 Measured Items) .......................260 Table 53. Summary of 11 (Non-Extreme) Measured Items ............................................260 Table 54. Summary of Category Structure (11 Measured Items) ....................................261 xvi
Table 55. Item Statistics: Misfit Order (11 Measured Items) ..........................................265 Table 56. Item Category/Option Frequencies: Misfit Order (11 Measured Items) .........265 Table 57. Summary of 46 Measured Persons (10 Measured Items) ................................270 Table 58. Summary of 10 Measured Items .....................................................................270 Table 59. Summary of Category Structure (10 Measured Items) ....................................271 Table 60. Item Statistics: Misfit Order (10 Measured Items) ..........................................274 Table 61. Item Category/Option Frequencies: Misfit Order (10 Measured Items) .........274 Table 62. Demographics for Reading Teachers in the Highland School District ............280 Table 63. Demographics for the Final Analysis Sample of Reading Teachers in the Highland School District..................................................................................................283 Table 64. Student District Demographic Information for the Highland Elementary School from the National Council for Education Statistics (NCES) for 2008/2009 ...................285 Table 65. Student District Demographic Information for the Highland Middle School from the National Council for Education Statistics (NCES) for 2008/2009 ...................286 Table 66. Student Demographic Information for the Original District Sample from the Highland School District by Grade Level ........................................................................289 Table 67. Descriptive Information for DORA Scores from the Highland School District by Grade Level .................................................................................................................294 Table 68. Demographic Information of Cases Removed Due to Missing DORA Scores from the Highland School District .................................................................................. 298 Table 69. Independent Samples t Tests Comparing ESL/ELL and Non-ESL/ELL Students on All DORA Subtests from the Highland School District .............................................303 xvii
Table 70. Demographic Information of the IEP Students from the Highland School District for Grades 3 through 8 ........................................................................................307 Table 71. Independent Samples t Tests Comparing IEP and Non-IEP Students on All DORA Subtests from the Highland School District ........................................................309 Table 72. Demographic Information of the Total Cases Removed from the Highland School District for Grades 3 through 8 ............................................................................312 Table 73. Demographic Information for the Final Analysis Sample from the Highland School District for Grades 3 through 8 ............................................................................314 Table 74. Independent Samples t Tests Comparing the Final Analysis Sample and the Total Cases Removed on All DORA Subtests from the Highland School District .........318 Table 75. One-Way Random Effects ANOVA Model with the DORA Word Recognition (WR) Subtest ....................................................................................................................343 Table 76. Unconditional Growth Model with DORA Word Recognition (WR) .............345 Table 77. Conditional Growth Model with DORA Word Recognition (WR) .................348 Table 78. Full Model with the DORA Word Recognition (WR) Subtest and the 50Question OFAS ................................................................................................................352 Table 79. Full Model with the DORA Word Recognition (WR) Subtest and the 10Question OFAS ................................................................................................................356 Table 80. One-Way Random Effects ANOVA Model with the DORA Oral Vocabulary (OV) Subtest ....................................................................................................................358 Table 81. Unconditional Growth Model with DORA Oral Vocabulary (OV) ................360 Table 82. Conditional Growth Model with DORA Oral Vocabulary (OV) ....................364 xviii
Table 83. Full Model with the DORA Oral Vocabulary (OV) Subtest and the 50-Question OFAS ...............................................................................................................................367 Table 84. Full Model with the DORA Oral Vocabulary (OV) Subtest and the 10-Question OFAS ...............................................................................................................................370 Table 85. One-Way Random Effects ANOVA Model with DORA Spelling (SP) .........372 Table 86. Unconditional Growth Model with DORA Spelling (SP) ...............................374 Table 87. Conditional Growth Model with DORA Spelling (SP) ...................................376 Table 88. Full Model with DORA Spelling (SP) and the 50-Question OFAS ................379 Table 89. Full Model with DORA Spelling (SP) and the 10-Question OFAS ................382 Table 90. One-Way Random Effects ANOVA Model with the DORA Reading Comprehension (RC) Subtest ..........................................................................................384 Table 91. Unconditional Growth Model with DORA Reading Comprehension (RC) ....386 Table 92. Conditional Growth Model with DORA Reading Comprehension (RC) ........389 Table 93. Full Model with the DORA Reading Comprehension (RC) Subtest and the 50Question OFAS ................................................................................................................392 Table 94. Full Model with the DORA Reading Comprehension (RC) Subtest and the 10Question OFAS ................................................................................................................395 Table 95. Comparing the 50-Question OFAS and 10-Question OFAS for all DORA Outcomes for Research Question 3..................................................................................456
xix
LIST OF FIGURES
Figure 1. Colorado Department of Education (CDE) scores for grades 3 through 10 for the Colorado Student Assessment Program (CSAP) reading test......................................74 Figure 2. The state of Colorado median growth percentile by year and content area for all students for 2007 through 2009 .........................................................................................99 Figure 3. The state of Colorado median growth percentile for English Language Learner (ELL) students for 2007 through 2009 ..............................................................................99 Figure 4. The state of Colorado median growth percentile for native English speaking students for 2007 through 2009 .......................................................................................100 Figure 5. The state of Colorado median growth percentile for free/reduced lunch status students for 2007 through 2009 .......................................................................................100 Figure 6. The state of Colorado median growth percentile for non-free/reduced lunch status students for 2007 through 2009 .............................................................................101 Figure 7. The state of Colorado median growth percentile for minority students for 2007 through 2009 ....................................................................................................................101 Figure 8. The state of Colorado median growth percentile for non-minority students for 2007 through 2009 ...........................................................................................................102
xx
Figure 9. The state of Colorado median growth percentile for Individualized Education Program (IEP) students for 2007 through 2009 ...............................................................102 Figure 10. The state of Colorado median growth percentile for non-Individualized Education Program (non-IEP) students for 2007 through 2009 ......................................103 Figure 11. The state of Colorado median growth percentile for female students for 2007 through 2009 ....................................................................................................................103 Figure 12. The state of Colorado median growth percentile for male students for 2007 through 2009 ....................................................................................................................104 Figure 13. A summary of the median growth percentile reading results for three consecutive years for the Highland School District and the state of Colorado ...............106 Figure 14. A positive linear trend demonstrated in the full original district sample .......113 Figure 15. Positive linear trend demonstrated by Cohorts 1 and 4 from the full original district sample ..................................................................................................................114 Figure 16. Plot of Time (X-axis) and the means at each test administration for the HighFrequency Words DORA subtest (Y-axis) from the full original district sample ...........121 Figure 17. Plot of Time (X-axis) and the means at each test administration for the Word Recognition DORA subtest (Y-axis) from the full original district sample ....................122 Figure 18. Plot of Time (X-axis) and the means at each test administration for the Phonics DORA subtest (Y-axis) from the full original district sample .........................................123 Figure 19. Plot of Time (X-axis) and the means at each test administration for the Phonemic Awareness DORA subtest (Y-axis) from the full original district sample .....124
xxi
Figure 20. Plot of Time (X-axis) and the means at each test administration for the Oral Vocabulary DORA subtest (Y-axis) from the full original district sample .....................125 Figure 21. Plot of Time (X-axis) and the means at each test administration for the Spelling DORA subtest (Y-axis) from the full original district sample ..........................126 Figure 22. Plot of Time (X-axis) and the means at each test administration for the Reading Comprehension DORA subtest (Y-axis) from the original district sample.......127 Figure 23. Plot of Time (X-axis) and the mean total scaled score for the CSAP reading state test (Y-axis) comparing ESL/ELL students to non-ESL/ELL students ..................136 Figure 24. Plot of Time (X-axis) and the mean DORA composite score (Y-axis) comparing ESL/ELL students to non-ESL/ELL students................................................140 Figure 25. Plot of Time (X-axis) and the mean total scaled score for the CSAP reading state test (Y-axis) comparing IEP students to non-IEP students......................................147 Figure 26. Plot of Time (X-axis) and the mean DORA composite score (Y-axis) comparing IEP students to non-IEP students ...................................................................151 Figure 27. A positive linear trend demonstrated in the final analysis sample .................160 Figure 28. Positive linear trend demonstrated by Cohorts 1 through 4 from the final analysis sample ................................................................................................................161 Figure 29. Plot of Time (X-axis) and the mean total scaled score for the CSAP reading test (Y-axis) comparing students not in the analysis sample to students in the analysis sample ..............................................................................................................................164 Figure 30. Plot of Time (X-axis) and the means for the High-Frequency Words DORA subtest (Y-axis) from the final analysis sample ...............................................................169 xxii
Figure 31. Plot of Time (X-axis) and the means for the Word Recognition DORA subtest (Y-axis) from the final analysis sample ...........................................................................170 Figure 32. Plot of Time (X-axis) and the means for the Phonics DORA subtest (Y-axis) from the final analysis sample .........................................................................................171 Figure 33. Plot of Time (X-axis) and the means for the Phonemic Awareness DORA subtest (Y-axis) from the final analysis sample ...............................................................172 Figure 34. Plot of Time (X-axis) and the means for the Oral Vocabulary DORA subtest (Y-axis) from the final analysis sample ...........................................................................173 Figure 35. Plot of Time (X-axis) and the means for the Spelling DORA subtest (Y-axis) from the final analysis sample .........................................................................................174 Figure 36. Plot of Time (X-axis) and the means for the Reading Comprehension DORA subtest (Y-axis) from the final analysis sample ...............................................................175 Figure 37. Plot of Time (X-axis) and the mean DORA composite score (Y-axis) comparing students not in the analysis sample to students in the analysis sample..........179 Figure 38. Scatterplot to check the linearity assumption at Level 1 of the HLM Growth Model in Research Question 1 .........................................................................................185 Figure 39. Histogram of the Level 1 residuals to examine the normality assumption in the model with Word Recognition as the time-varying covariate .........................................187 Figure 40. Histogram of the Level 2 residuals to examine the normality assumption in the model with Word Recognition as the time-varying covariate (Intercept) .......................188 Figure 41. Residuals plotted against the Sex covariate to examine Level 2 homogeneity of variance in the model with Word Recognition (Intercept) ..............................................190 xxiii
Figure 42. Category probabilities indicating the probability of a response for the 56-item survey ...............................................................................................................................233 Figure 43. The map of persons and items for the 56-item survey ...................................239 Figure 44. Category probabilities indicating the probability of a response for the 50-item survey ...............................................................................................................................248 Figure 45. The map of persons and items for the 50-item survey ...................................252 Figure 46. Category probabilities indicating the probability of a response for the 11-item survey ...............................................................................................................................262 Figure 47. The map of persons and items for the 11-item survey ...................................267 Figure 48. Category probabilities indicating the probability of a response for the 10-item survey ...............................................................................................................................272 Figure 49. The map of persons and items for the 10-item survey ...................................276 Figure 50. Scatterplot to check the linearity assumption at Level 1 of the HLM Growth Model in Research Question 3 .........................................................................................324 Figure 51. Scatterplot to check the linearity assumption at Level 3 of the HLM Growth Model in Research Question 3 (Teacher 2) .....................................................................325 Figure 52. Scatterplot to check the linearity assumption at Level 3 of the HLM Growth Model in Research Question 3 (Teacher 19) ...................................................................326 Figure 53. Histogram of the Level 1 residuals to examine the normality assumption in the model with Word Recognition as the outcome ................................................................327 Figure 54. Histogram of the Level 2 residuals to examine the normality assumption in the model with Word Recognition as the outcome (Intercept) ..............................................329 xxiv
Figure 55. Histogram of the Level 2 residuals to examine the normality assumption in the model with Word Recognition as the outcome (Slope) ...................................................330 Figure 56. Residuals plotted against the Sex covariate to examine Level 2 homogeneity of variance in the model with Word Recognition as the outcome (Intercept) .....................332 Figure 57. Residuals plotted against the Sex covariate to examine Level 2 homogeneity of variance in the model with Word Recognition as the outcome (Slope) ..........................333 Figure 58. Residuals plotted against teacher OFAS score to examine Level 3 homogeneity of variance in the model with Word Recognition as the outcome .............336
xxv
CHAPTER 1: INTRODUCTION
The purpose of this study was to examine the relationship between computerized/online formative assessments and summative, yearly state proficiency test scores. Specifically, the relationship between a computerized/online formative assessment program in reading, known as the Diagnostic Online Reading Assessment (DORA), and Colorado state student test scores in reading was examined across elementary, middle, and high school in one school district beginning in the 2004/2005 academic year and ending in 2009/2010. Multilevel growth modeling was used in an attempt to explore the relationship between the aforementioned variables. Additionally, a measure of teacher use of computerized/online formative assessment was developed, with the aim of beginning to validate the scores on this instrument. This preliminary investigation focused on the following three main goals: (1) Examining if computerized/online formative assessment growth is related to state test score growth, (2) Developing a behavioral frequency measure of teacher use of computerized/online formative assessment programs, and (3) In a preliminary concurrent validation effort, investigating the relationship between the aforementioned measure of teacher computerized/online formative assessment use and student computerized/online formative assessment scores. 1
Formative Assessment Formative assessment can be briefly defined as the use of diagnostic formal and informal assessments to provide feedback to teachers and students over the course of instruction for the purpose of improving performance and achievement (Boston, 2002). The existing formative assessment literature has long touted the benefits of the diagnostic use of assessment to inform curriculum and instruction. Previous research in this area has primarily focused on traditional formative assessment practices (i.e., paper-and-pencil quizzes, oral and written feedback to students). More recently with the technology movement in schools, the current literature is beginning to examine the effectiveness of different modes of formative assessment, namely computerized or Internet-based, automated formative assessment programs. The overall consensus from the traditional body of literature is that formative assessment is an essential component of classroom procedure, and that its proper use can raise standards and achievement (Black & Wiliam, 1998a), with the latest studies of technology-based formative assessment beginning to echo these findings. The current study will add to this burgeoning literature base by thoroughly examining one computerized/online formative assessment program and its relationship to a summative state proficiency test in the same content area, as demonstrated in previous standard formative assessment research. The formative assessment literature base is expansive and strong, with many significant contributions by Paul Black and Dylan Wiliam. Their research and metaanalytic reviews have demonstrated repeatedly that formative assessment has the ability 2
to raise standards and student achievement. The authors concluded that classroom assessment (i.e., formative assessment), if properly implemented, could not only improve how well students were learning what was being taught in class, but could also meaningfully boost students scores on external achievement exams (Black & Wiliam, 1998a). Overall, the evidence displays that high quality formative assessment does have a powerful impact on student learning, with effect sizes on standardized tests ranging between .4 and .7 in formative assessment intervention studies, which is larger than most other educational intervention studies. Although formative assessment has been found to be extremely effective in increasing student summative test performance, research also shows that high-quality formative assessment is rare in classrooms with most teachers having little competency in how to implement such assessment practices. The overall consensus is that teachers preparation in assessment and conducting assessment-related activities in the classroom is inadequate (Hills, 1991; Plake, 1993). Black and Wiliam (1998a) also found that most classroom testing encourages rote and superficial learning, with actual teacher assessment practices overemphasizing grading. Useful advice that can inform the teacher about their students is underemphasized, with teachers tending to replicate summative standardized testing in their own assessment practices. In general, research has highlighted how to effectively implement formative assessment to raise student performance and meet state standards and objectives, but this is rarely executed with modern policy expectations partially to blame.
The demand for school systems, individual schools, and teachers to be accountable for student performance has increased considerably in the past two decades. This demand for accountability relates to a direct measurement of attainment of educational standards and objectives. State standards and specific performance measures of those standards that track competency gains are a requirement for most educational systems. The emphasis on external, high-stakes standardized exams, such as No Child Left Behind (NCLB), favors summative tests over curriculum development and proper instruction, even though the research has shown repeatedly that summative assessments tend to have a negative impact on student learning (Black & Wiliam, 1998a; Crooks, 1988). Most of the current research on the abovementioned topics has documented how the standard mode of formative assessment has the potential to raise student test scores. Related to the current proposal, studies have linked standard formative assessment practices to other external measures of student performance forming a positive relationship (Wiliam, Lee, Harrison, & Black, 2004). For example, Fontana and Fernandes (1994) demonstrated in one study how formative assessment training had a positive impact on an experimental groups mean gain on a summative math exam compared to the control group. Other reviews have noted the role of the teacher in the effective formative assessment cycle, with teachers who engage in more frequent and quality formative assessment practices (e.g., the quality and use of feedback) demonstrating higher learning gains in their students (Elawar & Corno, 1985; Fuchs, Fuchs, Hamlett, & Stecker, 1991; Tenenbaum & Goldring, 1989). Many theories have 4
attempted to describe formative assessment in terms of these multilevel relationships (i.e., students, teachers, schools, school districts, etc.), with few studies focusing on methodologically and statistically accounting for these important nested associations (Black & Wiliam, 2009). Technology and Formative Assessment Although formative assessment has been an integral component of teacher assessment practices for decades, the introduction of technology into the classroom, specifically computers and the Internet, has provided more options for teachers to engage in this practice. Technological developments in the classroom have led to the increased use of computerized and online formative assessment in multiple subject areas to supplement or supplant traditional modes of formative assessment. Various technology tools have become more readily accessible and implemented in the 21st century. For example, Tuttle (2008) states that effective observation and diagnosis of student learning can be greatly assisted by online quizzes and web-based surveys. Examining the relationship between the most current mode of formative assessment and external, summative exams can provide teachers and administrators with evidence to warrant the continued use of technology-based formative assessment practices, which provide several practical benefits to teachers (e.g., more frequent results and timely feedback, less paperand-pencil test administration and grading) and students (e.g., more frequent results and timely feedback, engaging and interactive computer interfaces) alike.
Validity Standard formative assessment practices have been suggested as a means to improve student achievement, with an evaluation of technology-based formative assessment practices not far behind. However, one complication is the lack of empirical evidence bolstered by reasoned arguments to support the claim that improvements in student achievement are associated with this new mode of formative assessment (Nichols, Meyers, & Burling, 2009). According to the Standards for Educational and Psychological Testing (American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME), 1999), Validity logically begins with an explicit statement of the proposed interpretation of test scores along with a rationale for the relevance of the interpretation to the proposed use (p. 9). In accord with this statement, the current study used the above argument-based approach to validation as a guide for the use and interpretation of online formative assessment scores and scores from a measure of teacher online formative assessment practices as predictors of student summative test performance. Two kinds of arguments are used in the aforementioned validation process interpretive arguments and validation arguments (Kane, 2007). Interpretive arguments require producing claims pertaining to test score interpretation, and validation arguments involve a collection of evidence that either supports or refutes each claim. Interpretive and validation arguments were examined in the current research design. Based on the purpose and goals of this study, propositions can be made about the use of DORA scores and scores on the teacher measure of online formative assessment practices. For example, 6
if online formative assessment, in this quasi-intervention study, has a positive effect on student achievement, and a positive relationship exists between teacher online formative assessment practices and student online formative assessment scores, the validity claim can be partially substantiated (Shepard, 2009). Future validation studies adding to this preliminary body of evidence will either strengthen or contest the findings in this initial investigation. Few studies involving standard formative assessment practices have been conducted using this validation framework to enhance claims that formative assessment can have a positive impact on student achievement (Shepard, 2009). Predictably, no studies of online formative assessment have used the validation argument to assert similar evidence as defined in the body of traditional formative assessment literature. One reason why no studies have been performed is that quality online formative assessment programs are just beginning to emerge and see national and worldwide adoption and use. The current study used one such program that has seen success in the United States and Canada Lets Go Learn, Inc.s (LGL) online formative assessment program for Reading called the Diagnostic Online Reading Assessment (DORA). Data from DORA and state-level proficiency tests in one school district in Colorado were used to examine their relationship in a longitudinal and multilevel fashion while providing validation evidence for the use of DORA scores and scores from a measure of teacher online formative assessment practices.
Computerized/Online Formative Assessment DORA DORA is a Kindergarten through twelfth grade measure that provides objective, individualized assessment data across eight reading measures that profile each students reading abilities and prescribe individual learning paths (Lets Go Learn, Inc. , 2009a). DORA displays a student's unique reading profile, which encourages teachers to utilize the results to tailor instruction to individual student needs. DORA's Internet-based program is adaptive (i.e., Computerized Adaptive Testing; CAT), which means that the program adapts to the students performance level, varying the difficulty of presented items according to previous answers (Lets Go Learn, Inc. , 2009a). Students are asked multiple questions in each subtest area, and once a student reaches a ceiling, the program moves to the next subtest area. Individual student and classroom reports are made available for teachers detailing the students grade level equivalency on each subtest. In addition, the reports are aligned to all 50 states standards for reading, and can help track student achievement of grade-level expectations. Research Questions and Hypotheses As mentioned previously, this investigation focused on three main goals that correspond to three main research questions and related hypotheses. The first research question is, What is the relationship between student online formative assessment growth and student state test score growth? The hypothesis is that student online formative assessment growth will be significantly positively related to state test score growth. The second research question is, What are the psychometric properties of the newly developed behavioral frequency measure of teacher use of computerized/online 8
formative assessment programs? The hypothesis is that a psychometrically sound measure of teacher use of computerized/online formative assessment practices can be created with acceptable item properties and reliability. Finally, the third research question is, What is the relationship between the aforementioned measure of teacher computerized/online formative assessment practices and student online formative assessment scores? The hypothesis is that teacher computerized/online formative assessment practices will be significantly positively related to student online formative assessment scores. Research Question 1. Data from DORA were used from one school district in Ault, Colorado. DORA data are collected at three assessment points during the academic year with two occurring before the state proficiency test in March. The school district in this examination adopted DORA in the 2006/2007 academic year, rendering several data points for each student until January of 2010. Student DORA data was linked with their state proficiency test scores in reading across several years beginning before DORA adoption in the 2004/2005 academic year. Four longitudinal cohorts of students were tracked beginning in the 2004/2005 academic year until the 2009/2010 academic year. This study utilized the student-level linked DORA and state test scores to examine if online formative assessment growth is related to state test score growth. Research Question 2. Although the primary interest in this study was to evaluate the impact of an online formative assessment program on state test scores, a secondary aim, reflected in the second goal above, was to develop a portable and efficient behavioral frequency measure of teacher use of online formative assessment programs. 9
The purposes for developing this measure were multi-faceted: (1) A measure of online formative assessment practices does not currently exist, (2) A quick and portable measure will potentially allow school districts and schools to examine how teachers are using their online formative assessment programs, diagnose problems, and remedy weaknesses, and (3) The measure will be flexible to use with similar programs like DORA, and will be adaptable to other content areas such as math and science. Data from reading teachers familiar with DORA were gathered on two fronts: (1) Teachers from the school district used in this current investigation were asked to complete the newly developed online formative assessment survey, and (2) LGL sent a webpage link with the online formative assessment survey to all the teachers currently using DORA across the United States. All teacher participant data were used in the Rasch Analysis to render reliability diagnostics and eliminate poorly fitting items. The final measure can be used as a diagnostic tool for school districts and schools to evaluate how teachers are using their online formative assessment programs. Research Question 3. As noted previously, studies have shown that teachers who engage in more frequent and quality formative assessment practices have higher learning gains in their students (Elawar & Corno, 1985; Fuchs et al., 1991; Tenenbaum & Goldring, 1989). Thus, theoretically, teachers with higher scores on the newly developed behavioral frequency measure will potentially produce students with higher online formative assessment scores and growth on DORA. To examine this final goal, current reading teachers (i.e., the 2009/2010 academic year) in grades 3 through 10 completed the developed survey of online formative assessment practices, which were linked to their 10
students DORA scores from the 2009/2010 academic year. Because students were linked to their specific reading teacher, this required a multilevel growth analysis in which individual student levels were nested in higher teacher levels. Hierarchical linear growth modeling was used to examine the interrelationships between student-level DORA scores and teacher-level online formative assessment survey scores, in a preliminary attempt at validating the scores on the instrument. Finally, all the evidence from the studys main goals can be culled to bolster a validation argument supporting the following assertions: (1) DORA use, as a formative assessment tool, can improve student achievement on state proficiency tests in reading, (2) DORA growth is reflective of student growth in reading, which is related to growth on state proficiency tests in reading, and (3) Teachers who use DORA more frequently in multiple capacities are able to diagnose student reading learning barriers with specificity, and use that feedback to improve their students reading state test scores. Contributions This study is innovative for a number of reasons beginning with the development of a measure of teacher online formative assessment practices. To date, no flexible, efficient measure exists with the potential to evaluate teacher use of this mode of formative assessment and diagnose weaknesses in the system. Regarding previous studies of online formative assessment, the overwhelming majority of online formative assessment studies have examined college-age populations in the university setting, usually within one course (Buchanan, 2000; Jenkins, 2004). In addition, past and current formative assessment research has thoroughly examined the relationship between 11
measures of formative assessment and performance on a summative, usually end-ofcourse or final exam, but not state proficiency test scores. Current research is also just beginning to use more experimental and quasiexperimental designs and sophisticated statistical analyses, which is in contrast to the many qualitative studies summarizing student reactions and perceptions of a technologybased platform for quizzes and exams (Hunt, Hughes, & Rowe, 2002; Peat & Franklin, 2002). Finally, due to the novelty of the mode of online or computerized administration, understandably research is lacking in longitudinal data analysis, with no studies examining multiple years of data across several cohorts. Results from this study can support the burgeoning literature outlining the role of the Internet, and technology in general, in teaching and learning. As Internet-mediated teaching and assessment is gradually used to support and/or replace traditional forms of evaluation, the need to examine the extent to which these methods are educationally sound is in high demand. Results from this study can bolster support for federal initiatives and administrative demands for more efficient ways to meet state standards. For example, computer and Internet-based formative quizzes can deliver fast, automated, and highly individualized results with tailored feedback that teachers and students can use to improve curriculum and potentially increase student learning gains (Buchanan, 2000). In addition to the implications noted above, higher student numbers in schools and colleges with resulting larger teacher to student ratios can pressure administrators to find alternative, more efficient routes (i.e., in terms of teacher time and money spent on assessment) to raise test scores and meet federal and state demands. Thus, a positive 12
relationship between online formative assessment and student state test scores can render support for school districts and individual schools to obtain grant money and permission needed to acquire the necessary online formative assessment programs to alleviate some of the abovementioned strain. Chapter Summary The next section of this document (i.e., Chapter 2: Literature Review) will include a review of the general formative assessment literature and the limited technology-based formative assessment literature. Chapter 3 will discuss the methodology of the current study, which will provide an in-depth description of the methods proposed to address the studys main goals. This section will review the objectives, rationale, and research questions, and provide an extensive description of the variables, measures, data, and analyses. Chapter 4 will detail the results from the investigation organized by research question. Finally, Chapter 5 will include a discussion of the results from Chapter 4 with implications, limitations, and future directions.
13
CHAPTER 2: LITERATURE REVIEW
Educational assessment serves many functions, one of which is to support learning, which is the definition of formative assessment (Wiliam & Leahy, 2007). Formative assessment is broadly defined as the diagnostic use of assessment to provide feedback to teachers and students over the course of instruction (Boston, 2002, p. 1). A more commonly cited definition is provided by Black and Wiliam (1998b), which states that formative assessment encompasses all those activities undertaken by teachers, and/or by their students, which provide information to be used as feedback to modify the teaching and learning activities in which they are engaged. Such assessment becomes formative assessment when the evidence is actually used to adapt the teaching to meet student needs (p. 8). Under these definitions, general assessment includes various activities on the part of the teacher such as classroom observation, discussion, and evaluating student work such as in-class assignments, homework, and exams. General assessment is made formative when the process of teaching and learning to meet student needs is modified based on assessment results. In the definition of formative assessment above, the term assessment encompasses both formal and informal mechanisms of estimating the status of the student with regard to the curriculum. Formal formative assessment is more easily recognized as 14
standard or traditional paper-and-pencil tests. In informal formative assessment, teachers may pose frequent questions to students during class to gather information about the degree to which a topic has been properly understood (Ruiz-Primo & Furtak, 2006). Another feature of the definition is that the information rendered from formative assessment must be used in a timely fashion. For example, results of the formative assessment should be readily available to teachers and students so that critical curricular modifications can be made. Thus, assessment is not formative unless the ability to make changes to the present curriculum or subject being taught is an immediate possibility (Bell & Cowie, 2000). By definition, formative assessment also intends for instructional adjustments to better meet the needs of the students assessed. More specifically, not only does the information garnered from formative assessment need to be used to modify instruction or the curriculum, but the information also should improve student outcomes (Black & Wiliam, 1998b). Therefore, the data from formative assessments must be useful and informative to properly meet student needs, which is ultimately used to raise student test scores and overall achievement. The majority of research in this area has sought to ascertain the ability of various formative assessments to meet these definitional components, with the overwhelming consensus being that formative assessment, when used properly, has the ability to increase student performance and achievement. Formative Assessment History The term formative (and summative) evaluation was first used by Michael Scriven (1967) in the context of program evaluation, specifically curriculum evaluation. 15
He stated that evaluation may have a function in the continuous improvement of the curriculum, and proposed the word formative evaluation to qualify evaluation in this role. In 1989, the American Association for the Advancement of Science (AAAS) released Science for All Americans that circulated the idea that students are influenced by their existing ideas in the learning process, and stressed the importance of informative feedback (AAAS, 1989). Following the AAAS, and continuing for a number of years, the National Research Council (NRC; NRC, 1996) encouraged teachers to repeatedly make adjustments to their teaching related to information gained during intermittent, informal assessment. Coupled with this, the NRC also pushed teachers to provide students with opportunities to self-assess to enhance self-directed learning. Finally, the term formative assessment began to receive attention and use in the mainstream education research literature in the late 1990s. Paul Black and Dylan Wiliam published Assessment and Classroom Learning in 1998, which reviewed two earlier articles by Natriello (1987) and Crooks (1988). Although Black and Wiliam are considered the foremost researchers in formative assessment, Natriello and Crooks discussed the concept of formative assessment long before it was labeled as such in the literature. Natriello (1987), in a review of assessment and evaluation practices, found previous research on the impact of assessment as misguided due to a lack of control over certain variables such as the quality and quantity of feedback. Crooks, focusing more specifically on the impact of evaluation on students, concluded that the summative function of evaluation has been too dominant, and more attention should be given to formative assessments (Black & Wiliam, 1998a). 16
Over the next decade, Black and Wiliam became the leading researchers in the area of classroom and formative assessment by determining in massive and detailed reviews that formative assessment, when used properly, is a definitive means to improve student achievement. Following the success of Black and Wiliam, the NRC in 2000 described how self-assessment by students, and conversations (i.e., between the student and teacher) instead of inquisitions, were critical attributes of formative assessment. This NRC report promoted formative assessment as a key attribute in the classroom, and advocated an increase in the frequency of formative assessment use (NRC, 2000). The formative assessment movement increased in popularity significantly in 2001 and 2002 due to various educational reforms by the national government (i.e., No Child Left Behind; NCLB). NCLB was/is a re-authorization of the Elementary and Secondary Education Act (ESEA) of 1994 (No Child Left Behind Act of 2001, 115 Stat. 1425). NCLB was/is a form of stringent educational accountability that requires public schools receiving government funds to improve their students scores on state proficiency exams. Educators, in a frenzied attempt to boost their students scores, began to incorporate formative assessments into their curricula as a research-proven method to increase students scores on various achievement tests. Due to the sudden demand for quality formative assessments to meet adequate yearly progress (AYP) on state-mandated end-of-year exams, testing companies such as Educational Testing Service (ETS) began promoting and marketing formative assessment programs. In addition to providing these tests, ETS also developed the Assessment Training Institute (ATI). Currently still in operation, this training institute provides 17
seminars, school improvement services, and professional development programs aimed at increasing the effective use of formative assessment to promote higher student achievement (ETS, 2009). Following the success of the ATI, ETS was commissioned to examine similar training programs such as The California Formative Assessment and Support System for Teachers (CFASST), which provides guidance for beginning teachers as a central component of Californias Beginning Teacher Support and Assessment (BTSA) program. ETS investigated the impact of the program on the teaching practices of beginning teachers, such as the use of formative assessment practices, and how that was related to the learning of their students. The study showed that by encouraging and training teachers to incorporate frequent formative assessment, there was a positive impact of BTSA/CFASST on not only the teachers, but also the students who demonstrated increased learning, equivalent to half a years growth or more (Thompson, Paek, Goe, & Ponte, 2004). Thus, the development of training programs, the promotion of formative assessment by reputable testing companies, and the increase in investigating formative assessment programs and interventions in recent years undoubtedly showcases the popularity of formative assessment as a recognized best practice in the current education system, and consequently, a vital component of education reform. More recently, individual schools and school districts are developing strategies, such as technology-based platforms, for supporting formative assessment at the local and state level. For example, in Texas, the Technology Immersion Pilot provides a platform for administering online diagnostic assessments, with test items being provided by each 18
district (Gallagher & Worth, 2008). In addition, Louisiana has supported a state grant that allows the Louisiana Department of Education to provide all districts with an online formative assessment system, including both a pool of custom items aligned to state standards and training for teachers and administrators in collecting and reporting data for formative purposes (Gallagher & Worth, 2008). Formative assessment policies are becoming commonplace worldwide as well, with the United States and the United Kingdom as leaders in this field of research. In 2005, the Center for Educational Research and Innovation (CERI) at the Organization for Economic Co-operation and Development (OECD) published research on formative assessment in secondary education, drawing on case studies of eight countries (i.e., Canada, Denmark, England, Finland, Italy, New Zealand, Australia and Scotland). The review detailed the development and adoption of formative assessment policy in other countries (Seba, 2005). Theoretical Frameworks The current proposal is couched in theory from diverse areas such as cognitive and educational psychology. However, any theoretical framework of computerized/online formative assessment must first address how students learn beginning with traditional theories of learning (i.e., Information Processing Theory), and progressing towards more contemporary theories that account for the influence of technology on information processing and learning (i.e., The Cognitive Theory of Multimedia Learning). The abovementioned theories will be briefly discussed, followed by Black and Wiliams (2009) comprehensive explanation of the connection between learning theory and formative assessment. 19
Information Processing Theory (IPT) is viewed as one of many general and wideranging theoretical frameworks for how students learn. This theory uses a computer analogy to explain how students encounter, process, and retain information. Goldhaber (2000) states that in IPT, the mind is analogous to a computer, stressing that the cognitive structures that store the information are the hardware, and the cognitive processes are the software (Goldhaber, 2000, p. 113). There are three processing levels in this theory: (1) sensory registers, (2) short-term memory (or working memory), and (3) long-term memory. IPT uses these levels to examine how information is processed, utilized, and learned, and how subsequent behavior is affected. The first level, or sensory registers, is where the information is first received from the surrounding environment (Miller, 1956). This is followed by working memory, or short-term memory, where information may be further processed and prepared for encoding and storage. Working memory processes information for longer periods of time if the individual is actively concentrating on the information. Depending on how deeply the information is processed, it may be transferred to long-term memory via rehearsal, organization, and elaboration (McDevitt & Ormrod, 2004). At its most basic level, IPT explains how students perceive, remember, and store information learned in the classroom on a daily basis. Specific to student learning in the classroom, IPT illustrates that learning something new involves attending to and focusing on the new material, comparing it to stored material existing in long-term memory, and either adding the new material to the long-term memory register, or creating new categories for the information that does not fit in any established group (Svinicki, 2005). 20
The latter part of that process is the most critical step in demonstrating learning in a classroom setting. Once information is moved from working memory to long-term memory, successful retrieval of that information in the future via different forms of assessment is an indication that learning has occurred; McDevitt & Ormrod, 2004). A more modern theoretical framework, stemming from IPT, which accounts for the influence of multimedia (e.g., technology) on information processing and learning is The Cognitive Theory of Multimedia Learning (Mayer & Moreno, 1998). The overarching belief in this theory is termed the multimedia principle, which states that people learn more deeply from words and pictures than from words alone (Mayer, 1997, p. 47). This theory includes three components: (1) Two separate channels (i.e., auditory and visual) exist for processing information, (2) Each channel has a limited capacity, and (3) Learning is an active process of filtering, selecting, organizing, and integrating information based upon prior knowledge. Similar to how learning is explained in IPT, multimedia learning is illustrated in three important cognitive processes selecting, organizing, and integrating (Mayer & Moreno, 1998). Selecting, similar to sensory registers in IPT, is applied to separate incoming verbal and visual information forming two text bases a word base and an image base. Following this, organizing (i.e., or working memory in IPT) occurs when a verbally-based model of the eventual system is created from the word base, and a visually-based model of the pending system is applied to the image base. Finally, similar to the transfer of information from short-term memory to long-term memory in IPT,
21
integrating involves building connections between corresponding parts of information in the verbally-based model and the visually-based model. The processes of the theory (i.e., selecting, organizing, and integrating) are embedded in five principles and corresponding research demonstrating how student learning occurs in a multimedia and technology-infused environment. The first principle is the multiple representation principle, which states that it is better to present an explanation using two modes of representation rather than one (Mayer & Moreno, 1998, p. 2). Research has shown that presenting students with auditory and visual guides for procedural learning demonstrated significant learning and retention of information when asked questions about the presented task (Mayer & Anderson, 1991, 1992). The second principle is the contiguity principle, which illustrates that students better understand a presented topic when words and pictures are presented simultaneously compared to separating the words and pictures (Moreno & Mayer, 1999). This was also demonstrated in the same series of research studies as mentioned above, and similar results were found by other researchers (Sweller & Chandler, 1994). The third principle is the split-attention principle, which notes that words should be presented auditorily rather than visually for the best information processing and subsequent learning results. For example, research has shown that students who viewed an animation of a procedure while also listening to a corresponding narration generated more correct answers on a related test than did students who viewed the same animation while reading corresponding on-screen text (Moreno & Mayer, 1999).
22
The fourth principle is the individual differences principle, which includes that the previously mentioned principles (i.e., multimedia, contiguity, and split-attention) depend on individual differences in the student. Research has shown that there are differences in the impact of the multimedia stimuli on students who lack prior knowledge (i.e., stronger effects) compared to students who possessed high levels of prior knowledge (i.e., weaker effects; Mayer & Gallini, 1991). Finally, the fifth principle, or the coherence principle, states that students learn better from a coherent summary which highlights [only] the relevant words and pictures than from a longer version of the summary (Mayer & Moreno, 1998, p. 4). Related research has shown that students demonstrated greater learning who read a passage explaining only the basic steps of a procedure compared to students who were provided with additional details in the materials (Mayer, Bove, Bryman, Mars, & Tapangco, 1996). By accounting for the multimedia capabilities (i.e., and increasing technological capabilities) of modern society, The Cognitive Theory of Multimedia Learning has enhanced traditional IPT, and made it a suitable theoretical framework for how learning occurs in an increasingly digital world. The five principles outlined by Mayer and Moreno (1998) have a developing research base, which relates back to the three processes outlined in the theoretical model (i.e., selecting, organizing, integrating). Currently, no all-inclusive theory exists linking multimedia learning to formative assessment. IPT and The Cognitive Theory of Multimedia Learning provide a starting point in examining how students learn in a multimedia environment, which can eventually be connected to the
23
standard formative assessment literature base to develop a comprehensive theoretical framework for computerized/online formative assessment. Black and Wiliam (2009), in a first attempt to fill the conceptual voids in the literature on formative assessment, detail a theoretical framework for formative assessment. Any theoretical framework specific to formative assessment must include three specific elements such as the teachers agenda, the internal world of the student, and the inter-subjective. The authors state, Any evidence of formative interaction must be analyzed as reflecting a teachers chosen plan to develop learning, the formative interactions which that teacher carries out contingently within the framework of that plan as realized in the social world of the classroom and school and the internal cognitive and affective models of each student of which the responses and broader participation of students provide only indirect evidence (Black & Wiliam, 2009, p. 26). One theoretical framework encompassing some (but not all) of these features is a variation on selfregulated learning (SRL) theory stemming from IPT (Boekaerts & Corno, 2005; Greene & Azevedo, 2007; Winne & Hadwin, 1998). Traditionally, student learning in IPT has been viewed as primarily teacherdriven; however, the interaction of the teacher, learner, and context also play a major role in the process. A complete theoretical framework of formative assessment includes all factors (i.e., teachers, students, the context), and how these elements interact to establish the current position of the student, where the student is going, and the ultimate goal for the student (Wiliam & Leahy, 2007). Therefore, general learning theories such as IPT and more modern theories incorporating multimedia in learning do not attend to the other 24
parties involved such as teachers and context, and more importantly, do not account for the recursive nature of learning between teachers and students in the form of feedback and self-monitoring (i.e., metacognitive self-monitoring; Black & Wiliam, 2009). This is one of the main criticisms in using a general learning theory such as IPT to frame formative assessment, and the impetus for Black and Wiliams use of SRL theory to fill some of the gaps in previous learning theory frameworks to better explain formative assessment. Self-regulation is defined as a multi-component, multi-level, iterative selfmonitoring process that targets metacognition and subsequent action, as well as environmental cues that modulate ones goals (Boekaerts & Corno, 2005). From SRL theory, one model that attempts to incorporate metacognitive self-monitoring from evaluation or feedback was first described by Winne and Hadwin (1998) and further developed by Greene and Azevedo (2007). The authors state that their model specifies the recursively applied forms of metacognitive monitoring and feedback that change information over time (thus influencing goals) as self-regulated learners engage in an assignment (Winne & Hadwin, 1998, p. 203). The model combines stages of a production of a response and the various resources that might be used during these stages. The overarching influence on the entire model is the learners control and selfmonitoring, which directs the progress through the stages and the ways in which the resources as used (Black & Wiliam, 2009; Greene & Azevedo, 2007). The model is outlined in an abbreviated chart below in Table 1.
25
Table 1 Self-Regulated Learning Theory and Model (Greene & Azevedo, 2007; Winne & Hadwin, 1998) Stage 1 2 3 4 Identify a task. Plan a response. Enact a strategy. Adapt: Review or re-cycle. Resource A B C D Conditions of learner and context. Operations to transform input and own data. Standards: Criteria for self-appraisal. Evaluation.
The four stages of the production of a response during the course of an assignment or an assessment include: (1) Identifying the task, (2) Planning a response, (3) Enacting a strategy, and (4) Adapting (i.e., reviewing or re-cycling; Greene & Azevedo, 2007). Any of the four stages can be associated with any or all of the resources, which include: (A) Conditions of the learner and context, (B) Operations to transform input and own data, (C) Standards (i.e., criteria for self-appraisal), and (D) Evaluation. Each of these resources will be briefly explained as outlined by Black and Wiliam (2009). Resource A, or the conditions, encompasses the resources available to a student to learn the material, and the environmental or task constraints in completing an assignment or assessment. Conditions specific to the learner (i.e., cognitive conditions of the student such as past experiences, beliefs, and motivations) and the task (i.e., resources, time, instructional cues) are combined as internal and external components influencing the learning capability and successful completion of the assignment. Conditions influence the next two resources (i.e., B and C), which include operations and standards. Operations include the searching, assembling, rehearsing, and translating processes, which are 26
considered dependent on the ease with which a student can perform the abovementioned processes and memory capacity of the student. Resource C, the standards or criteria for self-appraisal, is dependent on the interpretation of the assignment or assessment, the perception of the criteria for success, the personal orientation of the student towards the task, and the students view of time constraints (Black & Wiliam, 2009). Finally, the predominantly metacognitive component, resource D (or evaluations), involves the overall control and monitoring functions, which may lead to significant revisions in or re-cycling of the learning process in light of various forms of feedback (Black & Wiliam, 2009). Evaluation may lead to re-cycling either after a single stage of the production of a response to an assignment, or after the completion of all stages in succession. For example, a student may evaluate at stage 3 (i.e., enacting a strategy) that completing the task may be too difficult given the possible strategies, and may start again from the beginning with identifying or defining a new task. Conversely, the student may instead remain at stage 3, but implement a different resource in the form of a new operation (i.e., resource B) to enact a very different strategy. As noted above, any stage can be terminated or adjusted in light of feedback or self-monitoring, using any or all resources (i.e., A through D) at a single or multiple time points during the assessment process. One of the criticisms of the above model (and related SRL theory) in explaining formative assessment is the lack of definition of the teacher role. Although it is implied that self-monitoring and evaluation may occur in light of feedback provided from an external source, such as a teacher, this is not specifically mentioned, and thus the model 27
is deficient in explaining all vital components (i.e., teacher, student, and context) of the formative assessment process (Black & Wiliam, 2009). Thus, this model (and SRL in general) is mainly viewed as a guide for teachers in explaining how students employ metacognitive self-monitoring during assessment, and is also a framework for teachers contingent actions in providing feedback in the formative assessment process. The Assessment Cycle Formative assessment is embedded in an assessment cycle, which Wiliam and Black (1996) define as encompassing three main parts: (1) Evidence, (2) Interpretation, and (3) Action. The first part of the cycle suggests that before any inferences or actions are made, the general level of performance needs to be measured. This evidence is usually in the form of some kind of artifacts such as writing samples, tests and quizzes, audio or videotapes. The second part of the assessment cycle is interpretation of the evidence or performance. In most cases, the interpreter is the teacher, who is usually required to decipher the results in comparison with the entire classroom, school, district, or state or national standards. The teacher generally will determine if there is a gap between the evidence and the standards, and from this point, progress to the third part of the cycle action. Action is the decision phase of the cycle where the interpretation is used to guide future performance or placement. This phase also incorporates one of the most crucial features of successful formative assessment feedback. Feedback, as part of the action phase of the assessment cycle, pertains to the gap or lack of gap in knowledge comparing the evidence and the standards of comparison. Many researchers insist that assessment is 28
made formative through the process of administering feedback, and incorporating that feedback into future assessments, lectures, and curricula (Black & Wiliam, 1998; Wiliam, 2000). Feedback Researchers have attempted to define the major features of formative assessment through extensive research and review of the literature base (Wiliam, 2000; Wiliam, 2007b). Based on findings from Black and Wiliam (1998), Clarke (2001) suggests that the key factors of formative assessment include: (1) effective feedback to pupils, (2) active involvement of students in their own learning, (3) adjusting teaching to take account of the result of assessment, (4) a recognition of the influence that assessment has on motivation and self-esteem, and (5) the need for students to be able to assess themselves and to understand how to improve. Although these key factors are all crucial for the proper implementation of formative assessment, the factor that is arguably the most important is feedback (Shute, 2008). The author notes that research conducted in this area [states] that good formative feedback can significantly improve learning processes and outcomes, if delivered correctly (Shute, 2008, p. 154). The caveat that feedback must be delivered correctly envelopes the authors review of the literature, emphasizing the importance of quality feedback in order for the assessment cycle to produce the greatest learning gains in students. Feedback can be defined as any communication between the instructor and the student that provides information about the students performance of an assessment task (Shute, 2008). The term feedback is generally discussed in the larger context of 29
formative assessment, and as noted in the definition of formative assessment from Black and Wiliam (1998b), is an integral component of the formative assessment cycle. Regarding research on feedback, the authors have summarized the following: (1) All formative assessment involves feedback between student and teacher by definition, (2) The success of this interaction has a direct impact on the learning process, (3) Difficulties arise in analyzing the contribution of feedback alone, (4) Formative assessment necessitates feedback, and (5) Feedback is most effective when it is objective and not subjective (Black & Wiliam, 1998b; Shute, 2008). Research has shown that providing high quality feedback on student work is a very powerful way to raise standards (Black & Wiliam, 1998; Shute, 2008). Although some evidence of the negative or null impact of feedback on student achievement exists (Kluger & DeNisi, 1996), the majority of research and meta-analyses have found that feedback improves learning, with effect sizes ranging from approximately .4 to .8 compared to control conditions (Guzzo, Jette, & Katzell, 1985; Kluger & DeNisi, 1996). Shute details recommendations for the effective use of formative feedback in her review, and notes that not only can student learning be improved, but also how teachers teach when using valid, objective, focused, and clear feedback mechanisms. Formative Assessment Some Evidence Formative assessment has been shown repeatedly in the literature that students receiving frequent feedback about their progress yield substantial learning gains. Evidence from research is cited frequently in several meta-analyses in support of formative assessment as a key factor in increasing student achievement. Fontana and 30
Fernandes (1994) used an experimental design in which math teachers participated in self-assessment training for students (e.g., teaching students to understand learning objectives and assessment criteria, and providing tasks and tests to students to assess their own learning). This was implemented in a 20-week math course in younger and older groups of students. The control group, who did not use the self-assessment methods, had smaller gains at posttest compared to the experimental group. Black and Wiliam (1998) also summarized a study by Whiting, Van Burgh, and Render (1995), which examined an implementation of mastery learning in classrooms compared to a control learning condition in other classrooms with approximately 7,000 students over a period of 18 years. Mastery learning was defined as frequent testing and feedback, and students were required to achieve specific mastery criteria before proceeding to the next task or test. The results showed that the final, summative test scores (and the grade point averages) of the students in classes with mastery learning were higher than students in the control classrooms. Similarly, in a study of frequent testing, Martinez and Martinez (1992) examined 120 college students in an introductory algebra course. Results revealed that students who were tested more frequently made significant learning gains compared to students who were tested less frequently. Another example involved 44 students ages 9 or 10 years in one elementary school that worked over seven days on instructional materials with graduate students (Schunk, 1996). There were four treatment conditions, with two groups having the instructors stress learning goals (e.g., learning how to solve the problem), and instructors for the other two groups stress performance goals (e.g., solving the problem). The groups 31
were further differentiated by either evaluating their problem-solving capabilities or completing an attitude questionnaire. The outcome measures of skill, motivation, and self-efficacy showed that the group focusing primarily on performance goals without self-evaluation was lower compared to the other groups. Thus, frequent self-evaluation as part of the formative assessment process, has been shown in multiple studies to be a key factor in enhancing student performance. Focusing primarily on student performance and mean gains, Fuchs and Fuchs (1986) conducted a meta-analysis of 21 different studies with elementary school students. The studies were all experimental designs with control groups and compared effect sizes of teachers and schools that implement various conditions of formative assessment and those who do not. For example, when teachers would frequently (i.e., two to five times per week) review the classroom- and individual-level data from assessment activities with the students, a larger effect size was noted (d = .92) compared to teachers who did not or who relied on more subjective assessment and feedback techniques (d = .42). Significant mean gains were also noted when teachers would use graphical displays to chart the progress of classrooms and individuals compared to teachers who did not do this frequently or at all. More recent research in London on a formative assessment intervention has demonstrated that students make significant learning gains and produce higher achievement scores when teachers implement various formative assessment strategies. The achievement gains made were comparable to raising achievement from the lower quartile in performance on national achievement tests to above average (Black, Harrison, 32
Lee, Marshall, & Wiliam, 2002). Likewise in the United States, the relationship between assessment practices and achievement in mathematics was investigated using data from the Third International Math and Science Study (TIMSS). The results indicated that teacher assessment practices were significantly positively related to classroom performance (Rodriguez, 2004). In a series of studies on the impact of informal formative assessment, Ruiz-Primo and Furtak (2006, 2007) examined the informal formative assessment practices of science teachers and their classrooms. The results illustrated that using informal formative assessment strategies can also lead to improved student performance compared to teachers and classrooms that do not implement such strategies. In addition, Fox-Turnbull (2006) investigated the relationship between teacher knowledge of formative assessment feedback and student achievement. It was found that formative assessment practices, such as various critical thinking tasks and the quality of feedback provided after such tasks, improved students achievement on a summative assessment. The researcher found that teacher knowledge had an impact on the use and quality of formative assessment feedback, which had a positive influence on students achievement. In sum, research has highlighted the benefits of formative assessment on learning, motivation, and achievement. Many short-term benefits of formative assessment are explicated: (1) It encourages active learning strategies; (2) Formative assessment provides knowledge of results and corrective feedback; (3) It helps students monitor their own progress; and (4) Formative assessment strategies foster accountability (Brookhart, 2007). Additionally in the short-term, results have shown that formative assessment also 33
improves overall academic achievement for low-achieving and disabled students (Boston, 2002). Because formative assessment is arguably the most cost-effective technique to improve student achievement, long-term benefits entail higher lifetime earnings and general economic growth for society (Wiliam & Leahy, 2007). Computerized/Online Formative Assessment Technology has become central to learning, and consequently is becoming vital to assessment (Bennett, 2002). E-learning (i.e., learning that is facilitated by electronic technologies) is referred to as part of the equipment of 21st Century scholarship, and support for this is evidenced by the success of online universities and high schools in the United States (Buzzetto-More & Guy, 2006). However, e-learning is only half of the equation as government mandates have required schools to use data to inform decision making beginning with the ESEA and NCLB. The use of data has necessitated the development of improved information technology and access to computers and highspeed Internet in schools (Petrides, 2006). Thus, the other half of the equation is the use of data rendered from e-learning, or e-assessment, which entails using electronic technologies to drive student learning and assessment (Ridgway, McCusker, & Pead, 2004). Although formative assessment, formally and informally, has been an integral component of teacher assessment practices for decades, the introduction of technology into the classroom, specifically computers and the Internet, has provided more options for teachers to engage in this practice. Technological developments in the classroom have led to the increased use of computerized and online formative assessment in multiple subject 34
areas to supplement traditional modes of formative assessment. Various technology tools have become more readily accessible and implemented in the 21st century. For example, Tuttle (2008) states that effective observation and diagnosis of student learning can be greatly assisted by current technology such as clickers, online quizzes, web-based surveys, digital logs, and spreadsheets (i.e., all examples of e-assessment). The following paragraphs will detail some of these technologies, specifically online quizzes, as this mode is frequently compared with traditional formative assessment practices. Clickers are personal response systems that allow instructors to get a quick response to a question from a mass of students (Tuttle, 2008). For example, a professor can pose a sample test question to an large, introductory-level class, and all students equipped with a clicker can respond to the test question immediately using various response options on the clicker (e.g., Yes/No, True/False, or Multiple-choice option button(s)). The professors computer and projection system display the summary results for everyone to see, and provide quick information on any gaps or trends in student comprehension. Clickers are the quickest way to get test results and feedback, and extremely useful for large, multi-section, introductory-level classes, where the practical logistics of formative assessment prevent the process from being completely and thoroughly actualized. Another technology tool that aids formative assessment is digital logs, which allows teachers to monitor reading progress in elementary school students (Tuttle, 2008). For example, teachers can record how many words their students read in a specific amount of time, and document this information using a computer, Personal Digital 35
Assistant (PDA), or other digital device. Spreadsheets are also helpful in aiding formative assessment through modern technology. Recording and rating information on a spreadsheet that can be accessed by students and the teacher can provide nearly instantaneous feedback as part of the formative assessment cycle. The previous two aids mentioned digital logs and spreadsheets are more organizational tools that aid formative assessment, which may require more work on the part of the teacher or instructor. This is in comparison to clickers and the next exemplars web-based surveys and online quizzes which can collect and process data instantly and disseminate individualized and collective feedback by the software or websites structure and programming. Web-based surveys are short quizzes (e.g., three or four questions in length) that a teacher may give at the end of a daily or weekly lesson that provide instantaneous feedback to the student and teacher (Tuttle, 2008). The surveys can either be a structured quiz on the information learned during the week or day, or be more opinion-based, where the student provides information to the teacher about what was confusing or easy to understand about the topic. Online quizzes are similar to the web-based surveys, but are longer and typically mimic the types of questions a student may encounter in an end-ofquarter or end-of-year summative assessment. These online quizzes are usually hosted by a large software program or website that can alter questions depending on the students response as in item response theory. These quizzes are also more sophisticated in that the questions can be structured hierarchically and aligned with school, district, or national
36
proficiency standards. Online quizzes are the most frequently evaluated in the literature as the mode of formative assessment integrating technology in the assessment process. Computerized/Online Formative Assessment in College Courses Computerized and online formative assessment have been more readily applied to large, multi-section, undergraduate review courses such as Psychology 101 or Biology 101. For example, Buchanan (2000) examined the effectiveness of an Internet-based formative assessment package that was used in an undergraduate psychology course. The main research question asked if students that used the formative assessment package benefitted from the experience in that their performance on an end-of-course summative assessment was better than those who did not use the package. The study examined the Internet-based formative assessment package as an integral part of the syllabus and class requirements for one cohort, and compared that with a different cohort where using the package was not compulsory. Unsurprisingly, a higher level of use of the package was related to superior exam performance. Another example, not drawn from the social sciences, involves a similar study implemented in two graduate-level biomedical science courses. Olson and McDonald (2004) investigated the impact of online formative assessment and hypothesized that providing students with practice online exam questions (i.e., formative assessments) would enable those taking these quizzes to perform better on subsequent summative tests. The formative, online quizzes were provided during two points in each course, and the results were provided immediately after completion, along with the correct answers to the questions. The results showed that students who took the online formative assessments 37
performed a letter grade higher on the summative exam compared to students who did not take advantage of the formative quizzes. Another study conducted at a university, and involving a summative, end-ofcourse exam, provides insight into the effectiveness of online formative assessment in an undergraduate, business mathematics course (Angus & Watson, 2009). The study used a retrospective regression model controlling for other variables such as prior math ability, and examined the influence of online formative assessments on final examination score in the course. The results indicated that more exposure to online quizzes led to higher student learning. The authors caution that they did not implement an experimental design, or compare standard regular paper-based quizzes with regular online quizzes. Thus, making assumptions about one mode being better than another are not supported with the current studys findings. Advantages. Many advantages exist in using computerized or online formative assessment, or e-assessment, and have been detailed in the literature. One major benefit is the ease of disseminating feedback to students after an assessment. Buchanan (2000) notes that the individualized and timely feedback in a flexible, cost-effective manner make this mode of formative assessment ideal for large, multi-section, introductory-level college courses. Olson and McDonald (2004) note similar benefits by comparing online or computerized formative assessment to paper-based formative assessment. They state that paper-based tests have numerous limitations foremost of which is that students must all be gathered together at one specific site and at a specific time. Individualized
38
feedback with paper-and-pencil tests becomes time-consuming, and analysis of question reliability and validity can be tedious (p. 656). Feedback should be timed appropriately so that the student can use this information to improve performance (Brown & Knight, 1994). Feedback from eassessments can be provided instantaneously, or in some cases be disseminated at regular intervals or depending on a students needs, to maximally enhance learning potential and achievement. This is a practical constraint that limits the usefulness of traditional formative feedback (e.g., paper-and-pencil tests). The use of technology such as automated, computerized formative assessments or Internet-based formative assessment websites can remedy the previously mentioned limitation, and provide timely, and thus more useful, feedback to students to improve overall academic achievement. The variety of feedback options is also enhanced by using e-assessment strategies. For example, automated formative feedback can be tailored to entirely correct, partially correct, or entirely incorrect answers (Wood & Burrow, 2002). Feedback can be given to students with references to textbook chapters or websites, or delivered after each question or at the end of a timed session (Mackenzie, 2003). E-portfolios, or digital portfolios that are a collection of electronic evidence demonstrating a persons ability and achievement, as a mode of formative assessment, also have several feedback options. E-portfolios allow teachers and students to view works in progress and drafts, which allow for the continuous administration of feedback rather than concentrating on the final product (Twining et al., 2006). In addition, online communication tools used in the e-portfolio environment can allow for varied feedback with respect to the audience (i.e., individual 39
students, groups of students, or other teachers), and mode of communication such as writing or speech (McGuire, 2005). Other advantages include the customization of assessment products, or the ability of educational practitioners to use technologies to provide assessment solutions to suit their particular teaching and learning needs (Bennett, 2001). The computer and the Internet can provide a range of new tools that classroom teachers can use to create formative assessments to suit their students needs. For example, many computerized or online formative assessments use different task or item types and diverse assessment designs such as variations of multiple-choice questions (Cassady & Grindley, 2005). One study implemented a multiple-choice question format in which students were allowed to indicate how confident they were in a particular answer before submitting it (GardnerMedwin & Gahan, 2003). More advanced and specialized computerized or online assessment modes that facilitate formative assessment have been realized including intensely interactive multimedia such as scenario-based assessments that include vignettes, simulations, and collaborative problem solving (Crisp & Ward, 2005; Hsieh & ONeill, 2002; Young & Cafferty, 2003). Additionally, alternative forms of assessment are made more accessible via technology such as e-portfolios and communications tools such as electronic discussion boards and forums in e-learning courses (Keppell, & Carless, 2006; McGuire, 2005; Woodward & Nanlohy, 2004). All of the aforementioned options are either complicated or not possible in the traditional formative assessment environment.
40
Other benefits outlined by previous research include the shear amount of students being tested. That is, paper-and-pencil tests are becoming more difficult to administer as more students flood the education system at every level, especially colleges and universities (Ridgway, McCusker, & Pead, 2004). Implementing e-assessment systems can help ease the strain of the enormous number of tests administered. Additionally, preparation for the future is another benefit of using technology-based assessments. Much of everyday life requires people to use computers, and requiring students to take assessments via the computer and/or Internet can help prepare individuals for the everincreasing digital professional world. Another benefit of e-assessments is the flexibility in scheduling assessments to meet students needs, especially part-time or commuter students. For example, computerized or online assessments allow for flexible scheduling and timing of tests, so that students are able to take the assessment when they deem it appropriate or are ready (Ridgway & McCusker, 2003). Flexible scheduling also allows for tests to be taken according to rate of progression. Students progress at different rates and e-assessments can accommodate students who are progressing faster or slower than the norm. For example, advanced placement tests in high school allow students to take college-level courses for college credit during high school. If afforded the opportunity to take these exams at their leisure via a computer or online system, students who are progressing faster in high school can be rewarded and not stymied by scheduled group administration of tests (Ridgway, McCusker, & Pead, 2004).
41
Finally, in regards to the individual student, research has shown that todays students prefer e-assessment to the standard paper-and-pencil format, although this finding depends on the type of student (e.g., cohort, traditional versus non-traditional, familiarity with technology). Most current students in grade school, high school, and college have been raised with technology in the home, and therefore, according to the research, most current students prefer and future students will prefer e-assessment (Richardson, Baird, Ridgway, Ripley, Shorrocks-Taylor, & Swan, 2002). Students reasons for preferring e-assessment include a sense of more control, user-friendly interfaces, and the potential for simulations. Advantages are present not only to students, but also to the faculty, school system, and education in general. One main benefit is the timeliness of results. Not only will quicker results produce more frequent feedback, which is necessary in quality formative assessment, but quicker results can improve the design of tests and quizzes faster. For example, with quicker results, it is easier for teachers to alter bad test questions on the basis of information during testing because of the immediacy of data collection and processing of results. This can also include changing the test to ensure that there are items to represent all skill levels and the detection of biased words or phrasing (Ridgway, McCusker, & Pead, 2004). Other practical advantages include cost in that automated formative assessments, specifically multiple-choice questions are cheap to administer frequently in bulk. One added practical advantage that is just beginning to flourish is the linking of online test scores and e-assessment information to state standards or benchmarks in various subject 42
areas. Knowledge of how students are progressing in meeting state standards in subjects such as math and reading can provide diagnostic information to teachers and school systems to implement better strategies or maintain current practices to meet state education goals (Ridgway & McCusker, 2003). Overall, the aforementioned evidence examining the benefits of computerized and Internet-based formative assessment, highlight the practical gains of using this medium, and herald it as necessary to be universally adopted by universities, colleges, and even grade schools and high schools in the immediate future. Disadvantages. Compared to the abundance of advantages of using e-assessment or computerized formative assessment strategies, the disadvantages are few. Not all can be remedied or is implicated as beneficial, according to other researchers who claim that the introduction of online assessment may disadvantage some students (Ricketts & Wilks, 2002). In one study comparing three groups of students (i.e., a formative paper-based multiple-choice test group, a formative online test group with a similar interface as a paper-based test, and a formative online test group where each question appeared separately on an individual screen), the results showed that students using the online assessments did not perform as well as those who took the paper-based tests. The researchers examined qualitatively why some students felt that the online assessments may have been disadvantageous, with some students acknowledging forms of technology anxiety when using a computer to complete an assessment. As evidenced in the above study, to acknowledge the limitations of e-assessment requires the recognition of why traditional formative assessment (e.g., paper-and-pencil 43
tests) is beneficial. In the traditional medium, all people involved are generally familiar with this format, and high-resolution displays are readily available, which may not be possible when using a computer or other digital medium (Ridgway, McCusker, & Pead, 2004). Another limitation of e-assessment (i.e., a benefit of traditional formative assessment) is that in the majority of testing situations students cannot answer test questions in any order. Many computer-based tests administer one question on the screen at a time, and the only way to progress to the next question is to answer the current question. Finally, how responses are delivered such as the form of writing (i.e., cursive, print, diagrams, graphs, tables), and other problems associated with access, such as students with disabilities, is minimized when using the traditional format (Ridgway, McCusker, & Pead, 2004). Methodological Limitations. Some methodological problems exist in the above examples involving the impact of computerized or Internet-based formative assessment programs. For example, Buchanan (2000) mentions the positives of conducting research that is practice-based as opposed to experimental manipulation. The author observes that experimentally manipulating the natural conditions and controlling for confounds may not be practical or ethical in some cases, and is careful to note that one cannot make causal claims of the medium having a direct impact on student achievement. Olson and McDonald (2004) note similar challenges in methodology and interpreting results, carefully observing the possibility that the better students simply are the ones who take advantage of the online formative assessment opportunities, and as a result, naturally perform better on end-of-course summative assessments. Additionally, 44
many studies simply fail to control for important preexisting information such as a measure of pre-test knowledge, technology familiarity, or other demographic and personality variables that can confound results. Controlling for grade point average, among other variables, could add to the validity of results. In addition, engaging in more true experimental designs, although practically and ethically difficult to conduct, that can accurately compare students who take online or computerized formative assessments and paper-based formative assessments will improve the conclusions drawn from these studies and the overall literature base. In general, several studies have claimed that use of e-assessment is associated with superior learning gains and academic achievement compared to more traditional formats. However, this claim is not well-founded in the literature, and such claims that computerized or online formative assessment provides learning gains over and above traditional formative assessment should be supported with more rigorous evaluation. For instance, most studies have been conducted with small cohorts, or the results were confounded with other variables (e.g., technology anxiety and familiarity with the mode of administration, comparability of the test formats and questions, degree of summative versus formative information accessed). Thus, the assertion that e-assessment produces greater learning gains compared to traditional formative assessment is difficult to make in this era of constantly evolving, ephemeral technological trends, and the variety of educational applications of these trends.
45
CHAPTER 3: METHODOLOGY
Objectives and Rationale The objectives of the current study included examining the relationship between computerized/online formative assessment in reading and end-of-year state test scores in the same subject. Specifically, the relationship between DORA and Colorado state test scores in reading was examined beginning in the 2004/2005 academic year and ending in 2009/2010 in one school district. The relationship between the set of scores was explored using Hierarchical Linear Growth Modeling. Additionally, a behavioral frequency measure of computerized/online formative assessment for teachers was developed, with the intent to validate the scores on this instrument. This study included three main goals: (1) Examining if DORA growth is related to state test score growth, (2) Developing a behavioral frequency measure of teacher use of computerized/online formative assessment programs, and (3) Investigating the relationship between the measure of teacher computerized/online formative assessment use and student DORA scores. From a research perspective, most studies on technology-based formative assessment practices have examined college-age populations usually within one course. Thus, this study is unique in extending the research to younger grade levels, and 46
examining a weighty outcome such as end-of-year state proficiency scores. Additionally, the use of quantitative methods and statistics compared to the majority of existing studies that have examined the implementation of online assessment programs qualitatively is a unique contribution to the field and a solid rationale for conducting the study. Finally, as part of the rationale for conducting the study, research was noted to be lacking in longitudinal data analysis, with no studies examining multiple years of data across several cohorts. Thus, the use of these datasets to answer the objectives will allow for an exploration of trends and growth across multiple years and cohorts. Part of the rationale for developing a portable and efficient measure of teacher use of online formative assessment programs is that no flexible, efficient measure exists that has the potential to diagnose weaknesses in the system. Thus, the developing measure can be used as a diagnostic tool for school districts and schools to evaluate how teachers are using their online formative assessment programs, and find ways to increase the effective and efficient use of these programs. In addition, the developing measure will be flexible to use with similar programs like DORA, and will be adaptable to other content areas such as math and science. Therefore, the rationale for developing a flexible measure is in direct response to the market of online formative assessment products that have expanded to encompass many formats and content areas. Overall, the justification for this study involves many components on a number of levels. At the national level, with the accountability movement on the rise ever since the implementation of NCLB, the evaluation and successful use and implementation of programs that can demonstrate competency gains is imperative for schools and school 47
districts needing funding. At the school level, results from this study have the potential to promote more effective and economical ways to meet state and national standards. For example, a positive relationship between the online formative assessment scores and state test scores indicates that this more efficient mode of assessment can be used with confidence to improve the curriculum, and increase student learning and achievement on various outcome measures. Thus, the stress and strain placed on teachers, schools, and school districts can be potentially alleviated through the examination of technology-based formative assessment practices that are proven to increase student achievement on important exams such as end-of-year proficiency exams, where student success and improvement is directly related to government funding. Context It is important to examine the abovementioned company, LGL, and the Colorado Department of Education (CDE) and their defined content standards for reading to fully develop this studys background. Lets Go Learn, Inc. . LGLs company mission statement boasts a commitment to creating innovative, scalable, educational assessment and instructional tools that help parents, teachers, and administrators advance students abilities in major subject areas such as reading and math (Lets Go Learn, Inc. , 2009b). LGL stresses that their research-based products, created by experts in reading, math, assessment, curriculum, and instruction, are effective as formative assessment tools when combined with best practices in education. The company includes that it strives to develop solutions that are
48
practical and easily sustained in todays educational system, which is becoming increasingly computer and Internet-oriented. The company was founded in 2000 by Richard Capone and Dr. Richard McCallum, and is headquartered in Kensington, California. The stated mission of the company is to provide diagnostic testing, data, reporting, and instruction to potentially boost student performance in reading and math. The computerized/online formative assessment tools are available to large school districts, individual schools, and homes for students who are home-schooled. In December of 2001, the company launched its first product, the LGL Reading Assessment, which is now known as the Diagnostic Online Reading Assessment (DORA). In the same year, LGL received a Department of Education grant to compare its assessment system with other assessments offered by trained specialists. The study provided some proof that LGL computerized/online assessments are effective formative assessment tools (Lets Go Learn, Inc. , 2009c). Results also allowed LGL to tout their products as being helpful in rendering individualized, scalable student achievement profiles that lead to data-driven instruction the cornerstone of quality formative assessment. LGLs product line is research-based and developed by skilled professionals in many fields. The company has developed more products, which are now used frequently by multiple school districts such as other reading modules (i.e., DORA Phonemic Awareness, DORA Spanish, Unique Reader), and several math modules including Diagnostic Online Math Assessment (DOMA), Unique Math, and Pre-Algebra Pathways. LGLs products are aligned with all state standards, and with the requirements of No 49
Child Left Behind (NCLB), and have been used to perform over 600,000 assessments in the United States and Canada (Lets Go Learn, Inc. , 2009b). LGL is managed by Mr. Richard Capone, who is the CEO, Chairman of the Board and Cofounder, Dr. Richard McCallum, Professor in the Graduate School of Education at University of California at Berkeley, who is the Chief Education Architect and Cofounder, and Dr. Stephen Moore, who is the Evaluations Advisor. The company also has an extensive network of advisors in education, curriculum, psychometrics, and evaluation. An Educational Advisory Board, Board of Directors, and Business Advisory Board round off the expert panel of advisers and employees. LGL has partnered with several other assessment companies such as Learning Today, Harcourt Achieve, Sonlight Curriculum, Catapult Learning, Edu2000, and Learning Upgrade. The company distributes their products via Curriculum Associates in the United States and Canada (Lets Go Learn, Inc. , 2009b). Colorado Department of Education. The Colorado Department of Education (CDE) is the administrative branch of the Colorado State Board of Education. The CDE assists Colorado's 178 local school districts containing more than 800,000 preKindergarten through twelfth grade students statewide (CDE, 2009c). The CDE has regionalized services with groups of specialists in nine different areas: (1) Literacy, (2) Special Education, (3) At-Risk Students, (4) Regional Services, (5) Language, Culture, and Equity, (6) Educational Technology, (7) Title I, (8) Early Childhood Initiatives, and (9) Academic Standards, with the latter being the focus in the current examination. The Academic Standards specialists are housed in the Office of Standards and Assessments, 50
which analyzes student performance results of state assessments. This office is also responsible for reviewing the Colorado Model Content Standards (CMCS) and the current research in testing content standards. Colorado Model Content Standards. There are three units in the Office of Standards and Assessments that work together to develop and continuously review the CMCSs: (1) the Unit of Academic Standards, (2) the Unit of Student Assessment, and (3) the Unit of Research and Evaluation. These units are required to adopt formal standards for Colorado in reading and math, and administer assessments based on the standards under the Elementary and Secondary Education Act (ESEA) of 1994 (i.e., the Improving Americas Schools Act; Lauer, Snow, Martin-Glenn, Van Buhler, Stoutmeyer, & Snow-Renner, 2005). NCLB in 2001 (i.e., similar to the ESEA) continued the focus on standards-based education and requires that schools receiving Title 1 funds make adequate yearly progress (AYP) toward achieving high standards as indicated by student performance on standards-based tests (No Child Left Behind Act of 2001, 115 Stat. 1425). These standards, or content standards, are expectations about what a student should know and be able to do in different subjects and grade levels, and defines expected student skills and knowledge and what schools should teach (Bhola, Impara, & Buckendahl, 2003). NCLB requires states to develop content standards in reading (among other subjects), and to implement some form of statewide assessment (No Child Left Behind Act of 2001, 115 Stat. 1425). For 51
the current study, the content standards for the CSAP were developed by highly qualified content-matter experts in reading/writing, mathematics, and science. Six content standards were developed for reading/writing, with several assessment objectives under each standard differing by grade level. All school districts in Colorado are required to adopt individual content standards which meet or exceed the CMCS, and ensure that their curriculum and programs align with the CMCS. Specific to the current investigation, the CMCS for reading and writing aim to help students to: (1) become fluent readers, writers, and speakers, (2) be able to communicate effectively, concisely, coherently, and imaginatively, (3) recognize the power of language and use that power ethically and creatively, and (4) be at ease communicating in an increasingly technological world (CDE, 2009b). Six CMCSs exist for reading and writing for grades Kindergarten through twelfth grade. These standards include that: (1) Students read and understand a variety of materials, (2) Students write and speak for a variety of purposes and audiences, (3) Students write and speak using conventional grammar, usage, sentence structure, punctuation, capitalization, and spelling, (4) Students apply thinking skills to their reading, writing, speaking, listening, and viewing, (5) Students read to locate, select, and make use of relevant information from a variety of media, reference, and technological sources, and (6) Students read and recognize literature as a record of human experience (CDE, 2009b). Thus, since
52
standards two and three solely concern writing, they are not factored into the reading state test scores. Students in grade 3 are only tested on Standard 1, while the remaining grade levels tests include Standard 1, and 4 through 6 (CDE, 2009b, 2009d). As mentioned previously, each standard contains assessment objectives, which differ by grade level, also known as the framework of the test. Related to the current investigation, each item on the CSAP is developed to measure a single test objective. Thus, the CSAP Assessment Frameworks define specifically what will be assessed on the state's paper and pencil, standardized, timed assessment. For example, in grade 3 for standard 1 (i.e., Students read and understand a variety of materials), which is summarized as Reading Comprehension, seven test objectives exist. The first objective (i.e., 1.a) is that students should be able to use a full range of strategies to comprehend materials (e.g., directions, nonfiction material, rhymes and poems, and stories). This test objective is linked to subcontent area names, which are standard across all grade levels: (1) Fiction, (2) Nonfiction, (3) Vocabulary, and (4) Poetry. For grades 3 through 6, subcontent areas 1 and 4 are combined to make one subcontent area of Fiction and Poetry. Finally, objectives are linked to specific items on the test and consequently the standard. Participants One of the major goals of the current study was the development of a measure of online formative assessment practices for teachers. There were two phases in the creation 53
of this measure, the development phase (i.e., pilot phase) and the actual implementation phase. In the development phase, five volunteers who either used or were familiar with DORA were asked to review the pilot measure for specific validity criteria (i.e., outlined below). These volunteers were asked at random by Ms. Sue Ann Highland, the District Assessment Coordinator and Curriculum and Instruction Director. Only those with DORA experience were asked to review the survey and provide anonymous feedback. Those reviewing the pilot measure were not participants in the later phases of the study to reduce any bias. That is, teachers who completed the OFAS in a later phase of the study did not serve as reviewers in the first phase of the study. Volunteers were compensated for their time. In the implementation phase (i.e., after the survey had been revised with the feedback and suggestions from the abovementioned volunteers), the survey and a brief demographic inventory was uploaded on an online survey administration website used for data collection, organization, and downloading results in various formats such as Microsoft Excel (e.g., www.surveymonkey.com). The webpage link to this survey was distributed via LGLs contact system to all available teachers who use DORA. Any teachers who used DORA were allowed to participate anonymously. In addition to the above sample, approximately 22 reading teachers who use DORA from the Highland School District were asked to complete the revised survey (i.e., in the same implementation phase). Again, grades 3 through 10 were the focus of the current investigation as these teachers and students are typically those who are held accountable to state testing and standards. All reading teachers in the district were asked 54
to participate, as all reading teachers in the district are required to use DORA. Due to the small sample size, teachers were not excluded based on gender, race, teaching experience, and degree, and English Language Learner (ELL), English as a Second Language (ESL), and Special Education (SPED) teachers were also included. Each teacher was linked to a general classroom of students where reading is one of many subjects taught (i.e., younger grade levels), or a group of students who take reading or language arts from a teacher who only instructs that subject area daily (i.e., older grade levels). All existing data used was from the students linked to the aforementioned teachers (i.e., the DORA and CSAP data). Data was only collected from teacher participants, with existing corresponding student data used from the Highland School District in Ault, Colorado. The Weld County School District No. RE 9, or the Highland School District, is comprised of an elementary school serving grades Kindergarten through fifth grade, a middle school housing grades 6 through 8, and a high school containing grades 9 through 12. All are public schools located in a relatively rural area in Weld County near the cities of Ault and Pierce in Northern/Northeastern Colorado (Weld RE-9 School District, 2009). More specific county, district, and school demographic information will be discussed in the results section. Four cohorts were used to examine the first objective in this study. These cohorts were from grades 3 through 10 beginning in the 2004/2005 academic year and ending in the current 2009/2010 academic year. Each cohort included the following: (1) Cohort 1 beginning in third grade and ending in eighth grade, (2) Cohort 2 beginning in fourth 55
grade and ending in ninth grade, (3) Cohort 3 beginning in fifth grade and ending in tenth grade, and (4) Cohort 4 beginning in sixth grade and ending in tenth grade. The third objective contained students across grades 3 through 8, as only the current academic year (i.e., 2009/2010) data were used, and only students in these grade levels could be linked to one reading teacher. This will be discussed further in upcoming sections. Measures Online Formative Assessment Survey. The OFAS is a brief measure, developed by the primary researcher, consisting of 56 questions pertaining to issues in online formative assessment use of student learning (i.e., DORA). The survey is a behavioral frequency measure that asks teachers to respond to a specific prompt: In a given quarter/semester, how often do you Questions are asked about general DORA use, accessing subscale results, informing the curriculum, providing feedback, communicating the results, gradelevel equivalency results, and using the results. Teachers are asked to rate the items on a 4-point Likert scale (i.e., 0 = Never, 1 = Rarely, 2 = Sometimes, 3 = Almost Always). The survey takes approximately 10 to 15 minutes to complete. The psychometric properties will be discussed in the results section. The original version of the survey can be viewed in Appendix A. Diagnostic Online Reading Assessment. DORA is a Kindergarten through twelfth grade measure that provides objective, individualized assessment data across eight reading measures that profile each students reading abilities and prescribe individual learning paths (Lets Go Learn, Inc. , 2009c). The eight subtests of reading assessed by DORA include: (1) High-Frequency Words, (2) Word Recognition, (3) Phonics, (4) 56
Phonemic Awareness, (5) Oral Vocabulary, (6) Spelling, (7) Reading Comprehension, and (8) Fluency. DORA displays a student's unique reading profile, which encourages teachers to utilize the results to tailor instruction to individual student needs, a hallmark of formative assessment. The High-Frequency Words subtest assesses words from Edward B. Frys 300 sight words, which include three levels of difficulty (Fry, Kress, & Fountoukidis, 2004). The Word Recognition subtest assesses the students ability to recognize words from lists of increasing difficulty. The Oral Vocabulary subtest examines students oral vocabulary using visual definitions. The Phonics subtest assesses a childs ability to recognize basic, high-utility English phonetic principles including: (1) Beginning Sounds, (2) Short Vowel Sounds, (3) Blends, (4) Silent E Rule, (5) Consonant Digraphs, (6) Vowel Digraphs, (7) R-Controlled Vowels, (8) Diphthongs, and (9) Syllabification (Pressley & Woloshyn, 1995). The Phonemic Awareness subtest diagnoses how students use oral and picture-based items such as addition, deletion, substitution, identification, categorization, blending, segmenting, isolation, and rhyming. In the Spelling subtest, students are required to generate correct spellings of words of increasing difficulty based on the number of syllables in a word, regular phonetic patterns within the words, irregular phonetic patterns within the words, vocabulary level, and the expected familiarity with a word based on his or her grade level. Finally, the Reading Comprehension subtest is a variation on protocols of various informal reading inventories (Gillet & Temple, 1994; Leslie & Caldwell, 1994). Students silently read expository passages of increasing difficulty, and answer questions about 57
each passage immediately after they read it. The questions for each passage include three factual questions, two inferential questions, and one contextual vocabulary question. DORA Research. The construct validity of DORA scores was supported by accurately defining the construct (i.e., the knowledge domains and skills) in initial unpublished pilot studies of the program. DORAs construct validity was derived from current research-based and classroom-proven models of reading subtest acquisition and diagnostic reading assessment. According to a study implemented by LGL with the assistance of a United States Department of Education Small Business Innovation Research (SBIR) grant, several experts in using diagnostic assessment to tailor instructional interventions to student individual needs of reading performance built in construct validity while developing the program (Lets Go Learn, Inc. , 2009c). In addition to construct validity, LGL investigated the concurrent validity (i.e., identified as criterion validity on LGLs website) of their reading assessment products, specifically DORA, by comparing the program to the nationally recognized CAL Reads program. CAL Reads uses a battery of diagnostic reading assessments as part of reading remediation. LGL designed its assessment to measure the same reading outcomes as CAL Reads in its reading assessments. In the same unpublished SBIR study, LGL demonstrated that its online assessment was highly correlated to similar reading assessments administered by reading specialists of the CAL Reads program (Lets Go Learn, Inc. , 2009c). Additionally, in a separate study, LGLs reading assessment was found to be 58
highly correlated with other nationally-normed paper-and-pencil reading assessments (e.g., the Slosson Oral Reading Test, the Woodcock Word Identification Test, and the Woodcock Word Attack). The test-retest reliability of LGLs reading assessments was also examined in a series of unpublished studies in 2003 (Lets Go Learn, Inc. , 2009c). The results indicated that variability was low, meaning that the LGL reading assessment can be re-administered with low bias. DORA The Program. DORA's Internet-based program is a mode of assessment administration that purports saving teachers time and paperwork via automated results and individualized student feedback. The interface graphics are specific to three different schooling levels elementary, middle, and high school. The tests are adaptive (i.e., Computerized Adaptive Testing; CAT), which means that the program adapts to the students performance level, varying the difficulty of presented items according to previous answers (Thissen & Mislevy, 2000). The assessment starting point is determined by the age of the user, and students are asked multiple questions in each subtest area. Once a student reaches a ceiling in any particular subtest area, the program moves to the next subtest area. Additionally, the previous subtests are used to gauge if a student will be given a subsequent subtest (i.e., phonemic awareness), and also determine at what difficulty level a subsequent subtest will begin. For example, if a student performs consistently well on all previous subtests and is about to begin reading comprehension, the computer will note the 59
students previous performance on the first six assessments, and start at a more difficult reading comprehension level than at the current grade level of the student. All the assessments together (i.e., all subtests combined) are approximately an hour in length depending on the performance of the user. DORA Reports. Individual student and classroom reports are made available for teachers to download, view, or print. The teacher and parent reports are extensive (i.e., 17 pages) and present a detailed description of each individual students reading profile and provides instructional recommendations. The summary report is a 1-page document detailing the students grade level equivalency on each subtest, gives examples of student performance, and provides a descriptive reading profile. In addition, the reports are also aligned to all 50 states standards for reading, and can help track student achievement of gradelevel expectations. Reports are criterion-referenced, which means that the test relates to some sort of established unit of measure (AERA, APA, & NCME, 1999). DORA is criterion-referenced because it reports a grade-level equivalency for each subtest. For example, a students word recognition skills can be reported at the high sixth grade level. DORA scores will be discussed more thoroughly in the data section. Colorado Student Assessment Program. With school districts across the nation engaging in annual state-wide testing, usually mandatory in grades three and above, the CDE in conjunction with their Office of Standards and Assessments, specifically the Unit of Student Assessment, have supported the administration of three assessments: (1) the 60
Colorado Student Assessment Program (CSAP) in grades 3 through 10, (2) the Colorado ACT (i.e., formerly known as the American College Test) for students in grade 11, and (3) the National Assessment of Educational Progress (NAEP) in grades 4, 8, and 12. Of particular interest to the current study is the CSAP, which was first administered in 1997 to the Colorado Public Schools. The purpose of the CSAP is to demonstrate how students in the state of Colorado are progressing toward meeting academic standards, and how schools are doing to ensure learning success of students (CDE, 2009c). CSAP scores will be discussed in detail in the data section below. Procedure Permission to conduct the study was obtained from the Highland School District on September 25, 2009, and permission was granted from LGL on October 22, 2009 (see Appendices B and C). Institutional Review Board (IRB) approval was obtained on December 21, 2009 under exempt status (see Appendix D). After IRB approval, existing, de-identified formative assessment data (i.e., DORA) were sent electronically from LGL for the 2006/2007 academic year to the present academic year (i.e., 2009/2010), and existing, de-identified reading state test scores (i.e., the CSAP) and information were emailed from the Highland School District for grades 3 through 10 beginning in the 2004/2005 academic year and ending in the current academic year. Data from LGL and Highland were linked with randomly assigned identification numbers. From the existing data mentioned above, four cohorts beginning in 2004/2005 and ending in 2009/2010 were used for longitudinal data analysis in the first research question: (1) Cohort 1 beginning in third grade and ending in eighth grade, (2) Cohort 2 61
beginning in fourth grade and ending in ninth grade, (3) Cohort 3 beginning in fifth grade and ending in tenth grade, and (4) Cohort 4 beginning in sixth grade and ending in eleventh grade. For the third research question, data from the current academic year (i.e., 2009/2010) were used from grades 3 through 8. Finally, for the second research question, survey data were collected from DORA-using reading teachers in the Highland School District in Colorado and across the United States in December of 2009 through February of 2010. Research Question 2. The two-phase procedure for the second research question (i.e., the development of the OFAS) will be described first, and this will be followed by a brief procedural discussion of the data management and cleaning process for Research Questions 1 and 3. Phase 1. Measure Development. The main purpose of the second research question was to develop a psychometrically sound measure of online formative assessment practices related to teacher DORA (i.e., the Online Formative Assessment Survey; OFAS). In a preliminary, informal investigation in autumn 2009, extensive interviews were conducted with employees, staff, and teachers affiliated with LGL who were familiar with how DORA is used. The interview questions can be viewed in Appendix E. These interview questions were developed from an extensive literature review on formative assessment and feedback. They were developed to be used as a flexible guide to elicit conversation from the interviewees, although more or less questions may have been asked during a typical interview. 62
The interviews were transcribed and several questions were compiled to create a pilot measure. This preliminary measure was distributed to five volunteers randomly chosen by Ms. Sue Ann Highland, the District Assessment Coordinator and Curriculum and Instruction Director, who were either familiar with or use DORA. The volunteers included three females and two males with an average of 17 years experience in their profession/field. More specifically, these reviewers included: (1) a Director of Curriculum (i.e., 17 years experience), (2) a Director of Student Achievement (i.e., 28 years experience), (3) a technology teacher (i.e., 5 years experience), (4) a Director of Student Services (i.e., 8 years experience), and (5) a Dean of Students (i.e., 27 years experience). During November and December of 2009, these volunteers were asked to review the survey for several validity criteria including: (1) Clarity in wording, (2) Relevance of the items, (3) Use of standard English, (4) Absence of biased words and phrases, (5) Formatting of items, and (6) Clarity of the instructions (Fowler, 2002). The volunteers were also asked to provide general criticisms or offer suggestions to eliminate questions or add questions to the survey. Upon completing the review, which was to take no less than 30 minutes, the feedback was returned to Ms. Sue Ann Highland. Reviewers were compensated with $20 Visa gift cards mailed by the primary researcher to Ms. Sue Ann Highland (i.e., five gift cards for a total of $100) who distributed them to the volunteers. She collated and forwarded the anonymous feedback to the primary investigator. The feedback provided by the 63
reviewers to improve the survey is detailed in Appendix F, along with the original survey questions. Feedback was reported collectively to the primary investigator. This feedback was used to modify the survey before being used in Phase 2 data collection for the second research question. Phase 2. Data Collection. Phase 2 data collection occurred on two fronts online and traditional formats. The online survey data collection will be described first followed by the traditional survey data collection procedures. On January 5, 2010, the revised survey was uploaded onto a website that hosts survey research www.surveymonkey.com. Before making the survey live, the website link was sent to three education research professionals to review for structure and layout. No changes in question content were made, as content was already reviewed and altered based on suggestions from teachers and administrators with DORA familiarity in Phase 1. The main structure and layout recommendations included the following: (1) Defining who is the primary and coinvestigator, (2) Being consistent in language use (e.g., using the word survey not inventory), (3) Defining the Likert scale points more clearly, (4) Displaying the prompt once on each page, (5) Making the demographics section of the survey into actual questions (e.g., What is your gender?), (6) Using different prompts for some sections of the survey, (7) Formatting the survey so the participant cannot refuse to answer a question, (8) Directing participants to include the current school year when answering demographic questions asking for the number of
64
years a teacher has taught, and (9) Defining what a subscale is (i.e., a subtest or a subdivision of the entire reading scale). The abovementioned feedback was provided between the dates of January 6 and January 8, 2010. Recommendations 6 and 7 were not incorporated into the changes made to the online survey. Recommendation 6 was not used because differences in the prompts from one page to the next may render missing or inaccurate data. Additionally, recommendation 7 could not be implemented due to IRB restrictions that insist a participant not be forced to answer any question he or she does not want to answer. After consideration of these recommendations, the online survey was altered and finalized on January 16, 2010. The finalized survey webpage link was e-mailed to the Director of Marketing and Educational Development at LGL, Ms. Anne-Evan K. Williams, who collaborated with the companys CEO/Chairman of the Board/Cofounder, Mr. Richard Capone, to distribute the information to the target population. Ms. Williams and Mr. Capone directed an anonymous affiliate at LGL to send an email describing the study along with the webpage link to the survey via their mass messaging system to any teachers/administrators who were currently using DORA along with an invitation to participate. The invitation e-mail was designed by the primary investigator with IRB approval (see Appendix G). The informed consent process was embedded in the invitation, which described consent as completion of the survey. Any teachers/administrators who use or are familiar with DORA were eligible to respond to the survey, which also 65
included a brief demographic section of questions. The demographic questions included the following: (1) Gender, (2) Age, (3) Total years teaching/administrating, (4) Total years in current school district, (5) State of residence, (6) Current grade level affiliation, (7) Current specialization (e.g., courses or subjects taught, special education, etc.), (8) Ethnicity, (9) Highest degree earned, and (10) College major(s). The e-mail was sent to all eligible participants across the United States on January 22, 2010. This recruitment was necessary to obtain a large enough sample size to be able to render reliability diagnostics using Rasch analysis. Approximately 500 school districts used DORA as of January 2010, with no upper limit as to how many teachers/administrators could complete the survey. Participants had approximately one month to respond to the invitation to participate. Attempts were made by the primary investigator to have a reminder e-mail sent to the target population beginning on February 8, 2010, and persisting for two weeks until the survey was closed. A reminder e-mail was never sent by LGL. The survey was officially closed as of February 20, 2010, although the webpage link to the survey remained active until the end of the month due to the paid subscription to the website. Traditional (i.e., paper-and-pencil) survey data collection was used to recruit participants in the Highland School District for the current study. A partnership was formed with the District Assessment Coordinator and Curriculum and Instruction Director for the Highland School District, Ms. Sue Ann Highland. 66
Upon IRB approval, Ms. Highland began to administer the revised survey in paper form to all the teachers in the district. During regularly scheduled district meetings, Ms. Highland explained the study to the teachers using the invitation in Appendix G as a guide and administered the surveys. These district meetings took place between December 21, 2009 and February 20, 2010 immediately following IRB approval. Two meetings were conducted, and approximately 22 teachers were eligible to participate. The same informed consent process was utilized as in the online administration of the survey, which dictated that the completion of the survey was consent to participate. Participants had their names placed in a drawing for six Visa gift cards (i.e., two $50 and four $25 gift cards totaling $200). These gift cards were mailed by the primary investigator to Ms. Sue Ann Highland for distribution. The first group meeting was conducted in early January 2010 at the middle school in the Highland School District during the middle to end of the school week around 8:00 am. Teachers met in a language arts classroom where the study was briefly described and the survey was administered. The participants names were entered into a drawing for the Visa gift card prizes. The second group meeting was conducted in late January 2010 in an administration building for the school district at approximately the same time as the first group. The protocol was replicated as described above. After the second meeting, the drawing for the
67
prizes took place, and Ms. Highland ensured that the winners not present received their gift cards within the following week. After data collection was completed in the district, Ms. Highland mailed the surveys from Colorado to the primary researcher in Ohio. All physical copies of the survey were received by the end of February 2010. Data are currently housed in a secure, locked filing cabinet in the primary investigators office. Differences in format between the online survey and the paper-and-pencil survey should not be problematic as growing research suggests little to no differences in accurate reporting of information and inferences drawn from online surveys as compared with paper-and-pencil surveys (Daley, McDermott, McCormackBrown, & Kittleson, 2003; Fouladi, McCarthy, & Moller, 2002; Vereecken, 2001). Research Questions 1 and 3. As mentioned previously, the de-identified, existing data for Research Questions 1 and 3 were sent electronically from LGL for the 2006/2007 academic year to present academic year (i.e., 2009/2010), and existing, de-identified reading state test scores (i.e., the CSAP) and information were e-mailed from the Highland School District for grades 3 through 10 beginning in the 2004/2005 academic year and ending in the current 2009/2010 academic year. The information from the two datasets was linked with randomly assigned identification numbers, and sent after IRB approval in January of 2010. Data were further managed and cleaned during January and February of 2010 by the primary researcher.
68
Two datasets were created in Excel and uploaded to use in other software programs. As mentioned previously, for the first research question, four cohorts beginning in 2004/2005 and ending in 2009/2010 were used for longitudinal data analysis. The third research question contained data from the 2009/2010 academic year, utilizing grades 3 through 8. Pertinent demographics (e.g., gender, ethnicity, Socioeconomic Status [SES], English as a Second Language [ESL] status, English Language Learner [ELL] status) from the district dataset were included in the datasets for descriptive purposes, and to use as covariates in the growth models. Data files from LGL and the Highland School District were structured in a specific format in Excel for analysis using growth modeling software. For Research Question 1, each cohort was managed separately, with student identification number, CSAP test dates and scores, DORA test dates and scores (i.e., including all the subscales), and coding scheme in months all structured in columns for the Level 1 data. At Level 2, the demographic information was arranged in columns (i.e., gender, ethnicity, SES, ESL/ELL status), which were linked to the Level 1 dataset with the same student identification number. For Research Question 3, student identification number, DORA test dates and scores from all subscales for the current academic year (i.e., 2009/2010), and a coding scheme in months were structured in columns for the Level 1 data. Gender, ethnicity, SES, and ESL/ELL status were arranged in columns for the Level 2 data file, which were linked to the Level 1 file by the student identification number. Finally, teacher OFAS
69
score was included in a Level 3 data file linked to the Level 2 data file. Data were double-checked for accuracy during the last week of February 2010. Data OFAS. The response categories on the OFAS are arranged in a 4-point Likert scale from 0 to 3 (i.e., 0 = Never, 1 = Rarely, 2 = Sometimes, 3 = Almost Always). The OFAS score is a continuous variable, with higher scores indicating more frequent use of computerized/online formative assessment practices specific to DORA use. The implication is that higher scores are a desirable attribute in that higher frequency of formative assessment use has been demonstrated to be beneficial in increasing student learning and performance (Black & Wiliam, 1998). In addition to teacher responses on the OFAS, teacher demographics were provided by the school district including such variables as gender, ethnicity, number of years teaching, number of years in the district, and highest degree earned. DORA. Existing data was used from one school district, the Highland School District, in Ault, Colorado. DORA scores for students in the district were obtained from the academic years of 2006/2007 to 2009/2010. Teachers in the school district are required to administer DORA at least twice a year before state testing commences in early March, and once after state testing in May/June. According to LGL, Highland administers their assessments twice before the state tests are administered and once after the state test. Thus, across the current grade levels of interest (i.e., grades 3 through 10), there were a number of DORA scores for each student in the fall and winter preceding the CSAP (i.e., DORA scores from the fall and winter of 2006/2007 to 2009/2010), and in 70
the spring following the CSAP (i.e., DORA scores from spring of 2006/2007 to 2009/2010). Across grade levels, Fluency subtest scores were not reported, as this subtest is teacher-administered, with teachers rarely recording the scores in the LGL database. As mentioned previously, the DORA scores are criterion-referenced scores representing a grade-level equivalency. For example, Kindergarten is at the 0 to 1 level and fifth grade is at the 5 to 6 level. Scores are reported as such: (1) low grade level = .17, mid grade level = .5, high grade level = .83. Therefore, if a student was performing at an average Kindergarten level, their report would show a grade-level equivalency score near .5, and if a student was performing at a high fifth grade level, their report would show a grade-level equivalency score at 5.83 or above. In addition, each subtest has a specified range as follows: (1) High-Frequency Words has a range of 0 to 3.83 (i.e., Kindergarten through high third grade), (2) Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension all have a range of 0 to 12.83 (i.e., Kindergarten through high twelfth grade), and Phonics has a range of 0 to 4.83 (i.e., Kindergarten through high fourth grade). Phonemic Awareness scores are based on percent correct out of nine questions. The ranges include the following: (1) 0% to 43% means that there are probable weaknesses, (2) 44% to 65% means that the student has partial mastery, and (3) 66% and above means that there are probable effective skills (Lets Go Learn, Inc. , 2009a). CSAP. Existing Colorado state reading test data was used from grades 3 through 10, as most state-mandated testing begins in grade 3 and continues through high school. For consistency purposes, grades 11 and 12 were not examined because the ACT is 71
administered in eleventh grade and the NAEP in twelfth grade. The CSAP tests are designed to be given in three 60-minute sessions, with grade 3 only having two sessions. Grade 3 CSAP testing parameters are slightly different from grades 4 through 10. For grade 3, there are 40 items on the test for a total test score of 52 points. Thirty-two multiple choice and eight constructed response items are on the test, with the multiple choice score points totaling 32 (i.e., 62% of the total) and the constructed response score points totaling 20 (i.e., 38% of the total). In grades 4 through 10, 70 items are on the test with 91 total test score points for grades 4 through 8 and 95 for grades 9 and 10. There are 56 multiple choice and 14 constructed response items, with the multiple choice score points totaling 56 (i.e., grades 4 through 8 = 62% of the total, grades 9 and 10 = 59% of the total) and the constructed response score points totaling 35 for grades 4 through 8 and 39 for grades 9 and 10 (i.e., grades 4 through 8 = 38% of the total, grades 9 and 10 = 41% of the total). The constructed response items range from two to four points each (CDE, 2009d). Scores are reported for grades 3 through 10 for the total reading test in a scaled score format in addition to a performance level ranging from 1 to 4 (i.e., 1 = Unsatisfactory, 2 = Partially Proficient, 3 = Proficient, 4 = Advanced). For the Highland School District, Proficient, or Level 3, is considered the ideal target. For NCLB and requirements for meeting AYP, Partially Proficient, or Level 2, and above are considered passing (No Child Left Behind Act of 2001, 115 Stat. 1425). In the Highland School District, students are not allowed to be retained if their proficiency scores are not passing according to the district or NCLB guidelines, unless parental permission is granted. 72
Each of these proficiency levels have corresponding definitions for each subject tested on the CSAP. For reading, a student scoring at the Advanced level (i.e., Level 4) is described as consistently utilizing sophisticated strategies to comprehend and interpret complex text. Students who score in this level illustrate exceptionally strong academic performance. For the Proficient level (i.e., Level 3), these students routinely utilize a variety of reading strategies to comprehend and interpret grade-level appropriate text. Students in this level demonstrate a solid academic performance on subject matter. In Level 2, or the Partially Proficient level, these students utilize some reading strategies to comprehend grade level text, and demonstrate partial understanding of the knowledge and application of the skills that are fundamental for proficient work. Some gaps in knowledge are evident and may require remediation. Finally, the Unsatisfactory Performance level (i.e., Level 1) describes students with below grade-level competency, and require extensive support to comprehend and interpret written information. Significant gaps and limited knowledge exist, with these students usually requiring a considerable amount of remediation (CDE, 2009a). The performance level scale ranges for the CSAP are outlined below in Figure 1 from the CDE website.
73
Figure 1. Colorado Department of Education (CDE) scores for grades 3 through 10 for the Colorado Student Assessment Program (CSAP) reading test in scaled score format in addition to the corresponding performance level ranging from 1 to 4. These levels include: (1) 1 = Unsatisfactory, (2) 2 = Partially Proficient, (3) 3 = Proficient, and (4) 4 = Advanced (CDE, 2009a).
Scaled scores are reported for each content standard along with the performance level for that content standard. As reported previously, six CMCSs exist for reading and writing for grades Kindergarten through twelfth grade. Standards 2 and 3 are specific to writing, while standards 1 and 4 through 6 pertain to reading. Therefore, only four standards exist for reading (CDE, 2009b). Students in grade 3 are only tested on Standard 1, while the remaining grade levels tests include Standard 1 and 4 through 6. Thus, in the current existing data, four standard scaled scores are reported for grades 4 through 10, and only one is reported from grade 3, along with the corresponding reading proficiency levels. Included in the existing dataset for the CSAP were demographics for each student with a state identification number. Thus, the data were de-identified, although basic demographics were included in the report. For example, ethnicity, gender, language
74
background, ESL status, ELL status, disability, and gifted and talented program status were some of the demographic variables in the existing state reading test data file. Analyses Different analytic methods were used to address each of the current studys objectives. Each objective will be presented below with the related analytic approach detailed in the following paragraphs. Objective 1 DORA Growth Related to CSAP Growth. It is hypothesized that student formative assessment score growth will be significantly positively related to student state test score growth, and a regression approach was used to test this hypothesis. It is common knowledge that linear regression is one approach that considers the relationship between the specified outcome variable and predictor(s). However, in the current study, the data were nested, and the application of linear regression cannot properly account for the nested design. Raudenbush and Bryk (2002) outline that hierarchical linear modeling (HLM) is one regression approach that can account for the nested structure of the data (i.e., a nested design), which is also called a linear mixed model (LMM). The following paragraphs will detail the chosen analytic method, specifically how the nested structure is formed and how it will be analyzed in the current study. Two-Level Time-Varying Covariate Hierarchical Linear Growth Model. According to Raudenbush and Bryk (2002), investigating a relationship with a growth trajectory of another variable of interest is common practice in growth modeling. The authors state, In some applications, we may have other level-1 predictorsthat explain 75
variation in Yti . We term these time-varying covariates (Raudenbush & Bryk, 2002, p. 179). Time-varying covariates are defined as person-level characteristics that are measured and may change over time, and are related to the outcome (OConnell & McCoach, 2004). Thus, the measurements across time and other Level 1 predictors form a nested structure when combined with other student level variables (i.e., Level 2). Hierarchical Linear Growth Modeling is a multilevel modeling technique where the model is considered to have a hierarchical structure at Level 1 (i.e., the repeated measures model). In Level 1 for the current study, occasions of measurement are nested within subjects, where each persons growth is represented by an individual trajectory that depends on a unique set of parameters (i.e., the intercept and slope), a distinct average trajectory for that individual. The parameters at Level 1 become outcomes that are modeled as a function of explanatory variables in the Level 2 model (Raudenbush & Bryk, 2002; Singer & Willett, 2003). The current study used a Two-Level Time-Varying Covariate Hierarchical Linear Growth Model to examine if DORA test score growth is related to CSAP test score growth. In other words, are DORA scores predictive of CSAP scores over time? The hypothesis is that DORA scores will be a significant and positive predictor of CSAP scores. By building a two-level growth model with students who were measured at five time points for state test scores and at least three time points for online formative assessment scores, observations and estimates of students growth over time can be examined. In addition, incorporating other demographic covariates such as gender, ethnicity, SES, and ESL/ELL status into Level 2 of the model provides more information 76
about how student characteristics contribute to the relationship between online formative assessment scores and state test scores in reading. Thus, this growth model provides an analysis of the rates of change across individual students and between groups of students as a function of time-varying (i.e., DORA scores) and time-invariant (i.e., demographic information) covariates. The data for this first objective and research question was analyzed using the statistical package Hierarchical Linear Modeling (HLM) 6.08 (Raudenbush, Bryk, & Congdon, 2004). The analysis fit a linear two-level growth model by using CSAP score at each time point as the outcome variable in the Level 1 models and DORA scores as the time-varying covariate in the same level. Several models were run, using each DORA subtest as the time-varying covariate in separate models. The Level 1 model structure will be outlined in the results section. The individual growth parameters became the outcome variables in the Level 2 models, where they were assumed to vary across individuals depending on student demographic information. Gender was included as a Level 2 covariate due to the fact that 2000 National Assessment of Educational Progress (NAEP) data found that girls score higher than boys in reading, and a higher percentage of girls achieve reading proficiency levels in school (NCES, 2008). Additionally, research has shown that girls display higher reading achievement in elementary school (Butler, Marsh, Sheppard, & Sheppard, 1985). Ethnicity was included in the second level because minority students have been shown to be at a disadvantage in schools across the nation (Jencks & Phillips, 1998). Previous research (Ferguson, 2002; Ferguson, Clark, & Stewart, 2002; Harman, 77
Bingham, & Food, 2002) has documented that there are achievement gaps among students from different ethnic groups, namely Whites versus minorities, in reading. Early environmental factors may contribute to this disparity with White children being much more likely to practice early reading skills than other minorities (Hoffman & Liagas, 2003). SES of the parent (i.e., income and education) has also been found to be a significant predictor of reading achievement in school-age children (Dickinson & McCabe, 2001; Smith, Brooks-Gunn, & Klebanov, 1997). A direct measure of SES was not included in the dataset from the CDE; however, student free/reduced lunch status was present. Research has documented that free/reduced lunch status has been frequently used as a proxy for SES in school-based studies. Free/reduced lunch status children enter Kindergarten with math and reading skills substantially lower, on average, than their middle-class or higher counterparts (Kurki, Boyle, & Aladjem, 2005; Merola, 2005). Therefore, free/reduced lunch status was also included in the current model as a Level 2 covariate. ESL/ELL status was also included in Level 2, and this will be discussed further in the results section. Gender was coded as 0 for Male and 1 for Female. Ethnicity was coded as 0 for White and 1 for Minority. Non-free/reduced lunch status was coded 0, and students enrolled in the program were coded 1. Finally, non-ESL/ELL students were coded 0, and their ESL/ELL counterparts were coded 1. The model at Level 2 is outlined in the results section for this research question.
78
A third level in the current growth model was not included. For example, schoolor district-level predictors were not incorporated in the model because their effects on students have been shown to be too distal to have any strong effects on student achievement or outcomes (Pascarella & Terenzini, 2005). Additionally, the formative assessment cycle generally involves students and teachers, and rarely incorporates administrators, schools, and districts in the intimate classroom assessment and feedback loop (Clarke, 2001). It is documented that multilevel growth modeling is well suited to handle more than two waves of data. In fact, estimation of parameters improves as the number of time waves increases, as does direct estimation of the reliability of growth parameters (Francis, Fletcher, Stuebing, Davidson, & Thompson, 1991). In the current model, a small number of waves to conduct analyses were used due to the fact that DORA has only been used in the Highland School District since 2006/2007. The district intends to use DORA as long as their students are improving their reading state test scores and the district is meeting AYP. Therefore, the analysis of these variables is ongoing as more data is collected and reported. Objective 2 Developing the OFAS. A measure of online formative assessment practices of teachers was developed in the current study. The purposes in developing this measure include the following: (1) A measure of online formative assessment practices does not currently exist, (2) A quick and portable measure will potentially allow schools to examine how teachers are using their online formative assessment programs, diagnose problems, and remedy weaknesses, and (3) The measure will be flexible to use with 79
similar programs like DORA, and will be adaptable to other content areas such as math and science. One way to examine if attainment of educational standards and objectives has been actualized is through measuring teaching practices and teacher behaviors. Studies have shown that teachers who engage in more frequent and quality formative assessment practices have higher learning gains in their students (Elawar & Corno, 1985; Fuchs et al., 1991; Tenenbaum & Goldring, 1989). Thus, teachers with higher scores on this behavioral frequency measure can potentially produce students with higher online formative assessment score on DORA, which can also indicate higher achievement on state proficiency tests. Therefore, the proposed measure development will not only aid in the assessment of teaching practices, but also give schools and districts an approximate indication of their students progress or potential achievement on state summative exams. Qualitative Data Analysis. Qualitative data analysis was used in the first phase of measure development, followed by Rasch analysis in the second phase. In a preliminary investigation, Qualitative Data Analysis (QDA) was used to examine the interviews with LGL employees to develop and refine a measure of online formative assessment practices (Caudle, 2004). QDA is used to examine the meaningful and symbolic content of qualitative data in order to identify someones interpretations. Caudles framework for QDA involves two steps: (1) Data reduction and pattern identification, and (2) Producing objective analytic conclusions and communicating those conclusions. LGL employees were informally interviewed about DORA use, and specifically, how teachers use DORA in a given quarter/semester. Data from the interviews were reduced to the major themes, 80
and patterns within these themes were identified. Conclusions were drawn based on these main themes and patterns. Using these interviews and research on DORA from the LGL website, a brief survey was created (i.e., the OFAS), which was described previously. This survey was administered to five volunteers for review and to provide feedback. After the survey was revised with the feedback and suggestions from the abovementioned volunteers, the survey and a brief demographic inventory were uploaded on an online survey administration website used for data collection, organization, and downloading results (e.g., www.surveymonkey.com). The webpage link to this survey was distributed via LGLs contact system to all available teachers who use DORA. Any teacher who uses DORA was allowed to participate anonymously. Rasch Analysis. Rasch Analysis was used to examine the psychometric properties of the newly developed OFAS. Rasch Analysis can be considered part of the Item Response Theory (IRT) family, and has been developed to overcome some of the problems and assumptions associated with Classical Test Theory (CTT). IRT does not require assumptions about sampling or normal distributions, which is ideal for performance assessment with different item structures. It also does not require that measurement error be considered the same for all persons taking a test (Bond & Fox, 2007; Wright & Stone, 1979). Rasch Analysis allows for the creation of an interval scale of scores for both item difficulty and person ability. Scores are reported in logits, and are placed on a vertical ruler, which measures person ability and item difficulty (Wright & Stone, 1979). Many 81
IRT models are available, and the simplest and most efficient one is called the Rasch (i.e., one parameter) model. The Rasch model calculates the probability that a person will get an item correct and that an item will be answered correctly by a person. If the probabilities are different from the observed, the results will indicate that the data do not fit the model (i.e., using fit statistics; Wright & Stone, 1979). In the current measure, responses to the items in the OFAS were ordered categories (i.e., a Likert scale) from Never to Almost Always. This indicates increasing levels of a response on the variable of interest (i.e., teacher online formative assessment practices). A total score was rendered that summarizes the responses to all the items, and a teacher with a higher total score is said to show more of the variable assessed. Based on the uniform response scale used across all items, a Rating Scale Model (RSM) was implemented (Andrich, 1978). In the RSM, one set of rating scale thresholds are rendered that is common to all the items. A threshold is [t]he level at which the likelihood of failure to agree with or endorse a given response category turns to the likelihood of agreeing with or endorsing the category (Bond & Fox, 2007, p. 314). Due to the nature of the RSM in this study (i.e., four response possibilities), three thresholds were rendered:
Pnik = e(Bn Di Fk) / 1 + e(Bn Di Fk)
[1]
82
where the probability of any person choosing any given category on any item as a function of the agreeability of the Person n is Bn, Di is the endorsability of the entire Item i, and Fk is any given threshold (i.e., estimated across all items). For the current study, approximately 50 to 100 teachers were needed for analysis of the OFAS if item calibrations were to be stable within + 1 logits (i.e., 99% CI - 50 people) or + 1/2 logits (i.e., 95% CI - 100 people; Linacre, 1994). Another consideration is the response structure of the items on the scale (i.e., Likert). Research by Linacre (2002, 1999) found that at least 10 observations per category are necessary for sufficient person and item measure estimate stability when developing a measure. Therefore, the largest possible sample size was obtained (N = 47) from the population of DORA users. The teacher analysis sample is described in depth in the results for Research Question 2. The Rasch Analysis of the OFAS produced fit statistics, infit and outfit, which both were examined to determine items or persons that were problematic for model fit. The mean square fit statistics were also scrutinized if they exceeded 1.5 to 2.0. The higher the mean square fit statistic, the more questionable the information (Wright & Stone, 1979). Based on this information, items or persons were eliminated to produce the best possible Rasch model, and the remaining items comprised the revised OFAS that was used in additional analyses. Objective 3 The Relationship between OFAS and DORA. Student formative assessment scores (i.e., growth) are hypothesized to be significantly, positively related to teacher scores on the newly developed measure of online formative assessment practices, and a regression approach was again used to test this hypothesis. As in the first objective, 83
the data were nested, and the application of linear regression to examine the current objective was not appropriate. Multilevel growth modeling is one regression approach that can account for the nested structure of the data (Raudenbush & Bryk, 2002). The next few paragraphs will outline the chosen analytic method, which will detail the nested structure of the data and how it was analyzed. Three-Level Hierarchical Linear Growth Model. A three-level multilevel growth model was employed to examine the relationship between student DORA score growth and teacher OFAS scores. The current study utilized a three-level model to examine if teacher OFAS scores at Level 3 are related to student DORA growth at Level 1 controlling for student demographic variables at Level 2. In other words, are OFAS scores predictive of DORA scores over time (i.e., over the current academic year)? The hypothesis is that OFAS scores will be a significant, positive predictor of DORA scores (i.e., DORA growth). Specifically, student- (Level 1) and teacher-level (Level 3) variables were examined for the current academic year (i.e., 2009/2010). That is, the last two DORA testing points in the current academic year were used as the outcome variable, with DORA scores from the spring of 2009 as the baseline (i.e., three data points). Each DORA subtest was a different outcome in a separate model. Data from previous years were not analyzed due to the timing of questionnaire administration and the shifting of teachers (i.e., new hires, retirements, etc.). Thus, the 2009/2010 configuration of reading teachers and classrooms was the only option for the current analysis.
84
By building a three-level growth model with students who were measured at three time points for DORA scores, observations and estimates of students growth over time was examined in relationship to teacher OFAS scores. In addition, incorporating other student-level variables such as gender, ethnicity, free/reduced lunch status, and ESL/ELL status into the second level of the model provided more information about how student characteristics contributed to the relationship between online formative assessment scores and teacher online formative assessment use. Raudenbush and Bryk (2002) state that statistical adjustments for individuals background are important because persons are not assigned at random to certain variables like gender and ethnicity. Failure to control for such variables may bias the estimates of teacher OFAS scores. Additionally, if predictors are strongly related to the outcome of interest, controlling for them will increase the precision of any estimates by reducing any unexplained variance. Thus, the current model provided an analysis of the rates of change across individual students and between groups of students as a function of time-invariant (i.e., gender and ethnicity) covariates. As mentioned previously, DORA scores were documented at the same intervals across the school year with two administrations before the CSAP (i.e., August/September and December/January) and one administration after the CSAP (i.e., April/May). The score from the spring of 2009 (i.e., April/May) served as a baseline measure of DORA achievement. The multilevel growth model was analyzed using the statistical package Hierarchical Linear Modeling (HLM) 6.08 (Raudenbush, Bryk, & Congdon, 2004). A linear three-level growth model was fit by using DORA scores at each time point as the outcome variable in the Level 1 model, student demographic 85
information as the time-invariant covariates in the second level, and teacher OFAS score in Level 3. The demographic covariates were coded the same as in the analysis for Research Question 1. The specific structure of the model will be detailed in the results for this research question.
86
CHAPTER 4: RESULTS
Research Question 1 Descriptives. Descriptive information about the sample used to address the first research question (i.e., Is Diagnostic Online Reading Assessment (DORA) growth related to Colorado Student Assessment Program (CSAP) reading growth?) is summarized in the following paragraphs. To examine this research question, data were used from two sources. Existing data was gathered from the Colorado Department of Education (CDE) for the Highland School District in Ault, Colorado (i.e., demographic information, CSAP scores), in addition to existing data from LGL (i.e., DORA scores). Demographic information and CSAP scores from four cohorts beginning in 2004/2005 and ending in 2009/2010 were gathered from existing records from the CDE for longitudinal data analysis in this first research question: (1) Cohort 1 beginning in third grade and ending in eighth grade, (2) Cohort 2 beginning in fourth grade and ending in ninth grade, (3) Cohort 3 beginning in fifth grade and ending in tenth grade, and (4) Cohort 4 beginning in sixth grade and ending in eleventh grade. DORA scores for the academic years between 2006/2007 and 2009/2010 were sent from LGL for all students in the Highland School District.
87
In the following pages, a description of the populations from which the data have been sampled with be discussed, which includes all students in the Highland School District who are administered the CSAP and DORA. This will be followed by a summary of the descriptive information for the final student sample used to address this first research question. The specific demographic information that will be highlighted includes gender, ethnicity, free/reduced lunch status, combined English as a Second Language/English Language Learner (ESL/ELL) status, and Individualized Education Program (IEP) status, as these are the main factors of interest included (or excluded) in the analysis of the current research question. A more detailed demographic profile (i.e., including grade level information, mean ages, gifted status, individual ESL and ELL statuses separated, accommodations for testing, 504 Plan status, etc.) of the current school district and related analysis sample can be found in the descriptives section for Research Question 3. State reading test score (CSAP) growth will be described for the state and district as well, including the original sample and analysis sample descriptive statistics for student CSAP and DORA scores. Current County Demographic Information. In Weld County, Colorado, the 2009 population estimate was approximately 250,000, with 27% of that population being under 18 years of age (United States Census Bureau, 2010). Of the entire population, 49.6% were female. In terms of ethnic/racial composition, 68.9% identified themselves as White (Non-Hispanic), 27.4% were Hispanic/Latino, and the remainder fell into other ethnic/racial categories. Twenty point three percent of the homes in Weld County
88
identified as speaking a language other than English in the home (i.e., mostly Spanish), and 12% of persons in the county fell below the poverty level. According to the National Council for Education Statistics (NCES), the Weld County School District No. Re-9 in Ault, Colorado, is a rural district with three schools an elementary school, middle school, and high school (NCES, 2010). The most recent information for the 2007/2008 academic year includes that the district has 832 students, 61 classroom teachers, and a 13.7 student/teacher ratio. There are 83 (10%) students identified as ELL, and no total number reported for IEP status. More specific demographic information is outlined by school in the following paragraphs. Additionally, at the state and district level, ESL/ELL status is not combined in the following sections (i.e., only ELL status is reported). Current District Demographic Information. In Highland Elementary School, NCES 2007/2008 data included 374 total students, 26 classroom teachers, and a 14.3 student/teacher ratio. Sixty-six (17.6%) students are listed in Kindergarten, 78 (20.9%) in first grade, 39 (10.4%) in second grade, 70 (18.7%) in third grade, 63 (16.8%) in fourth grade, and 58 (15.5%) in fifth grade. There were 186 (49.7%) females and 188 (50.3%) males. The enrollment by race/ethnicity included 229 (61.2%) White (Non-Hispanic), 136 (36.4%) Hispanic, three (.8%) Black, two (.5%) Asian, and four (1.1%) American Indian/Alaskan Native students. Lastly, for the elementary school, 163 students were categorized under free lunch, and 32 were reduced lunch eligible (52.1%). In Highland Middle School for the 2007/2008 academic year, NCES reported a total of 184 students, 17 classroom teachers, and an 11:1 student/teacher ratio. Fifty-eight 89
(31.5%) students are listed in sixth grade, 66 (35.9%) in seventh grade, and 60 (32.6%) in eighth grade. There were 88 (47.8%) females and 96 (52.2%) males. The enrollment by race/ethnicity included 129 (70.1%) White (Non-Hispanic), 49 (26.6%) Hispanic, three (1.6%) Black, one (.5%) Asian, and two (1.1%) American Indian/Alaskan Native students. As an indicator of socioeconomic status (SES) for the middle school, 68 students were categorized as free lunch status, and 15 were reduced lunch eligible (45.1%). ESL/ELL status was not provided for each school in the district (i.e., only the total across the district was provided). Finally, in Highland High School for the 2007/2008 academic year, NCES reported a total of 274 students, 18 classroom teachers, and a 15.1 student/teacher ratio. Seventy-eight students are listed in ninth grade, 74 in tenth grade, 68 in eleventh grade, and 54 in twelfth grade. There were 122 females and 152 males. The enrollment by race/ethnicity included 193 White (Non-Hispanic), 74 Hispanic, two Black, two Asian, and three American Indian/Alaskan Native students. Finally, for the high school, 88 students were categorized as free lunch status, and 25 were reduced lunch eligible. The tables below contain the information for the district as outlined above (see Tables 2 through 4 below).
90
Table 2 Student District Demographic Information for Highland Elementary School from the National Council for Education Statistics (NCES) for 2008/2009 (N = 374) Demographic Information Grade Kindergarten 1 2 3 4 5 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible n(%)
66 (17.6) 78 (20.9) 39 (10.4) 70 (18.7) 63 (16.8) 58 (15.5) 188 (50.3) 186 (49.7) 229 (61.2) 136 (36.4) 3 (.8) 2 (.5) 4 (1.1) 195 (52.1) 179 (47.9)
91
Table 3 Student District Demographic Information for Highland Middle School from the National Council for Education Statistics (NCES) for 2008/2009 (N = 184) Demographic Information Grade 6 7 8 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible n(%)
58 (31.5) 66 (35.9) 60 (32.6) 96 (52.2) 88 (47.8) 129 (70.1) 49 (26.6) 3 (1.6) 1 (.5) 2 (1.1) 83 (45.1) 101 (54.9)
92
Table 4 Student District Demographic Information for Highland High School from the National Council for Education Statistics (NCES) for 2008/2009 (N = 274) Demographic Information Grade 9 10 11 12 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible n(%)
78 (28.5) 74 (27.0) 68 (24.8) 54 (19.7) 152 (55.5) 122 (44.5) 193 (70.4) 74 (27.0) 2 (.7) 2 (.7) 3 (1.1) 113 (41.2) 161 (58.8)
Longitudinal District Profile Demographic Information. Demographic information for the most recent academic year (i.e., 2008/2009) was summarized above from the NCES website, which is the same information provided by the CDE. The CDE has summarized and made demographic information accessible for the previous eight academic years on their website, although summaries of the 2009/2010 academic year are not currently available from either NCES or the CDE (CDE, 2009c).
93
According to information provided on the CDE website, the district demographic profile has remained relatively consistent in the past few years, specifically the years of interest in the current research (i.e., 2004/2005 2008/2009). Gender has been nearly equal around 50%, and the ethnic composition has remained consistent with the majority being White (Non-Hispanic; i.e., around 65% across all years), and the next highest representation being Hispanic (i.e., around 30% across all years). Free/reduced lunch status has been approximately 43% to 48% across all years, and ELL and IEP status has maintained a range of 11% to 15% of students in the district. These same percentages were noted in the most current academic year documented by the NCES, although this information was separated by school in the district. Demographic information has been summarized from 2004/2005 through 2007/2008 in the table below (see Table 5).
94
Table 5 Student Demographic Information for the Highland School District from the Colorado Department of Education (CDE) from 2004/2005 to 2007/2008
Demographic Information (n(%)) 2004/2005 (N = 868) 2005/2006 (N = 844) 2006/2007 (N = 845) 2007/2008 (N = 843)
Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch English Language Learner (ELL) Individualized Education Program (IEP)
466 (53.7) 402 (46.3) 555 (63.9) 285 (32.8) 9 (1.1) 4 (0.6) 15 (1.7) 378 (43.5) 102 (11.7) 135 (15.5)
453 (53.7) 391 (46.3) 558 (66.1) 254 (30.0) 9 (1.1) 8 (1.0) 15 (1.8) 408 (48.4) 95 (11.2) 122 (14.4)
458 (54.2) 387 (45.8) 562 (66.5) 259 (30.6) 9 (1.1) 7 (0.8) 8 (1.0) 410 (48.5) 98 (11.6) 112 (13.2)
437 (51.8) 406 (48.2) 559 (66.4) 265 (31.5) 7 (0.8) 5 (0.6) 7 (0.8) 389 (46.1) 115 (13.6) 102 (12.1)
Note. Information summarized from http://www.cde.state.co.us/Finance_Text/DecEnrollStudy/3145districtprofile.pdf
State/District CSAP Growth. The CSAP is administered in grades 3 through 10 as a general measure of academic standards and progress. The purpose of the CSAP is to demonstrate how students in the state of Colorado are progressing toward meeting academic standards, and how schools are doing to ensure learning success of students (CDE, 2009c). The CSAP tests a range of subjects (e.g., math, science, reading, writing), with the reading test as the focus of the current study. As mentioned previously, scores are reported for grades 3 through 10 for the total reading test in a scaled score format
95
(i.e., vertical equating) in addition to a performance level ranging from 1 to 4 (i.e., 1 = Unsatisfactory, 2 = Partially Proficient, 3 = Proficient, 4 = Advanced). The State of Colorado CSAP Growth Model for 2007 through 2009. In August of 2009, the Office of Standards and Assessment in the CDE published summative information regarding the state of Colorados overall growth on the CSAP from 2007 to 2009. This information summarizes the states growth on the CSAP for comparison purposes in addressing this first research question. The state of Colorado will be summarized first, followed by a summary of the districts performance over the past few academic years. The Colorado Growth Model results based on analysis of the 2007 to 2009 statelevel growth data illustrate the extent to which Colorados proficiency objectives for its students are being met over time. The proficiency levels, as described earlier, include Levels 1 through 4. For reading, a student scoring at the Advanced level (i.e., Level 4) is described as consistently utilizing sophisticated strategies to comprehend and interpret complex text. Students who score in this level illustrate exceptionally strong academic performance. For the Proficient level (i.e., Level 3), these students routinely utilize a variety of reading strategies to comprehend and interpret grade-level appropriate text. Students in this level demonstrate a solid academic performance on subject matter. In Level 2, or the Partially Proficient level, these students utilize some reading strategies to comprehend grade level text, and demonstrate partial understanding of the knowledge and application of the skills that are fundamental for proficient work. Some gaps in knowledge are 96
evident and may require remediation. Finally, the Unsatisfactory Performance level (i.e., Level 1) describes students with below grade-level competency, and require extensive support to comprehend and interpret written information. Significant gaps and limited knowledge exist, with these students usually requiring a considerable amount of remediation (CDE, 2009a). Overall, the state-level data (i.e., combined for all grades) paint a picture of both short-term and longer-term progress towards the states goals, especially among specific groups (i.e., ESL/ELL, free/reduced lunch status, Minority students, and IEP status). All of these groups are demonstrating a positive trend of moving in increasing numbers into Proficient and Advanced levels, and being able to stay proficient and above over time, across all CSAP content areas. The results show that Proficient students are on track to maintain this level over time; however, the state still faces challenges in getting large numbers of below-proficient students to attain proficiency. One challenge specific to reading is moving already proficient students to advanced-level performance (CDE, 2009e). The CDE has made tables available containing information from the Colorado Growth Model for 2007, 2008 and 2009. Three years of data are included to provide the opportunity to examine the state data for growth trends. Four different types of information are included for each content area: (1) State median growth percentiles for reading, writing, and math, (2) The percent of students catching up to the Proficient level, (3) the percent of students keeping up at the Proficient level, and (4) the percent of students moving up to the Advanced proficiency level. The state median growth 97
percentiles have been extracted from the CDE website (CDE, 2009e) and are displayed in the figures below (see Figures 2 through 12). Percentiles range from 1 to 99. The middle percentile, or median, is 50 at the state level. This makes it possible to determine whether a group is above or below the middle score for the state and by how much. For example, a percentile of 35 would be well below the median while a percentile of 70 would be well above it. The overall summary table shows that the medians are stable across content areas for all three years. The results for ELL students, in general, show growth over time as they acquire English language skills, with the native English speakers showing stability over time. The results for students with free/reduced lunch status demonstrate somewhat lower median growth percentiles than their non-free/reduced lunch status counterparts. Although there is no evidence of a dramatic change for students eligible for free/reduced lunch, there is a slight positive trend across the three content areas. For ethnicity, compared with 2007 growth data, the 2009 results indicate that the gap among racial/ethnic groups has closed slightly. Evidence supporting this is the improved growth results for minority students in all three content areas over the three years of growth data, compared to the relatively stable results displayed in the nonminority graphs. Results for IEP status students demonstrate a noticeable difference (i.e., lower percentiles across all years and content areas) between their non-IEP counterparts, although a slight upward trend is shown for students with IEP plans. Finally, for completeness, graphs for males and females were also extracted. As expected, males
98
were consistently lower across all years and content areas compared to females, but both groups demonstrated a slight upward trend.
Figure 2. The state of Colorado median growth percentile by year and content area for all students for 2007 through 2009.
Figure 3. The state of Colorado median growth percentile for English Language Learner (ELL) students for 2007 through 2009.
99
Figure 4. The state of Colorado median growth percentile for native English speaking students for 2007 through 2009.
Figure 5. The state of Colorado median growth percentile for free/reduced lunch status students for 2007 through 2009.
100
Figure 6. The state of Colorado median growth percentile for non-free/reduced lunch status students for 2007 through 2009.
Figure 7. The state of Colorado median growth percentile for minority students for 2007 through 2009.
101
Figure 8. The state of Colorado median growth percentile for non-minority students for 2007 through 2009.
Figure 9. The state of Colorado median growth percentile for Individualized Education Program (IEP) students for 2007 through 2009.
102
Figure 10. The state of Colorado median growth percentile for non-Individualized Education Program (non-IEP) students for 2007 through 2009.
Figure 11. The state of Colorado median growth percentile for female students for 2007 through 2009.
103
Figure 12. The state of Colorado median growth percentile for male students for 2007 through 2009.
Highland School District CSAP Growth from 2007 through 2009. The CDE also provides growth models for each district. District growth rates are determined by combining growth percentiles from individual students. Growth rates for individual students are calculated by comparing their CSAP scores in the three major content areas over consecutive years. These individual growth scores are combined into a single number the districts median growth percentile. Higher median growth percentiles indicate higher growth rates for students in those districts, regardless of the districts achievement. For example, a low-achieving district can show high growth rates or a highachieving district can show low growth rates. This figure below (see Figure 13) is a summary of the reading results for three consecutive years for the Highland School District (and Colorado state-level data for comparison purposes) to examine trends in the data. The state median growth percentile for any grade is 50. Districts or other groups with medians less than 50 are growing at a slower rate than the state. Districts and grade 104
levels with numbers at or above 50 are growing as fast or at a faster rate than the state. District totals in this report reflect all grades in the district. The total growth percentile across the three years in reading for all grades in the district shows a slight decline compared to the state-level data. For ELL students, the trend is curvilinear with a spike in growth for 2008 compared to 2007 and 2009, compared to the native English speakers who showed more stability over time. The growth for ELL students state-wide was more consistent and positive over time by comparison. The results for students with free/reduced lunch status (i.e., FRL in the figure below) for the district demonstrate somewhat lower median growth percentiles than their non-free/reduced lunch status counterparts. Although there is no evidence of a dramatic change for students eligible for free/reduced lunch, there is a slight negative trend compared to the state-level data. For ethnicity for the district, minority students displayed a negative trend across the three years compared to their non-minority counterparts. This is opposite from the state-level growth information, which shows a slight improvement for minority and nonminority students over the three years of growth data. Results for IEP status students demonstrate a noticeable difference (i.e., lower percentiles across all years) between their non-IEP counterparts, and a noticeable negative trend across the three years for IEP students in general. This is compared to the relatively consistent percentiles across all three years for IEP and non-IEP status students for the state. Finally, for completeness, gender was highlighted as well. As expected, males were consistently lower across all years compared to females for the district (and state). Girls demonstrated a slight upward
105
trend across the three years, and boys displayed a slight negative trend for the district and the state reading proficiency exam.
Figure 13. A summary of the median growth percentile reading results for three consecutive years (i.e., 2007 through 2009) for grades 4 through 10 for the Highland School District and the state of Colorado. Minority versus non-minority, free/reduced lunch (FRL) versus non-FRL, Individualized Education Program (IEP) versus non-IEP, English Language Learner (ELL) versus non-ELL, and females versus males are also shown (CDE, 2009e).
Original District Sample As mentioned previously, the group that will be described in the following paragraphs includes all students in grades 3 through 11 who are administered the CSAP (and theoretically DORA) in the Highland School District in Ault, Colorado. The population of students in grades 3 through 11 who are administered DORA across the United States and Canada cannot be described, as LGL does not collect demographic 106
information from students. This is considered one of the limitations of the current study; however, the group of students administered DORA in grades 3 through 11 can be described for Highland, as this information was provided by the district (i.e., the CDE). Students in grades Preschool through 2 and grade 12 were not included in the following description for several reasons. First, the youngest grade levels were not included because this study is focusing primarily on the state test and regularly administered formative assessments. State testing in Colorado begins in grade 3, and grades 11 and 12 are given college preparatory exams and high school exit exams (i.e., not the CSAP). Additionally, DORA is administered more frequently in younger grade levels, and at least three time points are necessary to analyze the data for this research question, which supports the omission of the older grade levels. As mentioned above, the last cohort analyzed in this research question includes grades 6 through 11. Individuals with eleventh grade DORA scores were included even though CSAP testing does not continue after tenth grade. These individuals were retained to increase the sample size to conduct the proposed Hierarchical Linear Growth Model. The original sample demographic information will be discussed as a whole, and outlined by cohort in the descriptive table for ease of presentation, even though the analysis will include all individuals (i.e., cohort is not a factor). Information is not presented longitudinally due to the fact that individuals across the years of interest in the current dataset for this first research question did not change demographic status. The full original sample consisted of 298 students in grades 3 through 11 across the academic years of 2004/2005 to present, which included 135 females (45.3%) and 163 107
males (54.7%). The cohorts included the following: (1) 72 students in Cohort 1 (24.2%), (2) 75 students in Cohort 2 (25.2%), (3) 67 students in Cohort 3 (22.5%), and (4) 84 students in Cohort 4 (28.2%). The ethnic composition of the population included 202 students (67.8%) categorized as White (Non-Hispanic), and the remaining individuals classified as minority (n = 96; 32.2%). The minority students were further differentiated in that 87 were Hispanic (29.2%), three were Black (1.0%), two were Asian/Pacific Islander (.7%), and four were American Indian/Alaskan Native (1.3%). As this studys measure of SES, equal amounts of students (n =149) were eligible for free/reduced lunch status as those not eligible (50.0%). ESL/ELL status comprised 16.4% of the sample (n = 49; 16.4%), and 43 students were categorized as needing an IEP (14.4%). These data can be viewed below in Table 6 separated by cohort.
108
Table 6 Student Demographic Information in the Original Sample from the Highland School District for Grades 3 through 11 across the 2004/2005 to 2009/2010 Academic Years by Cohort
Demographic Information (n (%)) Cohort 1 (n = 72) Cohort 2 (n = 75) Cohort 3 (n = 67) Cohort 4 (n = 84) Total (N = 298)
Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible English Language Learner (ELL) Yes No Individualized Education Program (IEP) Yes No
35 (48.6) 37 (51.4) 51 (70.8) 20 (27.8) 1 (1.4) 41 (56.9) 31 (43.1) 11 (15.3) 61 (84.7) 12 (16.7) 60 (83.3)
44 (58.7) 31 (41.3) 51 (68.0) 22 (29.3) 1 (1.3) 1 (1.3) 35 (46.7) 40 (53.3) 12 (16.0) 63 (84.0) 13 (17.3) 62 (82.7)
35 (52.2) 32 (47.8) 45 (67.2) 19 (28.4) 1 (1.5) 2 (.3) 32 (47.8) 35 (52.2) 12 (17.9) 55 (82.1) 11 (16.4) 56 (83.6)
49 (58.3) 35 (41.7) 55 (65.5) 26 (31.0) 1 (1.2) 2 (2.4) 41 (38.8) 43 (51.2) 14 (16.7) 70 (83.3) 7 (8.3) 77 (91.7)
163 (54.7) 135 (45.3) 202 (67.8) 87 (29.2) 3 (1.0) 2 (.7) 4 (1.3) 149 (50.0) 149 (50.0) 49 (16.4) 249 (83.6) 43 (14.4) 255 (85.6)
109
109
Original District Sample CSAP Scores. CSAP scores for the district will be described in the following paragraphs, as this information is important for comparison purposes with the final analysis sample. Any given student could have between zero and five CSAP scores. Each of the five CSAP tests in the current data set were administered in the winter/spring of each academic year, with all CSAP data points documented around March of 2005, 2006, 2007, 2008, and 2009. Although at least three data points are needed to model growth in the HLM software, the entire original district sample CSAP scores will be described, including those students lacking the minimum number of data points needed, for comparison purposes. As mentioned previously, all scores will be summarized together, and also separated by cohort. In the full district original sample (i.e., four cohorts from third to tenth grade), there are CSAP scores from the academic years of 2004/2005 to 2008/2009. Each cohort has five CSAP scores, but will be analyzed collectively as one group in HLM. Further examination of cohort effects is beyond the scope of this study. As shown in the table and figures below (see Table 7 and Figures 14 and 15), the total original samples CSAP scores followed a positive linear trend, with the last two time points having nearly the same average CSAP scores demonstrating smaller growth compared to the other time points. In examining each individual cohort, Cohorts 2 and 3 displayed a negative slope between the final two time points from the CSAP administration in 2007 to 2008. Although this may be problematic, the total sample (i.e., all cohorts combined) showed a relatively consistent positive linear trend appropriate for the proposed analysis in the current research question. 110
The highest average CSAP score was in the final CSAP administration in the current dataset for Cohort 4 (M = 646.34, SD = 64.70). This is not surprising, as this time point is from the final cohort with the oldest age group/grade levels represented (i.e., sixth through tenth grade). The lowest average CSAP score was in the first CSAP administration in the current dataset for Cohort 1 (M = 545.78, SD = 63.65). Again, this is not surprising as this cohort has the youngest students/grade levels represented (i.e., third through seventh grade).
111
Table 7 Descriptive Statistics for the Original District Sample CSAP Reading Scores for the Highland School District for the 2004/2005 to 2008/2009 Academic Years by Cohort CSAP Testing Date (Time Code) Cohort 1 (Grades 3 7) Cohort 2 (Grades 4 8) Cohort 3 (Grades 5 9) Cohort 4 (Grades 6 10) Total
CSAP Total Scaled Reading Score (M (SD)) 03/14/05 (-1) 03/13/06 (0) 03/12/07 (1.25) 03/10/08 (2) 03/09/09 (3) n = 46 545.78 (63.65) n = 50 587.54 (56.40) n = 57 610.09 (63.42) n = 59 626.85 (55.30) n = 62 638.29 (52.44) n = 48 564.56 (84.17) n = 56 584.68 (79.65) n = 64 602.39 (58.88) n = 63 628.17 (54.59) n = 67 612.40 (80.62) n = 48 615.65 (58.84) n = 55 630.93 (49.83) n = 58 646.88 (52.53) n = 54 658.28 (47.83) n = 61 653.26 (57.39) n = 54 614.26 (62.56) n = 61 626.64 (64.34) n = 69 638.45 (63.08) n = 68 653.74 (53.94) n = 69 680.42 (42.24) N = 196 586.36 (73.98) N = 222 608.31 (66.89) N = 248 624.60 (62.25) N = 244 641.64 (54.72) N = 259 646.34 (64.70) 112
112
Figure 14. A positive linear trend demonstrated in the full original district sample (i.e., four cohorts from third to tenth grade combined). There are five CSAP scores from the academic years of 2004/2005 to 2008/2009. Mean TotSS on the Y-axis is the mean of the total scaled score for the CSAP reading state test at each test administration. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 7 for the date of the test administration.
113
Figure 15. Positive linear trend demonstrated by Cohorts 1 and 4 from the full original district sample. Cohorts 2 and 3 displayed a negative slope between the final two time points from the CSAP administration in 2007 to 2008. Mean TotSS on the Y-axis is the mean of the total scaled score for the CSAP reading state test at each test administration. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 7 for the date of the test administration.
Original District Sample DORA Scores. It is important to examine the DORA scores for the district as well, as this information is integral in addressing Research Question 1 and important for comparison purposes with the final analysis sample. Seven of the eight subtests are presented here, with fluency scores being omitted. This subtest is teacher-administered, with teachers rarely recording the scores in the LGL database for the current school district. As mentioned previously, the scores are reported in grade114
level equivalency format. Scores are reported as such: (1) low grade level = .17, mid grade level = .5, high grade level = .83. Each subtest has a specified range as follows: (1) High-Frequency Words has a range of 0 to 3.83 (i.e., Kindergarten through high third grade), (2) Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension all have a range of 0 to 12.83 (i.e., Kindergarten through high twelfth grade), and Phonics has a range of 0 to 4.83 (i.e., Kindergarten through high fourth grade). Phonemic Awareness scores are based on percent correct out of nine questions (Lets Go Learn, Inc. , 2009a). Any given student could have between 0 and 11 DORA scores between the academic years of 2006/2007 to 2009/2010. The DORA subtests in the current data set were administered in the autumn, winter, or spring of each academic year. The exact schedule of DORA administration by cohort is outlined below (see Table 8). Although at least three data points are needed to model growth in HLM, the entire original district sample DORA scores will be described, including those students lacking the minimum number of data points needed, for comparison purposes.
115
Table 8 DORA Administration Schedule for the 2006/2007 through 2009/2010 Academic Years by Cohort
Au 06 Wi 07 Sp 07 Au 07 Wi 08 Sp 08 Au 08 Wi 09 Sp 09 Au 09 Wi 10
Cohort 1 5th 5th 5th 6th 6th 6th Cohort 2 6th 6th 6th 7th 7th 7th Cohort 3 7th 7th 7th 8th 8th 8th Cohort 4 8th 8th 8th 9th 9th 9th 10th 10th 10th 11th 11th 9th 9th 9th 10th 10th 8th 8th 8th 9th 9th 7th 7th 7th 8th 8th
As done previously in describing CSAP growth, all scores will be summarized together, including a table of descriptives with corresponding figures depicting growth for each of the DORA subtests. In the full district original sample (i.e., four cohorts from third to eleventh grade), there are DORA scores from the academic years of 2006/2007 to 2009/2010. As shown in the table and figures below (see Table 9 and Figures 16 through 22), the total samples DORA scores follow a positive linear trend for four of the seven subtests Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension. High-Frequency Words, Phonics, and Phonemic Awareness trends were not as apparent, 116
with many peaks and dramatic drops between measured time points. As will be discussed in later sections, the three subtests not demonstrating growth are not included in further analyses due to their obvious ceiling and floor effects. Overall, the total original sample (i.e., all cohorts combined) DORA scores for the four main subtests of interest showed a relatively consistent positive linear trend appropriate for the proposed analysis for the current research question. The highest average DORA subtest score was for Word Recognition (N = 230) in the spring of 2008 (M = 12.27, SD = 1.39). This is not surprising, as Word Recognition is one of four subtests that have a range of 0 to 12.83 (i.e., Kindergarten through high twelfth grade); however, it is unclear why the highest score for this subtest was in the middle of the measured time points in the dataset, and not at the end (i.e., the final measured time point for the oldest cohort). The lowest average DORA subtest scores were for Phonemic Awareness across all time points. Again, this is expected, as previous subtests are used to gauge if a student will be administered the Phonemic Awareness subtest. Ideally, as students progress in their reading ability, this subtest will not be administered as frequently, with students having zeros as their score if they do not need to take this particular test. Although it is difficult to disentangle which students received zeros because they did not need to take the subtest, or received zeros because their true score was zero. Through personal communication with experienced DORA users in the district, it would be highly unlikely for a student who was administered this subtest to receive a true score of 0. Overall, the low average for this subtest can be considered an indication of better 117
reading ability, and also a noticeable floor effect, which prohibits the use of this subtest in further analyses (among other concerns mentioned above) as will be discussed in future sections.
118
Table 9 Descriptive Statistics for the Original District Sample DORA Subtest Scores for the Highland School District for the 2006/2007 to 2009/2010 Academic Years
DORA Testing Date (Time Code) High-Frequency Words Autumn 2006 (0) (N = 225) Winter 2007 (1) (N = 236) Spring 2007 (1.25) (N = 228) Autumn 2007 (1.5) (N = 245) Winter 2008 (1.75) (N = 174) 3.73 (.36) 3.80 (.13) 3.80 (.14) 3.77 (.32) 3.81 (.09) Word Recognition 10.27 (2.65) 10.62 (2.26) 10.71 (2.26) 11.35 (2.54) 11.99 (1.86) DORA Subtest Score (M (SD))
Phonics
Phonemic Awareness .01 (.10) .01 (.09) .01 (.11) .03 (.14) .00 (.00)
Oral Vocabulary 6.24 (1.86) 6.55 (1.90) 6.83 (2.04) 7.22 (2.36) 7.58 (2.28)
Spelling
Reading Comprehension 6.30 (3.55) 7.62 (3.13) 8.13 (3.08) 8.46 (3.31) 9.21 (3.04)
4.65 (.42) 4.67 (.49) 4.70 (.38) 4.63 (.62) 4.77 (.13)
4.00 (2.10) 4.36 (2.24) 4.74 (2.37) 4.88 (2.48) 5.03 (2.32)
119
Continued 119
Table 9 Continued
Spring 2008 (2) (N = 230) Autumn 2008 (2.25) (N = 245) Winter 2009 (2.5) (N = 119) Spring 2009 (3) (N = 247) Autumn 2009 (3.25) (N = 227) Winter 2010 (3.5) (N = 64) 3.82 (.06) 3.77 (.30) 3.75 (.36) 3.78 (.28) 3.80 (.19) 3.77 (.30) 12.27 (1.39) 11.93 (2.13) 11.94 (2.07) 12.04 (2.16) 12.24 (1.61) 11.90 (2.49) 4.78 (.13) 4.73 (.48) 4.70 (.50) 4.69 (.61) 4.75 (.37) 4.64 (.73) .00 (.00) .01 (.09) .01 (.08) .02 (.11) .01 (.08) .02 (.10) 8.09 (2.37) 8.36 (2.30) 7.94 (2.40) 8.71 (2.54) 9.02 (2.41) 8.09 (2.64) 5.82 (2.55) 5.91 (2.76) 5.13 (2.58) 6.20 (2.82) 6.41 (2.79) 6.19 (2.91) 9.69 (2.88) 9.53 (3.08) 9.01 (3.08) 9.72 (3.37) 10.16 (2.97) 9.39 (3.51)
120
120
Figure 16. Plot of Time (X-axis) and the means at each test administration for the HighFrequency Words DORA subtest (Y-axis) from the full original district sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 9 for the date of the test administration.
121
Figure 17. Plot of Time (X-axis) and the means at each test administration for the Word Recognition DORA subtest (Y-axis) from the full original district sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 9 for the date of the test administration.
122
Figure 18. Plot of Time (X-axis) and the means at each test administration for the Phonics DORA subtest (Y-axis) from the full original district sample. Time on the Xaxis is represented by the time code used in the multilevel growth model. See Table 9 for the date of the test administration.
123
Figure 19. Plot of Time (X-axis) and the means at each test administration for the Phonemic Awareness DORA subtest (Y-axis) from the full original district sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 9 for the date of the test administration.
124
Figure 20. Plot of Time (X-axis) and the means at each test administration for the Oral Vocabulary DORA subtest (Y-axis) from the full original district sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 9 for the date of the test administration.
125
Figure 21. Plot of Time (X-axis) and the means at each test administration for the Spelling DORA subtest (Y-axis) from the full original district sample. Time on the Xaxis is represented by the time code used in the multilevel growth model. See Table 9 for the date of the test administration.
126
Figure 22. Plot of Time (X-axis) and the means at each test administration for the Reading Comprehension DORA subtest (Y-axis) from the full original district sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 9 for the date of the test administration.
Cases Selected for Potential Removal Cases selected for potential removal for various reasons will be discussed in the following paragraphs. Demographic information will be outlined for each group selected for removal, and comparisons with the original district sample will follow. After all problematic cases have been removed from the sample, demographic comparisons will again be made between the original district sample and the final analysis sample, including comparisons between CSAP and DORA scores. Low Frequency CSAP or DORA Scores Although the HLM software screens for missing data, and eliminates cases that do not have enough data to model, the data were examined by the researcher to remove cases 127
without at least three time points of CSAP and DORA scores. A total of 55 cases were found to be without at least three time points of data collection. Fourteen cases were without at least three DORA time points, and 14 cases had both not enough CSAP and DORA time points. The 14 cases that had both were accounted for by those without enough CSAP scores for a total of 55 cases that will be eliminated due to insufficient data to run a growth model in HLM. With a total of 298 cases in the original district sample, the eliminated cases with insufficient data approximated 18.5% of the sample. Demographic information for the 55 removed cases is summarized in Table 10 below. Due to the low frequency of cases, the demographic information will not be separated by cohort.
128
Table 10 Demographic Information of the Cases Removed Due to Missing CSAP or DORA Scores from the Highland School District for Grades 3 through 11 across the 2004/2005 to 2009/2010 Academic Years (N = 55) Demographic Information (n (%)) Cohort 1 2 3 4 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible English Language Learner (ESL/ELL) Yes No Individualized Education Program (IEP) Yes No Total
13 (23.6) 14 (25.5) 10 (18.2) 18 (32.7) 31 (56.4) 24 (43.6) 39 (70.9) 15 (27.3) 1 (1.8) 31 (56.4) 24 (43.6) 10 (18.2) 45 (81.8) 8 (14.5) 47 (85.5)
Low Frequency CSAP or DORA Scores and the Original District Sample. The above removed cases due to insufficient CSAP and DORA scores consisted of 24 females (43.6%) and 31 males (56.4%), which is comparable to the original district student sample containing 135 females (45.3%) and 163 males (54.7%). Both the original sample 129
and the above cases removed had approximately the sample percentages of White (NonHispanic; 70%) and Hispanics ( 30%). As this studys measure of SES, nearly equal amounts of students were eligible for free/reduced lunch as those not eligible in both the original sample and the above removed cases. ESL/ELL status students comprised 18.2% of the cases removed, and eight students were categorized as IEP (14.5%), which again are equal to the original district sample. ESL/ELL Students Students involved in the districts ESL/ELL programs were examined for potential removal from the original district sample. There are a few issues to consider before retaining or removing these individuals. First, the measures of interest in this research question are reading tests, which require the use of standard English. ESL/ELL status could be considered problematic in successfully completing the CSAP or DORA subtests, as they are reading exams. Researchers have noted that the results of analyses involving the impact of students language background on the outcome of various achievement tests might be confounded by their language background, specifically ELL status (Abedi, 2002). Conversely, the target population is considered the district and all other similar districts that use DORA who have a larger Hispanic population ( 30%; i.e., predominantly categorized as ESL/ELL). Therefore, in this first step in the research process, it would be acceptable to first investigate trends including these individuals in the analysis sample. Examining specific populations such as non-English speaking students, ESL/ELL students, and students with disabilities will be considered in future 130
research. Thus, due to the exploratory nature of this study, and for generalizability purposes to the current district and similar districts, these students were included in the analysis sample. Finally, it is important when conducting HLM to have a larger sample size, and retaining these individuals in the analysis sample may bolster the power to detect significant effects. Thus, ESL/ELL status will be included in the HLM analyses for this first research (i.e., and Research Question 3) as another predictor in the model. A total of 49 cases out of the original 298 were categorized as ESL/ELL (i.e., if a student was categorized as ESL, they were also in the ELL group). Ten cases that were categorized as ESL/ELL were also in the same category of students who were removed for having insufficient CSAP or DORA data. Thus, the 49 ESL/ELL students comprised 16.4% of the original district sample. Demographic information for the 49 ESL/ELL cases is summarized in Table 11 below. Due to the low frequency of cases, the demographic information will not be separated by cohort.
131
Table 11 Demographic Information of the ESL/ELL Students from the Highland School District for Grades 3 through 11 across the 2004/2005 to 2009/2010 Academic Years (N = 49) Demographic Information (n (%)) Cohort 1 2 3 4 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible Individualized Education Program (IEP) Yes No Total
11 (22.4) 12 (24.5) 12 (24.5) 14 (28.6) 23 (46.9) 26 (53.1) 46 (93.9) 1 (2.0) 2 (4.1) 43 (87.8) 6 (12.2) 14 (28.6) 35 (71.4)
ESL/ELL Students and the Original District Sample. The ESL/ELL students in the above sample consisted of 26 females (53.1%) and 23 males (46.9%), which is comparable to the original district student sample containing 135 females (45.3%) and 163 males (54.7%). The majority of cases in the ESL/ELL sample were Hispanic (93.9%) with the remaining cases of American Indian/Alaskan Native or Asian/Pacific Islander 132
ethnic background. Compared to the original district sample where nearly equal amounts of students were eligible for free/reduced lunch and not eligible, the overwhelming majority of ESL/ELL students (n = 43) were categorized as eligible (87.8%). Finally, a slightly higher percentage of students in the ESL/ELL sample were categorized as in an IEP (28.6%), which was higher than the original district sample. ESL/ELL Students and the Original District Sample CSAP Scores. To further investigate ESL/ELL status, independent samples t tests were conducted between the ESL/ELL (N = 49) and non-ESL/ELL (N = 249) samples to examine performance on the CSAP across the five data time points. The normality assumption was not violated for either group on the variable of interest (i.e., total scaled reading CSAP test score) with the distributions approximating a normal curve. The non-ESL/ELL group displayed more of a negative skew with many students obtaining higher CSAP scores in the final two administrations (i.e., the two most recent academic years). Independence was not violated. Additionally, the homogeneity of variance assumption was not violated, with the CSAP tests at all time points having equal population variances (p > .05). The alpha level used to examine Levenes test (and all other assumptions) can be assessed at the nominal alpha level (i.e., .05) or the adjusted alpha level (i.e., the Bonferroni correction). No consensus exists as to which alpha level is appropriate (i.e., the nominal alpha level or the adjusted alpha level). A liberal test, which is more powerful, is one that is more likely to find statistical significance (i.e., even where it does not truly exist), and consequently more likely to make a Type I Error and less prone to Type II Errors. A conservative test, which has less power, is less likely to find statistical 133
significance (i.e., even where it does truly exist), and consequently less likely to make Type I Errors and more prone to Type II Errors (Lomax, 2007). For exploratory studies such as this, it can be argued that when interpreting assumptions a more liberal or conservative alpha level is needed. A more liberal alpha level in checking assumptions would more times than not render the assumptions not met. This is compared to a more conservative alpha level, which would produce assumptions that would more times than not be upheld. Therefore, to have a more conservative investigation of the assumptions, the nominal alpha level (i.e., more liberal alpha level) should be used. For the current study, since the adjusted alpha level for checking the statistical significance of the t tests is by nature more conservative, the decision was made to follow suit in checking assumptions. Thus, the nominal alpha level will be used to check assumptions, not the adjusted alpha level. The results from the t tests are summarized below in Table 12.
134
Table 12 Independent Samples t Tests Comparing ESL/ELL (n = 49) and Non-ESL/ELL (n = 249) Students from the Original District Sample on the CSAP Reading State Test from the Highland School District for the 2004/2005 to 2008/2009 Academic Years CSAP Test Date (Time Code) n 03/14/05 (-1) 03/13/06 (0) 03/12/07 (1.25) 03/10/08 (2) 03/10/09 (3) 34 38 40 37 44 ESL/ELL Non-ESL/ELL t df p
M (SD) 531.56 (68.29) 554.26 (70.36) 567.35 (67.80) 599.41 (49.87) 597.36 (63.16)
n 162 184 208 207 215
M (SD) 597.86 (70.03) 619.47 (60.58) 635.61 (54.80) 649.19 (52.15) 656.37 (60.42) 5.04* 5.87* 6.93* 5.38* 5.86* 194 220 246 242 257 .000 .000 .000 .000 .000
Note. The homogeneity assumption was not violated for all five data collection time points (p > .05).
*
p < .001 ( = .01; .05/5 = .01 for the Bonferroni correction).
For all five CSAP data collection time points, the ESL/ELL students performed significantly lower on average than the non-ESL/ELL students (p < .001 for all). This is also demonstrated in the line graph below (see Figure 23) where the ESL/ELL students are shown to perform lower on average, and inconsistently, compared to the nonESL/ELL students in the original district sample. The growth curve for the non-ESL/ELL students was more linear compared to the ESL/ELL students.
135
Figure 23. Plot of Time (X-axis) and the mean total scaled score (i.e., Mean TotSS) for the CSAP reading state test (Y-axis) at each test administration showing that the ESL/ELL students (i.e., coded 1) performed lower on average compared to the nonESL/ELL students (i.e., coded 0). There are five CSAP scores from the academic years of 2004/2005 to 2008/2009. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 12 for the date of the test administration.
ESL/ELL Students and the Original District Sample DORA Scores. It is also necessary to investigate ESL/ELL status on DORA scores as well. Independent samples t tests were again conducted between the ESL/ELL (n = 49) and non-ESL/ELL (n = 249) samples to examine performance on all DORA subtests. In order to control the familywise error rate (i.e., reduce the amount of t-tests that would be conducted), a composite of the DORA subtests was created. Z scores were produced for each of the seven subtests, and then these z scores were averaged. Finally, the ESL/ELL and non-ESL/ELL groups
136
were compared on this composite DORA score for each of the time points across the academic years in question. The normality assumption was violated with most distributions on the dependent variable (i.e., DORA) displaying a negative skew (i.e., many students obtained the highest possible score on a given subtest). Independence was not violated. Additionally, the homogeneity of variances assumption was also violated in the majority of cases (i.e., 6 of the 11 time points), with those six time points not having equal population variances. Departures from homogeneity (and the unbalanced groups) warranted the use of Welchs t test in some instances (i.e., time points 1.5 and 2.25 through 3.5). Violations of the homogeneity assumption were noted for time points 1.5 and 2.25 through 3.5 (p < .05). The results from the t tests are summarized below in Table 13.
137
Table 13 Independent Samples t Tests Comparing ESL/ELL (n = 49) and Non-ESL/ELL (n = 249) Students from the Original District Sample on DORA Scores from the Highland School District for the 2006/2007 to 2009/2010 Academic Years DORA Subtest Date (Time Code) n Autumn 2006 (0) Winter 2007 (1) Spring 2007 (1.25) Autumn 2007 (1.5) Winter 2008 (1.75) Spring 2008 (2) Autumn 2008 (2.25) Winter 2009 (2.5) Spring 2009 (3) Autumn 2009 (3.25) Winter 2010 (3.5) 33 34 32 41 23 30 37 18 43 37 11 ESL/ELL M (SD) -.64 (.67) -.45 (.52) -.47 (.57) -.46 (.83) -.19 (.46) -.10 (.40) -.40 (.95) -.51 (1.03) -.32 (.98) -.07 (.70) -.60 (.90) n 192 202 196 204 151 200 208 101 205 189 53 Non-ESL/ELL M (SD) -.34 (.49) -.19 (.41) -.09 (.39) -.00 (.47) .08 (.40) .20 (.37) .23 (.40) .12 (.43) .29 (.46) .33 (.42) .25 (.51) 2.98* 3.28* 4.70* 3.37* 2.98* 4.04* 3.92* 2.54 3.99* 3.39* 3.04 223.00 234.00 226.00 45.31 172.00 228.00 38.25 18.08 46.01 41.32 11.35 .003 .001 .000 .002 .003 .000 .000 .020 .000 .002 .011 t df p
138
Note. There are eight DORA subtests. Fluency scores have been omitted from analyses and reporting due to this test being administered infrequently. Means and standard deviations are based on a z score composite of the DORA subtests. Violations of the homogeneity assumption were noted for time points 1.5 and 2.25 through 3.5 (p < .05).
*
138
For nine of the 11 data collection time points, the ESL/ELL students performed significantly lower on average than the non-ESL/ELL students (p < .005). For time points 2.5 (i.e., winter 2009) and 3.5 (i.e., winter 2010), there were no significant differences between the groups. These time points also happen to be the smallest group sizes out of all the time points analyzed. This is also demonstrated in the line graph below where the ESL/ELL students are shown to perform lower on average, and inconsistently, compared to the non-ESL/ELL students in the original district sample. The growth curve for the non-ESL/ELL students was more linear compared to the ESL/ELL students. The problematic time points at 2.5 and 3.5 are shown graphically below, and these time points display sharp drops in the growth trajectory.
139
Figure 24. Plot of Time (X-axis) and the mean DORA composite score (i.e., Mean DORA_Comp; Y-axis) at each test administration showing that the ESL/ELL students (i.e., coded 1) performed lower on average compared to the non-ESL/ELL students (i.e., coded 0). There are 11 DORA scores from the academic years of 2006/2007 to 2009/2010. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 13 for the date of the test administration.
IEP Students IEP status was examined for removal from further analysis for Research Question 1 (i.e., and Research Question 3). With the most recent reauthorization of the Individuals with Disabilities Education Act (IDEA), students with disabilities must be included in all large-scale, state-wide testing programs as equally as possible. When the test administration is standardized, student scores are assumed to be comparable, and the inferences made from student performance are assumed to be more equitable (Bond, Braskamp, & Roeber, 1996). Although the use of standard administration conditions 140
allows comparability across students, the validity of the inferences made on the basis of the outcomes may be suspect if unrelated access skills needed to take the test actually impede performance (Messick, 1989). And when high-stakes decisions are made, such potentially invalid inferences cannot be tolerated. Thus, when examining general student performance, and in order to make valid inferences to the general student population, the literature more often than not recommends removing these students from the analysis sample (i.e., unless comparing IEP students to other students is the topic of the investigation; Phillips, 1994; Tindal, Heath, Hollenbeck, Almond, & Harniss, 1998). Aside from the literature base, according to the CDE, most IEP students in the state of Colorado take an alternative or modified state test based on the instructions or accommodations listed in their IEP. For example, most IEP students take the Colorado Student Assessment Program Alternate (CSAPA). The CSAPA is a standards-based assessment designed specifically for students with a number of significant cognitive disabilities, and is meant to provide an idea of student performance relative to the Expanded Benchmarks. The Expanded Benchmarks are an interpretation of The Colorado Model Content Standards at the most foundational level, which provide a framework for students with significant disabilities to access the general curriculum. Students are assessed in reading, writing, and math in grades 3 through 10 and science in grades 5, 8 and 10 (i.e., the same schedule and content areas as the CSAP). The primary purpose of the CSAPA assessment is to determine the level at which Colorado students are meeting the Expanded Benchmarks (CDE, 2009f).
141
The CSAPA is administered to students individually by the teacher who knows the student best in order to ensure that the student performs optimally, and to ensure that the appropriate expanded accommodations are in place. For the CSAPA, the teacher rates each students response on two data points. The first data point collected is how the student responded to an item (i.e., correct or incorrect, other, or no response). The second data point gathered is the students level of independence (i.e., independent, partial independence, limited independence, and no response). The levels of independence are explained on the student report sent to parents. Both data points provide information on a students performance that can help guide instruction and future IEP decisions (CDE, 2009f). Thus, the majority, if not all students with an IEP prescribed by the district due to some cognitive or physical disability are subject to the Expanded Accommodations for state-mandated testing such as the CSAP and CSAPA. These Expanded Accommodations are what the student and/or teacher use to provide greater access to assessment items and instruction to facilitate student responses. For the CSAPA, many accommodations are built into the assessment. For example, the assessment is administered individually and can be done over several days. However, in order for some students to access the assessment teachers may need to make changes to the student materials. For example, a teacher may need to provide a student with real objects or enlarge the picture symbols (CDE, 2009f). Thus, IEP student state test scores may be based on a modified test, and these scores are commonly known to not be comparable to other students not requiring accommodations due to the vast differences in test administration and adjustments made 142
in scoring. Based on the above literature and reason, these students will be removed from the final analysis sample. The demographic information for the IEP students in the original district sample is presented below (see Table 14). Fourty-three (14.4%) students across grades 3 through 11 were categorized as having an IEP. With a total of 298 cases in the original district sample, these eliminated cases approximated 14.4% of the sample. It can be assumed that if the IEP students needed special accommodations to take the state reading test, then their scores on the DORA subtests may suffer from the same problems. It was also determined that keeping the IEP status students in the sample would be far more detrimental to the validity of the study compared to retaining the ESL/ELL students. Compared to the ESL/ELL students, students with IEP status, encompass a range of individuals who may need special accommodations to take tests, whereas this is not as consistent with ESL/ELL students. There are also a smaller number of IEP status individuals, who appear to be struggling across the board compared to the larger sample. Only 14 students who were categorized as ESL/ELL were also noted to have an IEP. Thus, IEP students will be removed from the final analysis sample.
143
Table 14 Demographic Information of the IEP Students from the Highland School District for Grades 3 through 11 across the 2004/2005 to 2009/2010 Academic Years (N = 43) Demographic Information (n (%)) Cohort 1 2 3 4 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible English Language Learner (ESL/ELL) Yes No Total
12 (27.9) 13 (30.2) 11 (25.6) 7 (16.3) 33 (76.7) 10 (23.3) 28 (65.1) 13 (30.2) 2 (4.7) 29 (67.4) 14 (32.6) 14 (32.6) 29 (67.4)
IEP Students and the Original District Sample. The IEP students in the above sample consisted of 10 females (23.3%) and 33 males (76.7%), which is not comparable to the original district student sample containing 135 females (45.3%) and 163 males (54.7%). There appears to be a higher percentage of males in the IEP sample compared to the original district sample. With regards to ethnicity, the majority of cases in the IEP 144
sample were White (Non-Hispanic; 65.1%) with the remaining cases of Hispanic (30.2%) or American Indian/Alaskan Native (4.7%) background. This is comparable to the original district sample where Whites (Non-Hispanics) comprised approximately 70% of the sample, and Hispanics totaled near 30%. Compared to the original district sample where nearly equal amounts of students were eligible for free/reduced lunch and not eligible, the overwhelming majority of IEP students (n = 29) were categorized as eligible (67.4%). Finally, a slightly higher percentage of students in the IEP sample were categorized as ESL/ELL (32.6%), which was higher than the original district sample (16.4%). IEP Students and the Original District Sample CSAP Scores. To further investigate IEP status, independent samples t tests were conducted between the IEP (n = 43) and non-IEP (n = 255) samples to examine performance on the CSAP across the five data time points. The normality assumption was not violated for either group (i.e., IEP versus non-IEP) on the variable of interest (i.e., total scaled reading CSAP test score) with the distributions approximating a normal curve. The non-IEP group displayed more of a negative skew with many students obtaining higher CSAP scores in the final administration (i.e., the most recent academic year). Independence was not violated. Additionally, the homogeneity of variances assumption was violated in all cases, with the CSAP tests at all time points having unequal population variances (p < .05). The results from the t tests are summarized below in Table 15.
145
Table 15 Independent Samples t Tests Comparing IEP (n = 43) and Non-IEP (n = 255) Students from the Original District Sample on the CSAP Reading State Test from the Highland School District for the 2004/2005 to 2008/2009 Academic Years CSAP Test Date (Time Code) n 03/14/05 (-1) 03/13/06 (0) 03/12/07 (1.25) 03/10/08 (2) 03/09/09 (3) 25 33 38 30 37 IEP Non-IEP t df p
M (SD) 479.84 (95.39) 520.15 (95.90) 544.92 (76.81) 574.70 (71.85) 560.35 (90.55)
n 171 189 210 214 223
M (SD) 601.93 (55.50) 623.70 (45.73) 639.01 (46.60) 651.02 (44.64) 660.65 (45.79) 6.25* 6.08* 7.31* 5.67* 6.60* 26.43 34.58 42.06 32.21 39.11 .000 .000 .000 .000 .000
Note. The homogeneity assumption was violated for all five data collection time points (p < .05).
*
For all five CSAP data collection time points, the IEP students performed significantly lower on average than the non-IEP students (p =.000 for all). This is also demonstrated in the line graph below (see Figure 25), where the IEP students are shown to perform lower on average, and inconsistently from time points two to three, compared to the non-IEP students in the original district sample. Growth across time was demonstrated for the non-IEP students compared to the IEP students who displayed a slight drop in mean CSAP reading state test scores from time point two to three.
146
Figure 25. Plot of Time (X-axis) and the mean total scaled score (i.e., Mean TotSS) for the CSAP reading state test (Y-axis) at each test administration showing that the IEP students (i.e., coded 1) performed lower on average compared to the non-IEP students (i.e., coded 0). There are five CSAP scores from the academic years of 2004/2005 to 2008/2009. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 15 for the date of the test administration.
IEP Students and the Original District Sample DORA Scores. It is also necessary to investigate IEP status on DORA scores across the academic years in question in the original district sample. Independent samples t tests were again conducted between the IEP (n = 43) and non-IEP (n = 255) samples to examine performance on all DORA subtests. The same z score composite was used in order to control the family-wise error rate (i.e., reduce the amount of t tests conducted). The IEP and non-IEP groups were compared on this composite DORA score for each of the time points across the academic years from 2006/2007 to 2009/2010. 147
The normality assumption was violated with most distributions on the dependent variable (i.e., DORA) displaying a negative skew (i.e., many students obtained the highest possible score on a given subtest). Independence was not violated. Additionally, the homogeneity of variances assumption was also violated for all time points (i.e., unequal population variances; p < .05). Departures from homogeneity (and the unbalanced groups) warranted the use of Welchs t test. The results from the t tests are summarized below in Table 16.
148
Table 16 Independent Samples t Tests Comparing IEP (n = 43) and Non-IEP(n = 255) Students from the Original District Sample on DORA Scores from the Highland School District for the 2006/2007 to 2009/2010 Academic Years DORA Subtest Date (Time Code) n Autumn 2006 (0) Winter 2007 (1) Spring 2007 (1.25) Autumn 2007 (1.5) Winter 2008 (1.75) Spring 2008 (2) Autumn 2008 (2.25) Winter 2009 (2.5) Spring 2009 (3) Autumn 2009 (3.25) Winter 2010 (3.5) 26 27 25 35 20 22 31 19 36 29 12 IEP M (SD) -1.00 (.80) -.75 (.62) -.66 (.67) -.84 (.97) -.42 (.57) -.31 (.53) -.66 (.97) -.70 (1.03) -.67 (1.09) -.42 (.85) -.72 (1.09) n 199 209 203 210 154 208 214 100 212 197 52 Non-IEP M (SD) -.31 (.43) -.16 (.36) -.08 (.35) .05 (.34) .11 (.35) .21 (.34) .25 (.35) .16 (.35) .32 (.35) .37 (.32) .30 (.31) 4.34* 4.84* 4.20* 5.39* 4.04* 4.53* 5.15* 3.55* 5.43* 4.95* 3.22 26.95 28.30 25.62 35.41 20.88 22.86 31.15 18.79 36.25 29.17 11.41 .000 .000 .000 .000 .001 .000 .000 .002 .000 .000 .008 t p df
149
Note. There are eight DORA subtests. Fluency scores have been omitted from analyses and reporting due to this test being administered infrequently. Means and standard deviations are based on a z score composite of the DORA subtests. Violations of the homogeneity assumption were noted for all time points (p < .05).
*
149
For ten of the 11 data collection time points, the IEP students performed significantly lower on average than the non-IEP students (p < .005). For the final time point (winter 2010), there were no significant differences between the groups. This time point also contains the smallest group sizes out of all the time points analyzed. These findings are demonstrated in the line graph below (see Figure 26), where the IEP students are shown to perform lower on average, and inconsistently, compared to the non-IEP students in the original district sample. The growth curve for the non-IEP students was more linear compared to the IEP students. Time points 2.25, 2.5, 3.25, and 3.5 display sharp drops in the growth trajectory. Additionally, compared to the ESL/ELL students, the IEP students performed lower, and more inconsistently, across all time points when evaluating the means.
150
Figure 26. Plot of Time (X-axis) and the mean DORA composite score (i.e., Mean DORA_Comp; Y-axis) at each test administration showing that the IEP students (i.e., coded 1) performed lower on average compared to the non-IEP students (i.e., coded 0). There are 11 DORA scores from the academic years of 2006/2007 to 2009/2010. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 16 for the date of the test administration.
Total Cases Removed Demographic information is presented below outlining the total cases removed and the final analysis sample. Again, information is not presented longitudinally due to the fact that individuals across the years of interest in the current dataset for this first research question did not change demographic status. The total cases removed from the original district sample include 55 cases with insufficient CSAP or DORA time points to conduct the proposed HLM, and 43 cases categorized as IEP (N = 98). Crosstabulation of these two variables indicated that eight cases were shared by both variables. Thus, the 151
total amount that will be removed from the original district sample will be 90. The total cases removed accounts for approximately 30% of the original district sample. This is above the commonly accepted cutoff of 20%; however, HLM will remove the cases with insufficient data to examine growth. Future research will consider using regression analysis to approximate missing values to bolster the sample size at Level 1. The sample of total cases removed (N = 90) included 31 females (34.4%) and 59 males (65.6%). The cohorts included the following: (1) 25 students in Cohort 1 (27.8%), (2) 23 students in Cohort 2 (25.6%), (3) 19 students in Cohort 3 (21.1%), and (4) 23 students in Cohort 4 (25.6%). The ethnic composition of the population included 60 students (66.7%) categorized as White (Non-Hispanic), and the remaining individuals classified as minority. The minority students are further differentiated in that 27 are Hispanic (30%) and three are American Indian/Alaskan Native (3.3%). Thirty-four students (37.8%) were eligible for free/reduced lunch, and 56 (62.2%) were not eligible. ESL/ELL status students comprised 25.6% of the sample (n = 23), and 43 students were categorized as needing an IEP (47.8%). This descriptive information can be viewed below in Table 17.
152
Table 17 Demographic Information of the Total Cases Removed (N = 90) and the Final Analysis Sample (N = 208) from the Highland School District for Grades 3 through 11 across the 2004/2005 to 2009/2010 Academic Years Demographic Information (n (%)) Cohort 1 2 3 4 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible English Language Learner (ESL/ELL) Yes No Individualized Education Program (IEP) Yes No Cases Removed Final Analysis Sample
25 (27.8) 23 (25.6) 19 (21.1) 23 (25.6) 59 (65.6) 31 (34.4) 60 (66.7) 27 (30.0) 3 (3.3) 34 (37.8) 56 (62.2) 23 (25.6) 67 (74.4) 43 (47.8) 47 (52.2)
47 (22.6) 52 (25.0) 48 (23.1) 61 (29.3) 104 (50.0) 104 (50.0) 142 (68.3) 60 (28.8) 3 (1.4) 2 (1.0) 1 (.5) 93 (44.7) 115 (55.3) 26 (12.5) 182 (87.5) 208 (100.0)
Total Cases Removed and the District Student Population. Compared to the demographic information provided by the CDE for the district, there was a slightly higher percentage of males in the total cases removed (65.5%). This is in comparison to the nearly equal gender distribution (i.e., around 50%) from the CDE profile of the district. 153
The ethnic composition of the total cases removed sample was consistent with demographic information reported by the CDE for the district with the majority being White (Non-Hispanic; i.e., around 65% across all years), and the next highest representation being Hispanic (i.e., around 30% across all years). Less group membership existed in the total cases removed sample for those who are eligible for free/reduced lunch (i.e., 37.8% compared to the CDE district profile citing approximately 43% to 48% across all years). Higher percentages were also noted in the total cases removed sample for ESL/ELL and IEP status (25.6% and 47.8%, respectively). This is higher than the CDE district profile, which has maintained a range of 11% to 15% of students in the ESL/ELL and IEP categories across previous academic years. Total Cases Removed and the Original District Sample. The sample of total cases removed had a higher proportion of males (65.6%) compared to the original district sample (N = 298), where the proportion of males and females was nearly equivalent. The ethnic compositions of the samples were similar with approximately 70% of students identifying as White (Non-Hispanic), and 30% of students identifying as Hispanic. In the original district sample, the percentage of students eligible and not eligible for free/reduced lunch was equal (50%), and in the total cases removed, 37.8% were eligible for free/reduced lunch and 62.2% were not eligible. ESL/ELL status students comprised 25.6% of the total cases removed sample compared to 16.4% in the original district sample. Finally, IEP status was not comparable across the two samples with 47.8% of students categorized as needing an IEP in the total cases removed sample, and only 14.4% of students categorized as such in the original district sample. 154
Total Cases Removed and the Final Analysis Sample. The proportion of males and females was equal in the final analysis sample (i.e., 50% for both males and females), but there was a higher percentage of males in the total cases removed sample (65.6%). The ethnic composition of the samples was similar, with 66.7% of the total cases removed and 68.3% of individuals in the final analysis sample identifying as White (NonHispanic). The next highest ethnic proportion was Hispanic in both samples (i.e., around 30%). Both samples had higher proportions of individuals not eligible for free/reduced lunch (with the higher percentage in the total cases removed sample), with 62.2% of the total cases removed being not eligible compared to 55.3% of the final analysis sample being not eligible. Finally, ESL/ELL status across the two samples was not equivalent, as 25.6% of the total cases removed sample was categorized as ESL/ELL and only 12.5% of the final analysis sample was in that group. IEP status cannot be compared as the final analysis sample does not contain any students requiring an IEP. Final Analysis Sample In the final analysis sample, there were 208 students in grades 3 through 11 across the academic years of 2004/2005 to present, which included equal numbers of males and females (n = 104; 50%). The cohorts included the following: (1) 47 students in Cohort 1 (22.6%), (2) 52 students in Cohort 2 (25.0%), (3) 48 students in Cohort 3 (23.1%), and (4) 61 students in Cohort 4 (29.3%). The ethnic composition of the final analysis sample included 142 (68.3%) students categorized as White (Non-Hispanic). The minority students were comprised of 60 Hispanic (28.8%), three Black (1.4%), two Asian/Pacific 155
Islander (1.0%), and one American Indian/Alaskan Native (.5%). Free/reduced lunch status contained 93 (44.7%) students, and 115 (55.3%) were not eligible. ESL/ELL status students comprised 12.5% of the sample (n = 26), and as expected due to the elimination of this group, no students were categorized as needing an IEP. These data can also be viewed above in Table 17. Final Analysis Sample and the District Student Population. The CDE and final analysis sample were more than comparable with both demographic profiles citing nearly equal or exactly equal (i.e., 50%) groups of males and females. The ethnic composition reported by the CDE and in the final analysis sample were congruent, with the majority being White (Non-Hispanic; i.e., around 65% to 70%), and the next highest representation being Hispanic (i.e., around 30%). Compared to the CDE district profile, the final analysis sample had nearly equivalent percentages of students eligible for free/reduced lunch (i.e., approximately 44% for both). Finally, ESL/ELL status in the final analysis sample was similar to the CDE profile with 12.5% in the final analysis sample and the CDE reporting a range of 11% to 15% across the previous academic years. There were no IEP students in the final analysis sample for comparison with the CDE district population profile. Final Analysis Sample and the Original District Sample. In the final analysis sample (N = 208), there were equal numbers of males and females (50%). This was virtually equivalent to the original district sample consisting of 54.7% males and 45.3% females. As with previous comparisons, the ethnic composition of the final analysis sample was strikingly similar to the original district sample with approximately 70% in 156
both samples identifying as White (Non-Hispanic) and roughly 30% identifying as Hispanic. Fifty percent of the original district sample was eligible for free/reduced lunch, compared to 44.7% of the final analysis sample. Lastly, ESL/ELL status was almost equal across both samples with the original district sample containing 16.4% ESL/ELL and the final analysis sample containing 12.5%. Again, since IEP students were eliminated from the final analysis sample, this cannot be compared to the original district sample. Based on the above demographic comparisons, specifically the comparisons between the final analysis sample and the original district sample and district population information, generalizations can be safely made from the analysis sample back to the target population with regards to demographic information. Final Analysis Sample CSAP Scores CSAP scores for the final analysis sample (N = 208) will be described in the following paragraphs. As mentioned previously when describing the original district sample CSAP scores, each of the five CSAP tests in the current data set were administered in the winter/spring of each academic year, with all CSAP data points documented in mid-March of 2005, 2006, 2007, 2008, and 2009. All scores will be summarized together, and also separated by cohort. In the final analysis sample (i.e., four cohorts from third to tenth grade), there were CSAP scores from the academic years of 2004/2005 to 2008/2009. As noted previously, each cohort has five CSAP scores, but will be analyzed collectively as one group in the HLM. As shown in the table (see Table 18) and figures below (see Figures 157
27 and 28), the final analysis samples CSAP scores followed a positive linear trend, with Cohort 2 demonstrating the most inconsistent growth compared to the other cohorts over time. Although this may be problematic, the combined final analysis sample (i.e., Cohorts 1 through 4) showed a consistent positive linear trend appropriate for the proposed analyses of the current research question. The highest average CSAP score was in the final CSAP administration (i.e., March 9, 2009) for Cohort 4 (M = 683.72, SD = 34.16). This finding is not surprising, as this time point is from the final cohort with the oldest age group/grade levels represented (i.e., sixth through tenth grade). The lowest average CSAP score was in the first CSAP administration for Cohort 1 (M = 563.46, SD = 47.32). Again, this is not remarkable, as this cohort has the youngest students/grade levels represented (i.e., third through seventh grade).
158
Table 18 Descriptive Statistics for the Final Analysis Sample for the CSAP Reading State Test Scores for the Highland School District for the 2004/2005 to 2008/2009 Academic Years by Cohort CSAP Testing Date (Time Code) Cohort 1 (Grades 3 7) Cohort 2 (Grades 4 8) Cohort 3 (Grades 5 9) Cohort 4 (Grades 6 10) Total
CSAP Total Scaled Reading Score (M (SD)) 03/14/05 (-1) 03/13/06 (0) 03/12/07 (1.25) 03/10/08 (2) 03/09/09 (3) n = 37 563.46 (47.32) n = 40 605.50 (37.32) n = 44 630.84 (41.05) n = 44 641.84 (40.63) n = 43 651.63 (39.65) n = 41 585.59 (53.30) n = 47 608.34 (49.07) n = 52 617.19 (45.21) n = 52 637.48 (48.56) n = 49 638.08 (54.21) n = 41 630.78 (43.86) n = 46 643.02 (39.22) n = 48 659.04 (44.66) n = 44 665.73 (42.70) n = 44 671.82 (33.91) n = 50 621.52 (51.67) n = 54 633.81 (45.87) n = 60 649.77 (44.85) n = 54 660.69 (43.42) n = 53 683.72 (34.16) N = 169 602.34 (55.69) N = 187 623.62 (45.96) N = 204 639.56 (46.71) N = 194 651.34 (45.37) N = 189 661.81 (44.91)
159
159
Figure 27. A positive linear trend demonstrated in the final analysis sample (i.e., four cohorts from third to tenth grade combined). There are five CSAP scores from the academic years of 2004/2005 to 2008/2009. Mean TotSS on the Y-axis is the mean of the total scaled score for the CSAP reading state test at each test administration. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 18 for the date of the test administration.
160
Figure 28. Positive linear trend demonstrated by Cohorts 1 through 4 from the final analysis sample. Cohorts 2 showed the most inconsistent trend. Mean TotSS on the Yaxis is the mean of the total scaled score for the CSAP reading state test at each test administration. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 18 for the date of the test administration.
Final Analysis Sample and State and District CSAP Growth. In the beginning of this section, Colorado state-level CSAP growth and district-level CSAP growth were outlined. For the state, it was noted that progress was being made towards the states goals, especially among specific groups (i.e., ESL/ELL, free/reduced lunch status, Minority students, and IEP status). All of these groups are demonstrating a positive trend of moving in increasing numbers into Proficient and Advanced levels, and being able to stay proficient and above over time, across all CSAP content areas. For district CSAP growth, according to the CDE, the main findings included that the total growth percentile
161
across the previous three academic years in reading for all grades in the district shows a slight decline compared to the state-level data. When comparing the above information to the final analysis sample CSAP growth, similarities can be seen between the growth trajectories of the state, but not the district information provided by the CDE. Although a slight decline was noted in the district CSAP reading growth over the previous three academic years, in the final analysis sample, across the 2004/2005 to 2009/2010 academic years, steady growth was noted in the final analysis sample. This is ideal for the proposed analyses for this first research question, and ideal for the goals of the Colorado Growth Model. This discrepancy may be due to the composition of the final analysis sample where problematic cases were removed. The collective district information presented did not screen and remove cases as was demonstrated in the above paragraphs. Additionally, the final analysis sample includes growth into the current academic year, which has not been provided by the CDE for the state or district at this time. Final Analysis Sample and Total Cases Removed CSAP Scores. Independent samples t tests were conducted between the final analysis sample (n = 208) and total cases removed (n = 90) to examine performance on the CSAP across the five data time points. The normality assumption was not violated for either group on the variable of interest with the distributions approximating a normal curve. The final analysis sample displayed more of a negative skew with many students obtaining higher CSAP scores in the final administration (i.e., the most recent academic year). In the total cases removed sample, the CSAP distributions at every time point appeared more platykurtic compared 162
to the final analysis sample. Independence was not violated. Additionally, the homogeneity of variances assumption was violated in all cases, with the CSAP tests at all time points having unequal population variances (p < .05). The results from the t tests are summarized below in Table 19.
Table 19 Independent Samples t Tests Comparing the Final Analysis Sample (n = 208) and Total Cases Removed (n = 90) on the CSAP Reading State Test from the Highland School District for the 2004/2005 to 2008/2009 Academic Years CSAP Test Date (Time Code) Total Cases Removed Final Analysis Sample t df p
n 03/14/05 (-1) 03/13/06 (0) 03/12/07 (1.25) 03/10/08 (2) 03/10/09 (3) 27 35 44 50 71
M (SD) 486.33 (94.62) 526.51 (96.67) 555.20 (77.25) 604.02 (70.16) 605.28 (87.50)
n 169 187 204 194 189
M (SD) 602.34 (55.69) 623.62 (45.96) 639.56 (46.71) 651.34 (45.37) 661.81 (44.91) 6.20* 5.82* 6.97* 4.53* 5.19* 28.94 36.92 49.98 59.96 84.23 .000 .000 .000 .000 .000
Note. The homogeneity assumption was violated for all five data collection time points (p < .05).
*
For all five CSAP data collection time points, the total cases removed sample performed significantly lower on average than the final analysis sample (p < .001 for all). This is also demonstrated in the line graph below (see Figure 29) where the total cases removed (i.e., Not in Analysis Sample) are shown to perform lower on average, and inconsistently, compared to the final analysis sample. The growth curve for the final 163
analysis sample was more linear compared to the total cases removed, where the final two data points had approximately the same average CSAP score.
Figure 29. Plot of Time (X-axis) and the mean total scaled score (i.e., Mean TotSS) for the CSAP reading state test (Y-axis) at each test administration showing that the students not in the analysis sample performed lower on average compared to the students in the analysis sample. There are five CSAP scores from the academic years of 2004/2005 to 2008/2009. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 19 for the date of the test administration.
Final Analysis Sample DORA Scores It is important to examine the DORA scores for the final analysis sample as well. As mentioned when discussing the original district sample descriptive information, any given student could have between 0 and 11 DORA scores between the academic years of 2006/2007 to 2009/2010. Each of the DORA tests in the current data set was administered in either the autumn, winter, or spring of the academic year. As done 164
previously, all scores will be summarized together, including a table of descriptive information with corresponding figures depicting growth for each of the DORA subtests. As shown in the table (see Table 20) and figures below (see Figures 30 through 36), the final analysis samples DORA scores follow a positive linear trend for four of the seven subtests Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension. High-Frequency Words and Phonics display a linear trend, but this trend is near the ceiling for these subtests for all time points. Phonemic Awareness did not have a trend, as most of the time points had a mean and standard deviation of zero. Overall, the final analysis sample (i.e., all cohorts combined) DORA scores for the four main subtests of interest showed a relatively consistent positive linear trend appropriate for the proposed analyses of the current research question. The highest average DORA subtest score was for Word Recognition (N = 44) in the winter of 2010 (M = 12.67, SD = .27). This is not surprising, as Word Recognition is one of four subtests that have a range of 0 to 12.83 (i.e., Kindergarten through high twelfth grade); however, this highest score for this subtest also had one of the smallest sample sizes across all time points, and may not be extremely reliable. The lowest average DORA subtest scores were for Phonemic Awareness across all time points with means and standard deviations near or at 0. Again, this is expected, as previous subtests are used to gauge if a student will be administered the Phonemic Awareness subtest. Ideally, as students progress in their reading ability, this subtest will not be administered as frequently, with students having zeros as their score if they do not need to take this particular test. The low average for this subtest can be considered an indication of better 165
reading ability, and also a noticeable floor effect, which prohibits the use of this subtest in further analyses as will be discussed in future sections.
166
Table 20 Descriptive Statistics for the Final Analysis Sample DORA Subtest Scores for the Highland School District for the 2006/2007 to 2009/2010 Academic Years DORA Testing Date (N; Time Code) High-Frequency Words Word Recognition 10.62 (2.25) 10.94 (1.79) 11.07 (1.68) 11.96 (1.26) 12.40 (1.03) DORA Subtest Score (M (SD))
Phonics
Phonemic Awareness .01 (.07) .00 (.00) .00 (.00) .00 (.00) .00 (.00)
Oral Vocabulary 6.35 (1.85) 6.67 (1.90) 6.96 (2.04) 7.57 (2.30) 7.83 (2.31)
Spelling
Reading Comprehension 6.61 (3.47) 7.93 (2.92) 8.47 (2.81) 9.20 (2.75) 9.71 (2.76) Continued
167
Autumn 2006 (N = 195; 0) Winter 2007 (N = 203; 1) Spring 2007 (N = 197; 1.25) Autumn 2007 (N = 194; 1.5) Winter 2008 (N = 140; 1.75)
3.77 (.31) 3.82 (.08) 3.82 (.06) 3.83 (.02) 3.83 (.00)
4.71 (.16) 4.75 (.15) 4.76 (.14) 4.76 (.14) 4.79 (.11)
4.20 (2.12) 4.61 (2.25) 4.99 (2.37) 5.31 (2.41) 5.33 (2.31)
167
Table 20 Continued Spring 2008 (N = 190; 2) Autumn 2008 (N = 185; 2.25) Winter 2009 (N = 87; 2.5) Spring 2009 (N = 180; 3) Autumn 2009 (N = 170; 3.25) Winter 2010 (N = 44; 3.5) 3.83 (.03) 3.81 (.13) 3.81 (.07) 3.82 (.06) 3.83 (.03) 3.83 (.00) 12.52 (.91) 12.54 (.84) 12.53 (.67) 12.64 (.61) 12.60 (.60) 12.67 (.27) 4.79 (.11) 4.81 (.08) 4.80 (.10) 4.80 (.09) 4.80 (.10) 4.80 (.10) .00 (.00) .00 (.00) .00 (.00) .00 (.00) .00 (.00) .00 (.00) 8.28 (2.36) 8.68 (2.25) 8.22 (2.34) 9.10 (2.32) 9.26 (2.26) 8.64 (2.46) 6.13 (2.45) 6.47 (2.54) 5.61 (2.47) 6.89 (2.51) 6.91 (2.51) 6.86 (2.49) 10.07 (2.68) 10.34 (2.44) 9.82 (2.36) 10.60 (2.56) 10.81 (2.21) 10.41 (2.59)
168
168
Figure 30. Plot of Time (X-axis) and the means at each test administration for the HighFrequency Words DORA subtest (Y-axis) from the final analysis sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 20 for the date of the test administration.
169
Figure 31. Plot of Time (X-axis) and the means at each test administration for the Word Recognition DORA subtest (Y-axis) from the final analysis sample. Time on the Xaxis is represented by the time code used in the multilevel growth model. See Table 20 for the date of the test administration.
170
Figure 32. Plot of Time (X-axis) and the means at each test administration for the Phonics DORA subtest (Y-axis) from the final analysis sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 20 for the date of the test administration.
171
Figure 33. Plot of Time (X-axis) and the means at each test administration for the Phonemic Awareness DORA subtest (Y-axis) from the final analysis sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 20 for the date of the test administration.
172
Figure 34. Plot of Time (X-axis) and the means at each test administration for the Oral Vocabulary DORA subtest (Y-axis) from the final analysis sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 20 for the date of the test administration.
173
Figure 35. Plot of Time (X-axis) and the means at each test administration for the Spelling DORA subtest (Y-axis) from the final analysis sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 20 for the date of the test administration.
174
Figure 36. Plot of Time (X-axis) and the means at each test administration for the Reading Comprehension DORA subtest (Y-axis) from the final analysis sample. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 20 for the date of the test administration.
Final Analysis Sample and the Original District Sample DORA Scores. As shown in the tables and figures when discussing the original district sample (N = 298), the original samples DORA scores follow a positive linear trend for four of the seven subtests Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension. These were the same findings demonstrated above for the final analysis sample (N = 208). The problematic subtests of High-Frequency Words, Phonics, and Phonemic Awareness were inconsistent or displayed obvious floor or ceiling effects for both the original sample and final analysis sample. Overall, both samples displayed a relatively consistent positive linear trend appropriate for the proposed analyses of the current 175
research question for the four main subtests of Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension. Final Analysis Sample and the Total Cases Removed DORA Scores. Independent samples t tests were again conducted between the final analysis sample (N = 208) and the total cases removed sample (N = 90) to examine performance on all DORA subtests. In order to control the family-wise error rate (i.e., reduce the amount of t-tests that would be conducted), a composite of the DORA subtests was used. The groups were compared on this composite DORA score for each of the time points across the academic years in question. The normality assumption was violated with most distributions on the dependent variable (i.e., DORA) displaying a negative skew (i.e., many students obtained the highest possible score on a given subtest). Independence was not violated. Additionally, the homogeneity of variances assumption was also violated for all 11 time points (i.e., unequal population variances; p < .05). Departures from homogeneity (and the unbalanced groups) warranted the use of Welchs t test. The results from the t tests are summarized below in Table 21.
176
Table 21 Independent Samples t Tests Comparing the Final Analysis Sample (n = 208) and the Total Cases Removed (n = 90) on DORA Scores from the Highland School District for the 2006/2007 to 2009/2010 Academic Years DORA Subtest Date (Time Code) Total Cases Removed n Autumn 2006 (0) Winter 2007 (1) Spring 2007 (1.25) Autumn 2007 (1.5) Winter 2008 (1.75) Spring 2008 (2) Autumn 2008 (2.25) Winter 2009 (2.5) Spring 2009 (3) Autumn 2009 (3.25) Winter 2010 (3.5) 30 33 31 51 34 40 60 32 68 56 20 M (SD) -.93 (.77) -.65 (.62) -.57 (.64) -.62 (.89) -.33 (.52) -.16 (.49) -.33 (.84) -.41 (.92) -.26 (.95) -.06 (.77) -.37 (.96) Final Analysis Sample n 195 203 197 194 140 190 185 87 180 170 44 M (SD) -.30 (.43) -.15 (.36) -.08 (.35) .06 (.34) .14 (.33) .23 (.33) .28 (.32) .18 (.30) .35 (.33) .37 (.30) .32 (.30) 4.32* 4.49* 4.17* 5.34* 5.01* 4.72* 5.50* 3.62* 5.12* 4.07* 3.14* 31.89 35.58 32.93 53.88 39.64 46.45 64.69 33.54 72.99 60.74 20.69 .000 .000 .000 .000 .000 .000 .000 .001 .000 .000 .005 t df p
177
Note. There are eight DORA subtests. Fluency scores have been omitted from analyses and reporting due to this test being administered infrequently. Means and standard deviations are based on a z score composite of the DORA subtests. Violations of the homogeneity assumption were noted for all time points (p < .05).
*
177
For all data collection time points on the DORA composite, the total cases removed sample performed significantly lower on average than the final analysis sample (p < .005 for all). This is also demonstrated in the line graph below where the total cases removed (i.e., Not in Analysis Sample) are shown to perform lower on average, and inconsistently, compared to the final analysis sample. The growth curve for the final analysis sample was more linear compared to the total cases removed group. A sharp drop in the growth trajectories is displayed for time point 2.5 (i.e., winter 2009) in both groups, and the final time point (i.e., winter 2010) shows a decline in both groups, but especially for the total cases removed sample. The consistent pattern where the total cases removed performed significantly lower across all time points on the DORA composite further supports the exclusion of these cases from the final analysis sample.
178
Figure 37. Plot of Time (X-axis) and the mean DORA composite score (i.e., Mean DORA_Comp; Y-axis) at each test administration showing that the students not in the analysis sample performed lower on average compared to the students in the analysis sample. There are 11 DORA scores from the academic years of 2006/2007 to 2009/2010. Time on the X-axis is represented by the time code used in the multilevel growth model. See Table 21 for the date of the test administration.
DORA Subtests Used Three subtests (i.e., High-Frequency Words, Phonics, and Phonemic Awareness) do not have sufficient variability to be examined in the analysis of Research Question 1. All three subtests demonstrate either a ceiling effect or floor effect, which is an effect where data cannot assume a value higher (or lower) than a ceiling (or floor) or maximum (or minimum) value on a test. Specifically, the utility of a measurement strategy is compromised by a lack of variability. For example, in the case of a ceiling effect, the majority of scores are at or near the maximum possible for the test, which can present two major problems. First, the test is unable to measure traits above its ceiling, and the 179
test fails to distinguish between the people scoring highest on the test. Second, most statistical procedures rely on scores being variable and evenly distributed (e.g., HLM). With strong ceiling effects, distributions are usually distorted with little variability. This violates statistical assumptions and limits the possibility of finding effects. These are generally the same concerns with floor effects. The first subtest in question is High-Frequency Words, which has a range of 0 to 3.83 (i.e., Kindergarten through high third grade). The second subtest in question is Phonics, which has a range of 0 to 4.83 (i.e., Kindergarten through high fourth grade). Both of these tests consistently demonstrate a ceiling effect across all grade levels. Finally, the third subtest demonstrated a floor effect Phonemic Awareness. Scores on this subtest are based on percent correct out of 9 questions. This test is generally administered to only those students performing poorly. Above the third or fourth grade, this test is usually not administered unless the student has severe problems in a number of other reading subtest areas. Therefore, these three subtests will not be included in the analysis of Research Question 1 (or Research Question 3). Sample Size For the current research question (and Research Question 3), sample size is an issue to consider that could have an impact on the results of this study (e.g., statistical conclusion validity). One of the numerous benefits of HLM is that it is able to reliably perform analyses with missing data, unbalanced data, and small sample sizes (Raudenbush & Bryk, 2002). If a sample size problem exists, it is usually at the group level because the group-level sample size is always smaller than the individual-level 180
sample size (Maas & Hox, 2005). According to simulation research, this is generally problematic for the standard errors of the second-level variances, as they are estimated too small when the number of groups is lower than 100 (Maas & Hox, 2005). For Level 2 data where the group size is smaller than 30, the standard errors are estimated about 15% too small. Fortunately, the regression coefficients and the variance components are all estimated without bias for the individual-level and second-level data. The estimates of the standard errors at Level 2 should be accurate in the current research question, as the Level 2 variables consist of demographic grouping considerations such as gender, ethnicity, ESL/ELL status, and free/reduced lunch status. Therefore, the number of groups will be well over 100. The number of groups will prove to be problematic for the estimation of the standard errors for the third research question, as there are only 11 teachers in the third level of data. This will be described in more detail in the results for Research Question 3. To improve the precision of the estimates with small Level 2 (or Level 3) sample sizes, the choice of estimation method is important. Two main estimation methods exist in using HLM full maximum likelihood (FEML) and restricted maximum likelihood (REML). The main difference between these two estimation methods is that REML maximizes a likelihood function that is invariant for the fixed effects (Goldstein, 1995; Hox, 2002). REML accounts for the uncertainty in the fixed parameters when estimating random parameters, and should, in theory, provide better estimates of the variance components when the number of groups is small (Raudenbush & Bryk, 2002). For the current research question, as mentioned previously, there are over 100 Level 2 units. 181
Therefore, FEML estimates should be akin to REML estimates, and FEML will be used for Research Question 1. Hierarchical Linear Growth Modeling Assumptions In Hierarchical Linear Growth Modeling, assumptions are made about the structural and stochastic parts of the model. An assumption of linearity is made about the structural features at each level of the model. Singer and Willett (2003) state that the structural specification embodies assumptions about the true functional form of the relationship between the outcomes and predictors (p. 127). Therefore, a linear relationship should be specified and examined at every level. The only exception may be in the case of Level 2 dichotomous predictors because a linear model is de facto acceptable for predictors involving two categories as in the current study (Singer & Willet, 2003, p. 128). Assumptions are also made regarding the stochastic part of the model at each level, which represents the effect of the random error associated with the measurement of student i on occasion j (i.e., in the Level 1 model). According to Singer and Willett (2003), each students true change trajectory is determined by the structural component, and their observed change trajectory reflects the measurement errors (p. 54). The inclusion of the stochastic part of the model accounts for the differences between the true and observed trajectories. Thus, for the model in this current research question, each residual represents that part of student is DORA score at time j not predicted by the demographic predictors at Level 2.
182
In fitting the submodels to the data, assumptions about the distribution of the residuals at each level, from occasion to occasion and from person to person, are made (Singer & Willett, 2003). Using the Level 1 residuals as an example, the classical notation for assumptions is as follows:
ij ~ N (0, 2),
[2]
where ~ indicates as distributed as, N is for a normal distribution, 0 in the parentheses denotes the distributions mean, and 2 indicates variance. These assumptions state that the residuals are independently and identically distributed, with homoscedastic variance across occasions and individuals. As Singer and Willett (2003) note, this assumption is very restrictive and unrealistic for longitudinal data. For example, when students change across time, their Level 1 error structure may become more complex resulting in autocorrelation and heteroscedasticity over time. However, multilevel approaches to longitudinal data have some advantages over traditional methods for analyzing change. One such advantage is that they offer an extremely flexible framework for assessing change, particularly with regard to model assumptions (OConnell & McCoach, 2004). The model assumptions for longitudinal data will be discussed in the following paragraphs, along with an examination of these assumptions applied to the current research question. The main assumptions in Hierarchical Linear Growth Modeling pertain to the functional form of the model examining linearity, and the stochastic part of the model 183
involving normality and homoscedasticity. Although compound symmetry (i.e., a sufficient condition for sphericity; Lomax, 2007; Maxwell & Delaney, 1990) can be imposed on the data in a multilevel analysis, this assumption can also be relaxed in a multilevel framework (Raudenbush & Bryk, 2002). Linearity, normality, and homogeneity of variance (i.e., homoscedasticity), however, are assumptions that are typical of any General Linear Model (GLM) approach. Thus, checking and satisfying these assumptions for growth modeling can help to ensure unbiased estimates of population effects. Linearity. Linearity is assessed in Level 1 by examining empirical growth plots with the outcome of interest. Scatterplots were produced for all individuals examining the outcome of interest in the current research question the Colorado state test in reading (i.e., N = 208). Only the mean CSAP score as the outcome in the graph will be displayed in the figure below (see Figure 38). The figure below shows the scatterplot of Time on the X-axis, which was across the academic years of 2004/2005 to 2009/2010 (i.e., five possible time points), and the overall mean across students for the CSAP (i.e., the outcome in the HLM models for Research Question 1) on the Y-axis. The first time point (i.e., -1.00) is the CSAP test date in March of 2005. Zero represents the CSAP in March of 2006, followed by 1.25 as the time point for March of 2007. The last two time points are from March of 2008 (i.e., 2.00) and March of 2009 (i.e., 3.00).
184
Figure 38. Scatterplot to check the linearity assumption at Level 1 of the HLM Growth Model in Research Question 1. The scatterplot of the overall CSAP means across students displays linear change with time. Time on the X-axis is represented by the time code used in the multilevel growth model.
Although not shown above, the empirical growth plots for each student for the CSAP suggested that most students have linear change with time. For others (i.e., the minority of cases), some trajectories appeared curvilinear. A very small amount of student trajectories appeared to have no linear relationship between Time and the CSAP (i.e., n < 5). In the scatterplot above of the overall CSAP means across students, the plot demonstrated obvious linear change with time. Although this information was already presented in the descriptive statistics section (i.e., in the form of a line graphs), the linearity information is presented again in this section to highlight how this information is used to assess linearity as an important assumption of Hierarchical Linear Growth Modeling. The only difference between the descriptive presentation and the assumption 185
checking is that the descriptives presented linearity in the form of a line graph, and the assumptions tend to check linearity with empirical growth plots (i.e., scatterplots). Linearity should also be examined for the time-varying covariates in the models for the current research question. Line graphs were produced for all students in the sample examining each time-varying covariate DORA subtest. Only the mean DORA subtest score was displayed in the figures in this document. If this information was to be presented again, the scatterplots would contain Time on the X-axis, which was across the academic years of 2006/2007 to 2009/2010 (i.e., 11 possible time points), and the overall mean across students for each DORA subtest (i.e., Word Recognition, Oral Vocabulary, Spelling, Reading Comprehension) on the Y-axis. As shown in Figures 31 and 34 through 36, the time-varying covariates follow a positive linear trend for the four subtests used in this research questions models. Finally, linearity at Level 2 does not need to be assessed because all Level 2 covariates are dichotomous (i.e., Sex, Ethnicity, Free/Reduced Lunch status, and ESL/ELL status). Normality. The HLM software for the two-level model produces two residual files, one at each level. These files contain the Empirical Bayes (EB) residuals defined at the various levels, fitted values, OLS residuals, EB coefficients, and the Mahalanobis distance measures. Residual files were produced for each level of the two-level model. There were a total of eight residual files with four files at Level 1 for each DORA subtest (i.e., the time-varying covariate) and four files at Level 2. Normality can be assessed in a number of ways. For example, Singer and Willett (2003) demonstrate how to use Normality Probability Plots (i.e., P-P Plots) to graph the 186
raw residuals against their associated normal scores. The assumption is satisfied if the points form a line in the P-P Plot. Normality can also be assessed by simply examining histograms for the Level 1 residuals to ensure the approximation of a normal curve. Using the latter method, all Level 1 residual files for each DORA subtest as the timevarying covariate in the model were examined (see Figure 39 below with the Word Recognition model as an example).
Figure 39. Histogram of the Level 1 residuals to examine the normality assumption in the model with Word Recognition as the time-varying covariate. The residuals (i.e., l1resid above) should approximate a normal curve. All histograms for each DORA subtest as the time-varying covariate in separate models appeared to approximate a normal distribution.
This assumption states that all residuals should be normally distributed. All histograms for each DORA subtest as the time-varying covariate in separate models 187
appeared to approximate a normal distribution. The assumption was also checked again for each time-varying covariate model at Level 2. The raw residuals for only the intercept were examined in each model (i.e., not the slope) due to the fact that the demographic covariates were only modeled for the intercept at Level 2. This will be demonstrated in the next section describing the formulas at each level of the model. Figure 40 below shows the Word Recognition model as an example (i.e., intercept only).
Figure 40. Histogram of the Level 2 residuals to examine the normality assumption in the model with Word Recognition as the time-varying covariate. The residuals (i.e., ebintrcp above, which means Empirical Bayes Intercept residuals) should approximate a normal curve. All histograms for each DORA subtest as the time-varying covariate in separate models appeared to approximate a normal distribution at the intercepts.
188
As with the Level 1 normality analysis, this assumption states that all residuals should be normally distributed. All histograms (i.e., the intercept for each model) for all the DORA subtests as the time-varying covariate appeared to approximate a normal distribution. Homogeneity of Variance. Homogeneity of variance can be checked in a number of ways. The HLM software can test the homogeneity of Level 1 variance with a simple Chi-Square test. In the Full Models (as will be described later) with each time-varying covariate, the homogeneity of variance assumption was examined at Level 1. The results suggested that the assumption of homogeneity of variance for Level 1 was violated for all four models (i.e., Word Recognition: 2 = 264.67, df = 130, p = .000; Oral Vocabulary: 2 = 245.80, df = 132, p = .000; Spelling: 2 = 287.91, df = 135, p = .000; Reading Comprehension: 2 = 239.03, df = 136, p = .000). This violation could be explained by a number of reasons. For example, violations of homogeneity of variance at Level 1 can indicate a need for more important variables at Level 1. Additionally, any substantial violations of normality can lead to an increased risk for violating heterogeneity of variance (Lomax, 2007). Although not apparent from the above histograms examining normality, any nonnormality demonstrated in the residual analysis above (i.e., the histograms), may have had an impact on the heterogeneity found in examining this current assumption. Homogeneity at Level 2 can be examined by plotting raw residuals against the predictors. Raw residuals were plotted against Sex, Ethnicity, Free/Reduced Lunch status, and ESL/ELL status. If the assumption is satisfied, residual variability will be 189
approximately equal at every covariate value. Although all demographic covariates will be examined, to minimize the number of scatterplots presented in this document, only those for Sex will be displayed for every DORA subtest as a time-varying covariate in the model. Again, only plots for the intercept were created (see Figure 41 below with the Word Recognition model as an example).
Figure 41. Residuals plotted against the Sex covariate to examine the Level 2 homogeneity of variance assumption in the model with Word Recognition as the timevarying covariate. The residuals (i.e., ebintrcp above on the Y-axis, which means Empirical Bayes Intercept residuals) should be approximately equal at every covariate value to meet this assumption. For Sex, 0 is Male and 1 is Female. All plots for each DORA subtest as the time-varying covariate in separate models do not have approximately equal range and variability for Males and Females at the intercepts. This was the same pattern observed for the other covariates Ethnicity, ESL/ELL, and Free/Reduced Lunch.
190
For this assumption at Level 2, residual variability should be approximately equal at every predictor value. Thus, for the above figures, 0 is Male and 1 is Female, and the residual spread should be approximately equal for those values. The Level 2 residuals do not have approximately equal range and variability for Males and Females for all DORA subtests as the time-varying covariate. For example, across all the DORA time-varying covariates, there appears to be some outliers for both Males and Females. For Oral Vocabulary and Reading Comprehension, there is more spread (i.e., outliers) in the groups compared to the other DORA subtest plots by Sex. Comparing Males and Females specifically, the residual variability was not approximately equal at every predictor value. Thus, overall, for the Sex predictor, the homoscedasticity assumption does not appear to be satisfied. For the other predictors not depicted in scatterplots in this document, similar patterns were noted in Ethnicity (i.e., 0 = White and 1 = Minority), and ESL/ELL status (i.e., 0 = Non-ESL/ELL status and 1 = ESL/ELL status) in that the variability and spread was not equal between the groups. Two-Level Time-Varying Covariate Hierarchical Linear Growth Model Results A Two-Level Hierarchical Linear Growth Model was run to examine the relationship between student state test scores in reading (i.e., the CSAP) and student DORA scores across grades 3 through 10 from the academic years of 2004/2005 to 2009/2010. More specifically, the goal of this research question was to examine if student CSAP growth is related to student DORA growth controlling for various demographic variables. The hypothesis was that student CSAP score growth will be significantly related to student DORA score growth. 191
Four models, one for each DORA subtest as the time-varying covariate (i.e., Word Recognition, Oral Vocabulary, Spelling, Reading Comprehension) were run with the usual model-building strategy assumed the One-Way Random Effects ANOVA, followed by the Unconditional Growth Model (or Random Coefficients Model), followed by various Conditional Growth Models (or Contextual Models), and ending with the Full Model. Included in Level 1 was time of DORA data collection and one of four timevarying covariates (i.e., Word Recognition, Oral Vocabulary, Spelling, or Reading Comprehension). Level 2 variables were various demographic covariates including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status. Time at Level 1 and the demographic controls at Level 2 were uncentered, and the time-varying covariates were centered around their grand means. The HLM models were analyzed using the statistical package Hierarchical Linear Modeling (HLM) 6.08 (Raudenbush, Bryk, & Congdon, 2004) using Full Maximum Likelihood estimation (FEML). Sex was coded with 0 for Male and 1 for Female. Ethnicity was coded with 0 for White and 1 for Minority. Non-ESL/ELL and Non-Free/Reduced Lunch status students were coded 0, with their ESL/ELL and Free/Reduced Lunch counterparts coded 1. The full model is described again below. The model at Level 1 describes the general trajectory for each person across time. The model at Level 1 was the following:
Yti = 0i + 1i(DORA)ti + 2i(Time)ti + eti
[3]
192
where Yti is the students CSAP score for time t for student i, (Time)ti is the elapsed years/months since DORA implementation, (DORA)ti is the time-varying predictor for a student at a given time point, 0i (i.e., the intercept) is a students initial CSAP score, 1i is the linear growth coefficient (i.e., for DORA) along with 2i (i.e., the growth rate over all years/months), which represents the childs expected change in CSAP score for unit year/month increase. This individual growth-curve model assumes that eti is a studentand time-specific residual, and that errors at Level 1 (eti) are independent and normally distributed with a common variance (Raudenbush & Bryk, 2002). The model at Level 2 describes how the above information tends to vary based on various student demographics. The individual growth parameters will become the outcome variables in the Level 2 models, where they will be assumed to vary across individuals depending on various demographic controls. The model at Level 2 was the following:
0i = 00 + 01(SEX)i + 02(ETHNIC)i + 03(ESLELL)i + 04(FREERED)i + r0i 1i = 10 2i = 20
[4]
where 0i, 1i, and 2i are the individual-specific CSAP score parameters (i.e., initial status, DORA growth, and growth rate), 00 is the baseline expectation (i.e., initial CSAP status) for the demographic predictors coded as 0, 10 is the expected change of the CSAP
193
controlling for the DORA time-varying covariate, and 20 is the expected linear change of the CSAP. Finally, r0i is a residual. The demographics were not modeled for all the intercepts and slopes in the Level 2 equations. The final estimates of the Level 2 variance components for the demographic variables in the intercept and slope equations (i.e., the ones removed above) are very small with most being significant. When this occurs, it is common to fix some of the effects (i.e., eliminate the error term). Modeling small error variances can be problematic, especially when there is a small sample size as in the current study, due to the increase in parameters estimated and loss of degrees of freedom. The most parsimonious model necessitates fixing these effects. Thus, for the two equations above not including the demographics, the effects were set as fixed (i.e., the error term was eliminated) for every final model run for all the DORA time-varying covariates. One-Way Random Effects ANOVA Model. Table 22 below shows the results of One-Way Random Effects ANOVA Model (i.e., the Empty Model). The Empty Model below and the following Unconditional Growth Model are the same two beginning models used for every time-varying covariate (DORA subtest) in subsequent models. The intercept in this empty model is just the total reading state test score (i.e., CSAP) average per student regardless of time (i.e., the average of all the student means across all time points together). The reliability was .90 (i.e., the average Ordinary Least Squares (OLS) reliability for all students in the sample), which means the sample means for the CSAP tend to be reliable as indicators of true student CSAP scores.
194
The average student CSAP mean was statistically different from zero (00 = 643.76, t = 218.15, df = 207, p = .000). Considerable variation in the student CSAP means still exists (00 = 1621.91, 2 = 2068.81, df = 207, p = .000). The proportion of variance within students was 28.55%, indicating that 28.55% of the variability in CSAP scores was within students. Additionally, 71.45% of the differences in total state reading score lies between students. The total variability was 2269.85. Based on the significant amount of unexplained variability, additional Level 1 predictors (i.e., Time and DORA scores) were added to try and reduce the variation within students, as well as adding other Level 2 variables to explain between student differences in the following models.
Table 22 One-Way Random Effects ANOVA Model with the CSAP Reading State Test Fixed Effects Model for Initial WR Status (0i) Intercept (00) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Coefficient (SE) 643.76 (2.95) Variance 647.94 1621.91 207 2068.81*** .000 df t (df) 218.15*** (207) 2 p .000 p
Note. Deviance (FEML) = 7402.28; 3 estimated parameters.

***
p < .001
195
Unconditional Growth Model. Table 23 below shows the results of the Unconditional Growth Model with Time (i.e., in three to six month units) as the sole predictor at Level 1 and no Level 2 variables. After including Time as a predictor of CSAP score, within student variability in CSAP score was reduced by 54.05%, relative to the One-Way Random Effects ANOVA Model. The remaining variation in CSAP score after linear effects of Time at Level 1 were controlled for was 297.71 for Level 1 (i.e., across the current academic year) and 1930.97 for Level 2 (i.e., the between student level). Thus, the variance in CSAP scores after the linear effect of Time was controlled was 13.36% associated with nonlinear and residual effects of Time, and 86.64% associated with variation between students. Overall mean CSAP scores across students at initial status (i.e., Time = 0) was still significantly different from zero (00 = 622.31, t = 190.79, df = 207, p = .000). Also, there was a significant difference in the Time slope (i.e., the effect of Time on CSAP score) across students (10 = 13.81, t = 18.32, df = 207, p = .000), indicating a statistically significant linear trend. For every three to six month increase in Time, there was an average 13.81 points increase in student CSAP scores. The correlation between initial status and linear growth was -.34 (p < .05). This means that students who had a lower initial CSAP score had faster growth (i.e., rate of change). There was significant variability in both the intercepts and slopes. Statistically significant variability in the CSAP means still exist after considering Time (00 = 1930.97, 2 = 1716.47, df = 205, p = .000), as well as statistically significant variability in individual student growth rates (11 = 38.57, 2 = 317.85, df = 205, p = .000). Finally, the 196
scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was large (i.e., eti = 297.71).
Table 23 Unconditional Growth Model with the CSAP Reading State Test Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Model for CSAP Growth Rate (1i) Intercept (10) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Student Growth Rate (r1i) Coefficient (SE) 622.31 (3.26) 13.81 (.75) Variance 297.71 1930.97 38.57 205 205 1716.47*** 317.85*** .000 .000 df t (df) 190.79*** (207) 18.32* (207) 2 p .000 .000 p

***
p < .001
Conditional Growth Model. Table 24 below shows the results of Conditional Growth Model with Time and the time-varying covariate Word Recognition DORA subtest in Level 1 and no Level 2 variables. After including Word Recognition in the model, there was a 1.03% reduction in within student variance (i.e., compared to the Unconditional Growth Model). The remaining variation in CSAP scores after the linear effects of Time and Word Recognition at Level 1 were controlled for was 294.65 for 197
Level 1 (i.e., across the current academic year) and 1803.64 for Level 2 (i.e., the between student level). Thus, the variance in CSAP scores after the linear effect of Time and Word Recognition were controlled was 14.04% associated with nonlinear and residual effects of Time, and 85.96% associated with variation between students. There was still significant variability in both the intercepts and slopes. After including Time and the Word Recognition covariate in Level 1, 6.60% of the variance in initial CSAP status is explained by these variables. Statistically significant variability in the CSAP means still existed (00 = 1803.63, 2 = 724.36, df = 182, p = .000). This indicates that there are still considerable differences between students that might be explained by Level 2 variables. There was also statistically significant variability in individual student growth rates (11 = 44.73, 2 = 233.58, df = 182, p = .000), meaning that between student differences in the effect of time was not fully accounted for by Word Recognition. That is, there was a 15.97% decrease in the amount of variability explained from the Unconditional Growth Model, or the variability in the effect of time (i.e., -15.97% of the variance in linear growth rates), within students that can be explained by Word Recognition. However, there was no statistically significant variability in student Word Recognition slope (p > .05). That is, there were no statistically significant differences in the rate of Word Recognition change between students as a time-varying covariate in this model. Finally, the scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was still large (i.e., eti = 294.65).
198
Overall mean CSAP scores across students was still significantly different from zero (00 = 624.52, t = 193.92, df = 207, p = .000). This is the mean CSAP score when time is at initial status for a student with a mean Word Recognition score. There was also a statistically significant effect of Time on mean CSAP score controlling for Word Recognition score (20 = 12.55, t = 14.60, df = 207, p = .000). This means that there is a 12.55 point increase in a students CSAP score every three to six months adjusted for their average Word Recognition score. Finally, the mean effect of Word Recognition at initial status (i.e., Time = 0) was statistically significant (10 = 1.56, t = 3.28, df = 207, p = .002). This means that on average, across students, the DORA Word Recognition subtest is significantly positively related to the state test in reading, and for every one unit increase in Word Recognition score, there is a 1.56 increase in the state test score.
199
Table 24 Conditional Growth Model with the CSAP Reading State Test as the Outcome and the DORA Word Recognition (WR) Subtest as the Time-Varying Covariate Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Model for WR Growth Rate (1i) Intercept (10) Model for CSAP Growth Rate (2i) Intercept (20) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Student WR Growth Rate (r1i) Student Growth Rate (r2i) Coefficient (SE) 624.52 (3.22) 1.56 (.48) 12.55 (.86) Variance 294.65 1803.63 .62 44.73 182 182 182 724.36*** 168.33 233.58** .000 > .500 .006 df t (df) 193.92*** (207) 3.28** (207) 14.60*** (207) 2 p .000 .002 .000 p

**
p < .01; *** p < .001
Full Model. Table 25 below shows the results of the Full Model with Time and Word Recognition in Level 1 and all the demographic variables in Level 2. After including the demographic covariates in the model at the intercept, there was a 19.79% increase in within student variance (i.e., compared to the Unconditional Growth Model). The remaining variation in CSAP scores after controlling for the linear effects of Time and Word Recognition was 356.63 for Level 1 (i.e., across the current academic year) and 1370.00 for Level 2 (i.e., the between student level). Thus, the variance in CSAP scores 200
after the linear effects of Time and Word Recognition were controlled was 20.66% associated with nonlinear and residual effects of Time, and 79.35% associated with variation between students. After including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status in Level 2, 29.05% of the variance in the between student differences in mean CSAP score was accounted for by these covariates (i.e., 29.05% of the variability in initial status is explained by the demographic controls). However, since this result is statistically significant (00 = 1370.00, 2 = 3055.06, df = 203, p = .000), there are still differences between students that might be explained by other Level 2 variables. Finally, the scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was still large (i.e., eti = 356.63). Overall mean CSAP scores across students was still significantly different from zero (00 = 633.10, t = 138.33, df = 203, p = .000). This is the mean CSAP score when Sex is 0 (i.e., Male), Ethnicity is 0 (i.e., White), ESL/ELL status is 0 (i.e., not in the ESL/ELL program), Free/Reduced Lunch status is 0 (i.e., not in the Free/Reduced Lunch program), and Time is 0. There was a statistically significant effect of ESL/ELL status and Free/Reduced Lunch status on mean CSAP score. First, the effect of ESL/ELL status was negative and statistically significant (03 = -37.86, t = -4.22, df = 203, p = .000). The coefficient -37.86 represents the decrease in a students mean CSAP score for students who are in the ESL/ELL program on average. Non-ESL/ELL students are predicted to have a mean CSAP score of 633.10, and ESL/ELL students are predicted to have a mean
201
CSAP score of 595.24 (i.e., 633.10 37.86). Thus, at initial status, non-ESL/ELL students outperform ESL/ELL students on the CSAP on average. Second, the effect of Free/Reduced Lunch status was also negative and statistically significant (04 = -15.03, t = -2.62, df = 203, p = .010). This means that nonFree/Reduced Lunch status students outperformed students enrolled in the program by 15.03 points on the CSAP on average. Finally, there was no statistically significant increase or decrease in student mean CSAP scores for Sex and Ethnicity (p > .05). There was still a significant difference in the Time slope (i.e., the effect of Time on CSAP score) across students (20 = 12.49, t = 14.32, df = 737, p = .000). The effect of Time on mean CSAP score is positive on average when Sex is 0 (i.e., Male), Ethnicity is 0 (i.e., White), ESL/ELL status is 0 (i.e., not in the ESL/ELL program), and Free/Reduced Lunch status is 0 (i.e., not in the Free/Reduced Lunch program). That is, for each three to six month increase in Time, there was an average 12.49 points increase in student CSAP scores. Finally, the mean effect of Word Recognition at initial status was statistically significant (10 = 1.78, t = 3.41, df = 737, p = .001). This means that on average, across students, the DORA Word Recognition subtest is significantly and positively related to the state test in reading. For every one unit increase in Word Recognition score, there is a 1.78 point increase in the CSAP.
202
Table 25 Full Model with the CSAP Reading State Test as the Outcome and the DORA Word Recognition (WR) Subtest as the Time-Varying Covariate Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Sex (01) Ethnicity (02) ESL/ELL (03) Free/Reduced Lunch (04) Model for WR Growth Rate (1i) Intercept (10) Model for CSAP Growth Rate (2i) Intercept (20) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Coefficient (SE) 633.10 (4.58) 7.02 (5.45) -2.51 (6.60) -37.86 (8.98) -15.03 (5.74) 1.78 (.52) 12.49 (.87) Variance 356.63 1370.00 203 3055.06*** .000 df t (df) 138.33*** (203) 1.29 (203) -.38 (203) -4.22*** (203) -2.62* (203) 3.41** (737) 14.32*** (737) 2 p .000 .199 .704 .000 .010 .001 .000 p

*
p < .05; ** p < .01; *** p < .001
Conditional Growth Model. Table 26 below shows the results of Conditional Growth Model with Time and the time-varying covariate Oral Vocabulary DORA subtest in Level 1 and no Level 2 variables. After including Oral Vocabulary in the model, there was a 3.87% increase in the within student variance compared to the Unconditional Growth Model. The remaining variation in CSAP scores after the linear effects of Time 203
and Oral Vocabulary at Level 1 were controlled for was 309.24 for Level 1 and 1656.96 for Level 2. Thus, the variance in CSAP scores after the linear effect of Time and Oral Vocabulary were controlled was 15.73% associated with nonlinear and residual effects of Time, and 84.27% associated with variation between students. There was still significant variability in both the intercepts and slopes. After including Time and the Oral vocabulary covariate in Level 1, 14.19% of the variance in initial CSAP status was explained by these variables. Statistically significant variability in the CSAP means still existed (00 = 1656.96, 2 = 516.51, df = 184, p = .000). This indicates that there are still considerable differences between students that might be explained by Level 2 variables. There was also statistically significant variability in individual student growth rates (11 = 31.00, 2 = 225.49, df = 184, p = .020), meaning that between student differences in the effect of time is not fully accounted for by Oral Vocabulary. That is, 19.63% of the variability in the effect of time (i.e., 19.63% of the variance in linear growth rates) within students can be explained by Oral Vocabulary scores (i.e., compared to the Unconditional Growth Model). However, there was no statistically significant variability in student Oral Vocabulary slope (p > .05). That is, there are no statistically significant differences in the rate of Oral Vocabulary change between students as a time-varying covariate in this model. Finally, the scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was still large (i.e., eti = 309.24). Overall mean CSAP scores across students was still significantly different from zero (00 = 626.38, t = 197.24, df = 207, p = .000). This is the mean CSAP score when 204
time is at initial status for a student with a mean Oral Vocabulary score. There was also a statistically significant and positive effect of Time on mean CSAP score controlling for Oral Vocabulary score (20 = 11.58, t = 12.66, df = 207, p = .000). This means that there is an 11.58 point increase in a students CSAP score every three to six months adjusted for his/her average Oral Vocabulary score. Finally, the mean effect of Oral Vocabulary score at initial status was statistically significant and positive (10 = 2.37, t = 4.24, df = 207, p = .000). This means that on average, across students, the DORA Oral Vocabulary subtest is significantly positively related to the state test in reading. For every one unit increase in Oral Vocabulary score, there is a 2.37 point increase in the state test score.
205
Table 26 Conditional Growth Model with the CSAP Reading State Test as the Outcome and the DORA Oral Vocabulary (OV) Subtest as the Time-Varying Covariate Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Model for OV Growth Rate (1i) Intercept (10) Model for CSAP Growth Rate (2i) Intercept (20) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Student OV Growth Rate (r1i) Student Growth Rate (r2i) Coefficient (SE) 626.38 (3.18) 2.37 (.56) 11.58 (.91) Variance 309.24 1656.96 .54 31.00 184 184 184 516.51* 174.05 225.49* .000 > .500 .020 df t (df) 197.24* (207) 4.24* (207) 12.66* (207) 2 p .000 .000 .000 p

*
p < .05; ** p < .001
Full Model. Table 27 below shows the results of the Full Model with Time and Oral Vocabulary in Level 1 and all the demographic variables in Level 2. After including these variables, there was a 23.13% increase in within student variance (i.e., compared to the Unconditional Growth Model). The remaining variation in CSAP scores after controlling for the linear effects of Time and Oral Vocabulary scores was 366.57 for Level 1 (i.e., across the current academic year) and 1238.64 for Level 2 (i.e., the between student level). Thus, the variance in CSAP scores after the linear effects of Time and Oral 206
vocabulary were controlled was 22.84% associated with nonlinear and residual effects of Time, and 77.16% associated with variation between students. After including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status in Level 2 and Oral Vocabulary in Level 1, 35.85% of the variance in the between student differences in mean CSAP score was accounted for by these covariates. However, since this result is statistically significant (00 = 1238.64, 2 = 2713.69, df = 203, p = .000), there are still differences between students that might be explained by other Level 2 variables. Finally, the scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was still large (i.e., eti = 366.57). Overall mean CSAP scores across students was still significantly different from zero (00 = 633.17, t = 140.71, df = 203, p = .000). This is the mean CSAP score when the demographics are coded 0 and Time is 0 controlling for Oral Vocabulary scores. There were statistically significant effects of ESL/ELL status and Free/Reduced Lunch status on mean CSAP score. The effect of ESL/ELL status was negative and statistically significant (03 = -35.69, t = -4.21, df = 203, p = .000). Thus, at initial status, nonESL/ELL students outperformed ESL/ELL students on the CSAP on average. In addition, the effect of Free/Reduced Lunch status was also negative and statistically significant (04 = -14.58, t = -2.69, df = 203, p = .008). This means that non-Free/Reduced Lunch status students outperformed students enrolled in the program by 14.58 points on the CSAP on average. Finally, there was no statistically significant increase or decrease in student mean CSAP scores for Sex and Ethnicity (p > .05). 207
There was still a significant difference in the effect of Time on CSAP scores across students (20 = 11.65, t = 12.73, df = 737, p = .000). That is, for every three to six month increase in Time, there was an average 11.65 points increase in student CSAP scores. Finally, the mean effect of Oral Vocabulary at initial status was statistically significant (10 = 2.17, t = 3.71, df = 737, p = .000). This means that on average, across students, the DORA Oral Vocabulary subtest is significantly and positively related to the state test in reading. For every one unit increase in Oral Vocabulary score, there is a 2.17 point increase in the CSAP.
208
Table 27 Full Model with the CSAP Reading State Test as the Outcome and the DORA Oral Vocabulary (OV) Subtest as the Time-Varying Covariate Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Sex (01) Ethnicity (02) ESL/ELL (03) Free/Reduced Lunch (04) Model for OV Growth Rate (1i) Intercept (10) Model for CSAP Growth Rate (2i) Intercept (20) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Coefficient (SE) 633.16 (4.50) 7.45 (5.22) -.66 (6.21) -35.69 (8.48) -14.58 (5.41) 2.17 (.58) 11.65 (.92) Variance 366.57 1238.64 203 2713.69*** .000 df t (df) 140.71*** (203) 1.43 (203) -.12 (203) -4.21*** (203) -2.69** (203) 3.71*** (737) 12.73*** (737) 2 p .000 .155 .915 .000 .008 .000 .000 p

*
p < .05; ** p < .01; *** p < .001
Conditional Growth Model. Table 28 below shows the results of Conditional Growth Model with Time and the time-varying covariate Spelling DORA subtest in Level 1 and no Level 2 variables. After including Spelling in the model, there was less than a 1.00% reduction in within student variance compared to the Unconditional Growth Model (i.e., .67%). The remaining variation in CSAP scores after the linear effects of Time and Spelling at Level 1 were controlled for was 295.72 for Level 1 and 1629.66 for 209
Level 2. Thus, the variance in CSAP scores after the linear effect of Time and Spelling were controlled was 15.36% associated with nonlinear and residual effects of Time, and 84.64% associated with variation between students. There was still significant variability in both the intercepts and slopes. After including Time and the Spelling covariate in Level 1, 15.60% of the variance in initial CSAP status was explained by these variables. Statistically significant variability in the CSAP means still existed (00 = 1629.66, 2 = 537.06, df = 188, p = .000). This indicates that there are still considerable differences between students that might be explained by adding Level 2 variables. There was also statistically significant variability in individual student growth rates (11 = 39.94, 2 = 244.40, df = 188, p = .004), meaning that between student differences in the effect of time is not fully accounted for by Spelling. That is, there was a small decrease of 3.55% in the variability in the effect of time explained by Spelling scores (i.e., -3.55% of the variance in linear growth rates) within students (i.e., compared to the Unconditional Growth Model). There was no statistically significant variability in student Spelling slopes (p > .05). That is, there are no statistically significant differences in the rate of Spelling change between students as a time-varying covariate in this model. Finally, the scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was large (i.e., eti = 295.72). Overall mean CSAP scores across students was still significantly different from zero (00 = 626.75, t = 196.21, df = 207, p = .000). This is the mean CSAP score when time is at initial status for a student with a mean Spelling score. There was also a 210
statistically significant and positive effect of Time on mean CSAP score controlling for Spelling score (20 = 10.78, t = 11.43, df = 207, p = .000). This means that there is an 10.78 point increase in a students CSAP score every three to six months adjusted for his/her average Spelling score. Finally, the mean effect of Spelling score at initial CSAP status was statistically significant and positive (10 = 3.49, t = 5.25, df = 207, p = .000). This means that on average, across students, the DORA Spelling subtest is significantly positively related to the state test in reading. For every one unit increase in Spelling score, there is a 3.49 point increase in the state test score.
211
Table 28 Conditional Growth Model with the CSAP Reading Test as the Outcome and the DORA Spelling (SP) Subtest as the Time-Varying Covariate Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Model for SP Growth Rate (1i) Intercept (10) Model for CSAP Growth Rate (2i) Intercept (20) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Student SP Growth Rate (r1i) Student Growth Rate (r2i) Coefficient (SE) 626.75 (3.19) 3.49 (.66) 10.78 (.94) Variance 295.72 1629.66 10.03 39.94 188 188 188 537.06*** 194.12 244.40** .000 .364 .004 df t (df) 196.21*** (207) 5.25*** (207) 11.43*** (207) 2 p .000 .000 .000 p

**
p < .01; *** p < .001
Full Model. Table 29 below shows the results of the Full Model with Time and Spelling in Level 1 and all the demographic variables in Level 2. After including these variables, there was a 24.04% increase in within student variance (i.e., compared to the Unconditional Growth Model). The remaining variation in CSAP scores after controlling for the linear effects of Time and Spelling scores was 369.27 for Level 1 (i.e., across the current academic year) and 1132.17 for Level 2 (i.e., the between student level). Thus, the variance in CSAP scores after the linear effects of Time and Spelling were controlled 212
was 24.59% associated with nonlinear and residual effects of Time, and 75.41% associated with variation between students. After including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status in Level 2 and Spelling in Level 1, 41.37% of the variance in the between student differences in mean CSAP score was accounted for by these covariates. However, since this result is statistically significant (00 = 1132.17, 2 = 2476.06, df = 203, p = .000), there are still differences between students that might be explained by other Level 2 variables. Finally, the scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was still large (i.e., eti = 369.27). Overall mean CSAP scores across students was still significantly different from zero (00 = 635.45, t = 143.51, df = 203, p = .000). This is the mean CSAP score when the demographics are coded 0 and Time is 0 controlling for Spelling scores. There were statistically significant effects of ESL/ELL status and Free/Reduced Lunch status on mean CSAP score. The effect of ESL/ELL status was negative and statistically significant (03 = -35.53, t = -4.17, df = 203, p = .000). Thus, at initial status, nonESL/ELL students outperformed ESL/ELL students on the CSAP on average by 35.53 points. In addition, the effect of Free/Reduced Lunch status was also negative and statistically significant (04 = -12.98, t = -2.47, df = 203, p = .015). This means that nonFree/Reduced Lunch status students outperformed students enrolled in the program by 12.98 points on the CSAP on average. Finally, there was no statistically significant increase or decrease in student mean CSAP scores for Sex and Ethnicity (p > .05). 213
There was still a significant difference in the effect of Time on CSAP scores across students (20 = 10.67, t = 11.31, df = 737, p = .000). That is, for every three to six month increase in Time, there was an average 10.67 points increase in student CSAP scores. Finally, the mean effect of Spelling on the CSAP at initial status was statistically significant (10 = 3.41, t = 5.08, df = 737, p = .000). This means that on average, across students, the DORA Spelling subtest is significantly and positively related to the state test in reading. For every one unit increase in Spelling score, there is a 3.41 point increase in the CSAP.
214
Table 29 Full Model with the CSAP Reading State Test as the Outcome and the DORA Spelling (SP) Subtest as the Time-Varying Covariate Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Sex (01) Ethnicity (02) ESL/ELL (03) Free/Reduced Lunch (04) Model for SP Growth Rate (1i) Intercept (10) Model for CSAP Growth Rate (2i) Intercept (20) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Coefficient (SE) 635.45 (4.42) 6.98 (4.99) -4.70 (6.16) -35.53 (8.52) -12.98 (5.26) 3.41 (.67) 10.69 (.94) Variance 369.27 1132.17 203 2476.06*** .000 df t (df) 143.51*** (203) 1.40 (203) -.77 (203) -4.17*** (203) -2.47* (203) 5.08*** (737) 11.31*** (737) 2 p .000 .163 .445 .000 .015 .000 .000 p

*
p < .05; *** p < .001
Conditional Growth Model. Table 30 below shows the results of Conditional Growth Model with Time and the time-varying covariate Reading Comprehension DORA subtest in Level 1 and no Level 2 variables. After including Reading Comprehension in the model, there was a 2.67% reduction in the within student variance compared to the Unconditional Growth Model. The remaining variation in CSAP scores after the linear effects of Time and Reading Comprehension at Level 1 were controlled 215
for was 289.77 for Level 1 and 1313.66 for Level 2. Thus, the variance in CSAP scores after the linear effect of Time and Reading Comprehension were controlled was 18.07% associated with nonlinear and residual effects of Time, and 81.93% associated with variation between students. There was still significant variability in both the intercepts and slopes. After including Time and the Reading Comprehension covariate in Level 1, 31.97% of the variance in initial CSAP status was explained by these variables. Statistically significant variability in the CSAP means still existed (00 = 1313.66, 2 = 426.81, df = 185, p = .000). This indicates that there are still considerable differences between students that might be explained by other Level 2 variables. Conversely, there was no statistically significant variability in individual student growth rates (p > .05), meaning that between student differences in the effect of time was fully accounted for by Reading Comprehension. Additionally, there was no statistically significant variability in student Reading Comprehension slopes (p > .05). That is, there are no statistically significant differences in the rate of Reading Comprehension change between students as a timevarying covariate in this model. Finally, the scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was still large (i.e., eti = 289.77). Overall mean CSAP scores across students was still significantly different from zero (00 = 628.44, t = 213.28, df = 207, p = .000). This is the mean CSAP score when time is at initial status for a student with a mean Reading Comprehension score. There was also a statistically significant and positive effect of Time on mean CSAP score 216
controlling for Reading Comprehension score (20 = 9.37, t = 9.93, df = 207, p = .000). This means that there is an 9.37 point increase in a students CSAP score every three to six months adjusted for his/her average Reading Comprehension score. Finally, the mean effect of Reading Comprehension score at initial status was statistically significant and positive (10 = 3.48, t = 7.43, df = 207, p = .000). This means that on average, across students, the DORA Reading Comprehension subtest is significantly positively related to the state test in reading. For every one unit increase in Reading Comprehension score, there is a 3.48 point increase in the state test score.
217
Table 30 Conditional Growth Model with the CSAP Reading Test as the Outcome and the DORA Reading Comprehension (RC) Subtest as the Time-Varying Covariate Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Model for RC Growth Rate (1i) Intercept (10) Model for CSAP Growth Rate (2i) Intercept (20) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Student RC Growth Rate (r1i) Student Growth Rate (r2i) Coefficient (SE) 628.44 (2.95) 3.48 (.47) 9.37 (.94) Variance 289.77 1313.66 7.32 36.24 185 185 185 426.81*** 212.67 198.48 .000 .080 .236 df t (df) 213.28*** (207) 7.43*** (207) 9.93*** (207) 2 p .000 .000 .000 p

***
p < .001
Full Model. Table 31 below shows the results of the Full Model with Time and Reading Comprehension in Level 1 and all the demographic variables in Level 2. After including these variables, there was a 23.20% increase in within student variance (i.e., compared to the Unconditional Growth Model). The remaining variation in CSAP scores after controlling for the linear effects of Time and Reading Comprehension scores was 366.77 for Level 1 (i.e., across the current academic year) and 1086.20 for Level 2 (i.e., the between student level). Thus, the variance in CSAP scores after the linear effects of 218
Time and Reading Comprehension were controlled was 25.24% associated with nonlinear and residual effects of Time, and 74.76% associated with variation between students. After including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status in Level 2 and Reading Comprehension in Level 1, 43.75% of the variance in the between student differences in mean CSAP score was accounted for by these covariates. However, since this result is statistically significant (00 = 1086.20, 2 = 2396.68, df = 203, p = .000), there are still differences between students that might be explained by other Level 2 variables. Finally, the scatter of each students variability around his/her trajectory (i.e., the variation within one students trajectory) in this sample was still large (i.e., eti = 366.77). Overall mean CSAP scores across students was still significantly different from zero (00 = 634.63, t = 146.86, df = 203, p = .000). This is the mean CSAP score when the demographics are coded 0 and Time is 0 controlling for Reading Comprehension scores. There was a statistically significant effect of ESL/ELL status and Free/Reduced Lunch status on mean CSAP score. The effect of ESL/ELL status was negative and statistically significant (03 = -32.43, t = -4.25, df = 203, p = .000). This means that non-ESL/ELL status students outperformed students enrolled in the program by 32.43 points on the CSAP on average. Free/Reduced Lunch status was negative and statistically significant (04 = -13.56, t = -2.62, df = 203, p = .010). This means that non-Free/Reduced Lunch status students outperformed students enrolled in the program by 13.56 points on the
219
CSAP on average. Finally, there was no statistically significant increase or decrease in student mean CSAP scores for Sex, Ethnicity, and ESL/ELL status (p > .05). There was still a significant difference in the effect of Time on CSAP scores across students (20 = 10.02, t = 10.34, df = 737, p = .000). That is, for every three to six month increase in Time, there was an average 10.02 points increase in student CSAP scores. Finally, the mean effect of Reading Comprehension on the CSAP at initial status was statistically significant (10 = 2.75, t = 6.04, df = 737, p = .000). This means that on average, across students, the DORA Reading Comprehension subtest is significantly and positively related to the state test in reading. For every one unit increase in Reading Comprehension score, there is a 2.75 point increase in the CSAP.
220
Table 31 Full Model with the CSAP Reading Test as the Outcome and the DORA Reading Comprehension (RC) Subtest as the Time-Varying Covariate Fixed Effects Model for Initial CSAP Status (0i) Intercept (00) Sex (01) Ethnicity (02) ESL/ELL (03) Free/Reduced Lunch (04) Model for RC Growth Rate (1i) Intercept (10) Model for CSAP Growth Rate (2i) Intercept (20) Random Effects Level 1 Temporal Variation (eti) Level 2 Student Baseline (r0i) Coefficient (SE) 634.63 (4.32) 7.83 (4.91) -.43 (5.82) -32.43 (7.63) -13.56 (5.18) 2.75 (.46) 10.02 (.97) Variance 366.77 1086.20 203 2396.68** .000 df t (df) 146.86*** (203) 1.59 (203) -.07 (203) -4.25*** (203) -2.62* (203) 6.04*** (737) 10.34*** (737) 2 p .000 .112 .942 .000 .010 .000 .000 p

*
p < .05; *** p < .001
Research Question 2 Descriptives. The sample descriptive information used to address the second research question (i.e., What are the psychometric properties of the newly developed behavioral frequency measure of teacher use of a computerized/online formative assessment program?) is summarized in the following paragraphs. As mentioned previously, survey data were collected on two fronts: (1) Physical survey data collection 221
from all teachers in the Highland School District in Ault, Colorado, and (2) Online survey data collection from all teachers/administrators who currently use DORA across the United States. First, a description of the populations from which the survey data have been sampled with be discussed, followed by a summary of the descriptive information for the final sample obtained from the data collection on the two fronts mentioned above. The population from which the first portion of the survey data has been sampled is the state of Colorado, and more specifically, the Highland School District. Gender and ethnicity demographic information is provided to the public for the state and each school district for the 2008/2009 academic year, but more specific information (e.g., demographics by content area taught such as reading, demographics by number of years teaching) is not available to maintain anonymity (CDE, 2009c). In the state of Colorado in 2008/2009, there were 50,294 teachers, which include special education, ESL, and ELL teachers. Approximately 75% of the teachers were female (n = 37,899). Eighty-nine point four percent of the teachers reported their ethnicity/race as White (n = 44,984), 6.9% indicated Hispanic (n = 3,463), and 1.5% described their ethnicity/race as Black (n = 750). All other reported ethnicities/races were categorized as either Asian/Pacific Islander or American Indian/Alaskan Native (2.2%; n = 1097). More specifically, in the Highland School District in the 2008/2009 academic year, of the 59 teachers in the district, 39 were categorized as female (66%). Approximately 96.6% of the faculty reported being White (n = 57), with the two remaining teachers indicating Hispanic backgrounds (3.3%).
222
The reading teachers in the Highland School District using DORA include 22 individuals who are either reading specialists or are special education, English as a Second Language (ESL), or English Language Learner (ELL) teachers. Of the 22 teachers, 19 individuals completed the survey. The three teachers not participating were all females. Two of the female teachers were high school language arts instructors, with the other female as the designated district contact for Title I. Other demographics were not reported for these non-participants, as completing the survey was voluntary. Reasons for declining to participate were not reported. More specific demographic information for the 19 Highland School District reading teacher participants will be discussed in the descriptive results section for Research Question 3. The second front of data collection occurred by administering an online survey to all teachers/administrators who currently use DORA. Any teacher/administrator in the United States who currently uses or is familiar with DORA was eligible to complete the survey. Thus, the population from which the second portion of the survey data has been sampled is the reading specialists, special education, ESL, and ELL teachers who use or are familiar with DORA in the United States. It is difficult to describe this population with more specificity, as demographic and other descriptive information is not collected, reported, or made public by LGL. However, LGL states that approximately 500 school districts across the United States and Canada currently use DORA, with urban and rural school districts represented. The latest report by the United States Department of Education indicates that there are approximately 16,025 school districts in the United States as of 2003 (NCES, 2003). 223
Approximately 9,054 (56.5%) school districts are located in counties in metro areas with a population of greater than 20,000, and can be categorized as urban school districts. Information from interviewing LGL employees indicated a similar composition in the school districts that use DORA; however, a slightly higher percentage of rural school districts (i.e., more than 56.5%) were reported to use DORA compared to urban. The combined descriptive data from the survey participants on both fronts will be summarized in the following paragraphs. Cases were deleted if more than 75% of the data were missing including demographics. Three participants were deleted who appeared to only respond to the first page of questions, and did not provide any demographic information for comparison purposes. All these participants were from the online data collection. The remaining cases all had complete data (N = 47) and are summarized below in Table 32 by gender. The overwhelming majority of respondents were female (85.1%; n = 40), and White (NonHispanic; n = 35; 74.5%). Other ethnicities/races were represented with six (12.8%) Black (Non-Hispanic) participants, three multi-racial (6.4%) respondents, and one Asian/Pacific Islander, Hispanic, and Other each (2.1%).
224
Table 32 Descriptive Information from Online Formative Assessment Survey (OFAS) Teacher Participants by Gender Demographic Information Age Total Years Teaching Total Years in District Ethnicity White (Non-Hispanic) Asian/Pacific Islander Black (Non-Hispanic) Hispanic Multi-Racial Other Grade Elementary (P 5) Middle (6 8) High School (9 12) All Current Specialization General Reading Language Arts ESL ELL Title I Special Education Highest Degree Earned Teaching Credential Bachelors Degree Masters Degree Education Specialist Doctoral Degree/ Above Masters Degree Male (n = 7) 33.00 (SD = 6.46) 5.71 (SD = 4.27) 4.29 (SD = 4.07) 4 (57.1%) 1 (14.3%) 1 (14.3%) 1 (14.3%) 3 (42.9%) 1 (14.3%) 3 (42.9%) 4 (57.1%) 3 (42.9%) 1 (14.3%) 5 (71.4%) 1 (14.3%) Female (n = 40) 44.45 (SD = 11.19) 15.20 (SD = 10.53) 9.37 (SD = 7.95) 31 (77.5%) 6 (15%) 2 (5%) 1 (2.5%) 23 (57.5%) 8 (20%) 3 (7.5%) 6 (15%) 19 (47.5%) 5 (12.5%) 2 (5%) 3 (7.5%) 2 (5%) 9 (22.5%) 1 (2.5%) 7 (17.5%) 29 (72.5%) 2 (5%) 1 (2.5%) Total (N = 47) 42.74 (SD = 11.34) 13.79 (SD = 10.39) 8.62 (SD = 7.69) 35 (74.5%) 1 (2.1%) 6 (12.8%) 1 (2.1%) 3 (6.4%) 1 (2.1%) 26 (55.3%) 9 (19.1%) 6 (12.8%) 6 (12.8%) 23 (48.9%) 8 (17%) 2 (4.3%) 3 (6.4%) 2 (4.3%) 9 (19.1%) 1 (2.1%) 8 (17%) 34 (72.3%) 2 (4.3%) 2 (4.3%)
Note. State of residence was omitted due to the number of states represented. This information was outlined specifically in the text.
225
The majority indicated that their state of residence was Colorado (40.4%; n = 19). Virginia had the next highest number of participants with 14 (29.8%). Other states represented included California, North Carolina, and Pennsylvania with two participants each (4.3% for each), and The District of Columbia, Florida, Hawaii, Illinois, Maine, Minnesota, New York, and Ohio each with one (2.1% for each). Additionally, the participants reported an average age of 42.74 years (SD = 11.34). When asked how many years total they have been teaching including the current academic year, teachers responded with 13.79 years on average (SD = 10.39), and 8.62 (SD = 7.69) years in their current school district including the present academic year. Twenty-six respondents indicated teaching in the elementary grade levels (i.e., Preschool through grade 5; 55.3%). This was followed by nine teachers indicating instructing middle school (i.e., grades 6 through 8; 19.1%), and six teachers in high school and certified to teach all grade levels each (12.8% for each). The current specializations reported included 23 general reading teachers (48.9%), nine special education (19.1%), eight language arts (17%), three ELL (6.4%), and two ESL teachers and Title I contacts each (4.3% for each). Finally, 72.3% indicated that their highest degree obtained was a Masters Degree (n = 34), and 17% (n = 8) reported having their Bachelors Degrees. Two respondents indicated having their Educational Specialist degree (4.3%), and two individuals reported having higher than a Masters Degree (i.e., Doctoral Degree; 4.3%), and one participant reported having a Teaching Credential (2.1%).
226
Rasch Analysis Original Survey. As mentioned previously, for all the iterations in the following paragraphs, a rating scale model was run. Running the data with a uniform ratings scale model (i.e., deleting the Groups = 0 from the control file) assumes that the response scale is the same. In the first run of the data, the program went through six iterations (i.e., less iterations suggests better convergence). This small number of iterations suggested good convergence. After convergence was assessed, person and item separation and reliability of separation were examined. These indices assess instrument spread across the trait continuum. Separation measures the spread of both items and persons in standard error units (i.e., the number of levels into which the sample of items and persons can be separated). A good measure should exceed 1.0, with higher values of separation representing greater spread of items and persons along a continuum. Lower values of separation indicate redundancy in the items and less variability of persons on the trait. To operationalize a variable with items, each item should mark a different amount of the trait. Separation, on the other hand, determines reliability. Higher separation in concert with variance in person or item position yields higher reliability. Reliability of person separation is conceptually equivalent to Cronbach's Alpha (i.e., Coefficient Alpha), though the formulas are different. For persons, separation was 3.97 for the data at hand (i.e., real), and was 4.45 when the data have no misfit to the model (i.e., model). This result is desirable as the persons were measured on a continuum rather than a dichotomy. If separation is 1.0 or 227
below, the items may not have sufficient breadth in position. In that case, potential revision of the construct definition may be warranted, which can possibly be remedied by adding items that cover a broader range. Item separation for the present data was 3.09, which was a smaller continuum than for persons. It is typical to find larger separation values for items than for persons. This is usually a function of the data having a smaller number of items and a larger number of people. However, in the current measure there were 56 items and 47 people. Overall, the model had a separation value of greater than 2 (i.e., 3.09), which shows that true variability among items was much larger than the amount of error variability. A larger item separation is preferable, and perhaps a larger sample or removing bad items may remedy this. Separation is affected by sample size, as are fit indices and error estimates. With larger sample sizes, separation tends to increase and error decreases. The person separation reliability estimate for these data was .94. The conceptual analog to person reliability is item reliability, which estimates internal consistency of persons rather than items, was .90. This model reliability had value of .92 indicating that 92% of the variability among items was due to real item variance. Item means were analyzed next. Note that the mean for items was 0.0. The mean of the item logit position is always set at 0.0, similar to a standard z score. The person mean for the current model was .48, which suggests these items, on average, were easy to agree with. The persons had a higher level of the trait than the items did. If the person mean was -1, -2, or +1 or +2, this would indicate that the items were potentially too hard or too easy for the sample. Thus, .48 is a good person mean. Additionally, Cronbachs 228
Alpha was .95, suggesting that the reliability measures indicate that the model fit was good. Following the examination of item means, mean infit and outfit for person and item mean squares were investigated. Mean infit and outfit is expected to be 1.0. For these data, they were all 1.01 and 1.08, respectively. Related to this, the mean standardized infit and outfit are expected to be 0.0. In the current model, they were -.4 and -.2 for persons, and -.1 and .1 for items. This indicates that the items overfit, on average. Overall, the data fit the model somewhat better than expected, which may signal some redundancy (i.e., possibly redundant items). The standard deviation of the standardized infit is an index of overall misfit for persons and items (Bode & Wright, 1999). Using 2.0 as a cut-off criterion, both persons (i.e., standardized infit SD = .52) and items (i.e., standardized infit SD = .32) showed little overall misfit. Here the data evidence acceptable fit overall. This is in contrast to the overall Chi-Square test for this model, which was significant. This indicates that the Rasch model does not fit these items well (2 = 5757.10, df = 2, 529, p = .000). All the above information is presented in the tables below (see Tables 33 and 34 below).
229
Table 33 Summary of 47 Measured Persons (56 Measured Items) Winsteps Output

-----------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 100.9 56.0 .48 .17 1.01 -.4 1.08 -.2 | | S.D. 28.0 .0 .79 .02 .52 2.7 .75 2.7 | | MAX. 149.0 56.0 2.07 .24 2.43 6.6 4.77 6.9 | | MIN. 28.0 56.0 -1.64 .16 .31 -5.9 .33 -5.5 | |-----------------------------------------------------------------------------| |REAL RMSE .19 ADJ.SD .76 SEPARATION 3.97 PERSON RELIABILITY .94 | |MODEL RMSE .17 ADJ.SD .77 SEPARATION 4.45 PERSON RELIABILITY .95 | |S.E. OF PERSON MEAN = .12 | ------------------------------------------------------------------------------PERSON RAW SCORE-TO-MEASURE CORRELATION = 1.00 CRONBACH ALPHA (KR-20) PERSON RAW SCORE RELIABILITY = .95
Table 34 Summary of 56 Measured Items Winsteps Output

-----------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| |MEAN 84.7 47.0 .00 .19 1.01 -.1 1.08 .1 | |S.D. 18.7 .0 .65 .02 .32 1.7 .52 1.9 | |MAX. 125.0 47.0 1.58 .27 1.88 4.0 3.76 4.8 | |MIN. 37.0 47.0 -1.68 .17 .51 -3.1 .52 -3.0 | |-----------------------------------------------------------------------------| |REAL RMSE .20 ADJ.SD .61 SEPARATION 3.09 ITEM RELIABILITY .90 | |MODEL RMSE .19 ADJ.SD .62 SEPARATION 3.31 ITEM RELIABILITY .92 | |S.E. OF ITEM MEAN = .09 | ------------------------------------------------------------------------------UMEAN = .000 USCALE = 1.000 ITEM RAW SCORE-TO-MEASURE CORRELATION = -1.00 2632 DATA POINTS. LOG-LIKELIHOOD CHI-SQUARE: 5757.10 with 2529 d.f. p = .0000
The table below (see Table 35) contains information about how the response scale was used. For these data, the response scale was 0 (Never), 1 (Rarely), 2 (Sometimes), and 3 (Almost Always). The step logit position is where a step marks the transition from one rating scale category to the next (e.g., from a 2 to a 3). Observed Count" is the 230
number of times the category was selected across all items and all persons. "Observed Average" is the average of logit positions modeled in the category. It should increase by category value, and the current model demonstrated this. For example, persons responding with a 0 had an average measure (-.51) lower than those responding with a 1 (i.e., average measure = -.11). There was no substantial misfit for the categories, as the misfit indices (i.e., mean square misfit) for the categories were below 1.5. "Sample Expected" is the optimum value of the average logit position for these data. Sample expected values should not be highly discrepant from the observed averages, and for these data they were not. Infit and outfit mean squares were each expected to equal 1.0, and they were close to this value. Step (i.e., Structure) calibration is the logit calibrated difficulty of the step. This is demonstrated pictorially in Figure 42 below, where the transition points between one category and the next are the step (i.e., structure) calibration values in the table below. These values are expected to increase with category value, which they did in the current model. The step standard error is a measure of uncertainty around the step calibration, which were all low for each step (i.e., .07, .05, and .05).
231
Table 35 Summary of Category Structure (56 Measured Items) Winsteps Output

------------------------------------------------------------------|CATEGORY OBSERVED|OBSVD SAMPLE|INFIT OUTFIT||STRUCTURE|CATEGORY| |LABEL SCORE COUNT %|AVRGE EXPECT| MNSQ MNSQ ||CALIBRATN| MEASURE| |-------------------+------------+------------++---------+--------| | 0 0 346 13| -.51 -.63| 1.16 1.47|| NONE |( -2.27)| | 1 1 658 25| -.11 -.01| .87 .98||-.96 .07 | -.62 | | 2 2 800 30| .61 .59| .81 .74|| .09 .05 | .65 | | 3 3 828 31| 1.24 1.23| 1.01 1.02|| .87 .05 |( 2.22)| --------------------------------------------------------------------------------------------------------------------------------------------|CATEGORY STRUCTURE | SCORE-TO-MEASURE | 50% CUM.| COHERENCE|ESTIM| | LABEL MEASURE S.E. | AT CAT. ----ZONE----|PROBABLTY| M->C C->M|DISCR| |------------------------+---------------------+---------+----------+-----| | 0 NONE |( -2.27) -INF -1.50| | 60% 13%| | | 1 -.96 .07 | -.62 -1.50 .02| -1.22 | 45% 51%| .78| | 2 .09 .05 | .65 .02 1.48| .04 | 41% 72%| 1.13| | 3 .87 .05 |( 2.22) 1.48 +INF | 1.19 | 76% 39%| 1.05| --------------------------------------------------------------------------M->C = Does Measure imply Category? C->M = Does Category imply Measure?
0 1 2 3
0 1 2 3
A final way of examining step use is via probability curves. These curves display the likelihood of category selection (Y-axis) by the person-minus-item measure (X-axis). For example, if the difference in logit position between the person and item is +1.0, while any response is possible, the most likely category response is 2, followed closely by 3. If the difference in logit position between the person and item is more than 1.0 the most likely response is a 3. If all categories are utilized, each category value will be the most likely at some point on the continuum (i.e., as shown below in Figure 42), and there will be no category inversions where a higher category is more likely at a lower point than a lower category. This was not the case here where all categories are being used and are behaving according to expectation.
232
P R O B A B I L I T Y O F R E S P O N S E
CATEGORY PROBABILITIES: MODES - Structure Measures at Intersections -+---------+---------+---------+---------+---------+---------+1.0 + + | | |0 33| | 0000 333 | .8 + 000 333 + | 00 333 | | 00 33 | | 00 33 | .6 + 00 3 + | 0 33 | .5 + 00 3 + | 00 1111 33 | .4 + 111* 1111 222222*2 + | 111 00 *** 33 2222 | | 111 0 222 1133 222 | | 111 ** 311 222 | .2 + 111 222 00 33 11 222 + | 1111 22 0*3 111 22222 | |11 2222 333 000 1111 22| | 2222222 333333 000000 111111 | .0 +*********333333333333 000000000000*********+ -+---------+---------+---------+---------+---------+---------+-3 -2 -1 0 1 2 3 PERSON [MINUS] ITEM MEASURE
Figure 42. Category probabilities (i.e., probability curves) indicating the probability of a response for the 56-item survey. These curves display the likelihood of category selection (Y-axis) by the person-minus-item measure (X-axis). All categories in the above figure are being used according to expectation.
The table below (Table 36) contains information regarding item misfit diagnostics. Raw score is the total number of "points" the item got across the entire sample. Count describes that all 47 participants responded to all the items. Measure is the logit position of the item, with error being the standard error of measurement for the item. This is followed by the infit and outfit diagnostics. Infit is a t standardized information-weighted mean square statistic, which is more sensitive to unexpected behavior affecting responses to items near the person's measure level (Linacre, 2009, p.252). Outfit is a t standardized outlier-sensitive mean square fit statistic, more 233
sensitive to unexpected behavior by persons on items far from the person's measure level (Linacre, 2009, p.252). Finally, the point measure correlations are also reported (i.e., PTMEA), which is a correlation between the observations on an item and the corresponding person measures (Linacre, 2009, p. 253). Because this correlation is the correlation between the item score and the measure (i.e., different from a total score), it is an item discrimination index. Thus, it should be positive.
Table 36 Item Statistics: Misfit Order (56 Measured Items) Winsteps Output
-----------------------------------------------------------------------------------| ENTRY TOTAL MODEL| INFIT | OUTFIT |PTMEA|EXACT MATCH| | | NUMBER SCORE COUNT MEASURE S.E.|MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%|ITEM | |------------------------------------+----------+----------+-----+-----------------| 48 125 47 -1.68 .27|1.13 .5|3.76 4.8|A .32| 80.9 70.0| Q48 | | 47 76 47 .30 .17|1.88 4.0|1.99 4.3|B .34| 29.8 40.7| Q47 | | 46 111 47 -.91 .21|1.39 1.7|1.76 2.5|C .38| 44.7 53.2| Q46 | | 5 37 47 1.58 .20|1.68 2.8|1.63 2.4|D .27| 38.3 49.0| Q5 | | 53 87 47 -.04 .18|1.56 2.7|1.68 3.0|E .29| 31.9 43.0| Q53 | | 4 86 47 .00 .18|1.58 2.8|1.68 3.0|F .29| 38.3 42.9| Q4 | | 54 119 47 -1.30 .24|1.11 .5|1.67 2.0|G .49| 66.0 64.5| Q54 | | 50 88 47 -.07 .18|1.40 2.0|1.58 2.6|H .36| 27.7 43.0| Q50 | | 38 84 47 .06 .18|1.50 2.5|1.57 2.6|I .35| 34.0 42.9| Q38 | | 34 71 47 .46 .17|1.50 2.5|1.49 2.4|J .36| 29.8 40.9| Q34 | | 8 91 47 -.17 .18|1.11 .6|1.47 2.2|K .42| 42.6 43.7| Q8 | | 7 114 47 -1.05 .22|1.40 1.6|1.47 1.6|L .43| 57.4 56.4| Q7 | | 45 92 47 -.20 .18|1.36 1.8|1.44 2.0|M .31| 38.3 43.7| Q45 | | 56 71 47 .46 .17|1.28 1.5|1.41 2.0|N .42| 36.2 40.9| Q56 | ------------------------------------------------------------------------------------
Note. Not all the item statistics are listed above. Only the first few misfitting item are shown.
Item fit for the model was determined by: (1) Point measure correlations, and (2) Infit and outfit measures. A point measure correlation below .15 indicates a potentially misfitting item, and the values are preferably between .3 and .5. Point measure
234
correlations for this model were acceptable, as all items on the scale had point measure correlations above .15. Infit and outfit measures were also examined for all the items. No definitive rules exist regarding what is considered acceptable and unacceptable fit. Some suggestions for acceptable fit are as follows: (1) A mean square infit or outfit between .6 and 1.4 , (2) A mean square infit or outfit between .8 and 1.2 (Bode & Wright, 1999), (3) A mean square less than 1.3 for samples less than 500, 1.2 for 500 to 1,000, and 1.1 if the sample is greater than 1,000 (Smith, Schumacker, & Bush, 1995), (4) Standardized fit (i.e., infit or outfit) between -2 and 2, (5) Standardized fit between -3 and +2, and (6) Standardized fit less than 2 (Smith, 1992). Based on the above information, for the current study, infit and outfit values of less than 2 were considered acceptable, especially in exploratory studies such as this. Infit measures for all items on this measure were less than 2, and were therefore acceptable. Most outfit measures for items on the scale were less than 2 as well. One item (i.e., Question 48: In a given quarter/semester, how often do you use DORA results/reports to help the low-achieving students with their reading performance?) had an outfit value larger than 2 (i.e., 3.76). Seventy-four percent of the sample (n = 35) endorsed the highest category on the scale for this item. Thus, this item was very easy (i.e., item measure value = -1.68), which could account for the high outfit value. This item was grouped with items 46 and 47 in the scale as being very similar in content and wording. Unsurprisingly, these items were the next highest misfitting items. Question 47 (i.e., In a given quarter/semester, how often do you use DORA results/reports to help 235
the high-achieving students with their reading performance?) had an outfit value near 2 (i.e., 1.99), and Question 46 (i.e., In a given quarter/semester, how often do you use DORA results/reports to help all students with their reading performance?) had an outfit value near 2 as well (i.e., 1.76). These items had item measures of .30 and -.91, respectively. These items were examined to see if they did not follow a monotonically changing average theta per category. If each item worked the way it should (i.e., analogous to a magnified item total correlation), then there should be no asterisk symbols next to the average measure value (see Table 37 below). When an item does not fit, an asterisk is shown. This change in rank can lead to misfit. Items 48 and 47 were marked with an asterisk. Thus, the items did not perform as they should. For example, the logit should increase with every response category. For item 48, there are multiple problems in that the average measure logit value did not increase with each response category (i.e., 2.07 to 1.18 to -.18 to .70). This was also found for question 47. Thus, all three of these items were removed in the next run of the data in attempts to improve model fit.
236
Table 37 Item Category/Option Frequencies: Misfit Order (56 Measured Items) Winsteps Output
--------------------------------------------------------------------|ENTRY DATA SCORE | DATA | AVERAGE S.E. OUTF PTMEA| | |NUMBER CODE VALUE | COUNT % | MEASURE MEAN MNSQ CORR.| ITEM | |--------------------+------------+--------------------------+------| | 48 A 0 0 | 1 2 | 2.07 10.0 .30 |Q48 | | 1 1 | 2 4 | -1.18* .46 .2 -.45 | | | 2 2 | 9 19 | -.18* .17 .4 -.41 | | | 3 3 | 35 74 | .70* .10 .9 .47 | | | | | | | | 47 B 0 0 | 13 28 | .23 .27 2.1 -.20 |Q47 | | 1 1 | 8 17 | -.01* .26 .7 -.28 | | | 2 2 | 10 21 | .74 .19 .7 .17 | | | 3 3 | 16 34 | .77 .16 1.3 .26 | | | | | | | | 46 C 0 0 | 3 6 | -.25 .70 1.8 -.24 |Q46 | | 1 1 | 5 11 | .01 .52 2.6 -.21 | | | 2 2 | 11 23 | .34 .16 .7 -.10 | | | 3 3 | 28 60 | .70 .13 1.1 .34 | | ---------------------------------------------------------------------
0 1 2 3 0 1 2 3 0 1 2 3
Note. Not all the item category frequencies are listed above. Only the first three items are shown.
The map of persons and items is displayed below (see Figure 43). The distribution of person positions is on the left side of the vertical line and items on the right. Each "X" represents one person in this figure. "M" marks the person and item mean; "S" is one standard deviation away from the mean; and "T" is two standard deviations away from the mean. To determine variability, item measure values were investigated using the item/person map for this model. The degree to which these items are targeted at the teachers was investigated. As seen below, the scale appeared to be applicable for its purposes. The items were approximately normally distributed (i.e., although slightly leptokurtic), with a few items separated from the others on the far ends of the scale. Unfortunately, the majority of the items were compressed between -.5 and .5 on the scale. 237
Additionally, a few large gaps in between the items were shown on the variable maps. For example, items 5 and 36 were both at the more positive end of the scale (i.e., harder items). And at the negative end of the scale (i.e., easier items), there were questions 48, 2, and 54. The harder/more positive side of the vertical ruler reflects items that challenge the most able teacher (i.e., are harder for teachers who have higher use of online formative assessment in their classrooms), and the easier/more negative items of the vertical ruler reflects items that even the least able teachers can do successfully (i.e., are easier for teachers who have the least amount of use of online formative assessment in their classrooms). Those at the upper end of the scale agreed with more items and agreed more strongly (i.e., Sometimes or Almost Always). There were numerous persons whose position was above where items were measuring. As shown in the map, the items covered a range of -.5 to .5 logits in difficulty, which was narrower than the range of about -.5 to 1.2 for persons. This indicates that easier and harder items may need to be added in future studies to extend the range of the trait measured. Also, at four points on the scale there were six items at the same position. Usually, one or two could be dropped due to redundancy, but in this case, these items might go together as a family (e.g., items 9 16, and items 17 24) or a subscale.
238
-1
-2
<more>|<rare> + | | | | | | X T| X + | XX | X | | Q5 | XX | Q36 S|T X | XXX | X + Q3 XX | Q51 XX | XXX | Q27 XXX |S Q32 XX M| Q25 XXXXX | Q26 XX | Q16 XXX | Q12 X | Q20 X +M Q11 | Q10 X | Q13 XXX S| Q35 | XXXX | Q37 |S Q15 X | | Q6 | Q1 X + Q7 T| Q43 | |T Q54 | Q2 | X | | Q48 | | + <less>|<frequ>
Q52
Q28 Q55 Q34 Q24 Q17 Q22 Q14 Q21 Q31
Q56 Q29 Q44 Q38 Q18 Q33 Q39 Q30 Q49 Q19 Q40 Q42 Q41 Q47
Q4 Q50 Q45
Q53 Q9 Q8
Q23
Q46
Figure 43. The map of persons and items for the 56-item survey. The distribution of person positions is on the left side of the vertical line and items on the right. The scale appears to be applicable for its purposes. 239
53-Item Survey. As mentioned earlier, a few items in the original scale appeared to have potentially bad fit (i.e., MNSQ > 1.5). These items were 48, 47, and 46, and they were removed to improve the fit. This new 53-item scale was run in Winsteps and further examined. The program went through five iterations (i.e., one less than the previous). Less iterations suggests better convergence. For persons, separation was 4.01 for the data at hand (i.e., real), and was 4.49 when the data have no misfit to the model (i.e., model). This was a little improvement over the first run with 56 items. Item separation for the present case was 2.96, a smaller continuum than for persons, and slightly smaller than the first run of the data. Overall, the model had a separation value greater than 2 (i.e., 2.96), which shows that true variability among items is much larger than the amount of error variability. A larger item separation is preferable. The person separation reliability estimate for these data was .94, which is the same as the 56-question survey (see Tables 38 and 39 below). Item reliability is .90, which was, again, the same as before. This model reliability had a value of .91 indicating that 91% of the variability among items is due to real item variance. This is somewhat smaller than the first run of the data. The person mean in this 53-question model was .46, which suggests these items on average, were easy to agree with. This was a little smaller than before, which is desirable (i.e., converging on 0). Cronbachs Alpha was again .95. The mean infit and outfit for person and item mean squares were all 1.01 and 1.04, respectively. The mean standardized infit and outfit were expected to be 0.0. Here they were -.4 and -.2 for persons, and -.1 and 0 for items, which are nearly the same as before. The standard deviations of the standardized infit of persons and items were 240
approximately the same as before. Here the data evidence acceptable fit overall. However, again, the overall Chi-Square test for this model was significant indicating that the Rasch model does not fit these items well (2 = 5443.02, df = 2,391, p = .000).

------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 94.3 53.0 .46 .18 1.01 -.4 1.04 -.2 | | S.D. 27.1 .0 .83 .03 .51 2.7 .54 2.5 | | MAX. 148.0 53.0 2.64 .31 2.42 6.4 2.73 6.1 | | MIN. 27.0 53.0 -1.64 .16 .28 -6.2 .30 -5.8 | |-----------------------------------------------------------------------------| | REAL RMSE .20 ADJ.SD .81 SEPARATION 4.01 PERSON RELIABILITY .94 | |MODEL RMSE .18 ADJ.SD .81 SEPARATION 4.49 PERSON RELIABILITY .95 | | S.E. OF PERSON MEAN = .12 | ------------------------------------------------------------------------------PERSON RAW SCORE-TO-MEASURE CORRELATION = .99 CRONBACH ALPHA (KR-20) PERSON RAW SCORE RELIABILITY = .95

------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 83.6 47.0 .00 .19 1.01 -.1 1.04 .0 | | S.D. 18.0 .0 .62 .02 .32 1.7 .39 1.8 | | MAX. 120.0 47.0 1.58 .24 1.72 3.0 1.96 3.9 | | MIN. 37.0 47.0 -1.43 .18 .53 -2.9 .53 -2.9 | |-----------------------------------------------------------------------------| | REAL RMSE .20 ADJ.SD .59 SEPARATION 2.96 ITEM RELIABILITY .90 | |MODEL RMSE .19 ADJ.SD .59 SEPARATION 3.18 ITEM RELIABILITY .91 | | S.E. OF ITEM MEAN = .09 | ------------------------------------------------------------------------------UMEAN = .000 USCALE = 1.000 ITEM RAW SCORE-TO-MEASURE CORRELATION = -1.00 2491 DATA POINTS. LOG-LIKELIHOOD CHI-SQUARE: 5443.02 with 2391 d.f. p=.0000
241
After running the 53-question OFAS, a few other items were screened for potential removal (i.e., 53, 50, and 45). Question 53 (i.e., In a given quarter/semester, how often do you compare individual student results with the rest of the class?) had an outfit value of 1.96. Question 50 (i.e., In a given quarter/semester, how often do you compare the results with your other content-related classroom quiz/test/exam results (i.e., quizzes/tests/exams that you have constructed)?) had an outfit value of 1.86. And Question 45 (i.e., In a given quarter/semester, how often do you link the results to your course standards and/or objectives?) had an outfit value of 1.75. Questions 53 and 50 are grouped together conceptually, and with items 51 and 52. They all ask questions about comparing the results to other groups or standards (see Table 40 below).
------------------------------------------------------------------------------------|ENTRY TOTAL MODEL| INFIT | OUTFIT |PTMEA|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----+-----------+------| | 50 87 47 -.08 .18|1.64 3.0|1.96 3.9|A .25| 29.8 44.3| Q53 | | 47 88 47 -.12 .18|1.48 2.3|1.86 3.6|B .33| 31.9 43.8| Q50 | | 45 92 47 -.25 .18|1.43 2.1|1.75 3.1|C .27| 38.3 44.5| Q45 | | 4 86 47 -.05 .18|1.64 3.0|1.72 3.1|D .28| 36.2 43.8| Q4 | | 5 37 47 1.58 .20|1.72 2.9|1.66 2.5|E .29| 38.3 49.5| Q5 | | 51 119 47 -1.38 .24|1.13 .6|1.67 1.9|F .48| 63.8 64.1| Q54 | | 38 84 47 .01 .18|1.55 2.7|1.61 2.8|G .35| 34.0 43.7| Q38 | | 34 71 47 .42 .18|1.56 2.7|1.53 2.5|H .36| 29.8 42.4| Q34 | | 8 91 47 -.22 .18|1.14 .8|1.48 2.2|I .43| 42.6 44.5| Q8 | | 7 114 47 -1.12 .22|1.42 1.7|1.47 1.6|J .42| 55.3 56.4| Q7 | | 53 71 47 .42 .18|1.32 1.7|1.43 2.1|K .42| 38.3 42.4| Q56 | | 1 111 47 -.98 .21|1.40 1.7|1.34 1.3|L .40| 44.7 53.8| Q1 | -------------------------------------------------------------------------------------
242
Item fit for this model was again determined by: (1) Point measure correlations, and (2) Infit and outfit measures. Point measure correlations for this model were acceptable, as all items on the scale had correlations above .15. However, the next few items had lower point measure correlations ( .25). Most correlations were between .3 and .5, which is desirable. Infit and outfit measures were also examined for all the items. Again, for the current study, infit and outfit values of less than 2 were considered acceptable. Infit measures for all items on this measure were less than 2, and were therefore acceptable. All outfit measures for items on the scale were less than 2 as well. With the items in question, there was little discrimination along the response scale in that a nearly equivalent amount of people endorsed the middle two categories (i.e., Rarely and Sometimes), and also endorsed a little more frequently the highest category (i.e., Almost Always; see Table 41 below). More items should be removed to perhaps improve the person and item separation, as this was not markedly improved by removing items 46 through 48.
243
--------------------------------------------------------------------|ENTRY DATA SCORE | DATA | AVERAGE S.E. OUTF PTMEA| | |NUMBER CODE VALUE | COUNT % | MEASURE MEAN MNSQ CORR.| ITEM | |--------------------+------------+--------------------------+------| | 50 A 0 0 | 6 13 | .08 .21 1.4 -.17 |Q53 | | 1 1 | 12 26 | .22 .35 3.2 -.16 | | | 2 2 | 12 26 | .61 .20 .9 .11 | | | 3 3 | 17 36 | .64 .16 1.4 .17 | | | | | | | | 47 B 0 0 | 6 13 | -.18 .26 1.2 -.29 |Q50 | | 1 1 | 11 23 | .38 .36 3.7 -.05 | | | 2 2 | 13 28 | .44 .22 1.2 -.01 | | | 3 3 | 17 36 | .74 .14 1.1 .26 | | | | | | | | 45 C 0 0 | 3 6 | .06 .25 1.4 -.12 |Q45 | | 1 1 | 13 28 | .23 .28 2.6 -.17 | | | 2 2 | 14 30 | .42 .17 .8 -.03 | | | 3 3 | 17 36 | .73 .21 1.5 .25 | | ---------------------------------------------------------------------
0 1 2 3 0 1 2 3 0 1 2 3
50-Item Survey (Final Survey). As outlined above, items 53, 50, and 45 were removed to potentially improve the fit. This new 50-item scale was run in Winsteps and further examined. The program went through 4 iterations (i.e., 1 less than the 53-item scale). For persons, separation was 4.08 for the data at hand, and was 4.62 when the data have no misfit to the model, which demonstrated improvement over the second run with the 53-item survey. Item separation for the present scale with 50 questions was 3.09, which was slightly larger than the previous run. This was also the same amount in the original run of the data. Overall, the model had a separation value of greater than 2 (i.e., 3.09 for real and 3.32 for model). A larger item separation is preferable (see Tables 42 and 43 below). 244
The person separation reliability estimate for these data was .94, which was the same as the first two runs. Item reliability was .91, which was approximately the same as before. This model reliability had a value of .92 indicating that 92% of the variability among items is due to real item variance. Cronbachs Alpha was again .95.

------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 88.6 50.0 .47 .19 1.00 -.4 1.02 -.2 | | S.D. 26.2 .0 .92 .04 .52 2.6 .55 2.5 | | MAX. 145.0 50.0 3.48 .46 2.45 6.2 2.91 5.9 | | MIN. 24.0 50.0 -1.76 .17 .30 -5.7 .31 -5.4 | |-----------------------------------------------------------------------------| | REAL RMSE .22 ADJ.SD .89 SEPARATION 4.08 PERSON RELIABILITY .94 | |MODEL RMSE .19 ADJ.SD .90 SEPARATION 4.62 PERSON RELIABILITY .96 | | S.E. OF PERSON MEAN = .14 | ------------------------------------------------------------------------------PERSON RAW SCORE-TO-MEASURE CORRELATION = .98 CRONBACH ALPHA (KR-20) PERSON RAW SCORE RELIABILITY = .95

------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 83.3 47.0 .00 .19 1.00 -.1 1.02 -.1 | | S.D. 18.4 .0 .66 .02 .32 1.6 .38 1.7 | | MAX. 120.0 47.0 1.62 .24 1.80 3.2 1.77 3.1 | | MIN. 37.0 47.0 -1.49 .18 .54 -3.0 .53 -2.8 | |-----------------------------------------------------------------------------| | REAL RMSE .20 ADJ.SD .63 SEPARATION 3.09 ITEM RELIABILITY .91 | |MODEL RMSE .19 ADJ.SD .63 SEPARATION 3.32 ITEM RELIABILITY .92 | | S.E. OF ITEM MEAN = .09 | ------------------------------------------------------------------------------UMEAN = .000 USCALE = 1.000 ITEM RAW SCORE-TO-MEASURE CORRELATION = -1.00 2350 DATA POINTS. LOG-LIKELIHOOD CHI-SQUARE: 5048.94 with 2253 d.f. p=.0000
245
As shown in the tables above, the mean infit and outfit for person and item mean squares were all 1.00 and 1.02, respectively. The mean standardized infit and outfit are expected to be 0.0. Here, they were -.4 and -.2 for persons, and -.1 and -.1 for items, which are basically the same as before. The standard deviations of the standardized infit of persons and items were approximately the same as before. The data evidence acceptable fit overall. However, again as in the previous runs of the data, the overall ChiSquare test for this model was significant indicating that the Rasch model does not fit these items well (2 = 5048.94, df = 2, 253, p = .000). The table below (see Table 44) contains information about how the response scale was used. The step logit position is where a step marks the transition from one rating scale category to the next. "Observed Average" is the average of logit positions modeled in the category. It should increase by category value, and the current model demonstrated this. For example, persons responding with a 0 had an average measure (-.63) lower than those responding with a 1 (i.e., average measure = -.18). There was no misfit for the categories, as the misfit indices for all the categories were below 1.5. Sample expected values should not be highly discrepant from the observed averages, and for these data they were not. Infit and outfit mean squares were each expected to equal 1.0, and they are close to this value. Step calibration is the logit calibrated difficulty of the step. The transition points between one category and the next are the step calibration values in the table below. These values are expected to increase with category value, which they did in the current model. The step standard errors were all low (i.e., .07, .05, and .05). 246

------------------------------------------------------------------|CATEGORY OBSERVED|OBSVD SAMPLE|INFIT OUTFIT||STRUCTURE|CATEGORY| |LABEL SCORE COUNT %|AVRGE EXPECT| MNSQ MNSQ||CALIBRATN| MEASURE| |-------------------+------------+------------++---------+--------| | 0 0 314 13| -.63 -.75| 1.15 1.25|| NONE |( -2.36)| | 1 1 607 26| -.18 -.07| .88 .98|| -1.06 | -.66 | | 2 2 731 31| .61 .58| .82 .75|| .07 | .68 | | 3 3 698 30| 1.37 1.37| 1.02 1.03|| .99 |( 2.32)| --------------------------------------------------------------------------------------------------------------------------------------------|CATEGORY STRUCTURE | SCORE-TO-MEASURE | 50% CUM.| COHERENCE|ESTIM| | LABEL MEASURE S.E. | AT CAT. ----ZONE----|PROBABLTY| M->C C->M|DISCR| |------------------------+---------------------+---------+----------+-----| | 0 NONE |( -2.36) -INF -1.57| | 61% 15%| | | 1 -1.06 .07 | -.66 -1.57 .02| -1.31 | 46% 55%| .81| | 2 .07 .05 | .68 .02 1.56| .03 | 43% 71%| 1.15| | 3 .99 .05 |( 2.32) 1.56 +INF | 1.28 | 76% 39%| 1.03| --------------------------------------------------------------------------M->C = Does Measure imply Category? C->M = Does Category imply Measure?
0 1 2 3
0 1 2 3
Probability curves were examined next, and these curves display the likelihood of category selection (Y-axis) by the person-minus-item measure (X-axis). As shown in the figure below (see Figure 44), all categories were utilized meaning that each category value was the most likely at some point on the continuum and there were no category inversions where a higher category was more likely at a lower point than a lower category. Thus, all categories were being used and were behaving according to expectation.
247
CATEGORY PROBABILITIES: MODES - Structure measures at intersections P -+---------+---------+---------+---------+---------+---------+R 1.0 + + O | | B | 3| A |0000 333 | B .8 + 000 333 + I | 00 333 | L | 00 33 | I | 00 33 | T .6 + 00 3 + Y | 00 33 | .5 + 0 33 + O | 0011111111 3 | F .4 + 11110 111 2222222**22 + | 111 00 22*1 3 2222 | R | 11 00 22 11 33 22 | E | 111 2*2 3*1 222 | S .2 + 111 22 00 33 111 2222 + P |11111 222 00*3 11 2222 | O | 2222 333 000 1111 2| N | 222222 33333 000000 1111111 | S .0 +*********3333333333333 0000000000000********+ E -+---------+---------+---------+---------+---------+---------+-3 -2 -1 0 1 2 3 PERSON [MINUS] ITEM MEASURE
After items 53, 50, and 45 were removed, item fit for the model was again determined by: (1) Point measure correlations, and (2) Infit and outfit measures. Point measure correlations for this model were acceptable, as all items on the scale had point measure correlations above .15. Most correlations were between .3 and .5 (see Table 45 below). Infit and outfit measures were also examined for all the items. Again, for the current study, infit and outfit values of less than 2 were considered acceptable. Infit measures for all items on this measure were less than 2, and are therefore acceptable. All outfit measures for items on the scale were less than 2 as well. Overall, for this 50248
question survey, this iteration produced the lowest infit and outfit statistics, better separation, and improved reliability.
------------------------------------------------------------------------------------|ENTRY TOTAL MODEL| INFIT | OUTFIT |PTMEA|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----+-----------+------| | 5 37 47 1.62 .20|1.80 3.2|1.76 2.8|A .30| 38.3 51.3| Q5 | | 48 119 47 -1.43 .24|1.13 .6|1.77 2.1|B .47| 61.7 64.2| Q54 | | 4 86 47 -.06 .18|1.69 3.2|1.74 3.1|C .30| 36.2 44.8| Q4 | | 38 84 47 .01 .18|1.63 3.0|1.72 3.0|D .34| 34.0 44.9| Q38 | | 47 52 47 1.06 .19|1.32 1.6|1.65 2.8|E .31| 42.6 45.7| Q52 | | 8 91 47 -.23 .19|1.22 1.2|1.61 2.5|F .40| 42.6 45.4| Q8 | | 34 71 47 .43 .18|1.57 2.8|1.55 2.5|G .38| 29.8 43.3| Q34 | | 50 71 47 .43 .18|1.41 2.1|1.53 2.4|H .41| 38.3 43.3| Q56 | | 7 114 47 -1.16 .22|1.47 1.9|1.51 1.6|I .41| 55.3 58.1| Q7 | | 46 56 47 .92 .18|1.18 1.0|1.49 2.2|J .39| 36.2 44.4| Q51 | | 14 85 47 -.03 .18|1.20 1.1|1.45 2.0|K .49| 46.8 44.9| Q14 | | 1 111 47 -1.02 .21|1.44 1.9|1.36 1.3|L .40| 44.7 54.8| Q1 | | 49 65 47 .62 .18|1.35 1.8|1.29 1.4|M .39| 42.6 43.1| Q55 | | 3 52 47 1.06 .19|1.34 1.7|1.33 1.5|N .41| 38.3 45.7| Q3 | | 35 94 47 -.34 .19|1.24 1.3|1.15 .8|O .49| 48.9 45.3| Q35 | | 45 81 47 .10 .18|1.14 .8|1.24 1.2|P .42| 42.6 44.3| Q49 | | 37 102 47 -.64 .20|1.21 1.1|1.23 1.0|Q .37| 46.8 47.9| Q37 | | 44 78 47 .20 .18|1.21 1.2|1.13 .7|R .55| 42.6 43.5| Q44 | -------------------------------------------------------------------------------------
If questions 53 and 50 are removed, then perhaps questions 49, 51, and 52 should be removed as well. Although, in the last iteration, these items were not misfitting, and did not even have the next highest outfit, these questions go together conceptually with 53 and 50, as does 49 (i.e., Using the Results/Reports). Question 51 asks, In a given quarter/semester, how often do you compare your classroom results with the school district? And question 52 asks, In a given quarter/semester, how often do you compare individual student results with the school district? These items appeared to be positively 249
skewed with the majority of the respondents indicating that they never or rarely did these things (see Table 46 below). There was also the asterisk symbol present, indicating that the progression of average measure logits was not as expected (i.e., lowest to highest). Question 49 was similar (i.e., In a given quarter/semester, how often do you use the results linking state standards for reading to DORA scores?). Thus, for comparison purposes, these items will be removed in the following iteration to examine if model fit improves.
--------------------------------------------------------------------|ENTRY DATA SCORE | DATA | AVERAGE S.E. OUTF PTMEA| | |NUMBER CODE VALUE | COUNT % | MEASURE MEAN MNSQ CORR.| ITEM | 47 E 0 1 2 3 46 J 0 1 2 3 0 1 2 3 0 1 2 3 | | | | | | | | 14 18 11 4 12 19 11 5 30 38 23 9 26 40 23 11 | | | | | | | | .05 .45 .97 .63* -.05 .41 .94 .89* .14 .26 .27 .36 .15 .26 .21 .38 1.0 3.6 1.1 2.2 1.0 3.7 .8 1.7 -.29 -.02 .30 .05 -.33 -.05 .29 .16 |Q52 | | | |Q51 | | | | | | | | | | | 0 1 2 3 0 1 2 3 0 1 2 3
| | | | | | |
| 45 P 0 0 | 5 11 | .02 .21 1.3 -.17 |Q49 | | 1 1 | 14 30 | -.08* .21 .8 -.39 | | | 2 2 | 17 36 | .76 .23 2.0 .24 | | | 3 3 | 11 23 | .91 .26 1.3 .27 | | ---------------------------------------------------------------------
Finally, the map of persons and items was examined (see Figure 45 below). To determine variability, item measure values were investigated using the item/person map 250
for this model. The degree to which these items are targeted at the teachers was investigated. As seen below, the scale appeared to be applicable for its purposes. The items were approximately normally distributed, although slightly leptokurtic, with some items separated from the others on the far ends of the scale. Many items were congregated around 0 on the map, which is an indication of redundancy. This was expected, as some sections of the scale have very similar content and wording of questions. Unfortunately, a few large gaps in between the items were shown on the variable maps (e.g., questions 3 and 36). In addition, there some teachers whose position was above where items were measuring. As shown in the map, the items covered a range of 1.6 to -1.6 logits in difficulty, which was narrower than the range of about 3.4 to -1.6 for persons. This indicates the presence of a teacher or teachers who were endorsing the highest possible category for the majority of questions, and that harder items may need to be added in future studies to extend the range of the trait measured. Because this is the final version of the measure that will be used in the analysis of Research Question 3, basic descriptives of the scale should be presented with the final analysis sample (N = 47). The total possible points on the measure is 150. The mean of the 50-question OFAS in the final analysis sample was 88.57 (SD = 26.50). The range was 121 with a minimum score of 24 and a maximum score of 145. Histograms and skewness and kurtosis statistics revealed an approximately normal distribution.
251
<more>|<rare> 4 + | | | X | | | | + | | | | | T| X | + X | X | X | Q5 X | X S|T Q36 X | XX | XX + Q3 XXXX | Q51 X | Q28 XXXXX |S Q27 XXX M| Q25 XXXX | Q26 XXX | Q16 X | Q12 X +M Q14 XX | Q11 X | Q10 XX | Q35 XX S| XX |S Q15 XX | | Q6 + Q1 | Q7 X | Q43 T|T Q54 | Q2 | X | | + <less>|<frequ>
Q52
Q55 Q32 Q34 Q24 Q17 Q18 Q21 Q13
Q56 Q29 Q20 Q19 Q33 Q31
Q30 Q49 Q22 Q40 Q39
Q41 Q38 Q9 Q42
Q44 Q4 Q8
Q37
-1
Q23
-2
Figure 45. The map of persons and items for the 50-item survey. The distribution of person positions is on the left side of the vertical line and items on the right. The scale appears to be applicable for its purposes. 252
47-Item Survey (A Comparison). Items 51, 52, and 49 were removed to potentially improve the fit. This new 47-item scale was run in Winsteps and further examined. Interestingly, one initial major concern was that this iteration produced an extreme person, which is considered not ideal. This new information, without even examining the infit and outfit statistics, suggests that these items should be retained the final survey. For persons (N = 46; i.e., only non-extreme), real and model separation decreased compared to the previous models, and including the extreme person (N = 47), separation decreased considerably compared to the other versions of the survey. Item separation decreased slightly from the previous runs (see Tables 47 through 49 below). For the nonextreme persons, the person separation reliability estimate for these data was .93, which decreased from the previous runs. This model reliability had a value of .95. For both the extreme and non-extreme persons combined, the person separation reliability estimate was .92, which was the same as the model reliability, and this was the same as the first run of the data. Cronbachs Alpha was again .95. The mean infit and outfit for person and item mean squares were all 1.00 and 1.03, respectively. The mean standardized infit and outfit were expected to be 0.0. Here, in this 47-question scale, they were -.4 and -.2 for persons, and -.1 and -.1 for items, which was basically the same as before. The standard deviations of the standardized infit of persons and items were approximately the same as before. The data evidence acceptable fit overall. However, again as in the previous runs of the data, the overall ChiSquare test for this model was significant indicating that the Rasch model does not fit these items well (2 = 4676.71, df = 2,069, p = .000). 253
Table 47 Summary of 46 (Non-Extreme) Measured Persons (47 Measured Items) Winsteps Output
------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 83.3 47.0 .45 .19 1.00 -.4 1.03 -.2 | | S.D. 24.0 .0 .84 .02 .54 2.6 .59 2.5 | | MAX. 125.0 47.0 2.18 .27 2.53 6.2 3.17 6.0 | | MIN. 21.0 47.0 -1.87 .17 .30 -5.4 .33 -5.1 | |-----------------------------------------------------------------------------| | REAL RMSE .21 ADJ.SD .81 SEPARATION 3.79 PERSON RELIABILITY .93 | |MODEL RMSE .19 ADJ.SD .82 SEPARATION 4.24 PERSON RELIABILITY .95 | | S.E. OF PERSON MEAN = .13 | ------------------------------------------------------------------------------MAXIMUM EXTREME SCORE: 1 PERSONS
Table 48 Summary of 47 (Extreme and Non-Extreme) Measured Persons (47 Measured Items) Winsteps Output
------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 84.6 47.0 .57 .23 | | S.D. 25.2 .0 1.18 .24 | | MAX. 141.0 47.0 6.30 1.83 | | MIN. 21.0 47.0 -1.87 .17 | |-----------------------------------------------------------------------------| | REAL RMSE .34 ADJ.SD 1.13 SEPARATION 3.33 PERSON RELIABILITY .92 | |MODEL RMSE .33 ADJ.SD 1.14 SEPARATION 3.47 PERSON RELIABILITY .92 | | S.E. OF PERSON MEAN = .17 | ------------------------------------------------------------------------------PERSON RAW SCORE-TO-MEASURE CORRELATION = .89 CRONBACH ALPHA (KR-20) PERSON RAW SCORE RELIABILITY = .95
254
Table 49 Summary of 47 (Non-Extreme) Measured Items Winsteps Output

------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 84.6 47.0 .00 .19 1.01 -.1 1.03 -.1 | | S.D. 18.0 .0 .66 .02 .34 1.7 .39 1.8 | | MAX. 120.0 47.0 1.71 .25 1.86 3.4 1.86 3.4 | | MIN. 37.0 47.0 -1.47 .18 .53 -3.0 .53 -2.9 | |-----------------------------------------------------------------------------| | REAL RMSE .21 ADJ.SD .63 SEPARATION 3.05 ITEM RELIABILITY .90 | |MODEL RMSE .19 ADJ.SD .63 SEPARATION 3.28 ITEM RELIABILITY .92 | | S.E. OF ITEM MEAN = .10 | ------------------------------------------------------------------------------UMEAN=.000 USCALE=1.000 ITEM RAW SCORE-TO-MEASURE CORRELATION = -1.00 2162 DATA POINTS. LOG-LIKELIHOOD CHI-SQUARE: 4676.71 with 2069 d.f. p=.0000
After items 52, 51, and 49 were removed, the point measure correlations for this model were acceptable, as all items in the scale had point measure correlations above .15 (see Table 50 below). Most correlations were between .3 and .5. Infit and outfit measures were also examined for all the items. Infit measures for all items in this measure were less than 2, and are therefore acceptable. All outfit measures for items on the scale were less than 2 as well.
255
------------------------------------------------------------------------------------|ENTRY TOTAL MODEL| INFIT | OUTFIT |PTMEA|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----+-----------+------| | 5 37 47 1.71 .21|1.86 3.4|1.85 2.9|A .34| 34.8 52.0| Q5 | | 45 119 47 -1.41 .24|1.13 .6|1.86 2.5|B .42| 60.9 64.0| Q54 | | 38 84 47 .05 .18|1.67 3.1|1.77 3.4|C .33| 34.8 43.7| Q38 | | 4 86 47 -.02 .18|1.71 3.2|1.76 3.3|D .30| 34.8 43.6| Q4 | | 8 91 47 -.19 .19|1.30 1.5|1.72 3.1|E .36| 39.1 45.0| Q8 | | 47 71 47 .48 .18|1.49 2.4|1.64 2.9|F .38| 37.0 42.8| Q56 | | 34 71 47 .48 .18|1.61 2.9|1.60 2.8|G .37| 30.4 42.8| Q34 | | 7 114 47 -1.14 .23|1.50 2.0|1.56 1.9|H .36| 52.2 57.2| Q7 | | 14 85 47 .02 .18|1.22 1.2|1.51 2.4|I .45| 43.5 43.6| Q14 | | 1 111 47 -.99 .22|1.46 1.9|1.42 1.6|J .36| 43.5 54.3| Q1 | | 46 65 47 .68 .18|1.38 1.9|1.34 1.7|K .39| 41.3 42.6| Q55 | | 3 52 47 1.12 .19|1.35 1.7|1.37 1.7|L .42| 39.1 45.3| Q3 | | 37 102 47 -.60 .20|1.27 1.3|1.31 1.4|M .34| 47.8 47.6| Q37 | | 44 78 47 .25 .18|1.28 1.5|1.21 1.1|N .49| 43.5 43.4| Q44 | | 35 94 47 -.30 .19|1.26 1.3|1.20 1.0|O .45| 45.7 44.8| Q35 | -------------------------------------------------------------------------------------
Overall, this iteration produced higher infit and outfit statistics, worse person and item separation and reliability. Thus, the decision was made to leave items 52, 51, and 49 in the measure, although they conceptually are grouped with 53 and 50. Additionally, if item 49 is left in the measure and 51 and 52 are removed, the outfit for 49 increased to above 2 (i.e., 2.13). The final measure included the 50-item survey presented in the second run of the data. It was noted that the items regularly appearing as misfitting in the runs of the data were all conceptually related. These items were contained in the section of the survey entitled Using the Results. Using results is a key component of formative assessment, and that which makes formative assessment formative. This suggests that this 11-question section of the OFAS is perhaps a subscale that can be used separately or in
256
place of the larger scale to assess teacher use of computerized/online teacher formative assessment. This subscale will be examined in the following paragraphs. 11-Item Survey. As mentioned previously, for all the iterations in the following paragraphs, a rating scale model was run, which assumes that the response scale is the same. In the first run of the data, the program went through seven iterations (i.e., less iterations suggests better convergence). This small number of iterations suggested good convergence, but was not the smallest compared to all the other versions of the survey. After convergence was assessed, person and item separation and reliability of separation were examined. Additionally, an extreme person was produced as with the 47-question OFAS. Again, this is not ideal, but further examination of the infit and outfit diagnostics and the removal of poorly fitting items may remedy the problem. This extreme person responded with 3s to all 11 questions (i.e., Almost Always) and perhaps should be removed from further analysis. For persons (i.e., non-extreme; N = 46), separation is 1.83 for the data at hand (i.e., real), and is 2.12 when the data have no misfit to the model (i.e., model). For both extreme and non-extreme person (N = 47), the real separation was 1.89 and model separation was 2.10. Thus, with and without the extreme person, the values were approximately the same, and all the above values were considerably less than the previous models. As mentioned previously, if separation is 1.0 or below, the items may not have sufficient breadth in position. In that case, potential revision of the construct definition may be warranted, which can possibly be remedied by adding items that cover a broader range. 257
Item separation for the 11-question OFAS was 3.91 (real) and 4.13 (model), a larger continuum than for persons. It is typical to find larger separation values for items than for persons. This is usually a function of the data having a smaller number of items and a larger number of people. The item separation was greater than 2, which shows that true variability among items is much larger than the amount of error variability. As noted above, separation is affected by sample size, as are fit indices and error estimates. With larger sample sizes, separation tends to increase and error decreases. The person separation reliability estimate for these data is .77 (i.e., non-extreme) and .78 (i.e., nonextreme and extreme combined). The conceptual analog to person reliability is item reliability, which estimates internal consistency of persons rather than items, is .94. The model reliability was the same, indicating that 94% of the variability among items is due to real item variance. Item means were analyzed next, and it was found that the person mean (i.e., nonextreme) for the current model was .60 and the combined extreme and non-extreme results was .69, which suggests these items, on average, were easy to agree with. The persons had a higher level of the trait than the items did, and if too high, this would be an indication of the items being too easy for the sample. This was higher compared to all other person means in the previous models. Additionally, Cronbachs Alpha was .83, suggesting that the reliability measures indicate that the model fit was good; however, this was the lowest internal consistency value across all the versions of the survey. Mean infit and outfit for person (i.e., non-extreme) and item mean squares were investigated. Mean infit and outfit is expected to be 1.0, and for these data, they were 258
1.03 and 1.06 for persons and 1.02 and 1.06 for items. Related to this, the mean standardized infit and outfit are expected to be 0.0. In the current model, they were both 0.0 for persons, and .1 and .2 for items. This indicates that the items overfit, on average. A 2.0 cut-off is generally used for the standard deviation of the standardized infit as an index of overall misfit for persons and items. Both persons (i.e., standardized infit SD = .58) and items (i.e., standardized infit SD = .18) showed little overall misfit, with persons showing more misfit. Here the data evidence acceptable fit overall. This is in contrast to the overall Chi-Square test for this model, which is significant indicating that the Rasch model does not fit these items well (2 = 1020.94, df = 449, p = .000). All the above information is presented in the tables below (see Tables 51 through 53).
Table 51 Summary of 46 (Non-Extreme) Measured Persons (11 Measured Items) Winsteps Output
------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 20.2 11.0 .60 .41 1.03 .0 1.06 .0 | | S.D. 6.2 .0 .98 .06 .58 1.3 .95 1.1 | | MAX. 30.0 11.0 2.47 .61 3.05 2.6 6.82 3.9 | | MIN. 6.0 11.0 -1.67 .37 .24 -2.8 .26 -2.3 | |-----------------------------------------------------------------------------| | REAL RMSE .47 ADJ.SD .86 SEPARATION 1.83 PERSON RELIABILITY .77 | |MODEL RMSE .42 ADJ.SD .88 SEPARATION 2.12 PERSON RELIABILITY .82 | | S.E. OF PERSON MEAN = .15 | ------------------------------------------------------------------------------MAXIMUM EXTREME SCORE: 1 PERSONS
259
Table 52 Summary of 47 (Extreme and Non-Extreme) Measured Persons (11 Measured Items) Winsteps Output
------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 20.4 11.0 .69 .44 | | S.D. 6.4 .0 1.14 .21 | | MAX. 33.0 11.0 4.85 1.84 | | MIN. 6.0 11.0 -1.67 .37 | |-----------------------------------------------------------------------------| | REAL RMSE .54 ADJ.SD 1.01 SEPARATION 1.89 PERSON RELIABILITY .78 | |MODEL RMSE .49 ADJ.SD 1.03 SEPARATION 2.10 PERSON RELIABILITY .82 | | S.E. OF PERSON MEAN = .17 | ------------------------------------------------------------------------------PERSON RAW SCORE-TO-MEASURE CORRELATION = .96 CRONBACH ALPHA (KR-20) PERSON RAW SCORE RELIABILITY = .83
Table 53 Summary of 11 (Non-Extreme) Measured Items Winsteps Output

------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 87.4 47.0 .00 .20 1.02 .1 1.06 .2 | | S.D. 21.9 .0 .87 .03 .18 .9 .38 1.2 | | MAX. 125.0 47.0 1.30 .27 1.28 1.3 2.16 3.3 | | MIN. 52.0 47.0 -1.69 .18 .75 -1.4 .75 -1.3 | |-----------------------------------------------------------------------------| | REAL RMSE .21 ADJ.SD .84 SEPARATION 3.91 ITEM RELIABILITY .94 | |MODEL RMSE .20 ADJ.SD .84 SEPARATION 4.13 ITEM RELIABILITY .94 | | S.E. OF ITEM MEAN = .27 | ------------------------------------------------------------------------------UMEAN=.000 USCALE=1.000 ITEM RAW SCORE-TO-MEASURE CORRELATION = -1.00 506 DATA POINTS. LOG-LIKELIHOOD CHI-SQUARE: 1020.94 with 449 d.f. p=.0000
The table below (see Table 54) contains information about how the response scale was used. The step logit position is where a step marks the transition from one rating scale category to the next. "Observed Average" is the average of logit positions modeled in the category. It should increase by category value, and the current model demonstrated 260
this. For example, persons responding with a 0 have an average measure (-.86) lower than those responding with a 1 (i.e., average measure = -.20). There was no misfit for the categories, as the misfit indices for all the categories were below 1.5. Sample expected values should not be highly discrepant from the observed averages, and for these data they were not. Infit and outfit mean squares were each expected to equal 1.0, and they are close to this value. Step calibration is the logit calibrated difficulty of the step, which is represented pictorially in Figure 46 below. The transition points between one category and the next are the step calibration values in the table below. These values are expected to increase with category value, which they did in the current model. The step standard errors were all low (i.e., .15, .12, and .12).

------------------------------------------------------------------|CATEGORY OBSERVED|OBSVD SAMPLE|INFIT OUTFIT||STRUCTURE|CATEGORY| |LABEL SCORE COUNT %|AVRGE EXPECT| MNSQ MNSQ||CALIBRATN| MEASURE| |-------------------+------------+------------++---------+--------| | 0 0 72 14| -.86 -.91| 1.08 1.61|| NONE |( -2.32)| | 1 1 117 23| -.20 -.14| .89 .81|| -1.02 | -.64 | | 2 2 140 28| .73 .70| .87 .74|| .09 | .67 | | 3 3 177 35| 1.63 1.64| 1.09 1.06|| .92 |( 2.27)| --------------------------------------------------------------------------------------------------------------------------------------------|CATEGORY STRUCTURE | SCORE-TO-MEASURE | 50% CUM.| COHERENCE|ESTIM| | LABEL MEASURE S.E. | AT CAT. ----ZONE----|PROBABLTY| M->C C->M|DISCR| |------------------------+---------------------+---------+----------+-----| | 0 NONE |( -2.32) -INF -1.53| | 67% 26%| | | 1 -1.02 .15 | -.64 -1.53 .02| -1.27 | 39% 47%| .90| | 2 .09 .12 | .67 .02 1.52| .04 | 41% 61%| 1.08| | 3 .92 .12 |( 2.27) 1.52 +INF | 1.23 | 75% 55%| .98| --------------------------------------------------------------------------M->C = Does Measure imply Category? C->M = Does Category imply Measure?
0 1 2 3
0 1 2 3
261
Probability curves were examined next, and these curves display the likelihood of category selection (Y-axis) by the person-minus-item measure (X-axis). If all categories are utilized, each category value will be the most likely at some point on the continuum (i.e., as shown below in Figure 46), and there will be no category inversions where a higher category is more likely at a lower point than a lower category. This was not the case here where all categories were being used and were behaving according to expectation.
CATEGORY PROBABILITIES: MODES - Structure measures at intersections -+---------+---------+---------+---------+---------+---------+1.0 + + | | | 3| |0000 3333 | .8 + 000 333 + | 000 33 | | 00 33 | | 00 33 | .6 + 0 33 + | 00 3 | .5 + 00 33 + | 0 111111 33 | .4 + 111*0 1111 2222222*22 + | 111 0 22*1 33 2222 | | 111 00 22 11 3 222 | | 111 ** 3*1 222 | .2 + 111 222 00 33 111 222 + | 1111 22 0*3 11 2222 | |1 2222 333 000 1111 22| | 2222222 333333 000000 1111111 | .0 +*********333333333333 0000000000000********+ -+---------+---------+---------+---------+---------+---------+-3 -2 -1 0 1 2 3 PERSON [MINUS] ITEM MEASURE
Figure 46. Category probabilities (i.e., probability curves) indicating the probability of a response for the 11-item survey. These curves display the likelihood of category selection (Y-axis) by the person-minus-item measure (X-axis). All categories in the above figure are being used according to expectation. 262
The table below (Table 55) contains information regarding item misfit diagnostics. Measure is the logit position of the item, with error being the standard error of measurement for the item. This is followed by the infit and outfit diagnostics, and the correlation between the item score and the measure, which should be positive. Item fit for the model was determined by: (1) point measure correlations, and (2) infit and outfit measures. A point measure correlation below .15 indicates a potentially misfitting item, and the values are preferably between .3 and .5. Point measure correlations for this model were slightly higher, as all items on the scale had point measure correlations between .4 and .7. This is reflective of the content of this subscale, which contained items specific to using the DORA results. Infit and outfit measures were also examined for all the items. Infit and outfit values of less than 2 were considered acceptable. Infit measures for all items on this measure were less than 2, and are therefore acceptable. Most outfit measures for items on the scale were less than 2 as well. One item (i.e., Question 46: In a given quarter/semester, how often do you use DORA results to help all students with their reading performance?) had an outfit value larger than 2 (i.e., 2.16). Sixty percent of the sample (n = 28) endorsed the highest category on the scale for this item. Thus, this item was very easy, which could account for the high outfit value. This item was removed in the next run of the data to see if model fit improves. This item was grouped with items 47 and 48 in the scale as being very similar in content and wording. Unsurprisingly, these items were the next highest misfitting items. Question 47 (i.e., In a given quarter/semester, how often do you use DORA 263
results/reports to help the high-achieving students with their reading performance?) had an outfit value of 1.21, and Question 46 (i.e., In a given quarter/semester, how often do you use DORA results to help the low-achieving students with their reading performance?) had an outfit value of .84. although they were the next highest misfitting, their values are not near 2. Thus, questions 47 and 48 was retained in future analysis of this subscale, and question 46 was removed. Question 46 was examined to see if it did not follow a monotonically changing average theta per category. If the item worked the way it should, then there should be no asterisk symbols next to the average measure value (see Table 56 below). This change in rank can lead to misfit. Item 46 was marked with an asterisk. Thus, the items did not perform as they should. For example, the logit should increase with every response category. For this item, there were problems in that the average measure logit value did not increase with each response category (i.e., -.26 to .50 to .41 to 1.12). This was also found for question 48.
264
------------------------------------------------------------------------------------|ENTRY TOTAL MODEL| INFIT | OUTFIT |PTMEA|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----+-----------+------| | 4 111 47 -.87 .22|1.20 .9|2.16 3.3|A .48| 56.5 57.5| Q46 | | 6 125 47 -1.69 .27|1.28 1.0| .84 -.3|B .43| 69.6 71.5| Q48 | | 5 76 47 .46 .18|1.25 1.3|1.21 1.0|C .66| 37.0 43.4| Q47 | | 11 87 47 .08 .19|1.24 1.2|1.23 1.1|D .54| 41.3 46.3| Q53 | | 2 78 47 .40 .18|1.07 .4|1.07 .4|E .63| 47.8 43.5| Q44 | | 8 88 47 .05 .19| .87 -.6| .94 -.2|F .67| 52.2 46.3| Q50 | | 10 52 47 1.30 .19| .92 -.4| .93 -.3|e .62| 50.0 47.2| Q52 | | 3 92 47 -.10 .19| .91 -.4| .88 -.5|d .62| 50.0 48.0| Q45 | | 1 115 47 -1.07 .23| .89 -.4| .78 -.7|c .48| 63.0 60.7| Q43 | | 9 56 47 1.15 .19| .87 -.7| .85 -.7|b .64| 47.8 46.2| Q51 | | 7 81 47 .29 .19| .75 -1.4| .75 -1.3|a .67| 47.8 43.9| Q49 | |------------------------------------+----------+----------+-----+-----------+------| | MEAN 87.4 47.0 .00 .20|1.02 .1|1.06 .2| | 51.2 50.4| | | S.D. 21.9 .0 .87 .03| .18 .9| .38 1.2| | 8.8 8.6| | -------------------------------------------------------------------------------------
--------------------------------------------------------------------|ENTRY DATA SCORE | DATA | AVERAGE S.E. OUTF PTMEA| | |NUMBER CODE VALUE | COUNT % | MEASURE MEAN MNSQ CORR.| ITEM | |--------------------+------------+--------------------------+------| | 4 A 0 0 | 3 6 | -.26 1.21 6.2 -.22 |Q46 | | 1 1 | 5 11 | -.50* .18 .4 -.36 | | | 2 2 | 11 23 | .41 .13 .5 -.14 | | | 3 3 | 28 60 | 1.12 .21 1.1 .45 | | | | | | | | 6 B 0 0 | 1 2 | -.31 1.7 -.13 |Q48 | | 1 1 | 2 4 | -1.01* .25 .3 -.31 | | | 2 2 | 9 19 | .10 .16 .6 -.25 | | | 3 3 | 35 74 | .97 .20 1.2 .42 | | | | | | | | 5 C 0 0 | 13 28 | -.22 .25 1.3 -.50 |Q47 | | 1 1 | 8 17 | .21 .24 .6 -.19 | | | 2 2 | 10 21 | .73 .17 .6 .02 | | | 3 3 | 16 34 | 1.66 .27 1.0 .61 | | ---------------------------------------------------------------------
0 1 2 3 0 1 2 3 0 1 2 3
The map of persons and items is displayed below (see Figure 47). To determine variability, item measure values were investigated using the item/person map for this 265
model. The degree to which these items are targeted at the teachers was investigated. As seen below, the scale appeared to be applicable for its purposes. The items were approximately normally distributed, with some items separated from the others on the far ends of the scale. A group of six items were congregated around 0 on the map. Unfortunately, a few large gaps in between the items were shown on the variable maps. In addition, there were numerous persons whose position was above where items were measuring. As shown in the map, the items covered a range of 1.3 to -1.7 logits in difficulty, which is slightly narrower than the range of about -1.7 to 3 for persons. Without the extreme person, the upper range for items can be narrowed to 2.5 logits. Thus, this person was removed to improve model fit (i.e., along with item 46). This indicates that perhaps harder items may need to be added in future studies to extend the range of the trait measured.
266
-1
-2
<more>|<rare> X + | | | T| X | | | | XXX | + X | | XXXXX |T S| XX | | XX | Q52 | Q51 XXX | X + |S XXX | XX | M| XXX | Q47 XX | Q44 | Q49 X | XXXXX | Q53 X +M Q50 | Q45 X | X | S| XX | XXX | | X | X |S Q46 + | Q43 | X | T| | | X |T Q48 | | + <less>|<frequ>
Figure 47. The map of persons and items for the 11-item survey. The distribution of person positions is on the left side of the vertical line and items on the right. The scale appears to be applicable for its purposes, although an extreme person is present. 267
10-Item Survey (Final Survey). As mentioned previously, item 46 appeared to have potentially bad fit (i.e., MNSQ > 2.0). This item was removed along with the extreme person (i.e., teacher) from above, and the new 10-item scale was run in Winsteps and further examined. It should be noted that retaining the extreme teacher in running the 10-question OFAS (i.e., the current model presented below) did not alter the fit statistics drastically, although model fit was improved slightly by not including this teacher. In addition, simply removing the extreme teacher from the 11-question OFAS without removing item 46 did not improve the model fit, and in fact, item 46 was still targeted as a poorly fitting item (i.e., MNSQ > 2.0). Thus, the decided to remove both item 46 and the extreme teacher. As a side note, the extreme teacher removed from the analysis sample was not from the Colorado district, and therefore, this person was not used in the analysis of Research Question 3. In the run of the 10-question OFAS, the program went through six iterations (i.e., less iterations suggests better convergence). This small number of iterations suggested good convergence, and was smaller compared to the 11-question OFAS. Person and item separation and reliability of separation were examined next. No extreme persons were produced in this run of the data. For persons (N = 46), separation was 1.87 for the data at hand (i.e., real), and is 2.09 when the data have no misfit to the model (i.e., model), which was slightly better than the 11-item survey. Item separation for the 10-question OFAS was 3.93 (real) and 4.17 (model), a larger continuum than for persons and slightly larger than the 11-item survey. The item separation was greater than 2, which shows that true variability among items is much larger than the amount of error variability. The 268
person separation reliability estimate for these data was .78, and the item reliability was .94, which was approximately the same as the 11-question OFAS. The model reliability was the same, indicating that 95% of the variability among items is due to real item variance. This was also higher than the 11-question OFAS. Item means were analyzed next, and it was found that the person mean for the current model was .55, which suggests these items, on average, were easy to agree with. The persons had a higher level of the trait than the items did, and this was still higher compared to all other person means in the models with the full OFAS (i.e., not the current abbreviated version). Additionally, Cronbachs Alpha was .81, suggesting that the reliability measures indicate that the model fit was good; however, this was the lowest internal consistency value across all the versions of the survey including the 11-question OFAS. Although .81 is lower than the other models, the widely-accepted social science cut-off is that Alpha should be .70 or higher for a set of items to be considered an internally consistent scale. Mean infit and outfit for person and item mean squares were investigated. Mean infit and outfit is expected to be 1.0, and for these data, they were 1.01 and .97 for persons and 1.02 and .92 for items. Related to this, the mean standardized infit and outfit are expected to be 0.0. In the current model, they were both 0.0 for persons, and .1 and -.1 for items. A 2.0 cut-off is generally used for the standard deviation of the standardized infit as an index of overall misfit for persons and items. Both persons (i.e., standardized infit SD = .52) and items (i.e., standardized infit SD = .22) showed little overall misfit, with persons showing more misfit. Here the data evidence acceptable fit overall. This is 269
in contrast to the overall Chi-Square test for this model (i.e., lower than the 11-question OFAS), which is significant indicating that the Rasch model does not fit these items well (2 = 926.74, df = 404, p = .000). All the above information is presented in the tables below (see Tables 57 and 58).

------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 17.8 10.0 .55 .44 1.01 .0 .97 .0 | | S.D. 5.8 .0 1.05 .10 .52 1.2 .44 1.0 | | MAX. 29.0 10.0 3.57 1.02 2.58 2.5 2.46 2.7 | | MIN. 6.0 10.0 -1.56 .39 .22 -2.8 .23 -2.4 | |-----------------------------------------------------------------------------| | REAL RMSE .49 ADJ.SD .93 SEPARATION 1.87 PERSON RELIABILITY .78 | |MODEL RMSE .45 ADJ.SD .95 SEPARATION 2.09 PERSON RELIABILITY .81 | | S.E. OF PERSON MEAN = .16 | ------------------------------------------------------------------------------PERSON RAW SCORE-TO-MEASURE CORRELATION = .99 CRONBACH ALPHA (KR-20) PERSON RAW SCORE RELIABILITY = .81

------------------------------------------------------------------------------| RAW MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 82.0 46.0 .00 .20 1.02 .1 .97 -.1 | | S.D. 21.5 .0 .88 .03 .22 1.0 .20 .9 | | MAX. 122.0 46.0 1.24 .27 1.35 1.7 1.27 1.3 | | MIN. 49.0 46.0 -1.81 .19 .72 -1.6 .71 -1.5 | |-----------------------------------------------------------------------------| | REAL RMSE .22 ADJ.SD .85 SEPARATION 3.93 ITEM RELIABILITY .94 | |MODEL RMSE .20 ADJ.SD .85 SEPARATION 4.17 ITEM RELIABILITY .95 | | S.E. OF ITEM MEAN = .29 | ------------------------------------------------------------------------------UMEAN=.000 USCALE=1.000 ITEM RAW SCORE-TO-MEASURE CORRELATION = -1.00 460 DATA POINTS. LOG-LIKELIHOOD CHI-SQUARE: 926.74 with 404 d.f. p=.0000
270
The table below (see Table 59) contains information about how the response scale was used. The step logit position is where a step marks the transition from one rating scale category to the next. The observed average increased by category value as it should. There was no misfit for the categories, as the misfit indices for all the categories were below 1.5. Sample expected values should not be highly discrepant from the observed averages, and for these data they were not. Infit and outfit mean squares were each expected to equal 1.0, and they are close to this value. Step calibration is the logit calibrated difficulty of the step, and the transition points between one category and the next are the step calibration values in the table below. These values are expected to increase with category value, which they did in the current model. The step standard errors were all low (i.e., .15, .12, and .13).

------------------------------------------------------------------|CATEGORY OBSERVED|OBSVD SAMPLE|INFIT OUTFIT||STRUCTURE|CATEGORY| |LABEL SCORE COUNT %|AVRGE EXPECT| MNSQ MNSQ||CALIBRATN| MEASURE| |-------------------+------------+------------++---------+--------| | 0 0 69 15| -.98 -.95| .98 1.06|| NONE |( -2.36)| | 1 1 112 24| -.22 -.20| .92 .84|| -1.06 | -.66 | | 2 2 129 28| .72 .64| .86 .80|| .07 | .68 | | 3 3 150 33| 1.67 1.71| 1.16 1.13|| .99 |( 2.32)| --------------------------------------------------------------------------------------------------------------------------------------------|CATEGORY STRUCTURE | SCORE-TO-MEASURE | 50% CUM.| COHERENCE|ESTIM| | LABEL MEASURE S.E. | AT CAT. ----ZONE----|PROBABLTY| M->C C->M|DISCR| |------------------------+---------------------+---------+----------+-----| | 0 NONE |( -2.36) -INF -1.57| | 63% 20%| | | 1 -1.06 .15 | -.66 -1.57 .02| -1.31 | 40% 51%| 1.08| | 2 .07 .12 | .68 .02 1.56| .03 | 41% 62%| 1.10| | 3 .99 .13 |( 2.32) 1.56 +INF | 1.28 | 75% 52%| .88| --------------------------------------------------------------------------M->C = Does Measure imply Category? C->M = Does Category imply Measure?
0 1 2 3
0 1 2 3
271
Probability curves were examined next, and if all categories are utilized, each category value will be the most likely at some point on the continuum (i.e., as shown below in Figure 48). This means that there should be no category inversions where a higher category is more likely at a lower point than a lower category. This was not the case here where all categories were being used and were behaving according to expectation.
CATEGORY PROBABILITIES: MODES - Structure measures at intersections -+---------+---------+---------+---------+---------+---------+1.0 + + | | | 3| |0000 333 | .8 + 000 333 + | 00 333 | | 00 33 | | 00 33 | .6 + 00 3 + | 00 33 | .5 + 0 33 + | 0011111111 3 | .4 + 11110 111 2222222**22 + | 111 00 22*1 3 2222 | | 11 00 22 11 33 22 | | 111 2*2 3*1 222 | .2 + 111 22 00 33 111 2222 + |11111 222 00*3 111 2222 | | 2222 333 000 111 2| | 222222 33333 000000 1111111 | .0 +*********3333333333333 0000000000000********+ -+---------+---------+---------+---------+---------+---------+-3 -2 -1 0 1 2 3 PERSON [MINUS] ITEM MEASURE
272
The table below (see Table 60) contains information regarding item misfit diagnostics. A point measure correlation below .15 indicates a potentially misfitting item, and the values are preferably between .3 and .5. Point measure correlations for this model were slightly higher, as all items on the scale had point measure correlations between .4 and .7. This was also the same point measure correlations observed in the 11-question OFAS. Again, this is reflective of the content of this subscale, which contained items specific to using the DORA results. Infit measures for all items on this measure were less than 2, and are therefore acceptable. All outfit measures for items on the scale were less than 2 as well.This was compared to the 11-question OFAS where the removed question (i.e., Item 46) had an outfit measure of greater than 2. Thus, all items in this 10-question scale appeared to fit. Question 48 was examined because this item did not follow a monotonically changing average theta per category. If the item worked the way it should, then there should be no asterisk symbols next to the average measure value (see Table 61 below). This change in rank can lead to misfit, although this was not reflected in the misfit diagnostics. For example, the logit should increase with every response category. For this item, there were problems in that the average measure logit value did not increase with each response category (i.e., -.27 to -.94 to .03 to .79).
273
------------------------------------------------------------------------------------|ENTRY TOTAL MODEL| INFIT | OUTFIT |PTMEA|EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%| ITEM | |------------------------------------+----------+----------+-----+-----------+------| | 5 122 46 -1.81 .27|1.35 1.2| .94 .0|A .40| 67.4 71.0| Q48 | | 4 73 46 .38 .19|1.34 1.7|1.27 1.3|B .66| 32.6 44.0| Q47 | | 10 84 46 -.01 .19|1.29 1.4|1.27 1.2|C .54| 41.3 46.5| Q53 | | 2 75 46 .31 .19|1.08 .5|1.21 1.0|D .63| 45.7 44.0| Q44 | | 7 85 46 -.04 .19| .90 -.5| .96 -.1|E .69| 52.2 46.8| Q50 | | 3 89 46 -.19 .19| .94 -.3| .91 -.3|e .62| 50.0 47.3| Q45 | | 1 112 46 -1.18 .23| .91 -.3| .79 -.6|d .47| 58.7 60.4| Q43 | | 9 49 46 1.24 .20| .86 -.7| .83 -.8|c .63| 52.2 48.5| Q52 | | 8 53 46 1.09 .19| .81 -.9| .78 -1.1|b .65| 52.2 47.5| Q51 | | 6 78 46 .21 .19| .72 -1.6| .71 -1.5|a .70| 50.0 44.5| Q49 | |------------------------------------+----------+----------+-----+-----------+------| | MEAN 82.0 46.0 .00 .20|1.02 .1| .97 -.1| | 50.2 50.1| | | S.D. 21.5 .0 .88 .03| .22 1.0| .20 .9| | 8.9 8.3| | -------------------------------------------------------------------------------------
--------------------------------------------------------------------|ENTRY DATA SCORE | DATA | AVERAGE S.E. OUTF PTMEA| | |NUMBER CODE VALUE | COUNT % | MEASURE MEAN MNSQ CORR.| ITEM | |--------------------+------------+--------------------------+------| | 5 A 0 0 | 1 2 | -.27 1.7 -.11 |Q48 | | 1 1 | 2 4 | -.94* .19 .4 -.30 | | | 2 2 | 9 20 | .03 .19 .7 -.24 | | | 3 3 | 34 74 | .79 .19 1.2 .40 | | ---------------------------------------------------------------------
0 1 2 3
Note. Only item 48s item frequencies are listed above, as this is the highest misfitting item.
The map of persons and items is displayed below (see Figure 49). As seen below, the scale appeared to be applicable for its purposes. The items were approximately normally distributed, with some items separated from the others on the far ends of the scale. Two items (i.e., questions 50 and 53) were at the same logit value around 0 on the 274
map. Unfortunately, a few large gaps in between the items were shown on the variable maps. In addition, there some persons whose position was above where items were measuring, although this layout was better compared to the 11-question OFAS. As shown in the map, the items covered a range of 1.5 to near -2.0 logits in difficulty, which is slightly narrower than the range of about 3.5 to -1.5 for persons. The teacher near the high extreme logit value responded with mostly 3s (i.e., with the exception of one 2). Overall, the above evidence indicates that easier and harder items may need to be added in future studies to extend the range of the trait measured. Because this is one of two final measures that will be used in the analysis of Research Question 3, basic descriptives of the scale should be presented with the final analysis sample (N = 46) for this 10-question OFAS without the extreme teacher. The total possible points on this measure is 30. The mean of the 10-question OFAS in the final analysis sample was 17.83 (SD = 5.86). The range was 23 with a minimum score of 6 and a maximum score of 29. Histograms and skewness and kurtosis statistics revealed a normal distribution.
275
-1
-2
<more>|<rare> + | | X | | | | + | T| | X | | | XX + X | |T XXXXX S| | XX | Q52 XXX | Q51 XXX + |S XXXXX | M| XX | Q47 X | Q44 XXXX | Q49 XXX +M Q50 X | Q45 XX | X | XX S| XX | |S XXX + X | Q43 | | X T| |T | Q48 + <less>|<frequ>
Q53
Figure 49. The map of persons and items for the 10-item survey. The distribution of person positions is on the left side of the vertical line and items on the right. The scale appears to be applicable for its purposes.
276
Both the 50-question OFAS and the 10-question OFAS will be used to examine the third research question to determine if they are significant predictors of, or which is the better predictor of, student computerized/online formative assessment scores (i.e., DORA). Research Question 3 Descriptives. Descriptive information about the sample used to address this third research question (i.e., What is the relationship between a measure of teacher computerized/online formative assessment use and student computerized/online formative assessment scores?) is summarized in the following paragraphs. To examine this research question, data were used from multiple sources. Survey data were collected from all reading teachers who use DORA in the Highland School District in Ault, Colorado, and existing data (i.e., demographic information, CSAP scores, and DORA scores) for the current academic year (2009/2010) were sent from the Highland School District and LGL for all students in grades 3 through 8. First, a description of the populations from which the data have been sampled with be discussed. This includes two main groups: (1) all Highland School District DORA-using reading teachers, and (2) all students in grades 3 through 8 who are administered DORA in the Highland School District in Ault, CO. This will be followed by a summary of the descriptive information for the final samples used for the teachers and students. Teacher Sample As briefly outlined in the descriptive results section for Research Question 2, the reading teachers in the Highland School District using DORA include 22 individuals who 277
were either reading specialists or in special education, English as a Second Language (ESL) teachers, or English Language Learner (ELL) teachers. Of the 22 teachers, 19 individuals completed the survey. The three teachers not participating included three females. Two of the female teachers were high school language arts instructors, with the other female as the designated district contact for Title I. Other demographics were not reported for these non-participants, as completing the survey was voluntary. Reasons for declining to participate were not reported. For this research question, only 11 of the 18 teachers were used in the analysis. The reason for only using 11 of the 18 teachers was twofold: (1) grades 3 through 8 were the only grade levels analyzed due to the more frequent administration of DORA in younger grade levels (i.e., DORA was administered three times in the current academic year compared to only once or twice in older grade levels), and (2) only teachers in these grade levels could be linked with a classroom of students with a large enough sample to conduct the analyses (e.g., ESL/ELL, and Special Education reading teachers could not be linked to large groups of students, and there was some overlap in student group/classroom membership between these teachers and their regular reading teachers). A description of the 19 reading teachers who completed the OFAS from the Highland School District will follow, including specifics about the eight teachers removed from the final sample for the analyses. Original District Teacher Sample. The overwhelming majority of 19 respondents were female (78.9%; n = 15) and White (Non-Hispanic; n = 18; 94.7%). The only other ethnicity/race represented was multi-racial (5.3%; n = 1). These 19 teachers reported an 278
average age of 36.16 years (SD = 8.82). When asked how many years total they have been teaching including the current academic year, teachers responded with 8.16 years on average (SD = 6.52), and 4.00 (SD = 2.89) years on average in their current school district including the present academic year. Nine respondents indicated teaching in the elementary grade levels (i.e., Preschool through grade 5; 47.4%). This was followed by four teachers that indicated instructing middle school (i.e., grades 6 through 8; 21.1%), and four noted that they teach all grade levels (21.1%). The remaining two teachers instructed at the high school level (10.5%). The current specializations reported included eight general reading teachers (42.1%), five language arts (26.3%), four special education (21.1%), and two ELL (10.5%). Finally, 57.9% indicated that their highest degree obtained was a Masters Degree (n = 11), and 36.8% (n = 7) reported having their Bachelors Degrees. One teacher reported having a Doctoral Degree/Above a Masters Degree (5.3%). This information is presented below in Table 62.
279
Table 62 Demographic Information for Reading Teachers in the Highland School District (N =19) Demographic Information Age Total Years Teaching Total Years in District Gender Male Female Ethnicity White (Non-Hispanic) Multi-Racial Grade Elementary (Preschool 5) Middle (6 8) High School (9 12) All Current Specialization General Reading Language Arts English Language Learner Special Education Highest Degree Earned Bachelors Degree Masters Degree Doctoral Degree/Above Masters Degree M or n (SD or %) 36.16 (SD = 8.82) 8.16 (SD = 6.52) 4.00 (SD = 2.89) 4 (21.1%) 15 (78.9%) 18 (94.7%) 1 (5.3%) 9 (47.4%) 4 (21.1%) 2 (10.5%) 4 (21.1%) 8 (42.1%) 5 (26.3%) 2 (10.5%) 4 (21.1%) 7 (36.8%) 11 (57.9%) 1 (5.3%)
Cases Removed from the Original District Teacher Sample. As stated above, the reasons for removing eight teachers from the final analysis sample included the following: (1) Grades 3 through 8 were the only grade levels analyzed due to the more frequent administration of DORA in younger grade levels, and at least three time points are necessary to analyze the data for this research question, and (2) Only teachers in these grade levels could be linked with a classroom of students with a large enough sample to 280
conduct the analyses (e.g., ESL/ELL, and Special Education reading teachers could not be linked to large groups of students). The eight teachers removed from the above sample include seven females and one male. All teachers removed were White (Non-Hispanic), with an average age of 41.25 years (SD = 8.70). They had an average of 12.12 years (SD = 8.49) total teaching including the current academic year, and 4.25 years (SD = 3.62) total teaching in their current school district including the present academic year. As expected, the teachers removed from further analysis included individuals who indicated that they are certified to teach all grade levels (n = 4; i.e., all ELL and Special Education reading teachers), one elementary-specific ELL teacher, one middle school-specific Special Education teacher, and the high school teachers (n = 2). This corresponds to the removal of four Special Education reading teachers, two ELL teachers, and two Language Arts teachers (i.e., Language Arts is the class title for reading courses at the high school and middle school level in the district). This information will not be presented in a table due to the small number of individuals described above. Cases Removed and the Original District Teacher Sample. The eight teachers removed were predominantly female (87.5%), which is similar to the full population where 15 of the 19 teachers were female. All the teachers that were removed were White (Non-Hispanic), which is again similar to the full population of district reading teachers where 94.7% were White (Non-Hispanic). The average age of the teachers removed (M = 41.25, SD = 8.70) was slightly higher than the entire district population (M = 36.16, SD = 8.82), as was the total years teaching (i.e., M = 12.12, SD = 8.49 for the removed teachers; M = 8.16, SD = 6.52 for the entire district). For the total years teaching in their 281
current school district, the removed teachers, again, had more experience on average (M = 4.25, SD = 3.62) compared to the full district reading teacher population (M = 4.00, SD = 2.89). Perhaps the slightly older average age of the removed teachers compared to the entire district is a product of many of these individuals being ESL/ELL instructors, who typically need additional education (i.e., a Masters degree) and training before working in education. Final Teacher Analysis Sample. The descriptive data from the 11 teacher participants used in the final sample for analysis in Research Question 3 will be summarized in the following paragraphs. The teachers all had complete data and are summarized below in Table 63. Again, the majority of the respondents were female (72.7%; n = 8), and White (Non-Hispanic; n = 10; 90.9%). The other ethnicity/race represented was one multi-racial (9.1%) respondent. Additionally, the final sample of 11 teachers reported an average age of 32.45 years (SD = 7.15). When asked how many years total they had been teaching including the current academic year, teachers responded with 5.27 years on average (SD = 2.05), and 3.82 (SD = 2.40) years in their current school district including the present academic year. As anticipated, eight respondents indicated teaching in the elementary grade levels (i.e., Preschool through grade 5; 72.7%), and three teachers were in the middle school (27.3%). The current specializations reported included eight general reading (72.7%), and three Language Arts (27.3%). Finally, 54.5% indicated that their highest degree obtained was a Bachelors Degree (n = 6), and 36.4% (n = 4) reported having their
282
Masters Degrees. One teacher reported having a Doctoral Degree/Above a Masters Degree (9.1%).
Table 63 Demographic Information for the Final Analysis Sample of Reading Teachers in the Highland School District (N =11) Demographic Information Age Total Years Teaching Total Years in District Gender Male Female Ethnicity White (Non-Hispanic) Multi-Racial Grade Elementary (P 5) Middle (6 8) Current Specialization General Reading Language Arts Highest Degree Earned Bachelors Degree Masters Degree Doctoral Degree/Above Masters Degree M or n (SD or %) 32.45 (7.15) 5.27 (2.05) 3.82 (2.40) 3 (27.3) 8 (72.7) 10 (90.9) 1 (9.1) 8 (72.7) 3 (27.3) 8 (72.7) 3 (27.3) 6 (54.5) 4 (36.4) 1 (9.1)
Final Teacher Analysis Sample and the Original District Teacher Sample. Compared to the full sample of 19 teachers, the sample used to address Research Question 3 (N = 11) is similar. For example, for average age only 3.71 years separate the two samples, with the higher average age in the full sample. All other key demographic 283
variables remained comparable when examining the percentages, with the only noticeable change being the highest degree earned. In the full sample, more teachers indicated having a Masters Degree (61.1%) than a Bachelors Degree (38.9%), with this reversed in the analysis sample (i.e., Bachelors Degrees = 60% versus Masters Degrees = 40%). This is not considered problematic for addressing this research question, as perhaps more of the individuals removed from the final sample required a Masters Degree for their specialization such as Special Education or ESL/ELL teachers. Student Sample County Demographic Information. See Research Question 1. District Demographic Information. See Research Question 1 for more information. The tables below (see Tables 64 and 65) contain the information for the elementary school and middle school, excluding the high school demographic information, as these data are not being used to address the current research question.
284
Table 64 Student District Demographic Information for the Highland Elementary School from the National Council for Education Statistics (NCES) for 2008/2009 (N = 374) Demographic Information Grade Kindergarten 1 2 3 4 5 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible n(%)
66 (17.6) 78 (20.9) 39 (10.4) 70 (18.7) 63 (16.8) 58 (15.5) 188 (50.3) 186 (49.7) 229 (61.2) 136 (36.4) 3 (.8) 2 (.5) 4 (1.1) 195 (52.1) 179 (47.9)
285
Table 65 Student District Demographic Information for the Highland Middle School from the National Council for Education Statistics (NCES) for 2008/2009 (N = 184) Demographic Information Grade 6 7 8 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible n(%)
58 (31.5) 66 (35.9) 60 (32.6) 96 (52.2) 88 (47.8) 129 (70.1) 49 (26.6) 3 (1.6) 1 (.5) 2 (1.1) 83 (45.1) 101 (54.9)
Original District Sample As mentioned previously, the group that will be described in the following paragraphs includes all students in grades 3 through 8 who are administered DORA in the Highland School District in Ault, Colorado. The population of students in grades 3 through 8 who are administered DORA across the United States and Canada cannot be described, as LGL does not collect demographic information from students. The group of
286
students administered DORA in grades 3 through 8 can be described for Highland, as this information was provided by the district. Students in grades Preschool through 2 and high school were not included in the following description for several reasons. First, the youngest grade levels were not included because this study is focusing primarily on regularly administered formative assessments and the state test. State testing in Colorado begins in grade 3. Additionally, DORA is administered more frequently in younger grade levels, and at least three time points are necessary to analyze the data for this research question, which supports the omission of high school grade levels. Finally, only teachers in grades 3 through 8 could be linked with a classroom of students with a large enough sample to conduct the analyses (e.g., ESL, ELL, and Special Education reading teachers could not be linked to large groups of students). The population consisted of 374 students across grades 3 through 8, which includes 188 females (50.3%) and 186 males (49.7%). The grade levels included the following: (1) 66 students in third grade (17.6%), (2) 54 students in fourth grade (14.4%), (3) 65 students in fifth grade (17.4%), (4) 70 students in sixth grade (18.7%), (5) 52 students in seventh grade (13.9%), and (6) 67 students in eighth grade (17.9%). The average age of the students in the district in grades 3 through 8 was 11.96 years (SD = 1.78). The ethnic composition of the population included 230 students (61.5%) categorized as White (Non-Hispanic), and the remaining individuals classified as minority (n = 144; 38.5%). The minority students were further differentiated in that 136 were Hispanic (36.4%), five were Black (Non-Hispanic; 1.3%), two were American 287
Indian/Alaskan Native (.5%), and one was Asian/Pacific Islander (.3%). As this studys measure of SES, 186 students fell under the free/reduced lunch status category (49.7%) and 188 were not eligible (50.3%). These data can be viewed below in Table 66 separated by grade level.
288
Table 66 Student Demographic Information for the Original District Sample from the Highland School District for the 2009/2010 Academic Year by Grade Level Demographic Information (n (%)) Grade 3 (n = 66) Grade 4 (n = 54) Grade 5 (n = 65) Grade 6 (n = 70) Grade 7 (n = 52) Grade 8 (n = 67) Total (N = 374)
Basic Characteristics Age M = 9.43 (SD = .45) 35 (53.0) 31 (47.0) 36 (54.5) 28 (42.4) 1 (1.5) 1 (1.5) 32 (48.5) 34 (51.5) M = 10.44 (SD = .40) 26 (48.1) 28 (51.9) 28 (51.9) 24 (44.4) 2 (3.7) 30 (55.6) 24 (44.4) M = 11.60 (SD = .69) 33 (50.8) 32 (49.2) 41 (63.1) 23 (35.4) 1 (1.5) 31 (47.7) 34 (52.3) M = 12.42 (SD = .47) 38 (54.3) 32 (45.7) 45 (64.3) 24 (34.3) 1 (1.4) 35 (50.0) 35 (50.0) M = 13.37 (SD = .39) 24 (46.2) 28 (53.8) 33 (63.5) 17 (32.7) 1 (1.9) 1 (1.9) 22 (42.3) 30 (57.7) M = 14.48 (SD = .45) 30 (44.8) 37 (55.2) 47 (70.1) 20 (29.9) 36 (53.7) 31 (46.3) M = 11.96 (SD = 1.78) 186 (49.7) 188 (50.3) 230 (61.5) 136 (36.4) 5 (1.3) 1 (.3) 2 (.5) 186 (49.7) 188 (50.3) Continued
289
Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible
289
Table 66 Continued Demographic Information (n (%)) Grade 3 (n = 66) Grade 4 (n = 54) Grade 5 (n = 65) Grade 6 (n = 70) Language (n = 66) English Language Learner (ELL) Yes No English as a Second Language (ESL) Yes No 14 (21.2) 52 (78.8) (n = 38) 1 (2.6) 37 (97.4) 7 (18.4) 31 (81.6) (n = 58) 2 (3.4) 56 (96.6) 11 (19.0) 47 (81.0) (n = 63) 4 (6.3) 59 (93.7) 11 (17.5) 52 (82.5) (n = 47) 3 (6.4) 44 (93.6) 8 (17.0) 39 (83.0) (n = 60) 8 (13.3) 52 (86.7) 10 (16.7) 50 (83.3) (n = 266) 18 (6.8) 248 (93.2) 61 (18.4) 271 (81.6) Grade 7 (n = 52) Grade 8 (n = 67) Total (N = 374)
290
Disabled/Disadvantaged Gifted Program Yes No Individualized Education Program (IEP) Yes No Accommodations for Testing Yes No 9 (13.6) 57 (86.4) 38 (100) 4 (10.5) 34 (89.5) 9 (23.7) 29 (76.3) 58 (100) 10 (17.2) 48 (82.8) 19 (32.8) 39 (67.2) 8 (12.7) 55 (87.3) 7 (11.1) 56 (88.9) 7 (11.1) 56 (88.9) 4 (8.5) 43 (91.5) 3 (6.4) 44 (93.6) 3 (6.4) 44 (93.6) 1 (1.7) 59 (98.3) 8 (13.3) 52 (86.7) 11 (18.3) 49 (81.7) 13 (4.9) 253 (95.1) 41 (12.3) 291 (87.7) 49 (18.4) 217 (81.6)
290
For the remaining demographic information, missing data were present. For the current third grade, the majority of demographic details were missing, as this information comes from the CDE. Current third graders were administered their first CSAP state test in March of 2009, and the report of all detailed demographic information from this test will be first reported in August 2010. Thus, the majority of this information is not available for descriptive purposes at this time. Other missing demographic data may be due to students being new to the district and have yet to take the CSAP for the first time, or the missing demographic data may be due to students opting out from providing that information or even requesting special permission to not take the state exams. The remaining demographic information available for grade 3 included IEP and ESL status, which combined with the rest of the district information for descriptive information for 332 students (i.e., 42 missing cases) out of 374 total students (i.e., see the footnote in the above table). Descriptive information without the third graders included 266 students (i.e., 108 missing cases). The language background of these students in the district in grades 3 through 8 included 219 native English speakers (82.3%), 46 Spanish speakers (17.3%), and one categorized as other (.4%). These frequencies and percentages, when separated by grade level, were equivalent to the frequencies and percentages for students participating or not participating in the ESL program. Thus, only ESL status will be reported in the current study. The majority of students have been categorized as not participating in an ELL program for three or more years (n = 248; 93.2%), and not participating in the district ESL program (n = 271; 81.6%). Most of the individuals categorized as ESL students 291
were monitored for three or more years (n = 19; 5.7%). Participation in ESL/ELL will be discussed later in the following paragraphs. The majority of the students in the district in grades 3 through 8 are not in the gifted program (n = 253; 95.1%), and 234 (88%) of the students were categorized as not having a primary disability. Those with a primary disability ranged from most having a specific learning disability (n = 19; 7.1%) to a small fraction having either limited intellectual capacity, or an emotional, hearing, physical, or speech/language disability. A total of 291 students do not have IEPs (87.7%). The frequencies and percentages of students who were reported having a primary disability were also recognized as having an IEP, which is expected. Thus, only IEP status will be reported. Two-hundred sixty-four students out of 266 do not require a 504 Plan (99.2%), and the two students who do have a 504 Plan were in grade 5. Due to the low frequency of those requiring 504 Plans, these students will be removed from further analysis. Fortynine students (18.4%) require some accommodations for either reading, math, or science state tests that include Braille, larger print, a scribe, or extended timing. Finally, according to the CDE website for the current academic year (i.e., 2009/2010), Highland Elementary School runs a school-wide Title I program. The middle school and high school in the district are not eligible to receive Title I funds. Original District Sample DORA Scores. It is important to examine the DORA scores for the district as well, as this information is integral in addressing Research Question 3. As with Research Question 1, seven of the eight subtests are of interest, with the Fluency subtest excluded. This subtest is teacher-administered, and teachers rarely 292
record the scores in the LGL database. In the full district population from third to eighth grade, there were three groups of DORA scores for the current academic year (2009/2010) for the seven subtests: (1) April/May 2009, (2) August/September 2009, and (3) December 2009/January 2010. Thus, there were 21 columns of scores in the dataset, with each student at the time of this study having at most three DORA scores for each subtest for the current academic year. The highest average score was from the third testing point in December 2009/January 2010 for the Word Recognition subtest (M = 10.29, SD = 3.69), and the lowest was from the third testing point as well for the Phonemic Awareness subtest (M = .05, SD = .18). This is not surprising as previous subtests are used to gauge if a student will be administered the Phonemic Awareness subtest. Ideally, as students progress in their reading ability, this subtest will not be administered as frequently. Thus, the low average in the third time point for this subtest is an indication of better reading ability. These DORA scores for the district for grades 3 through 8 are summarized below in Table 67.
293
Table 67 Descriptive Information for DORA Scores from the Highland School District for the 2009/2010 Academic Year by Grade Level DORA Subtests (M (SD)) Grade 3 (n = 66) Grade 4 (n = 54) Grade 5 (n = 65) Grade 6 (n = 70) Grade 7 (n = 52) Grade 8 (n = 67) Total (N = 374)
Subtest Administration 1 (04/09 05/09) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension (n = 60) 3.32 (.87) 5.37 (4.00) 3.67 (1.55) .71 (.21) 4.02 (1.29) 1.96 (.64) 2.49 (2.01) (n = 42) 3.70 (.38) 6.86 (4.18) 4.41 (.87) .16 (.32) 4.87 (1.42) 2.33 (.86) 4.44 (2.39) (n = 58) 3.70 (.44) 10.09 (3.50) 4.48 (.93) .01 (.07) 5.29 (1.16) 3.16 (1.69) 5.97 (3.43) (n = 62) 3.80 (.12) 11.30 (2.56) 4.69 (.52) .01 (.10) 5.99 (1.71) 3.82 (1.96) 7.36 (3.62) (n = 48) 3.79 (.13) 11.68 (2.61) 4.75 (.14) .02 (.12) 6.67 (1.93) 4.79 (2.09) 8.18 (3.16) (n = 61) 3.80 (.12) 12.27 (1.37) 4.78 (.13) .01 (.11) 8.20 (2.57) 5.83 (2.61) 9.23 (3.42) (n = 331) 3.68 (.48) 9.68 (4.05) 4.46 (.94) .16 (.31) 5.87 (2.23) 3.69 (2.26) 6.32 (3.85)
294
Subtest Administration 2 (08/09 09/09) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension (n = 66) 3.27 (.89) 5.39 (4.23) 3.56 (1.65) .17 (.25) 4.08 (1.39) 1.85 (.68) 2.32 (2.01) (n = 48) 3.59 (.52) 7.38 (4.27) 4.23 (1.15) .14 (.30) 5.15 (1.51) 2.09 (.72) 3.61 (2.37) (n = 62) 3.74 (.26) 10.19 (3.81) 4.56 (.86) .04 (.14) 5.30 (1.35) 3.26 (1.71) 6.11 (3.67) (n = 65) 3.69 (.46) 11.20 (2.93) 4.63 (.77) .02 (.10) 6.12 (1.89) 3.82 (2.01) 7.03 (3.69) (n = 49) 3.78 (.15) 11.65 (2.65) 4.77 (.15) .03 (.14) 7.18 (1.97) 5.03 (2.21) 7.88 (3.56) (n = 62) 3.80 (.12) 12.37 (1.08) 4.78 (.13) .01 (.10) 8.09 (2.59) 5.82 (2.88) 9.58 (3.28) (n = 352) 3.64 (.52) 9.68 (4.18) 4.41 (1.06) 0.07 (.19) 5.96 (2.28) 3.64 (2.36) 6.09 (4.02) Continued
294
Table 67 Continued
DORA Subtests (M (SD)) Grade 3 (n = 66) Grade 4 (n = 54) Grade 5 (n = 65) Grade 6 (n = 70) Grade 7 (n = 52) Grade 8 (n = 67) Total (N = 374)
Subtest Administration 3 (12/09 01/10) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension (n = 66) 3.51 (.77) 6.61 (4.28) 3.91 (1.44) .14 (.26) 4.27 (1.48) 2.09 (.65) 2.78 (2.21) (n = 51) 3.67 (.40) 9.13 (3.57) 4.51 (.68) .07 (.22) 5.19 (1.19) 2.47 (.79) 4.59 (2.86) (n = 65) 3.74 (.35) 10.47 (3.34) 4.62 (.71) .02 (.12) 5.58 (1.69) 3.58 (1.91) 7.10 (3.12) (n = 57) 3.71 (.36) 11.64 (2.68) 4.70 (.54) .04 (.15) 6.63 (2.25) 4.27 (2.09) 7.71 (3.53) (n = 49) 3.81 (.08) 11.76 (2.58) 4.67 (.40) .01 (.06) 7.45 (2.16) 5.25 (2.23) 7.93 (3.32) (n = 67) 3.82 (.07) 12.38 (1.10) 4.79 (.11) .01 (.12) 8.11 (2.49) 6.30 (2.81) 9.39 (3.25) (n = 355) 3.71 (.43) 10.29 (3.69) 4.52 (.84) .05 (.18) 6.18 (2.36) 4.00 (2.44) 6.58 (3.81)
295
Note. There are eight DORA subtests. Fluency scores have been omitted from analyses and reporting due to this test being administered infrequently. Due to missing data at each test administration, the sample sizes are different
295
Cases Selected for Potential Removal As with Research Question 1, the following sections will outline the treatment of missing data and removal of cases based on missing data and other considerations. The cases removed will be compared with the district population information and original district sample. This will be followed by a summary of the final analysis sample. Low Frequency DORA Scores For Research Question 3, at least three scores on each DORA subtest are needed to conduct the HLM examining growth over the current academic year. As mentioned previously, the three time points for the current academic year include April/May 2009, August/September 2009, and December 2009/January 2010. If any one of the three data points were missing, the cases were removed from the district dataset, and were not used in the analysis sample. Specifically, 43 cases were missing scores at the first time point, 22 were missing at the second time point, and 19 were missing at the third time point. Fifty-six cases were missing two of the scores, and 14 were missing one score. One case appeared to not be missing any data, but could not be linked to one reading teacher. The treatment of missing data in this study could be handled a number of ways. One common practice is to impute scores for the missing time points using a regression method that involves predicting a missing score at a time point with all other available scores for a given subtest. However, it was decided that since less than 20% of the entire sample had missing scores, it would be acceptable to not use these cases in the analysis sample. Additionally, these 71 cases did not differ significantly on the overwhelming majority of DORA subtests across all time points (p > .002), and as an outside measure, 296
did not differ significantly on the state reading test as well (p > .05). Overall, the group with missing DORA scores and the remaining cases performed similarly on DORA and the state reading test. This supports removing the cases from the analysis sample without any detriment to the validity of the studys results. This resulted in 71 cases being removed. The demographic information of the cases removed is summarized below in Table 68.
297
Table 68 Demographic Information of Cases Removed Due to Missing DORA Scores from the Highland School District for the 2009/2010 Academic Year (N = 71) Demographic Information Age Grade 3 4 5 6 7 8 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible English Language Learner (ELL) a. Yes No English as a Second Language (ESL) b. Yes No Gifted Programa. Yes No Total (n (%)) M = 11.96 (SD = 1.78) 6 (8.5) 17 (23.9) 8 (11.3) 23 (32.4) 10 (14.1) 7 (9.9) 34 (47.9) 37 (52.1) 34 (47.9) 36 (50.7) 1 (1.4) 42 (59.2) 29 (40.8) 3 (10.7) 25 (89.3) 9 (26.5) 25 (73.5) 2 (7.1) 26 (92.9) Continued
298
Table 68 Continued Individualized Education Program (IEP) b. Yes No Accommodations for Testing a. Yes No Note. a. n = 28, b.n = 34
4 (11.8) 30 (88.2) 5 (17.9) 23 (82.1)
Low Frequency DORA Scores and the District Student Population. In the cases removed (N = 71), the proportion of students in each grade level was not comparable to the proportions reported by NCES for the District in 2008/2009. Similar to the above demographic comparisons for gender and free/reduced lunch status, the cases removed were comparable to the NCES population demographic information reported with both males and females and eligible and non-eligible near 50% for each group. The ethnic composition of the samples was dissimilar, with 47.9% of the cases removed identifying as White (Non-Hispanic) and just over 50% of the cases removed being categorized as Hispanic. This is different than the NCES information where 61.2% of the population in elementary school and 70.1% of the population in middle school was White (NonHispanic), and 36.4% of the population in elementary school and 26.6% in the middle school were Hispanic. Regarding ethnicity, the proportion of cases removed was lower for Whites compared to both the original sample and NCES district population demographics, and higher for Hispanics compared to both the original sample and NCES
299
district population information. All other demographic information was not reported by NCES. Low Frequency DORA Scores and the Original District Sample. In the cases removed, the proportion of students in each grade level was not comparable to the proportions in the original district sample (N = 374). For example, 9.9% of the cases removed were eighth graders, and 17.9% of the original district sample were categorized as eighth grade. The average ages of the samples were exactly the same with a mean of 11.96 (SD = 1.78). For gender and free/reduced lunch status, the samples were comparable with both males and females and eligible and non-eligible split roughly down the middle near 50%. The ethnic composition of the samples were dissimilar, with 47.9% of the cases removed identifying as White (Non-Hispanic) and just over 50% of the cases removed being categorized as Hispanic. This is different than the original district sample where 61.5% of the sample was White (Non-Hispanic) and 36.4% were Hispanic. All other demographic information (i.e., ESL/ELL status, Gifted status, IEP, and testing accommodations) were nearly equivalent across the two samples. 504 Plan As mentioned previously, students with a 504 Plan (i.e., a disability plan) will be removed due to the low frequency of students enrolled (n = 2), and the implication that those involved with this program are somehow disadvantaged. The 504 Plan aims to help disabled students have the opportunity to perform at the same level as regular students, and many of these students are receiving special accommodations to take various tests. Therefore, due to the low frequency of disabled students, namely those with a 504 Plan, 300
and for purposes of generalizability, these students will be removed from further analysis. Including the 71 cases previously removed with insufficient DORA data, a total of 73 students have been removed up to this point from the final analysis sample. ESL/ELL Students Students involved in the districts ESL/ELL programs were examined for potential removal from the final analysis sample. Literature support and considerations pertaining to the inclusion or exclusion of these students in the final analysis sample were addressed in the results section for Research Question 1. To summarize from Research Question 1, these students were included in the final analysis sample to bolster sample size to conduct the HLM analysis. As with the first research question, ESL/ELL status will be included in the current HLM model as another predictor for this third research question. To further support and investigate ESL/ELL status for inclusion in the final analysis sample, independent samples t-tests were conducted between the ESL/ELL (n = 51) and non-ESL/ELL (n = 250) sample to examine performance on all DORA subtests. The normality assumption was violated with most distributions on the dependent variables displaying a negative skew (i.e., many students obtained the highest possible score on a given subtest). Independence was not violated. Additionally, the homogeneity of variances assumption was also violated in the majority of cases, with few having equal population variances. Departures from homogeneity (and the unbalanced groups) warranted the use of Welchs t test in some instances (i.e., High-Frequency Words, Word Recognition, and Oral Vocabulary for all three data collection time points; Spelling for 301
the first two data collection time points; and Phonics and Reading Comprehension for the second data collection time point). Violations of the homogeneity assumption were noted for High-Frequency Words, Word Recognition, and Oral Vocabulary for all three data collection time points; Spelling for the first two data collection time points; and Phonics and Reading Comprehension for the second data collection time point (p < .05). The results from the t tests are summarized below in Table 69.
302
Table 69 Independent Samples t Tests Comparing ESL/ELL (n = 51) and Non-ESL/ELL (n = 250) Students on All DORA Subtests from the Highland School District for the 2009/2010 Academic Year DORA Subtest Scores ESL/ELL M (SD) Non-ESL/ELL M (SD) t df p
Subtest Administration 1 (04/09 05/09) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.57 (.71) 8.06 (4.70) 4.25 (1.16) .18 (.28) 4.57 (1.59) 2.89 (1.79) 4.06 (3.25) 3.69 (.44) 9.94 (3.92) 4.47 (.94) .17 (.33) 6.19 (2.32) 3.88 (2.37) 6.68 (3.84) 1.21 2.68 1.45 -.11 6.06* 3.40* 4.56* 58.00 64.96 299.00 299.00 99.05 89.65 299.00 .232 .009 .147 .914 .000 .001 .000
303
Subtest Administration 2 (08/09 09/09) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.50 (.82) 8.36 (4.56) 4.25 (1.28) .09 (.21) 4.88 (1.71) 3.03 (2.02) 4.12 (3.49) 3.69 (.43) 10.17 (4.00) 4.48 (.95) .06 (.18) 6.32 (2.37) 3.88 (2.51) 6.74 (4.03) 1.61 2.64 1.26 -1.10 5.08* 2.62 4.75* 55.61 66.60 61.82 299.00 93.98 84.60 79.67 .114 .010 .213 .273 .000 .011 .000 Continued
303
Table 69 Continued DORA Subtest Scores ESL/ELL M (SD) Non-ESL/ELL M (SD) t df p
Subtest Administration 3 (12/09 01/10) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.59 (.64) 8.88 (4.12) 4.40 (1.02) .04 (.16) 4.66 (1.49) 3.42 (2.29) 4.97 (3.49) 3.74 (.37) 10.61 (3.51) 4.55 (.81) .05 (.16) 6.62 (2.40) 4.14 (2.49) 7.12 (3.83) 1.59 2.80 1.10 .08 7.60* 1.91 3.70* 57.01 65.62 299.00 299.00 110.60 299.00 299.00 .117 .007 .271 .935 .000 .057 .000
304
Note. There are eight DORA subtests. Fluency scores have been omitted from analyses and reporting due to this test being administered infrequently. Violations of the homogeneity assumption were noted for High-Frequency Words, Word Recognition, and Oral Vocabulary for all three data collection time points; Spelling for the first two data collection time points; and Phonics and Reading Comprehension for the second data collection time point (p < .05).
*
304
For the Oral Vocabulary and Reading Comprehension subtests for all three data collection time points, the ESL/ELL students performed significantly lower on average than the non-ESL/ELL students (p < .002). For the Spelling subtest, only the first data collection time point was significantly lower for the ESL/ELL students compared to the non-ESL/ELL students. The High-Frequency Words, Phonics, and Phonemic Awareness subtests were not significantly different between the two groups. This may be a reflection of the scale of measurement of these subtests, where a ceiling effect was consistently present, prohibiting the use of these subtests in further analyses. This will be discussed further in the following paragraphs. For the majority of the DORA subtests, ESL/ELL students did not perform statistically significantly different from non-ESL/ELL students. These results are different from Research Question 1, where the ESL/ELL students performed statistically significantly different from the non-ESL/ELL students on both the CSAP and DORA scores for all but two time points. Regardless, for consistency purposes, the ESL/ELL students in this research question will be retained in the analysis sample for further examination in the HLM as another predictor. IEP Students Finally, IEP status was examined for removal from further analysis for Research Question 3. Literature support and considerations for this group were outlined in the results for Research Question 1. Thirty-seven (12.3%) students across grades 3 through 8 were categorized as having an IEP in the sample in which missing DORA scores/cases were removed (N = 301). Students with IEP status, as mentioned previously, encompass a range of individuals who may need special accommodations to take tests, whereas this is 305
not as consistent with ESL/ELL students. It was determined that keeping the IEP status students in the sample would be far more detrimental to the validity of the study compared to retaining the ESL/ELL students. There are also a smaller number of IEP status individuals, who appear to be struggling across the board compared to the larger analysis sample. Only 12 students who were categorized as ESL/ELL were also noted to have an IEP. Additionally, in comparing the means on all the DORA subtests for all the data collection time points for ESL/ELL students and IEP status students, the ESL/ELL students performed better than the IEP status students on the majority of subtests across the current academic year. Thus, IEP students will be removed from the final analysis sample. Demographic information for the IEP sample is provided in Table 70 below for comparison purposes with the analysis sample.
306
Table 70 Demographic Information of the IEP Students from the Highland School District for Grades 3 through 8 for the 2009/2010 Academic Year (N = 37) Demographic Information Age Grade 3 4 5 6 7 8 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible English Language Learner (ELL) a. Yes No English as a Second Language (ESL) Yes No Gifted Programa. No Accommodations for Testing a. Yes No
Note. n = 28
a.
Total (n (%)) M = 12.01 (SD = 1.98) 9 (24.3) 3 (32.4) 10 (59.5) 4 (10.8) 3 (8.1) 8 (21.6) 22 (59.5) 15 (40.5) 16 (43.2) 20 (54.1) 1 (2.7) 24 (64.9) 13 (35.1) 5 (17.9) 23 (82.1) 12 (32.4) 25 (67.6) 28 (100.0) 24 (85.7) 4 (14.3)
307
To further examine the inclusion of IEP students in the sample, independent samples t-tests were conducted between the IEP (n = 37) and non-IEP (n = 264) samples to examine performance on all DORA subtests. Again, the normality assumption was violated with most distributions on the dependent variables displaying a negative skew (i.e., many students obtained the highest possible score on a given subtest). Independence was not violated. Additionally, the homogeneity of variances assumption was also violated in the majority of cases, with few having equal population variances. Departures from homogeneity (and the unbalanced groups) warranted the use of Welchs t test in some cases (i.e., High-Frequency Words, Phonics, Spelling, and Reading Comprehension for all three data collection time points, and Word Recognition and Phonemic Awareness for the last two data collection time points). Violations of the homogeneity assumption were noted for High-Frequency Words, Phonics, Spelling, and Reading Comprehension for all three data collection time points, and Word Recognition and Phonemic Awareness for the last two data collection time points (p < .05). The results from the t tests are summarized below in Table 71.
308
Table 71 Independent Samples t Tests Comparing IEP (n = 37) and Non-IEP (n = 264) Students on All DORA Subtests from the Highland School District for the 2009/2010 Academic Year DORA Subtest Scores IEP M (SD) Non-IEP M (SD) t df p
Subtest Administration 1 (04/09 05/09) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.21 (.89) 5.50 (4.14) 3.48 (1.74) .24 (.31) 4.90 (2.03) 1.95 (1.05) 3.27 (3.07) 3.74 (.36) 10.20 (3.77) 4.57 (.73) .16 (.32) 6.05 (2.29) 3.96 (2.33) 6.66 (3.79) 3.53* 7.01* 3.78* -1.39 2.92 9.00* 6.09* 37.69 299.00 37.81 299.00 299.00 96.53 52.68 .001 .000 .001 .165 .004 .000 .000
309
Subtest Administration 2 (08/09 09/09) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.26 (.90) 5.50 (4.45) 3.42 (1.80) .24 (.31) 5.04 (2.04) 2.02 (1.57) 3.11 (3.16) 3.71 (.41) 10.47 (3.72) 4.59 (.75) .04 (.15) 6.22 (2.34) 3.97 (2.46) 6.74 (3.97) 3.04 6.49* 3.89* -3.81* 2.93 6.54* 6.32* 38.18 43.36 37.78 38.51 299.00 63.94 53.27 .004 .000 .000 .000 .004 .000 .000 Continued
309
Table 71 Continued DORA Subtest Scores IEP M (SD) Non-IEP M (SD) t df p
Subtest Administration 3 (12/09 01/10) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.30 (.94) 6.54 (4.45) 3.66 (1.61) .15 (.26) 4.89 (2.01) 2.07 (1.59) 3.67 (3.03) 3.77 (.25) 10.85 (3.23) 4.64 (.59) .03 (.14) 6.48 (2.37) 4.29 (2.45) 7.19 (3.76) 3.02 5.67* 3.68* -2.70 3.87* 7.37* 6.41* 36.74 41.46 37.36 38.87 299.00 63.01 52.93 .005 .000 .001 .010 .000 .000 .000
310
Note. There are eight DORA subtests. Fluency scores have been omitted from analyses and reporting due to this test being administered infrequently. Violations of the homogeneity assumption were noted for High-Frequency Words, Phonics, Spelling, and Reading Comprehension for all three data collection time points, and Word Recognition and Phonemic Awareness for the last two data collection time points (p < .05).
*
310
As shown above, IEP status students performed consistently lower on average on the majority of the DORA subtests for most data collection points during the current academic year. The mean differences were larger compared to the differences between ESL/ELL students and non-ESL/ELL students. More specifically, for the Word Recognition, Phonics, Spelling, and Reading Comprehension subtests for all three data collection time points, the IEP students performed significantly lower on average than the non-IEP students (p < .002). For the High-Frequency Words subtest, only the first data collection time point was significantly lower for the IEP students compared to the nonIEP students. The same was found for Phonemic Awareness at the second data collection time point and for Oral Vocabulary at the third data collection time point. This consistent pattern where IEP students were performing significantly lower on the majority of DORA subtests warranted the exclusion of these cases from the final analysis sample. Future research may consider examining this specific population, or including IEP status as another variable in modeling growth in HLM. Total Cases Removed The total cases removed from the original district sample (N = 374) included the 71 cases with missing DORA scores, the two students with 504 Plan status, and the IEP students (N = 37) for a total of 110 cases removed. This encompassed 29.4% of the original district sample. A summary of the demographic information of the total cases removed is below in Table 72.
311
Table 72 Demographic Information of the Total Cases Removed from the Highland School District for Grades 3 through 8 for the 2009/2010 Academic Year (N = 110) Demographic Information Age Grade 3 4 5 6 7 8 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible English Language Learner (ELL) a. Yes No English as a Second Language (ESL) b. Yes No Gifted Programa. Yes No Accommodations for Testing a. Yes No Note. a. n = 28, b. n = 73 Total (n (%)) M = 12.01 (SD = 1.77) 15 (13.6) 20 (18.2) 20 (18.2) 27 (24.5) 13 (11.8) 15 (13.6) 56 (50.9) 54 (49.1) 50 (45.5) 58 (52.7) 1 (.9) 1 (.9) 68 (61.8) 42 (38.2) 8 (13.8) 50 (86.2) 22 (30.1) 51 (69.9) 2 (3.4) 56 (96.6) 31 (53.4) 27 (46.6)
312
Final Analysis Sample The final sample consisted of 264 students across grades 3 through 8, which included 134 females (50.8%) and 130 males (49.2%). The grade levels included the following: (1) 51 students in third grade (19.3%), (2) 34 students in fourth grade (12.9%), (3) 45 students in fifth grade (17.0%), (4) 43 students in sixth grade (16.3%), (5) 39 students in seventh grade (14.8%), and (6) 52 students in eighth grade (19.7%). The average age of the students in the district in grades 3 through 8 was 11.94 years (SD = 1.79). The ethnic composition of the population included 180 students (68.2%) categorized as White (Non-Hispanic), and the remaining individuals classified as minority (n = 84; 31.8%). The minority students were further differentiated in that 78 were Hispanic (29.5%), four were Black (Non-Hispanic; 1.5%), and two were American Indian/Alaskan Native (.8%). As this studys measure of SES, 118 students fell under the free/reduced lunch status category (44.7%) and 146 were not eligible (55.3%). Finally, 10 students were categorized as ELL (4.7%), and 39 were categorized as ESL (14.8%). This information is presented below in Table 73.
313
Table 73 Demographic Information for the Final Analysis Sample from the Highland School District for Grades 3 through 8 for the 2009/2010 Academic Year (N = 264) Demographic Information Age Grade 3 4 5 6 7 8 Gender Male Female Ethnicity White (Non-Hispanic) Hispanic Black (Non-Hispanic) Asian/Pacific Islander American Indian/Alaskan Native Free/Reduced Lunch Eligible Not Eligible English Language Learner (ELL) a. Yes No English as a Second Language (ESL) Yes No Gifted Programb. Yes No Accommodations for Testing b. Yes No Note. a. n = 213, b.n = 208 Total (n (%)) M = 11.94 (SD = 1.79) 51 (19.3) 34 (12.9) 45 (17.0) 43 (16.3) 39 (14.8) 52 (19.7) 130 (49.2) 134 (50.8) 180 (68.2) 78 (29.5) 4 (1.5) 2 (.8) 118 (44.7) 146 (55.3) 10 (4.7) 203 (95.3) 39 (14.8) 225 (85.2) 11 (5.3) 197 (94.7) 18 (8.7) 190 (91.3)
314
Final Analysis Sample and the District Student Population. In the final analysis sample (N = 264), the proportion of students in each grade level was not comparable to the proportions reported by NCES for the district in 2008/2009. For gender and free/reduced lunch status, the final analysis sample was comparable to the NCES population demographic information reported with both males and females and eligible and non-eligible near 50% for each group. Finally, the ethnic composition of the final analysis sample and the NCES population information was similar, with approximately 60% to 70% of the cases in both groups identifying as White (Non-Hispanic), and approximately 30% of the cases reporting Hispanic ethnicity. Final Analysis Sample and the Original District Sample. In the final analysis sample (N = 264), the proportion of students in each grade level was not comparable to the proportions in the original district sample (N = 374). The average ages of the samples were nearly equivalent with a mean of 11.94 (SD = 1.79) in the final analysis sample and a mean of 11.96 (SD = 1.78) in the original sample. For gender and free/reduced lunch status, the samples were comparable with both males and females and eligible and noneligible split down the middle near 50%. The ethnic composition of the samples were similar, with approximately 60% to 70% of the cases in both samples identifying as White (Non-Hispanic), and approximately 30% to 35% of the cases categorized as Hispanic. Other demographic information (i.e., ESL/ELL status, Gifted status) were nearly equivalent across the two samples, with the percentage of students needing accommodations for testing considerably higher in the original district sample. Total Cases Removed and Final Analysis Sample. In the total cases removed (N = 110), the proportion of students in each grade level was relatively comparable to the 315
proportions in the final analysis sample (N = 264) except for the sixth grade. For example, 24.5% of the total cases removed were sixth graders, and 16.3% of the final analysis sample was categorized as sixth graders. The mean ages of the two samples were similar with the total cases removed average at 12.01 (SD = 1.77) and the final analysis sample average at 11.94 (SD = 1.79). For gender, the samples were analogous with both males and females and split approximately down the middle around 50%. The ethnic composition of the samples were dissimilar, with 45.5% of the total cases removed identifying as White (Non-Hispanic) and 52.7% of the total cases removed being categorized as Hispanic. This is different than the final analysis sample where 68.2% of the sample is White (Non-Hispanic) and 29.5% is Hispanic. Free/reduced lunch status was also not equivalent, with 61.8% of the total cases removed being eligible compared to 44.7% of the final analysis sample being eligible. Other demographic information such as ESL/ELL status and Gifted status were nearly equivalent across the two samples. Unsurprisingly, students requiring accommodations for testing were not similar across samples, with a higher percentage reported in the total cases removed (53.4%) compared to the final analysis sample (8.7%). This is a product of the removal of the IEP students from the analysis sample, who generally require special accommodations for testing. Final Analysis Sample and Total Cases Removed DORA Scores. To compare the total cases removed (n = 110) with the final analysis (n = 264) sample on DORA scores, independent samples t-tests were again conducted. Again, the normality assumption was violated with most distributions on the dependent variables displaying a negative skew (i.e., many students obtained the highest possible score on a given subtest). 316
Independence was not violated. Additionally, the homogeneity of variances assumption was also violated in the majority of cases, with few having equal population variances. Departures from homogeneity (and the unbalanced groups) warranted the use of Welchs t test in some cases (i.e., High-Frequency Words, Word Recognition, Phonics, and Spelling for all three data collection time points; Reading Comprehension and Oral Vocabulary for the second data collection time point; and Phonemic Awareness for the last two data collection time points). Violations of the homogeneity assumption were noted for High-Frequency Words, Word Recognition, Phonics, and Spelling for all three data time points; Reading Comprehension and Oral Vocabulary for the second data collection time point; and Phonemic Awareness for the last two data collection time points (p < .05). The results from the t tests are summarized below in Table 74.
317
Table 74 Independent Samples t Tests Comparing the Final Analysis Sample (n = 264) and the Total Cases Removed (n = 110) on All DORA Subtests from the Highland School District for the 2009/2010 Academic Year DORA Subtest Scores Cases Removeda. M (SD) Analysis Sample M (SD) t df p
Subtest Administration 1 (04/09 05/09) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.46 (.74) 7.63 (4.48) 4.02 (1.44) .15 (.27) 5.16 (1.79) 2.61 (1.55) 5.03 (3.86) 3.74 (.36) 10.20 (3.77) 4.57 (.73) .16 (.32) 6.05 (2.29) 3.96 (2.33) 6.66 (3.79) -3.00 -4.32* -3.04 -.43 -2.97 -5.69* -3.12* 74.39 91.20 74.90 329.00 329.00 150.94 329.00 .004 .000 .003 .667 .003 .000 .002
318
Subtest Administration 2 (08/09 09/09) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.43 (.73) 7.30 (4.58) 3.88 (1.56) .15 (.27) 5.15 (1.87) 2.62 (1.70) 4.12 (3.53) 3.71 (.41) 10.47 (3.72) 4.59 (.75) .04 (.15) 6.22 (2.34) 6.74 (3.97) 3.97 (2.46) -3.49* -5.88* -4.12* 3.58* -4.35* -5.74* -5.83* 106.40 127.43 100.76 106.75 184.86 215.79 165.94 .001 .000 .000 .001 .000 .000 .000 Continued
318
Table 74 Continued DORA Subtest Scores Cases Removeda. M (SD) Analysis Sample M (SD) t df p
Subtest Administration 3 (12/09 01/10) 1. High-Frequency Words 2. Word Recognition 3. Phonics 4. Phonemic Awareness 5. Oral Vocabulary 6. Spelling 7. Reading Comprehension 3.52 (.70) 8.66 (4.41) 4.18 (1.26) .11 (.24) 5.32 (2.09) 3.14 (2.19) 4.80 (3.38) 3.77 (.25) 10.85 (3.23) 4.64 (.59) .03 (.14) 6.48 (2.37) 7.19 (3.76) 4.29 (2.45) -3.33* -4.34* -3.42* 2.92 -4.15* -4.20* -5.35* 98.22 124.80 103.87 110.90 353.00 173.44 353.00 .001 .000 .001 .004 .000 .000 .000
319
Note. There are eight DORA subtests. Fluency scores have been omitted from analyses and reporting due to this test being administered infrequently. Violations of the homogeneity assumption were noted for High-Frequency Words, Word Recognition, Phonics, and Spelling for all three data collection time points; Reading Comprehension and Oral Vocabulary for the second data collection time point; and Phonemic Awareness for the last two data collection time points (p < .05). For the first administration, n = 67 for the cases removed; For the second administration, n = 88 for the cases removed; For the third administration, n = 91 for the cases removed.
* a.
p < .002 ( = .002; .05/21 = .002 for the Bonferroni correction)
319
The total cases removed performed consistently lower on average on the majority of the DORA subtests for most data collection points during the current academic year. More specifically, for the Word Recognition, Spelling, and Reading Comprehension subtests for the first data collection time points, the cases removed performed significantly lower on average than the analysis sample (p < .002). For the second data collection time point, the cases removed performed significantly lower on average compared to the analysis sample for all subtests. And finally, for the third data collection time point, the cases removed performed significantly lower compared to the analysis sample on all subtests except for Phonemic Awareness. This consistent pattern where the total cases removed performed significantly lower on the overwhelming majority of DORA subtests further supported the exclusion of these cases from the final analysis sample. DORA Subtests Used As mentioned in the analysis of Research Question 1, three subtests (i.e., HighFrequency Words, Phonics, and Phonemic Awareness) do not have sufficient variability to be examined in the analysis of this final research question. All three subtests demonstrate either a ceiling effect or floor effect. The first subtest in question is HighFrequency Words, which has a range of 0 to 3.83 (i.e., Kindergarten through high third grade). Across all grade levels for this subtest, the mean was consistently above 3, mostly in the mid to high 3 range, with very small standard deviations. The second subtest in question is Phonics, which has a range of 0 to 4.83 (i.e., Kindergarten through high fourth grade). Again, across all grade levels for this subtest, the mean was consistently above 4,
320
with the averages mostly in the mid to high 4 range. The standard deviations also reflected this ceiling effect. Finally, the third subtest demonstrated a floor effect Phonemic Awareness. Scores on this subtest are based on percent correct out of 9 questions. The ranges include the following: (1) 1% to 43% means that there are probable weaknesses, (2) 44% to 65% means that the student has partial mastery, and (3) 66% and above means that there are probable effective skills (Lets Go Learn, Inc., 2009a). The averages for this subtest across all grade levels were between 1% and 5% for grades 5 through 8, with the majority having a very low percentage and small standard deviations. Grades 3 and 4 displayed more variation due to the fact that this subtest is only administered to very poorly performing students. As mentioned previously, previous subtests are used to gauge if a student will be administered the Phonemic Awareness subtest. Ideally, as students progress in their reading ability, this subtest will not be administered as frequently, with students having zeros as their score if they do not need to take this particular test. Thus, above the third or fourth grade, this test is usually not administered unless the student has severe problems in a number of other reading subtest areas. Teacher OFAS Scores As mentioned in the results for Research Question 2, there will be two final OFAS measures used in the analysis of Research Question 3. The final measure included the 50-item survey presented in the second run of the Rasch Analysis. Additionally, a potential 10-question subscale/separate measure will be examined in Research Question 3 as well. It was noted that the items regularly appearing as misfitting in the runs of the data in Research Question 2 were all conceptually related. These items were contained in 321
the section of the survey entitled Using the Results. This suggests that perhaps this 10question section of the OFAS is perhaps a subscale or separate measure that can be used in place of the larger scale to assess online teacher formative assessment use. Thus, both the 50-question OFAS and the 10-question OFAS will be used to examine this third research question to determine which is the better predictor of student formative assessment scores (i.e., DORA). For the 50-question OFAS, teachers could indicate on a scale from 0 to 3 how often in a given quarter/semester the teacher engaged in a specific online formative assessment-related activity. Thus, for the 50-question OFAS, the range of scores could be from 0 to 150. For the original sample (N = 19), the mean was 94.53 (SD = 24.01) with a range of 80. The minimum score was 49 and the maximum was 129. Histograms and skewness and kurtosis statistics indicated that these scores were approximately normally distributed. For the final analysis sample (N = 11), the mean was 89.18 (SD = 23.91) with a range of 77. The minimum score was 49 and the maximum was 126. Again, Histograms and skewness and kurtosis statistics indicated that the scores were normally distributed. For the 10-question OFAS, the potential range was 0 to 30. For the original sample, the mean was 19.26 (SD = 4.93) with a range of 17. The minimum score was 9 and the maximum was 26. Histograms and skewness and kurtosis statistics indicated that the scores were approximately normally distributed for this abbreviated measure, with a slight negative skew. For the final analysis sample, the mean was 18.55 (SD = 5.66) with a range of 17. The minimum and maximum scores were the same as the original sample. The scores were again normally distributed based on histograms and skewness and kurtosis statistics. Overall, these descriptives indicate that the final analysis samples 322
scores were comparable to the entire district reading teachers scores for both versions of the OFAS. This also indicates that there is sufficient variability to examine both the 50question and 10-question OFAS in the analysis of the current research question involving multilevel growth modeling. Hierarchical Linear Growth Modeling Assumptions As noted when discussing the assumptions in Research Question 1, in fitting the submodels to the data, assumptions about the distribution of the Level 1, Level 2, and Level 3 residuals, from occasion to occasion and from person to person are made (Singer & Willett, 2003). The model assumptions of linearity, normality, and homoscedasticity for the data in this third research question will be examined in the following paragraphs. Linearity. To examine linearity at Level 1, for each individual, empirical growth plots should be examined with the outcome of interest. Thus, four sets of scatterplots were produced, one for each DORA outcome. All individuals were examined for each outcome (i.e., N = 264 for each of the four DORA subtests that is an outcome), but only the mean DORA subtest score as the outcome in the graph was investigated in the figure below (see Figure 50). The figure below shows the scatterplot of Time on the X-axis, which was across the 2009/2010 academic year (i.e., three time points), and the overall mean across students for each DORA subtest (i.e., the outcome in each of the HLM models) on the Y-axis (i.e., Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension). To minimize the number of figures in this document, only Word Recognition is displayed below as an example.
323
Figure 50. Scatterplot to check the linearity assumption at Level 1 of the HLM Growth Model in Research Question 3. The scatterplot of the overall DORA Word Recognition means across students displays linear change with time (i.e., the 2009/2010 academic year). Time on the X-axis is represented by the time code used in the multilevel growth model.
Although not shown above, the empirical growth plots for each student for each DORA subtest suggested that most students have linear change with time. For others (i.e., the minority of cases), the small number of waves of data (i.e., three time points across the 2009/2010 academic year) make it difficult to accurately assess growth, with some trajectories appearing curvilinear and others seemingly having no linear relationship. In the scatterplots above of the overall subtest means across students, the plots suggest linear change with time. Linearity at Level 2 does not need to be assessed because all Level 2 predictors are dichotomous (i.e., Sex, Ethnicity, Free/Reduced Lunch status, and ESL/ELL status). Linearity is difficult to assess in Level 3 of the current model, as there are only 11 324
teachers. Thus, linearity will be examined for each teacher as done above with the Level 1 data. The first and last teachers will be used as examples in the figures below (see Figures 51 and 52). Again, a linear relationship is hypothesized for Time and the mean of each DORA subtest as the outcome within each teacher. To minimize the number of figures in this document, only Word Recognition is displayed below.
Figure 51. Scatterplot to check the linearity assumption at Level 3 of the HLM Growth Model in Research Question 3. The scatterplot of the overall DORA Word Recognition means across students for Teacher ID 2 displays linear change with time (i.e., the 2009/2010 academic year). Time on the X-axis is represented by the time code used in the multilevel growth model.
325
Figure 52. Scatterplot to check the linearity assumption at Level 3 of the HLM Growth Model in Research Question 3. The scatterplot of the overall DORA Word Recognition means across students for Teacher ID 19 displays linear change with time (i.e., the 2009/2010 academic year). Time on the X-axis is represented by the time code used in the multilevel growth model.
As demonstrated in the majority of graphs above, the assumption of linearity was upheld in that the plots of Time versus the mean of each DORA subtest across the teachers indicates a linear relationship. Although some relationships appear curvilinear (e.g., Teacher ID 19 for Spelling and Reading Comprehension), the linear trajectory is a reasonable approximation due to the low number of waves of data. Normality. The HLM software for the three-level model produces three residual files, one at each level. These files contain the Empirical Bayes (EB) residuals defined at the various levels, fitted values, and OLS residuals, and EB coefficients. Unfortunately, some statistics provided in the residual file of two-level HLMs, for example the Mahalanobis distance measures, are not available in the residual files produced by three326
level HLMs. Thus, residual files were produced for each level of the three-level model for each DORA subtest as the outcome for a total of 12 residual files. Normality can be assessed in a number of ways. For the current research question, normality was assessed by simply examining histograms for the Level 1 residuals to ensure the approximation of a normal curve. Each DORA subtest Level 1 residuals were examined (see Figure 53 below). Only the histogram for Word Recognition is displayed to minimize the number of figures in this document.
Figure 53. Histogram of the Level 1 residuals to examine the normality assumption in the model with Word Recognition as the outcome. The residuals (i.e., l1resid above) should approximate a normal curve. All histograms for each DORA subtest as the outcome in separate models appeared to approximate a normal distribution.
327
This assumption states that all residuals should be normally distributed. All histograms for each DORA subtest appeared to approximate a normal distribution. However, the distributions appeared more leptokurtic, with Word Recognition appearing the most problematic with visible extreme scores. For Level 2 normality, the assumption was checked again for each DORA subtest as the outcome in the model. The raw residuals for the intercept and slope in each model were examined, rendering a total of eight histograms for the Level 2 normality analysis (see Figures 54 and 55 below). Only the model with Word Recognition as the outcome at the intercept and slope is depicted below to minimize the number of figures in this document.
328
Figure 54. Histogram of the Level 2 residuals to examine the normality assumption in the model with Word Recognition as the outcome. The residuals (i.e., ebintrcp above, which means Empirical Bayes Intercept residuals) should approximate a normal curve. All histograms for each DORA subtest as the outcome in separate models appeared to approximate a normal distribution at the intercepts, with Word Recognition appearing slightly leptokurtic.
329
Figure 55. Histogram of the Level 2 slope residuals to examine the normality assumption in the model with Word Recognition as the outcome. The residuals (i.e., ebslope above, which means Empirical Bayes Slope residuals) should approximate a normal curve. All histograms for each DORA subtest as the outcome in separate models appeared to approximate a normal distribution at the slopes, with Word Recognition appearing slightly leptokurtic.
As with the Level 1 normality analysis, this assumption states that all residuals should be normally distributed. All histograms (i.e., the intercept and slope for each model) for all the DORA subtests as the outcomes appeared to approximate a normal distribution, with Reading Comprehensions intercept and slope residuals demonstrating the closest approximation. Word Recognitions intercept and slope residual histograms appeared slightly leptokurtic, and Spellings intercept displayed a slight positive skew.
330
Finally, Level 3 normality will not be examined directly due to the small number of teachers (N = 11). Homogeneity of Variance. Homogeneity of variance at Level 1 was checked using the HLM softwares Chi-Square test. In the full contextual model (as will be described later), the homogeneity of variance assumption was examined for all outcomes at Level 1. The results suggested that the assumption of homogeneity of variance for Level 1 was violated for all four outcomes (i.e., Word Recognition: 2 = 1092.52, df = 10, p = .000; Oral Vocabulary: 2 = 106.96, df = 10, p = .000; Spelling: 2 = 505.94, df = 10, p = .000; Reading Comprehension: 2 = 47.43, df = 10, p = .000). The violations of homogeneity in this research question might be a result of the violations of normality observed in the previous paragraphs. The nonnormality demonstrated in the residual analysis above (i.e., the histograms) may have had an impact on the heterogeneity found in examining this current assumption. Homogeneity at Level 2 can be examined by plotting raw residuals against the predictors. The raw residuals were plotted against Sex, Ethnicity, Free/Reduced Lunch status, and ESL/ELL status. If the assumption is satisfied, residual variability will be approximately equal at every predictor value. Although all covariates were examined, to minimize the number of scatterplots presented in this document, only those for Sex will be reported for the Word Recognition DORA subtest. Because Level 2 homogeneity is being examined, plots for the intercept and slope were created, producing a total of eight scatterplots (see Figures 56 and 57 below with Word Recognition as an example).
331
Figure 56. Residuals plotted against the Sex covariate to examine the Level 2 homogeneity of variance assumption in the model with Word Recognition as the outcome. The residuals (i.e., ebintrcp above on the Y-axis, which means Empirical Bayes Intercept residuals) should be approximately equal at every covariate value to meet this assumption. For Sex, 0 is Male and 1 is Female. All plots for each DORA subtest as the outcome in separate models do not have approximately equal range and variability for Males and Females at the intercepts. This was the same pattern observed for the other covariates Ethnicity, ESL/ELL, and Free/Reduced Lunch.
332
Figure 57. Residuals plotted against the Sex covariate to examine the Level 2 homogeneity of variance assumption in the model with Word Recognition as the outcome. The residuals (i.e., ebslope above on the Y-axis, which means Empirical Bayes Slope residuals) should be approximately equal at every covariate value to meet this assumption. For Sex, 0 is Male and 1 is Female. All plots for each DORA subtest as the outcome in separate models do not have approximately equal range and variability for Males and Females at the slopes. This was the same pattern observed for the other covariates Ethnicity, ESL/ELL, and Free/Reduced Lunch.
For this assumption at Level 2, residual variability should be approximately equal at every predictor value. Thus, for the above figures, 0 is Male and 1 is Female, and the residual spread should be approximately equal for those values. The Level 2 residuals do not have approximately equal range and variability for Males and Females for all DORA subtests. For example, in Word Recognition for both the intercept and slope, there appears to be some outliers amongst the Females. For Reading Comprehension, in the intercept and slope figures, there is more spread (and outliers) in the Female group compared to the Males. Males appear to have more variability in the Oral Vocabulary 333
graphs with some remarkable outliers. Thus, overall, for the Sex predictor, the homoscedasticity assumption does not appear to be satisfied. For the other predictors not depicted in scatterplots in this document, similar patterns were noted for Ethnicity (i.e., 0 = White and 1 = Minority), and ESL/ELL status (i.e., 0 = Non-ESL/ELL status and 1 = ESL/ELL status) in that the variability or spread was not equal between the groups. For Free/Reduced lunch status (i.e., 0 = NonFree/Reduced Lunch status and 1 = Free/Reduced Lunch status), the homoscedasticity assumption appears to have been upheld for every DORA subtest for both the intercept and slope figures. Comparing all the predictors, Sex and Free/Reduced Lunch status appeared to have the closest variability between the categories examined. These predictors also had nearly equivalent numbers in each group compared to Ethnicity and ESL/ELL status where the proportion of White and non-ESL/ELL students was considerably higher. Again, Level 3 homogeneity of variance is difficult to assess due to the small number of teachers (N = 11). If plots were examined for the intercepts and slopes for the OFAS predictor in the Level 3 residual file, there would not be enough values in each OFAS score category to assess spread and variability. However, the main relationship of interest in the current research question is between teacher OFAS score at Level 3 and student DORA score at Level 1. Thus, plotting the student raw residuals at Level 1 for each DORA subtest against the OFAS as a predictor at Level 3 could indirectly assess homoscedasticity at the teacher level. As with the plots at Level 2, if the assumption is satisfied, residual variability will be approximately equal at every predictor value. Technically, the intercept and slope at 334
Level 3 should be examined; however, as mentioned previously, the relationship of interest is between the Level 3 teacher OFAS score and Level 1 student DORA score for each subtest. The EB intercept and EB slope values for the Level 1 and Level 3 relationship are not provided in the Level 1 residual file (or anywhere). Therefore, only the Level 1 residuals will be used at the student level, which can adequately address homoscedasticity at Level 3. Thus, four scatterplots were created, one for each DORA subtest (see Figure 58 below). Only the scatterplot for Word Recognition as the outcome is shown to minimize the number of figures in this document. The plots contain the standardized OFAS score, which is the raw score minus the mean across all 11 teachers scores. Only nine categories are displayed on the scatterplot below because two teachers had the same OFAS score.
335
Figure 58. Residuals plotted against teacher OFAS score to examine the Level 3 homogeneity of variance assumption in the model with Word Recognition as the outcome. The residuals (i.e., l1resid above on the Y-axis) should be approximately equal at every OFAS value to meet this assumption. All plots for each DORA subtest as the outcome in separate models do not have approximately equal range and variability for every OFAS value. This was the same pattern observed for the other DORA subtest outcomes.
For this assumption at Level 3, residual variability should be approximately equal at every predictor value. Thus, for the above figures, across the teachers OFAS scores, the residual spread should be approximately equal for those values. The Level 3 residuals do not have approximately equal range and variability across teachers for all DORA subtests. For example, for Oral Vocabulary, Spelling, and Reading Comprehension, the plots of the Level 1 residuals against teacher DORA standardized scores reveal a precipitous drop in variability at the lower predictor values (i.e., OFAS < 14). This suggests some potential heteroscedasticity in this region for these three subtests; however, the smaller sample size of some of the teacher groups make it difficult to reach 336
definitive conclusions. Overall, the homoscedasticity assumption does not appear to be satisfied across teachers. As described above, some assumptions have been violated, which can increase the likelihood of committing a Type I or Type II Error. However, due to the small sample size in this research question, eliminating more cases at Level 1 or groups at Levels 2 and 3 is not advised, which can likely impact the validity of the study as with the assumption violations. Furthermore, the data for this research question were already screened and problematic cases and groups were removed (i.e., see the Descriptives section). Based on this information, the results for the current research question should be interpreted with caution. The HLM results will be reported with the robust standard errors, as this is advised when assumptions have been violated. If the model-based and robust standard errors for the estimates are different, this gives evidence of a misspecification of the distribution of random effects, and infers distributions other than the normal distribution may be more appropriate. However, it should be noted that the coefficients in the HLM output without robust standard errors and with robust standard errors did not differ greatly, which supports the interpretation and validity of the following results. Three-Level Hierarchical Linear Growth Model Results A three-level Hierarchical Linear Growth Model was computed to examine the relationship between teacher OFAS scores and student DORA scores (i.e., growth) across grades 3 through 8 for the 2009/2010 academic year. More specifically, the goal of this research question was to examine if teacher OFAS scores are related to student DORA growth controlling for student demographic variables. The hypothesis is that OFAS 337
scores will be a significant predictor of DORA scores (i.e., teacher OFAS scores will be a significant, positive predictor of student DORA growth). Four models, one for each DORA subtest outcome (i.e., Word Recognition, Oral Vocabulary, Spelling, Reading Comprehension) were run with the usual model-building strategy assumed the OneWay Random Effects ANOVA, followed by the Unconditional Growth Model (or Random Coefficients Model), followed by various Conditional Growth Models (or Contextual Models), and ending with the Full Model. Included in Level 1 was time of DORA data collection across the current academic year (i.e., April/May 2009, August/September 2009, December/January 2010). Level 2 variables were various demographics including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status. Finally, the Level 3 variable was teacher OFAS score. Time at Level 1 and the demographic controls at Level 2 were uncentered, and teacher OFAS score at Level 3 was centered around the grand mean. The HLM models were analyzed using the statistical package Hierarchical Linear Modeling (HLM) 6.08 (Raudenbush, Bryk, & Congdon, 2004) using Full Maximum Likelihood estimation (FEML), which is the default when conducting a three-level HLM in the software program (i.e., due to the fact that more complex models require larger sample size). Sex was coded with 0 for Male and 1 for Female. Ethnicity was coded with 0 for White and 1 for Minority. For ESL/ELL and Free/Reduced Lunch status, if the student did not fall under those categories, he/she was coded 0, with their ESL/ELL and Free/Reduced Lunch counterparts coded 1. The full model is described again below. The model at Level 1 describes the general trajectory for each person across time. The model at Level 1 was the following: 338
Ytij = 0ij + 1ij(Time)tij + etij
[5]
where Ytij is the students DORA score for time t for student i for teacher j, (Time)tij is the elapsed months for the current academic year of DORA use, and 0ij (i.e., the intercept) is a students initial score, 1ij (i.e., the growth rate over all months or slope) represents the students expected change in DORA score for a three to four month (i.e., one unit) increase. Finally, this model assumes that etij is a student-, time-, and teacher-specific residual. The model at Level 2 describes how the above trajectories tend to vary based on various student demographic information. The individual growth parameters will become the outcome variables in the Level 2 models, where they will be assumed to vary across individuals depending on various demographic controls. The model at Level 2 was the following:
0ij = 00j + 01j(SEX)ij + 02j(ETHNIC)ij + 03j(ESLELL)ij + 04j(FREERED)ij + r0ij 1ij = 10j + 11j(SEX)ij + 12j(ETHNIC)ij + 13j(ESLELL)ij + 14j(FREERED)ij + r1ij
[6]
where 0ij and 1ij are the student- and teacher-specific DORA score parameters, 00j is the baseline expectation (i.e., initial DORA status) for the demographic predictors coded as 0, and 10j is the expected linear change of DORA for the demographic predictors coded as 0. Finally, r0ij and r1ij are residuals.
339
Level 3 of this model describes how the estimates for the intercept and time slopes (i.e., the growth curves) vary based on teacher OFAS score. The model at Level 3 was the following:
00j = 000 + 001 (OFAS)j + 00j 01j = 010 02j = 020 03j = 030 04j = 040
[7]
10j = 100 + 101 (OFAS)j + 10j 11j = 110 12j = 120 13j = 130 14j = 140
where the outcomes (i.e., 00j, etc., and 10j, etc.) are the teacher-specific DORA intercepts and slopes (i.e., 00j = the mean initial status within teacher j, 10j = the mean academic year DORA growth rate), 000 is the baseline expectation (i.e., initial DORA status), and 100 is the expected linear change of DORA for j teacher (i.e., the direction and strength of the association between teachers), or the overall mean academic year DORA growth rate. Additionally, 00j through and 14j are residuals that represent the
340
deviation of teacher js coefficient from its predicted value based on this teacher-level model. The OFAS was not modeled for all the intercepts and slopes in the Level 3 equations. The Level 3 equations of interest were 00j and 10j. Substantively, the research question and related hypothesis focused on the relationship between the OFAS and DORA (i.e., not between the OFAS and the demographic predictors). For example, the current study is not interested in examining if teacher OFAS scores are a good predictor of Ethnicity slope differences. Therefore, OFAS as a predictor was not included in eight of the 10 equations as demonstrated above. Additionally, as will be shown later in the Conditional Growth Model with just the demographic predictors at Level 2, the final estimates of the Level 3 variance components for the demographic variables in the intercept and slope equations (i.e., the ones removed above) are very small with most being significant. When this occurs, it is common to fix some of the effects (i.e., eliminate the error term). Modeling small error variances can be problematic, especially when there is a small sample size as in the current study, due to the increase in parameters estimated and loss of degrees of freedom. The most parsimonious model will fix these effects and model the variances related to the substantive questions of interest. Thus, for the eight equations above not including the OFAS, the effects were set as fixed (i.e., the error term was eliminated) for every final model run for all the DORA outcomes. Word Recognition One-Way Random Effects ANOVA Model. Table 75 (below) shows the results of One-Way Random Effects ANOVA Model (i.e., the Empty Model). The intercept in this 341
empty model is just the Word Recognition average per student regardless of time (i.e., the average of all the student means across all time points together). The average student Word Recognition mean was statistically different from zero (000 = 9.75, t = 12.53, df = 10, p = .000). Considerable variation in the student Word Recognition means still exists (00 = 5.37, 2 = 3062.04, df = 253, p = .000), and variation still exists between teachers (00 = 6.36, 2 = 309.04, df = 10, p = .000). The proportion of variance within students was 11%, indicating that 11% of the variability in Word Recognition scores was within students. The proportion of variance in Word Recognition scores between students within teachers was 40.74%, and the proportion of variance in Word Recognition score between teachers was 48.25%. The total variability was 13.18. Based on the significant amount of unexplained variability, an additional Level 1 predictor (i.e., Time) was added to try and reduce the variation within students, as well as adding Level 2 and Level 3 variables to explain between student and teacher differences in the following models.
342
Table 75 One-Way Random Effects ANOVA Model with the DORA Word Recognition (WR) Subtest Fixed Effects Model for Initial WR Status (0ij) Intercept (000) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Level 3 Teacher Baseline (00j) Coefficient (SE) 9.75 (.78) Variance 1.45 5.37 6.36 253 10 3062.04*** 309.04*** .000 .000 df t (df) 12.53*** (10) 2 p .000 p

***
p < .001
Unconditional Growth Model. Table 76 (below) shows the results of the Unconditional Growth Model with Time (i.e., in months) as the sole predictor at Level 1 and no Level 2 or 3 variables. After including Time as a predictor of Word Recognition score within students, within student variability was reduced by 27.58%, relative to the One-Way Random Effects ANOVA Model. The percent of variability between teachers in the intercept (i.e., initial status) was 55.97% and in the growth rate (i.e., slope) was 31.51%. The remaining variation in Word Recognition score after linear effects of Time at Level 1 were controlled for was 1.05 for Level 1 (i.e., across the current academic year), 6.16 for Level 2 (i.e., the student level), and 7.83 for Level 3 (i.e., the teacher 343
level). Thus, the variance in Word Recognition scores after the linear effect of Time was controlled was 6.98% associated with nonlinear and residual effects of Time, 40.89% associated with variation between students within teachers, and 52.06% associated with between-teacher variation. Overall mean Word Recognition scores across students was still significantly different from zero (000 = 9.32, t = 10.80, df = 10, p = .000). Also, there was a significant difference in the Time slope (i.e., the effect of Time on Word Recognition score) across students (100 = .12, t = 3.79, df = 10, p = .004). For each three to four month increase in Time, there was an average .12 points increase in student Word Recognition scores. The correlation between initial status and linear growth was -.38 (p < .05). This means that students who had a low initial Word Recognition score had faster growth across the year (i.e., rate of change). However, statistically significant variability in the Word Recognition means still exists after considering Time (00 = 6.16, 2 = 1936.57, df = 253, p = .000), as well as statistically significant variability in individual student growth rates (11 = .02, 2 = 348.82, df = 253, p = .000). Similarly, statistically significant differences in variability between teacher mean initial status (00 = 7.83, 2 = 317.95, df = 10, p = .000) and between teacher mean growth (11 = .01, 2 = 48.26, df = 10, p = .000) exists. The between student variability will be addressed in the following model by incorporating some demographic covariates in Level 2.
344
Table 76 Unconditional Growth Model with the DORA Word Recognition (WR) Subtest Fixed Effects Model for Initial WR Status (0ij) Intercept (000) Model for WR Growth Rate (1ij) Intercept (100) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 9.32 (.86) .12 (.03) Variance 1.05 6.16 .02 7.83 .01 253 253 10 10 1936.57*** 348.82*** 317.95*** 48.26*** .000 .000 .000 .000 df t (df) 10.80*** (10) 3.79** (10) 2 p .000 .004 p

**
p < .01; *** p < .001
Conditional Growth Model. Table 77 below shows the results of the Conditional Growth Model with Time in Level 1 and all the demographic variables in Level 2. After including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status in Level 2, 10.71% of the variance in the between student differences in mean Word Recognition was accounted for by these predictors (i.e., 10.71% of the variability in initial status is explained by the demographic controls). However, since this result was statistically significant (00 = 5.50, 2 = 1443.70, df = 201, p = .000), there are still considerable differences between students that might be explained by other Level 2 variables. 345
Similarly, 26.51% of the variability in the effect of time (i.e., 26.51% of the variance in linear growth rates) within students can be explained by the demographic predictors added. Since this result is found to be statistically significant (11 = .01, 2 = 263.90, df = 201, p = .002), between student differences in the effect of time is not fully accounted for by the demographic controls. The remaining variation in Word Recognition score after linear effects of Time at Level 1 and demographics at Level 2 were controlled for was 1.05 for Level 1 (i.e., across the current academic year), 5.50 for Level 2 (i.e., the student level), and 7.02 for Level 3 (i.e., the teacher level). Thus, the variance in Word Recognition scores after the linear effect of Time was controlled was 7.74% associated with nonlinear and residual effects of Time, 40.53% associated with variation between students within teachers, and 51.73% associated with between-teacher variation. Overall mean Word Recognition scores across students was still significantly different from zero (000 = 9.81, t = 11.88, df = 10, p = .000). This is the mean Word Recognition score when Sex is 0 (i.e., Male), Ethnicity is 0 (i.e., White), ESL/ELL status is 0 (i.e., not in the ESL/ELL program), and Free/Reduced Lunch status is 0 (i.e., not in the Free/Reduced Lunch program). There was no statistically significant effect of any of the demographic controls on mean Word Recognition score (p > .05 for all). This means that there was no statistically significant increase in student mean Word Recognition score for Females, Minorities, ESL/ELL and Free/Reduced Lunch status students. In this model, all groups performed statistically equal on average on the Word Recognition subtest.
346
Also, there was still a significant difference in the Time slope (i.e., the effect of Time on Word Recognition score) across students (100 = .13, t = 2.69, df = 10, p = .023). The effect of Time on mean Word Recognition score is positive on average when Sex is 0 (i.e., Male), Ethnicity is 0 (i.e., White), ESL/ELL status is 0 (i.e., not in the ESL/ELL program), and Free/Reduced Lunch status is 0 (i.e., not in the Free/Reduced Lunch program). That is, for each three to four month increase in Time, there was an average .13 points increase in student Word Recognition scores for the predictors at Level 2 coded 0. For Females, Minorities, ESL/ELL and Free/Reduced Lunch status students, the effect of the demographic predictors on mean Word Recognition score was not statistically significant (p > .05 for all). That is, there were no statistically significant differences in the effect of Time (i.e., rate of Word Recognition change) between the different demographic groups as predictors in this model. As shown below, the final estimates of the Level 3 variance components for the demographics are small, with many being statistically significant. For example, Ethnicity and Free/Reduced Lunch have variances for the intercept and slope random effects that are below the alpha level (p < .05). It is common practice to fix these effects (i.e., remove or toggle off the error term) in subsequent models, as noted above when discussing the Level 3 equations for the final model. Doing so can create a more parsimonious model, as modeling all the variances will increase the number of parameters estimated, which can be complicated with undersized samples.
347
Table 77 Conditional Growth Model with the DORA Word Recognition (WR) Subtest Fixed Effects Model for Initial WR Status (0ij) Intercept (000) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for WR Growth Rate (1ij) Intercept (100) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Sex Ethnicity ESL/ELL Free/Reduced Lunch Teacher Mean Growth (10j) Sex Ethnicity ESL/ELL Free/Reduced Lunch Coefficient (SE) 9.81 (.83) -.43 (.34) .07 (.42) -.28 (.50) -.87 (.51) .13 (.05) -.02 (.04) -.03 (.05) -.01 (.05) .04 (.02) t (df) 11.88*** (10) -1.27 (10) .16 (10) -.57 (10) -1.70 (10) 2.69* (10) -.67 (10) -.52 (10) -.28 (10) 1.67 (10) 2 p .000 .233 .877 .582 .119 .023 .516 .616 .782 .126
Variance 1.05 5.50 .01 7.02 .61 .51 1.41 1.80 .02 .01 .02 .01 .00
df
201 201 9 9 9 9 9 9 9 9 9 9
1443.70*** 263.90** 118.80*** 9.06 20.96* 13.83 21.67* 39.54*** 14.67 23.66** 8.99 19.29*
.000 .002 .000 .432 .013 .128 .010 .000 .100 .005 > .500 .023

*
p < .05; ** p < .01; *** p < .001
348
Full Model (OFAS 50 Questions). Table 78 below shows the results of the Full Model with Time in Level 1, all the demographic variables in Level 2, and the 50question OFAS in Level 3. After including Level 2 and Level 3 predictors (i.e., compared to the Unconditional Growth Model), 2.92% of the variance in the between student differences in mean Word Recognition score was accounted for by these predictors (i.e., 2.92% of the variability in initial status is explained by the Level 2 and 3 variables). This result was statistically significant (00 = 5.98, 2 = 1888.99, df = 249, p = .000); therefore, there are still considerable differences between students that might be explained by other Level 2 or 3 variables. Similarly, the variability in the effect of time within students that can be explained by the Level 2 and 3 variables decreased slightly from the Unconditional Growth Model to this Full Model (i.e., the demographic controls and teacher OFAS score decreased by .61% of the variance in linear growth rates of students). Since this result was found to be statistically significant (11 = .02, 2 = 347.79, df = 249, p = .000), between student differences in the effect of time is not fully accounted for by the demographic controls or 50-question OFAS. The remaining variation in Word Recognition score after controlling for the linear effects of Time at Level 1, the effects of the demographic controls at Level 2, and teacher OFAS score at Level 3 was 1.05 for Level 1, 5.98 for Level 2, and 6.9 for Level 3. Thus, the variance in Word Recognition score after controlling for these variables was 7.54% associated variation among measurement months, 42.93% associated with variation between students within teachers, and 49.53% between teachers. Overall mean Word Recognition scores across students and teachers was still positive and significantly different from zero (000 = 9.82, t = 12.99, df = 9, p = .000). 349
This is the mean Word Recognition score (000 = 9.82) when Sex is 0 (i.e., Male), Ethnicity is 0 (i.e., White), ESL/ELL status is 0 (i.e., not in the ESL/ELL program), Free/Reduced Lunch status is 0 (i.e., not in the Free/Reduced Lunch program), and teacher 50-question OFAS score is 0. Furthermore, Sex, Ethnicity, ESL/ELL, and Free/Reduced Lunch did not have a significant effect on the intercepts (p > .05 for all). Additionally, teacher OFAS score did not have a statistically significant influence on initial status (i.e., the intercept), suggesting that initial Word Recognition score was similar across students regardless of teacher OFAS score (p > .05). With regards to the slope, the inclusion of demographic predictors for these linear change modelsprovides information on whether differences exist in the DORA subtest growth rates over the current academic year (i.e., 2009/2010). There was still a significant difference in the Time slope (i.e., the effect of Time on Word Recognition score) across students and teachers (100 = .12, t = 2.52, df = 9, p = .032). The effect of Time on mean Word Recognition score was positive on average when Sex is 0 (i.e., Male), Ethnicity is 0 (i.e., White), ESL/ELL status is 0 (i.e., not in the ESL/ELL program), Free/Reduced Lunch status is 0 (i.e., not in the Free/Reduced Lunch program), and teacher OFAS score is 0. That is, the predicted Word Recognition growth rate (i.e., every three to four months) for Males, Whites, non-ESL/ELL, non-Free/Reduced Lunch status students in classrooms with teachers with an OFAS score of 0 was .12. Sex, Ethnicity, ESL/ELL, and Free/Reduced Lunch did not have a significant effect on the slopes of the predictor Time (i.e., rate of change; p > .05 for all). As with the intercept, teacher OFAS score had no statistically significant effect on the Time slope (i.e., rate of
350
increase), suggesting that Word Recognition growth was similar regardless of teacher OFAS score (p > .05). Overall, the effect of adding the Level 2 and Level 3 variables to the Unconditional Growth Model resulted in a reduction close to zero. That is, none of the Level 1 remaining variance in the Unconditional Growth Model was accounted for by adding Level 2 and Level 3 predictors to the model.
351
Table 78 Full Model with the DORA Word Recognition (WR) Subtest and the 50-Question OFAS Fixed Effects Model for Initial WR Status (0ij) Intercept (000) OFAS (001) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for WR Growth Rate (1ij) Intercept (100) OFAS (101) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 9.82 (.76) .05 (.04) -.36 (.29) -.04 (.37) -.25 (.57) -.67 (.36) .12 (.05) .00 (.00) -.02 (.04) .00 (.05) -.03 (.05) .02 (.02) Variance 1.05 5.98 .01 6.90 .01 249 249 9 9 1888.99*** 347.79*** 268.66*** 48.17*** .000 .000 .000 .000 df t (df) 12.99*** (9) 1.20 (9) -1.28 (259) -.11 (259) -.44 (259) -1.88 (259) 2.53* (9) .06 (9) -.63 (259) .03 (259) -.60 (259) .83 (259) 2 p .000 .261 .203 .912 .662 .061 .032 .956 .528 .975 .550 .409 p

*
p < .05; *** p < .001
Full Model (OFAS 10 Questions). As mentioned when discussing the results for Research Question 2, a subscale from the original OFAS questions was made based on the results of the Rasch Analysis pertaining to the key component of formative 352
assessment using assessment results. This subscale was used a predictor in the full model in place of the final OFAS scale with 50 questions as described above. As will be demonstrated here, the 10-question OFAS is a better predictor in the model consistently across the four DORA subtests as outcomes. The 50-question OFAS results will be presented for comparison purposes with the 10-question OFAS for all DORA outcomes in the Full Model. Table 79 below shows the results of the Full Model with Time in Level 1, all the demographic variables in Level 2, and the 10-question OFAS in Level 3. Compared to the Unconditional Growth Model, 2.92% of the variability in initial status was explained by the Level 2 and 3 variables. This result was statistically significant (00 = 5.98, 2 = 1888.27, df = 249, p = .000). This means that there are still considerable differences between students that might be explained by other covariates. The variability in the effect of time within students that can be explained by the Level 2 and 3 variables was 4.96%. The variability for the effect of time within students in this Full Model with the 10question OFAS was lower compared to the Full Model with the 50-question OFAS at Level 3 and the Unconditional Growth Model. Since this result was found to be statistically significant (11 = .02, 2 = 347.42, df = 249, p = .000), between student differences in the effect of time was not completely explained by the demographic controls or 10-question OFAS. The remaining variation in Word Recognition score after controlling for the linear effects of Time at Level 1, the effects of the demographic controls at Level 2, and teacher OFAS score at Level 3 was 1.05 for Level 1 (i.e., same as the 50-question OFAS Full Model), 5.98 for Level 2 (i.e., same as the 50-question OFAS Full Model), and 4.69 for 353
Level 3 (i.e., lower than the 50-question OFAS Full Model). Thus, the variance in Word Recognition score after controlling for these variables was 8.96% associated variation among measurement months, 51.02% associated with variation between students within teachers, and 40.02% between teachers. Overall mean Word Recognition scores across students and teachers was positive and significantly different from zero (000 = 9.80, t = 13.74, df = 9, p = .000). This is the mean Word Recognition score (000 = 9.80) when the demographic covariates are 0 and teacher 10-question OFAS score is 0. Additionally, Sex, Ethnicity, ESL/ELL, and Free/Reduced Lunch did not have a significant effect on the intercepts (p > .05 for all). Compared to the 50-question OFAS Full Model, the 10-question teacher OFAS score was a statistically significant influence on initial status (i.e., the intercept), suggesting that there are differences in initial Word Recognition score across students depending of teacher OFAS score (001 = .36, t = 2.67, df = 9, p = .026). This means that as teacher OFAS score increases by one point, the average student DORA Word Recognition score increases by .36 points (i.e., controlling for the demographic at Level 2). There was a significant difference in the Time slope (i.e., the effect of Time on Word Recognition score) across students and teachers (100 = .12, t = 3.21, df = 9, p = .012). The effect of Time on mean Word Recognition score is positive on average when the demographic covariate are 0 and teacher OFAS score is 0. This means that the predicted Word Recognition growth rate (i.e., every three to four months) for Males, Whites, non-ESL/ELL, non-Free/Reduced Lunch status students in classrooms with teachers with an OFAS score of 0 is .12. The demographic controls did not have a significant effect on the rate of change of the predictor Time (p > .05 for all). Finally, 354
teacher OFAS score had no statistically significant effect on the Time slope, suggesting that Word Recognition rate of change was similar regardless of teacher OFAS score (p > .05). Similar to the 50-question OFAS Full Model, the effect of adding the Level 2 and Level 3 variables to the Unconditional Growth Model resulted in a reduction close to zero. That is, none of the Level 1 remaining variance in the Unconditional Growth Model was accounted for by adding Level 2 and Level 3 predictors to the model. A more extensive comparison and full discussion of the 50-question OFAS Full Model compared to the 10-question OFAS Full Model will be examined in Chapter 5 (i.e., the discussion of the results).
355
Table 79 Full Model with the DORA Word Recognition (WR) Subtest and the 10-Question OFAS Fixed Effects Model for Initial WR Status (0ij) Intercept (000) OFAS (001) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for WR Growth Rate (1ij) Intercept (100) OFAS (101) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 9.80 (.71) .34 (.13) -.36 (.33) -.04 (.46) -.27 (.58) -.67 (.38) .12 (.04) -.00 (.01) -.03 (.03) .00 (.04) -.03 (.05) .01 (.03) Variance 1.05 5.98 .02 4.69 .01 249 249 9 9 1888.27*** 347.42*** 163.78*** 45.24*** .000 .000 .000 .000 df t (df) 13.74*** (9) 2.67* (9) -1.09 (259) -.09 (259) -.47 (259) -1.76 (259) 3.21* (9) -.43 (9) -.87 (259) .06 (259) -.51 (259) .34 (259) 2 p .000 .026 .279 .927 .637 .080 .012 .678 .387 .953 .608 .731 p

*
p < .05; *** p < .001
Conditional Growth Model (OFAS 10 Questions) with No Demographic Predictors. As mentioned previously, many of the demographic controls at Level 2 were not significant and accounted for very little variability in the model. The decision to 356
exclude or include these variables at Level 2 was examined. An investigation of models excluding the demographic covariates at Level 2 produced similar coefficients, standard errors, t statistics, and p values for the hypothesized relationships of interest across all four DORA outcomes compared to the models that included the demographic covariates. The decision to include the Level 2 demographic covariates was based on other supporting information as well. For example, as mentioned in previous sections, research and theory suggests that Sex, Ethnicity, ESL/ELL status, and SES (i.e., Free/Reduced Lunch status in the current model) may be related to reading achievement growth, especially for younger students in grade school and middle school. Although the inclusion of more variables at any level can lead to an increase in parameters estimated, and potentially a less parsimonious model, the inclusion of the demographic controls at Level 2 did not impact the degrees of freedom for Level 3, which included the primary variable for the hypothesized relationship of interest (i.e., the OFAS). Overall, based on the above rationale, the demographic covariates will be retained in the Full Models for all DORA subtest outcomes. Oral Vocabulary One-Way Random Effects ANOVA Model. Table 80 (below) shows the results of One-Way Random Effects ANOVA Model (i.e., the Empty Model) for Oral Vocabulary as the outcome. The intercept in this empty model is the Oral Vocabulary average per student across all time points. The average student Oral Vocabulary mean was statistically different from zero (000 = 5.72, t = 15.03, df = 10, p = .000). Considerable variation in the student Oral Vocabulary means still exists (00 = 2.96, 2 = 3628.00, df = 253, p = .000), and variation still exists between teachers (00 = 1.44, 2 = 186.84, df = 357
10, p = .000). The proportion of variance within students was 13.15%, indicating that 13.15% of the variability in Oral Vocabulary scores was within students. The proportion of variance in Oral Vocabulary scores between students within teachers was 58.38%, and the proportion of variance in Oral Vocabulary score between teachers was 28.40%. The total variability was 5.07. Based on the significant amount of unexplained variability, an additional Level 1 predictor (i.e., Time) was added to try and reduce the variation within students, as well as adding Level 2 and Level 3 variables to explain between student and teacher differences in the following models.
Table 80 One-Way Random Effects ANOVA Model with the DORA Oral Vocabulary (OV) Subtest Fixed Effects Model for Initial OV Status (0ij) Intercept (000) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Level 3 Teacher Baseline (00j) Coefficient (SE) 5.72 (.38) Variance .67 2.96 1.44 253 10 3628.00*** 186.84*** .000 .000 df t (df) 15.03*** (10) 2 p .000 p

***
p < .001
358
Unconditional Growth Model. Table 81 (below) shows the results of the Unconditional Growth Model with Time (i.e., in months) as the sole predictor at Level 1. After including Time as a predictor of Oral Vocabulary score within students, within student variability was reduced by 19.53%, relative to the One-Way Random Effects ANOVA Model. The percent of variability between teachers in the intercept (i.e., initial status) was 33.49% and in the growth rate (i.e., slope) was 13.41%. The remaining variation in Oral Vocabulary score after linear effects of Time were controlled for was .54 for Level 1, 2.84 for Level 2, and 1.43 for Level 3. Thus, the variance in Oral Vocabulary scores after the linear effect of Time was controlled was 11.15% associated with nonlinear and residual effects of Time, 59.04% associated with variation between students within teachers, and 29.73% associated with between-teacher variation. Overall mean Oral Vocabulary scores across students was still significantly different from zero (000 = 5.50, t = 14.45, df = 10, p = .000). Also, there was a significant difference in the Time slope (i.e., the effect of Time on Oral Vocabulary score) across students (100 = .06, t = 4.25, df = 10, p = .002). For every three to four month increase, there was an average .06 points increase in student Oral Vocabulary scores. The correlation between initial status and linear growth was .09 and not significant (p > .05). Statistically significant variability in the Oral Vocabulary means still exist after considering Time (00 = 2.84, 2 = 1780.36, df = 253, p = .000), as well as statistically significant variability in individual student growth rates (11 = .01, 2 = 323.31, df = 253, p = .002). Similarly, statistically significant differences in variability between teacher mean initial status (00 = 1.43, 2 = 184.48, df = 10, p = .000) and between teacher mean
359
growth (11 = .00, 2 = 19.70, df = 10, p = .032) exists. Adding Level 2 demographic controls will attempt to address the between student variability in the following model.
Table 81 Unconditional Growth Model with the DORA Oral Vocabulary (OV) Subtest Fixed Effects Model for Initial OV Status (0ij) Intercept (000) Model for OV Growth Rate (1ij) Intercept (100) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 5.50 (.38) .06 (.01) Variance .54 2.84 .01 1.43 .00 253 253 10 10 1780.36*** 323.31** 184.48*** 19.70* .000 .002 .000 .032 df t (df) 14.45*** (10) 4.25** (10) 2 p .000 .002 p

*
p < .05; ** p < .01; *** p < .001
Conditional Growth Model. Table 82 below shows the results of Conditional Growth Model with Time in Level 1 and all the demographic variables in Level 2. After including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status in Level 2, 24.65% of the variance in the between student differences in mean Oral Vocabulary was accounted for by these predictors. However, this finding was statistically significant (00 360
= 2.14, 2 = 1302.79, df = 201, p = .000), which means that there is still a significant amount of differences between students that might be explained by other Level 2 variables. In addition, 22.56% of the variability in the effect of time within students can be explained by the demographic predictors added (i.e., the demographic controls account for 22.56% of the variance in linear growth rates of students). Since this result was found to be statistically significant (11 = .00, 2 = 268.04, df = 201, p = .001), between student differences in the effect of time on Oral Vocabulary score is not completely explained by the demographic covariates. The remaining variation in Oral Vocabulary score after linear effects of Time at Level 1 were controlled was .54 for Level 1, 2.14 for Level 2, and 2.58 for Level 3. Thus, the variance in Oral Vocabulary scores after accounting for the linear effect of time was 10.20% associated with nonlinear and residual effects of Time, 40.68% associated with variation between students within teachers, and 49.05% associated with between-teacher variation. Overall mean Oral Vocabulary scores across students was still significantly different from zero (000 = 6.18, t = 12.247, df = 10, p = .000). This is the mean Oral Vocabulary score when Sex is 0 (i.e., Male), Ethnicity is 0 (i.e., White), ESL/ELL status is 0 (i.e., not in the ESL/ELL program), and Free/Reduced Lunch status is 0 (i.e., not in the Free/Reduced Lunch program). The effect of Sex on mean Oral Vocabulary was negative and statistically significant (010 = -.38, t = -2.93, df = 10, p = .016). The coefficient -.38 represents the decrease in a students mean Oral Vocabulary for the Sex coded as 1 (i.e., female) on average. Males were predicted to have a mean Oral Vocabulary score of 6.18, and females were predicted to have a mean Oral Vocabulary
361
score of 5.80 (i.e., 6.18 - .38). Thus, at initial status, males outperformed females on the Oral Vocabulary subtest on average. There were statistically significant effects of all the demographic controls on mean Oral Vocabulary score (p < .05 for all). That is, the effects of Ethnicity (p = .004), Free/Reduced Lunch status (p = .010), and ESL/ELL status (p = .026) on mean Oral Vocabulary score were statistically significant. For Ethnicity, White was coded 0 and Minority was coded 1. The coefficient -.52 represents the decrease in a students mean Oral Vocabulary for Minorities on average. Whites were predicted to have a mean Oral Vocabulary score of 6.18; therefore, Minorities were predicted to have a mean Oral Vocabulary score of 5.66 (i.e., 6.18 - .52). This means that at initial status, Whites outperformed Minorities on the Oral Vocabulary subtest on average. The coefficient (-.52) and result was the same for Free/Reduced Lunch status. For this covariate, nonFree/Reduced Lunch status students were coded 0 and students enrolled in the program were coded 1. At initial status, non-Free/Reduced Lunch status students (i.e., students of higher SES) outperformed students enrolled in the program (i.e., lower SES students) on the Oral Vocabulary subtest on average. Finally, for ESL/ELL status, non-ESL/ELL status students were coded 0 and ESL/ELL students were coded 1. The coefficient -.67 represents the decrease in a students mean Oral Vocabulary for ESL/ELL students on average. Non-ESL/ELL students were predicted to have a mean Oral Vocabulary score of 6.18, and ESL/ELL students were predicted to have a mean Oral Vocabulary score of 5.51 (i.e., 6.18 - .67). This means that at initial status, non-ESL/ELL students outperformed ESL/ELL students on the Oral Vocabulary subtest on average. 362
Including the demographic controls, there was still a significant difference in the Time slope (i.e., the effect of Time on Oral Vocabulary score) across students (100 = .07, t = 3.80, df = 10, p = .004). This means that the effect of Time on mean Oral Vocabulary score is positive on average controlling for the demographic variables. For every three to four month increase in Time, there was an average .07 points increase in student Oral Vocabulary scores. Additionally, the effect of the demographic predictors on mean Oral Vocabulary score was not statistically significant (p > .05 for all). That is, there were no statistically significant differences in the rate of Oral Vocabulary change between the different demographic groups as predictors in this model.
363
Table 82 Conditional Growth Model with the DORA Oral Vocabulary (OV) Subtest Fixed Effects Model for Initial OV Status (0ij) Intercept (000) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for OV Growth Rate (1ij) Intercept (100) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Sex Ethnicity ESL/ELL Free/Reduced Lunch Teacher Mean Growth (10j) Sex Ethnicity ESL/ELL Free/Reduced Lunch Coefficient (SE) 6.18 (.50) -.38 (.13) -.52 (.14) -.67 (.25) -.52 (.16) .07 (.02) -.02 (.02) -.02 (.02) -.03 (.03) .02 (.02) t (df) 12.25*** (10) -2.93* (10) -3.80** (10) -2.62* (10) -3.22* (10) 3.80** (10) -.98 (10) -.69 (10) -.99 (10) .77 (10) 2 p .000 .016 .004 .026 .010 .004 .5350 .508 .348 .461
Variance .54 2.14 .00 2.58 .10 .05 .04 .17 .00 .00 .00 .00 .00
df
201 201 9 9 9 9 9 9 9 9 9 9
1302.79*** 268.04** 138.69*** 4.77 4.49 8.21 10.86 16.23 12.09 7.90 13.10 8.86
.000 .001 .000 > .500 > .500 > .500 .285 .062 .207 > .500 .157 > .500

*
p < .05; ** p < .01; *** p < .001
364
Full Model (OFAS 50 Questions). Table 83 below shows the results of the Full Model with Time in Level 1, all the demographic variables in Level 2, and the 50question OFAS in Level 3. After including the demographic covariates and the OFAS in the model (i.e., compared to the Unconditional Growth Model), 18.66% of the variance in the between student differences in mean Oral Vocabulary score (i.e., in initial status) was accounted for by these predictors. This result was statistically significant (00 = 2.31, 2 = 1495.34, df = 249, p = .000). Similarly, the variability in the effect of time within students that can be explained by the Level 2 and 3 variables compared to the Unconditional Growth Model was 2.19%. This result was also found to be statistically significant (11 = .01, 2 = 322.17, df = 249, p = .001). The remaining variation in Oral Vocabulary score after controlling for the linear effects of Time at Level 1, the effects of the demographic controls at Level 2, and teacher OFAS score at Level 3 was .54 for Level 1, 2.31 for Level 2, and 1.26 for Level 3. Thus, the variance in Oral Vocabulary score after controlling for these variables was 13.05% associated variation among measurement months, 56.20% associated with variation between students within teachers, and 30.66% between teachers. Overall mean Oral Vocabulary scores across students and teachers was still positive and significantly different from zero (000 = 6.32, t = 16.46, df = 9, p = .000). This is the mean Oral Vocabulary score (000 = 6.32) when Sex is 0 (i.e., Male), Ethnicity is 0 (i.e., White), ESL/ELL status is 0 (i.e., not in the ESL/ELL program), Free/Reduced Lunch status is 0 (i.e., not in the Free/Reduced Lunch program), and teacher 50-question OFAS score is 0. The effect of Sex on mean Oral Vocabulary was still negative and statistically significant with a larger coefficient compared to the Conditional Growth 365
Model (010 = -.53, t = -2.55, df = 259, p = .012). Males in this model with the 50question OFAS were predicted to have a mean Oral Vocabulary score of 6.32, and females were predicted to have a mean Oral Vocabulary score of 5.79. There was still a statistically significant effect of Free/Reduced Lunch status (p = .003). At initial status, non-Free/Reduced Lunch status students (i.e., students of higher SES) outperformed students enrolled in the program (i.e., lower SES students) on the Oral Vocabulary subtest on average. Free/Reduced Lunch status students average Oral Vocabulary score at initial status was predicted to be 5.59. Ethnicity and ESL/ELL covariates became nonsignificant when the 50-question OFAS was added to the model, and did not have a significant effect on the intercepts (p > .05 for both). Additionally, teacher OFAS score did not have a statistically significant influence on initial status (i.e., the intercept), suggesting that initial Oral Vocabulary score was similar across students regardless of teacher OFAS score (p > .05). With regards to the slope, there was still a significant difference in the Time slope (i.e., the effect of Time on Oral Vocabulary score) across students and teachers (100 = .07, t = 3.36, df = 9, p = .009). The effect of Time on mean Oral Vocabulary score is positive on average when the demographic covariates are coded 0 and teacher OFAS score is 0. Thus, the predicted Oral Vocabulary score growth rate (i.e., every three to four months) for Males, Whites, non-ESL/ELL, non-Free/Reduced Lunch status students in classrooms with teachers with an OFAS score of 0 was .07. All the demographic controls did not have a significant effect on the rate of growth (p > .05 for all). As with the intercept, teacher OFAS score had no statistically significant effect on the Time slope,
366
indicating that Oral Vocabulary growth was similar across students regardless of teacher OFAS score (p > .05).
Table 83 Full Model with the DORA Oral Vocabulary (OV) Subtest and the 50-Question OFAS Fixed Effects Model for Initial OV Status (0ij) Intercept (000) OFAS (001) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for OV Growth Rate (1ij) Intercept (100) OFAS (101) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 6.32 (.38) .02 (.02) -.53 (.21) -.49 (.29) -.72 (.37) -.73 (.24) .07 (.02) -.00 (.00) -.01 (.02) .01 (.03) -.04 (.04) .01 (.02) Variance .54 2.31 .01 1.26 .00 249 249 9 9 1495.34*** 322.17** 198.84*** 18.41* .000 .001 .000 .030 df t (df) 16.46*** (9) 1.37 (9) -2.55* (259) -1.69 (259) -1.96 (259) -3.04** (259) 3.36** (9) -.06 (9) -.43 (259) .29 (259) -1.21 (259) .35 (259) 2 p .000 .204 .012 .093 .051 .003 .009 .952 .670 .769 .229 .724 p

*
p < .05; ** p < .01; *** p < .001
367
Full Model (OFAS 10 Questions). Table 84 below shows the results of the Full Model with Time in Level 1, all the demographic controls in Level 2, and the 10-question OFAS in Level 3. Compared to the Unconditional Growth Model, 19.01% of the variability in initial status of Oral Vocabulary was explained by the Level 2 and 3 variables. This result was statistically significant (00 = 2.30, 2 = 1495.24, df = 249, p = .000). The variability in the effect of time within students that can be explained by the Level 2 and 3 variables was 2.19%. The variability for the effect of time within students in this Full Model with the 10-question OFAS was the same as the Full Model with the 50-question OFAS at Level 3. This result was also statistically significant (11 = .01, 2 = 322.16, df = 249, p = .001) meaning that between student differences in the effect of time is not completely explained by the demographic controls or 10-question OFAS. The remaining variation in Oral Vocabulary score after controlling for the linear effects of Time at Level 1, the effects of the demographic covariates at Level 2, and teacher OFAS score at Level 3 was .54 for Level 1 (i.e., same as the 50-question OFAS Full Model), 2.30 for Level 2 (i.e., approximately equal to the 50-question OFAS Full Model), and .75 for Level 3 (i.e., lower than the 50-question OFAS Full Model). Thus, the variance in Oral Vocabulary score after controlling for these variables was 14.94% associated with variation among measurement months, 64.07% associated with variation between students within teachers, and 20.90% between teachers. Overall mean Oral Vocabulary scores across students and teachers was again positive and significantly different from zero (000 = 6.33, t = 13.98, df = 9, p = .000). This is the mean Oral Vocabulary score (000 = 6.33) when the demographic covariates are 0 and teacher 10-question OFAS score is 0. Interestingly, compared to the 50368
question OFAS, with the 10-question OFAS at Level 3 in the model, all the demographic controls had a significant effect on the intercepts (p < .05 for all). Thus, at initial status, males outperformed females on the Oral Vocabulary subtest on average, with females having a predicted Oral Vocabulary score of 5.80. Whites surpassed Minorities on the Oral Vocabulary subtest on average, with Minorities having a predicted score of 5.84. Students of higher SES outperformed students in the lower SES category on the Oral Vocabulary subtest on average, with lower SES students having an average initial score of 5.58. Finally, non-ESL/ELL students surpassed ESL/ELL students on the Oral Vocabulary subtest on average, with ESL/ELL students having an average score of 5.61. Compared to the 50-question OFAS Full Model, the 10-question teacher OFAS score was a statistically significant influence on initial status, suggesting that there were differences in initial Oral Vocabulary score across students depending on teacher OFAS score (001 = .16, t = 2.80, df = 9, p = .021). As teacher OFAS score increased by one point, the average student Oral Vocabulary score increased by .16 points. There was a significant difference in the effect of Time on Oral Vocabulary score across students and teachers (100 = .07, t = 3.05, df = 9, p = .014). This means that the effect of Time on mean Oral Vocabulary score is positive on average when the demographic covariate are 0 and teacher OFAS score is 0. Thus, the predicted Oral Vocabulary growth rate (i.e., every three to four months) for Males, Whites, nonESL/ELL, non-Free/Reduced Lunch status students with teachers who have an OFAS score of 0 was .07 (i.e., the same result as the 50-question OFAS Full Model). The demographic controls did not have a significant effect on the rate of change (p > .05 for 369
all). Finally, teacher OFAS score had no statistically significant effect on the Time slope, indicating that Oral Vocabulary rate of change was similar regardless of teacher OFAS score (p > .05).
Table 84 Full Model with the DORA Oral Vocabulary (OV) Subtest and the 10-Question OFAS Fixed Effects Model for Initial OV Status (0ij) Intercept (000) OFAS (001) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for OV Growth Rate (1ij) Intercept (100) OFAS (101) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 6.33 (.45) .16 (.06) -.53 (.17) -.49 (.18) -.72 (.30) -.75 (.24) .07 (.02) -.00 (.00) -.01 (.02) .01 (.03) -.04 (.04) .01 (.03) Variance .54 2.30 .01 .75 .00 249 249 9 9 1495.24*** 322.16** 113.15*** 18.24* .000 .001 .000 .032 df t (df) 13.98*** (9) 2.80* (9) -3.07** (259) -2.66** (259) -2.37* (259) -3.11** (259) 3.05* (9) -.24 (9) -.46 (259) .25 (259) -1.20 (259) .33 (259) 2 p .000 .021 .003 .009 .019 .003 .014 .814 .646 .801 .232 .746 p

*
p < .05; ** p < .01; *** p < .001 370
Spelling One-Way Random Effects ANOVA Model. Table 85 (below) shows the results of One-Way Random Effects ANOVA Model for the DORA subtest outcome Spelling. The average student Spelling mean was statistically different from zero (000 = 3.38, t = 8.03, df = 10, p = .000). Considerable variation in the student Spelling means still exists (00 = 2.99, 2 = 5585.01, df = 253, p = .000), and variation still exists between teachers (00 = 1.92, 2 = 243.03, df = 10, p = .000). The proportion of variance within students was 7.98%. This means that 7.98% of the variability in Spelling scores was within students. The proportion of variance in Spelling scores between students within teachers was 55.99%, and the proportion of variance between teachers was 35.96%. The total variability was 5.34. Based on the significant amount of unexplained variability, an additional Level 1 predictor (i.e., Time) was added to attempt to reduce the variation within students, in addition to adding Level 2 and Level 3 variables to explain between student and teacher differences in the following models.
371
Table 85 One-Way Random Effects ANOVA Model with the DORA Spelling (SP) Subtest Fixed Effects Model for Initial SP Status (0ij) Intercept (000) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Level 3 Teacher Baseline (00j) Coefficient (SE) 3.48 (.43) Variance .43 2.99 1.92 253 10 5585.01*** 243.03*** .000 .000 df t (df) 8.03*** (10) 2 p .000 p

***
p < .001
Unconditional Growth Model. Table 86 (below) shows the results of the Unconditional Growth Model with Time as the only predictor at Level 1. After including Time as a predictor of Spelling score, within student variability was reduced by 14.74%, compared to the One-Way Random Effects ANOVA Model. The percent of variability between teachers in initial status (of Spelling) was 38.20%, and the percent of variability between teachers in the growth rate (of Spelling) was 6.93%. The remaining variation in Spelling score after linear effects of Time were controlled for was .36 for Level 1 (i.e., across the current academic year), 2.88 for Level 2 (i.e., the student level), and 1.78 for Level 3 (i.e., the teacher level). Thus, the variance in Spelling scores after the linear effect of Time was controlled was 7.24% associated with nonlinear and residual effects of 372
Time, 57.37% associated with variation between students within teachers, and 35.46% associated with between-teacher variation. Overall mean Spelling scores across students was still significantly different from zero (000 = 3.34, t = 7.98, df = 10, p = .000). Also, there was a significant difference in the Time slope (i.e., the effect of Time on Spelling score) across students (100 = .04, t = 6.16, df = 10, p = .000). This means that for each three to four month increase in time across the 2009/2010 academic year, there was an average .04 points increase in student Spelling scores. The correlation between initial status and linear growth was .15 and was not statistically significant (p > .05). Statistically significant variability in the Spelling means still existed after considering Time (00 = 2.88, 2 = 2545.00, df = 253, p = .000), as well as statistically significant variability in individual student growth rates (11 = .00, 2 = 309.85, df = 253, p = .009). Similarly, statistically significant differences in variability between teacher mean initial status (00 = 1.78, 2 = 220.00, df = 10, p = .000) exists, but not between teacher mean growth (11 = .00, 2 = 8.65, df = 10, p > .500). This means that between teacher differences in the effect of Spelling growth appears to be fully explained by the passing of time.
373
Table 86 Unconditional Growth Model with the DORA Spelling (SP) Subtest Fixed Effects Model for Initial SP Status (0ij) Intercept (000) Model for SP Growth Rate (1ij) Intercept (100) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 3.34 (.42) .04 (.01) Variance .36 2.88 .00 1.78 .00 253 253 10 10 2545.00*** 309.85** 220.00*** 8.65 .000 .009 .000 > .500 df t (df) 7.98*** (10) 6.16*** (10) 2 p .000 .000 p

**
p < .01; *** p < .001
Conditional Growth Model. Table 87 below shows the results of Conditional Growth Model with Time in Level 1 and all the demographic covariates in Level 2 for the model of the DORA Spelling subtest. After including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status in Level 2, 7.99% of the variability in Spelling initial status was explained by the demographic controls. However, since this result was statistically significant (00 = 2.65, 2 = 2147.60, df = 201, p = .000), there are still considerable differences between students that might be explained by other Level 2 covariates. Similarly, the demographic controls accounted for 59.93% of the variance in 374
linear growth rates of students. Since this result was found to be statistically significant (11 = .00, 2 = 239.40, df = 201, p = .033), between student differences in the effect of time is not fully accounted for by the demographic controls. The remaining variation in Spelling scores after linear effects of Time at Level 1 and demographics at Level 2 were controlled for was .36 for Level 1, 2.65 for Level 2, and 1.87 for Level 3. Thus, the variance in Spelling scores after the linear effect of Time was controlled was 7.44% associated with nonlinear and residual effects of Time, 54.30% associated with variation between students within teachers, and 38.32% associated with between-teacher variation. Overall mean Spelling scores across students was still significantly different from zero (000 = 3.46, t = 7.95, df = 10, p = .000). This is the mean Spelling score when the demographics are coded 0. There was no statistically significant effect of any of the demographic covariates on mean Spelling score (p > .05 for all). This means that there was no statistically significant increase or decrease in student mean Spelling score for Females, Minorities, ESL/ELL and Free/Reduced Lunch status students. In this model, all groups performed statistically equal on average on the Spelling subtest. Also, there was still a significant difference in the effect of Time on Spelling score across students (100 = .02, t = 2.41, df = 10, p = .037). For every three to four month increase in Time, there was an average .02 points increase in student Spelling scores for the predictors at Level 2 coded 0. In addition, there were no statistically significant differences in the effect of Time (i.e., rate of Spelling change) between the different demographic groups as predictors in this model (p > .05 for all).
375
Table 87 Conditional Growth Model with the DORA Spelling (SP) Subtest Fixed Effects Model for Initial SP Status (0ij) Intercept (000) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for SP Growth Rate (1ij) Intercept (100) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Sex Ethnicity ESL/ELL Free/Reduced Lunch Teacher Mean Growth (10j) Sex Ethnicity ESL/ELL Free/Reduced Lunch Coefficient (SE) 3.46 (.43) .01 (.12) -.01 (.18) -.15 (.44) -.25 (.21) .02 (.01) .01 (.01) .02 (.03) .02 (.03) .02 (.01) Variance .36 2.65 .00 1.87 .07 .18 1.01 .16 .02 .01 .02 .01 .00 201 201 9 9 9 9 9 9 9 9 9 9 2147.60*** 239.40* 83.49*** 3.73 5.19 14.21 7.10 6.19 8.86 4.98 5.80 4.80 .000 .033 .000 > .500 > .500 .115 > .500 > .500 > .500 > .500 > .500 > .500 df t (df) 7.95*** (10) .12 (10) -.05 (10) -.35 (10) -1.19 (10) 2.41* (10) .36 (10) .92 (10) .78 (10) 1.56 (10) 2 p .000 .906 .958 .737 .262 .037 .729 .380 .452 .150 p

*
p < .05; *** p < .001
376
Full Model (OFAS 50 Questions). Table 88 below shows the results of the Full Model with Time in Level 1, all the demographic variables in Level 2, and the 50question OFAS in Level 3. After including Level 2 and Level 3 predictors, 2.79% of the variance in the between student differences in mean Spelling score was accounted for by these predictors. This result was statistically significant (00 = 2.79, 2 = 2474.87, df = 249, p = .000); therefore, there are still considerable differences between students that might be explained by other Level 2 or 3 variables. In addition, the demographic controls and teacher OFAS score accounted for 25.89% of the variance in linear growth rates of students. Since this result was found to be statistically significant (11 = .00, 2 = 297.40, df = 249, p = .019), between student differences in the effect of time is not fully explained by the demographic controls or 50-question OFAS. The remaining variation in Spelling score after controlling for the linear effects of Time at Level 1, the effects of the demographic controls at Level 2, and teacher OFAS score at Level 3 was .36 for Level 1, 2.79 for Level 2, and 1.72 for Level 3. Thus, the variance in Word Recognition score after controlling for these variables was 7.46% associated variation among measurement months, 57.29% associated with variation between students within teachers, and 35.32% between teachers. Overall mean Spelling scores across students and teachers was still positive and significantly different from zero (000 = 3.45, t = 7.85, df = 9, p = .000). This is the mean Spelling score (000 = 3.45) when the demographics are coded 0 and teacher 50-question OFAS score is 0. Furthermore, Sex, Ethnicity, ESL/ELL, and Free/Reduced Lunch did not have a significant effect on the intercepts (p > .05 for all). Additionally, teacher OFAS score did not have a statistically significant influence on the intercept, indicating 377
that initial Spelling score was similar across students regardless of teacher OFAS score (p > .05). With regards to the slope, compared to the Conditional Growth Model, there was not a significant difference in the effect of Time on Spelling score across students and teachers (100 = .02, t = 1.30, df = 9, p = .225). Sex, Ethnicity, ESL/ELL, and Free/Reduced Lunch did not have a significant effect on the slopes of the predictor Time (i.e., rate of change; p > .05 for all). And finally, teacher OFAS score had no statistically significant effect on the rate of change, suggesting that Spelling growth was similar regardless of teacher OFAS score (p > .05).
378
Table 88 Full Model with the DORA Spelling (SP) Subtest and the 50-Question OFAS Fixed Effects Model for Initial SP Status (0ij) Intercept (000) OFAS (001) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for SP Growth Rate (1ij) Intercept (100) OFAS (101) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 3.45 (.44) .02 (.02) .09 (.22) .33 (.31) -.55 (.39) -.48 (.25) .02 (.01) -.00 (.00) .00 (.02) .02 (.02) .04 (.03) .02 (.02) Variance .36 2.79 .00 1.72 .00 249 249 9 9 2474.87*** 297.40* 212.13*** 9.99 .000 .019 .000 .351 df t (df) 7.86*** (9) .96 (9) .41 (259) 1.06 (259) -1.43 (259) -1.90 (259) 1.30 (9) -.61 (9) .01 (259) .75 (259) 1.41 (259) 1.21 (259) 2 p .000 .365 .685 .291 .154 .058 .225 .559 .995 .455 .161 .227 p

*
p < .05; *** p < .001
Full Model (OFAS 10 Questions). Table 89 below shows the results of the Full Model with Time in Level 1, all the demographic variables in Level 2, and the 10question OFAS in Level 3. Compared to the Unconditional Growth Model, 2.79% of the 379
variability in initial status of Spelling was explained by the Level 2 and 3 variables. This result was statistically significant (00 = 2.79, 2 = 2474.94, df = 249, p = .000), and the same as the 50-question OFAS Full Model. The variability in the effect of time within students that can be explained by the Level 2 and 3 variables was 24.11%. This variability in this Full Model with the 10-question OFAS was almost the same as the Full Model with the 50-question OFAS. This result was found to be statistically significant (11 = .00, 2 = 297.42, df = 249, p = .019); therefore, between student differences in the effect of time is not completely explained by the demographic controls or 10-question OFAS. The remaining variation in Spelling score after controlling for the linear effects of Time at Level 1, the effects of the demographic controls at Level 2, and teacher OFAS score at Level 3 was .36 for Level 1 (i.e., same as the 50-question OFAS Full Model), 2.79 for Level 2 (i.e., same as the 50-question OFAS Full Model), and 1.19 for Level 3 (i.e., lower than the 50-question OFAS Full Model). Thus, the variance in Spelling scores after controlling for these variables was 8.37% associated variation among measurement months, 64.29% associated with variation between students within teachers, and 27.42% between teachers. Overall mean Spelling scores across students and teachers was again positive and significantly different from zero (000 = 3.46, t = 9.08, df = 9, p = .000). This is the mean Spelling score when the demographic covariates are 0 and teacher 10-question OFAS score is 0. Additionally, Sex, Ethnicity, ESL/ELL, and Free/Reduced Lunch did not have a significant effect on the intercepts (p > .05 for all). Compared to the 50-question OFAS Full Model, the 10-question teacher OFAS score had a statistically significant impact on 380
initial status, suggesting that there are differences in initial Spelling score across students depending on teacher OFAS score (001 = .15, t = 2.37, df = 9, p = .042). This means that as teacher OFAS score increased by one point, the average student DORA Spelling subtest score increased by .15 points (i.e., controlling for the demographics at Level 2). There was not a significant difference in the Time slope (p > .05), meaning that the rate of change of Spelling score was the same across students and teachers. The demographic controls did not have a significant effect on the rate of change of Spelling score (p > .05 for all). Finally, teacher OFAS score had no statistically significant effect on the Time slope, suggesting that Spelling rate of change was similar regardless of teacher OFAS score (p > .05).
381
Table 89 Full Model with the DORA Spelling (SP) Subtest and the 10-Question OFAS Fixed Effects Model for Initial SP Status (0ij) Intercept (000) OFAS (001) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for SP Growth Rate (1ij) Intercept (100) OFAS (101) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 3.46 (.38) .15 (.07) .08 (.22) .33 (.31) -.55 (.39) -.50 (.25) .02 (.01) .00 (.00) .00 (.02) .02 (.02) .04 (.03) .02 (.02) Variance .36 2.79 .00 1.19 .02 249 249 9 9 2474.94*** 297.42* 133.42*** 9.74 .000 .019 .000 .372 df t (df) 9.08*** (9) 2.37* (9) .39 (259) 1.06 (259) -1.42 (259) -1.94 (259) 1.25 (9) .20 (9) .03 (259) .76 (259) 1.39 (259) 1.17 (259) 2 p .000 .042 .700 .291 .157 .053 .245 .844 .979 .445 .165 .242 p

*
p < .05; *** p < .001
382
Reading Comprehension One-Way Random Effects ANOVA Model. Table 90 (below) describes the results of One-Way Random Effects ANOVA Model for Reading Comprehension as the outcome. The intercept is the Reading Comprehension average per student across all time points. The average student Reading Comprehension mean was statistically different from zero (000 = 6.01, t = 8.17, df = 10, p = .000). A large amount of variation in the student Reading Comprehension means still existed (00 = 6.68, 2 = 2629.10, df = 253, p = .000), and variation still existed between teachers (00 = 5.57, 2 = 249.26, df = 10, p = .000). The proportion of variance within students was 14.81%, indicating that 14.81% of the variability in Reading Comprehension scores was within students. The proportion of variance in Reading Comprehension scores between students within teachers was 46.45%, and the proportion of variance in Reading Comprehension score between teachers was 38.73%. The total variability was 14.38. Based on the significant amount of unexplained variability, adding Time as a predictor in Level 1 to reduce the variation within students, as well as adding Level 2 and Level 3 variables to explain between student and teacher differences in the following models was warranted.
383
Table 90 One-Way Random Effects ANOVA Model with the DORA Reading Comprehension (RC) Subtest Fixed Effects Model for Initial RC Status (0ij) Intercept (000) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Level 3 Teacher Baseline (00j) Coefficient (SE) 6.01 (.74) Variance 2.13 6.68 5.57 253 10 2629.10*** 249.26*** .000 .000 df t (df) 8.17*** (10) 2 p .000 p

***
p < .001
Unconditional Growth Model. Table 91 (below) shows the results of the Unconditional Growth Model with Time (i.e., in months) at Level 1. After including Time as a predictor of Reading Comprehension score, within student variability was reduced by 3.29% compared to the One-Way Random Effects ANOVA Model. The percent of variability between teachers in initial status was 44.05%, and the percent variability in the growth rate between teachers was 23.60%. The remaining variation in Reading Comprehension score after the linear effects of Time were controlled for was 2.06 for Level 1, 7.00 for Level 2, and 5.51 for Level 3. Thus, the variance in Reading Comprehension scores after controlling for the linear effect of Time was 14.14% 384
associated with nonlinear and residual effects of Time, 48.04% associated with variation between students within teachers, and 37.82% associated with between-teacher variation. Overall mean Reading Comprehension scores across students was still significantly different from zero (000 = 5.76, t = 7.83, df = 10, p = .000). Also, there was a significant difference in the Time slope (i.e., the effect of Time on Reading Comprehension score) across students (100 = .07, t = 3.91, df = 10, p = .003). For every three to four month increase, there was an average .07 points increase in student Reading Comprehension scores. The correlation between initial status and linear growth was -.61 significant (p < .05), meaning that students who began with low initial Reading Comprehension scores had the fastest growth or rate of change. Statistically significant variability in the Reading Comprehension means still existed after considering Time (00 = 7.00, 2 = 1226.58, df = 253, p = .000), but there was no statistically significant variability in individual student growth rates (11 = .00, 2 = 238.27, df = 253, p > .05). This means that between student differences in the effect of the rate of Reading Comprehension growth was fully explained by the passing of time. Statistically significant differences in variability between teacher mean initial status was also found (00 = 5.51, 2 = 206.17, df = 10, p = .000), but significant variability between teacher mean growth rates was not present (11 = .00, 2 = 12.52, df = 10, p = .251). This result indicates that between teacher differences in the growth rate of student Reading Comprehension scores across the current academic year was completely explained by Time.
385
Table 91 Unconditional Growth Model with the DORA Reading Comprehension (RC) Subtest Fixed Effects Model for Initial RC Status (0ij) Intercept (000) Model for RC Growth Rate (1ij) Intercept (100) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 5.76 (.74) .07 (.02) Variance 2.06 7.00 .00 5.51 .00 253 253 10 10 1226.58*** 238.27 206.17*** 12.52 .000 > .500 .000 .251 df t (df) 7.83*** (10) 3.91** (10) 2 p .000 .003 p

*
p < .01; *** p < .001
Conditional Growth Model. Table 92 below shows the results of Conditional Growth Model with Time in Level 1 and all the demographic controls in Level 2. After including Sex, Ethnicity, ESL/ELL status, and Free/Reduced Lunch status in Level 2, 24.29% of the variance in the between student differences in mean Reading Comprehension was accounted for by these predictors. However, this finding was statistically significant (00 = 5.30, 2 = 832.57, df = 201, p = .000), representing a significant amount of differences between students that might be explained by other Level 2 variables. In addition, 55.88% of the variability in the effect of time within 386
students can be explained by the demographic predictors added (i.e., the demographic controls accounted for 55.88% of the variance in linear growth rates of students). Unsurprisingly as with the Unconditional Growth Model, this result was not statistically significant (11 = .00, 2 = 199.97, df = 201, p > .05), which means that between student differences in the effect of time on Reading Comprehension score is completely explained by the demographic covariates and time. The remaining variation in Reading Comprehension score after the linear effects of Time at Level 1 were controlled was 2.00 for Level 1, 5.30 for Level 2, and 7.15 for Level 3. Thus, the variance in Reading Comprehension scores after controlling for the linear effect of time and the demographic covariates was 13.84% associated with nonlinear and residual effects of Time, 36.68% associated with variation between students within teachers, and 49.48% associated with between-teacher variation. Overall mean Reading Comprehension scores across students were still significantly different from zero (000 = 7.10, t = 8.46, df = 10, p = .000). This is the mean Reading Comprehension score when the demographic covariates are coded 0. The effect of Sex on mean Reading Comprehension score was negative and statistically significant (010 = .90, t = -2.68, df = 10, p = .023). The coefficient -.90 represents the decrease in a students mean Reading Comprehension score for females on average. Males were predicted to have a mean Reading Comprehension score of 7.10, and females were predicted to have a mean score of 6.20 (i.e., 7.10 - .90). Thus, at initial status, males outperformed females on the Reading Comprehension subtest on average. There were also statistically significant effects of Ethnicity on mean Reading Comprehension score (p = .009). The coefficient -1.41 represents the decrease in a 387
students mean Reading Comprehension for Minorities on average. Minorities were predicted to have a mean Reading Comprehension score of 5.69, indicating that at initial status, Whites outperformed Minorities on the Reading Comprehension subtest on average. There was no statistically significant effect of Free/Reduced Lunch status and ESL/ELL status on mean Reading Comprehension score (p > .05 for both). In this model, non-Free/Reduced Lunch and Free/Reduced Lunch status students and non-ESL/ELL and ESL/ELL status students performed statistically equally on average on the Reading Comprehension subtest at initial status. Including the demographic controls, there was no significant difference in the effect of Time on Reading Comprehension score across students (p > .005). This means that the effect of Time on mean Reading Comprehension score was equal on average across students controlling for the demographic variables. Additionally, the effect of the demographic predictors on mean Reading Comprehension score was not statistically significant (p > .05 for all). That is, there were no statistically significant differences in the rate of Reading Comprehension change between the different demographic groups as covariates in this model.
388
Table 92 Conditional Growth Model with the DORA Reading Comprehension (RC) Subtest Fixed Effects Model for Initial RC Status (0ij) Intercept (000) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for RC Growth Rate (1ij) Intercept (100) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Sex Ethnicity ESL/ELL Free/Reduced Lunch Teacher Mean Growth (10j) Sex Ethnicity ESL/ELL Free/Reduced Lunch Coefficient (SE) 7.10 (.84) -.90 (.34) -1.41 (.43) -.58 (.53) -.74 (.49) .05 (.02) .03 (.04) -.01 (.04) .09 (.05) .02 (.03) Variance 2.00 5.30 .00 7.15 .35 1.48 1.33 .17 .00 .01 .00 .00 .00 201 201 9 9 9 9 9 9 9 9 9 9 832.57*** 199.97 121.52*** 11.93 12.60 16.01 17.25* 5.50 11.94 5.64 10.25 3.27 .000 > .500 .000 .217 .181 .066 .045 > .500 .216 > .500 .330 > .500 df t (df) 8.46*** (10) -2.68* (10) -3.29** (10) -1.09 (10) -1.51 (10) 2.17 (10) .76 (10) -.36 (10) 1.64 (10) .58 (10) 2 p .000 .023 .009 .304 .161 .055 .467 .728 .131 .573 p

*
p < .05; ** p < .01; *** p < .001
389
Full Model (OFAS 50 Questions). Table 93 below shows the results of the Full Model with Time in Level 1, all the demographic variables in Level 2, and the 50question OFAS in Level 3 predicting Reading Comprehension score. After including the demographic covariates and the OFAS in the model, 19.43% of the variance in the between student differences in mean Reading Comprehension score was accounted for by these predictors. This result was statistically significant (00 = 5.64, 2 = 1038.30, df = 249, p = .000). Similarly, the variability in the effect of time within students that can be explained by the Level 2 and 3 variables compared to the Unconditional Growth Model was 32.35%. This result was not statistically significant (11 = .00, 2 = 235.78, df = 249, p > .05). The remaining variation in Reading Comprehension score after controlling for the linear effects of Time at Level 1, the effects of the demographic controls at Level 2, and teacher OFAS score at Level 3 was 2.05 for Level 1, 5.64 for Level 2, and 4.38 for Level 3. Thus, the variance in Reading Comprehension score after controlling for these variables was 16.98% associated variation among measurement months, 46.73% associated with variation between students within teachers, and 36.29% between teachers. Overall mean Reading Comprehension scores across students and teachers was still positive and significantly different from zero (000 = 7.11, t = 10.20, df = 9, p = .000). This is the mean Reading Comprehension score (000 = 7.11) for Males, Whites, nonESL/ELL status students, non-Free/Reduced Lunch status students, and when the teacher 50-question OFAS score is 0. The effect of Sex on mean Reading Comprehension was still negative and statistically significant with the same coefficient as the Conditional 390
Growth Model (010 = -.90, t = -2.66, df = 259, p = .009). Males in this model with the 50question OFAS were predicted to have a mean Reading Comprehension score of 7.11, and females were predicted to have a mean score of 6.21. There was still a statistically significant effect of Ethnicity (p = .018). At initial status, White students outperformed Minority students on the Reading Comprehension subtest on average. With the 50question OFAS in the model, Free/Reduced Lunch status as a predictor was significant (p = .011). Again, students who were not categorized as Free/Reduced Lunch status surpassed those students enrolled in the program on the Reading Comprehension subtest on average. Finally, ESL/ELL was not significant when the 50-question OFAS was added to the model (p > .05). Additionally, teacher OFAS score did not have a statistically significant influence on initial status, suggesting that initial Reading Comprehension score was similar across students regardless of teacher OFAS score (p > .05). With regards to the slope, there was not a significant difference in the effect of Time on Reading Comprehension score across students and teachers (p > .05). All the demographic controls did not have a significant effect on the rate of Reading Comprehension growth as well (p > .05 for all). Finally, teacher OFAS score had no statistically significant effect on the Time slope, indicating that Reading Comprehension growth was similar across students regardless of teacher OFAS score (p > .05).
391
Table 93 Full Model with the DORA Reading Comprehension (RC) Subtest and the 50-Question OFAS Fixed Effects Model for Initial RC Status (0ij) Intercept (000) OFAS (001) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for RC Growth Rate (1ij) Intercept (100) OFAS (101) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 7.11 (.70) .05 (.03) -.90 (.34) -1.14 (.48) -.88 (.60) -1.01 (.39) .06 (.03) .00 (.00) -.00 (.04) -.01 (.05) .10 (.06) .01 (.04) Variance 2.05 5.64 .00 4.38 .00 249 249 9 9 1038.30*** 235.79 191.01*** 12.42 .000 > .500 .000 .190 df t (df) 10.20*** (9) 1.63 (9) -2.66** (259) -2.39* (259) -1.48 (259) -2.57* (259) 1.84 (9) .58 (9) -.02 (259) -.14 (259) 1.56 (259) .17 (259) 2 p .000 .137 .009 .018 .141 .011 .099 .578 .984 .891 .121 .865 p

*
p < .05; ** p < .01; *** p < .001
Full Model (OFAS 10 Questions). Table 94 below shows the results of the Full Model with Time in Level 1, all the demographic controls in Level 2, and the 10-question 392
OFAS in Level 3. Compared to the Unconditional Growth Model, 19.71% of the variability in initial status of Reading Comprehension was explained by the Level 2 and 3 variables. This result was statistically significant (00 = 5.62, 2 = 1038.45, df = 249, p = .000), and was almost the same as in the Full Model with the 50-question OFAS. The variability in the effect of time within students that can be explained by the Level 2 and 3 variables was 44.12%. The variability for the effect of time within students in this Full Model with the 10-question OFAS was nearly the same as the Full Model with the 50question OFAS at Level 3. This result was not statistically significant (11 = .00, 2 = 235.81, df = 249, p > .05) meaning that between student differences in the effect of time is completely explained by the demographic controls and 10-question OFAS. The remaining variation in Reading Comprehension score after controlling for the linear effects of Time at Level 1, the effects of the demographic covariates at Level 2, and teacher OFAS score at Level 3 was 2.05 for Level 1 (i.e., same as the 50-question OFAS Full Model), 5.62 for Level 2 (i.e., approximately equal to the 50-question OFAS Full Model), and 2.34 for Level 3 (i.e., lower than the 50-question OFAS Full Model). Thus, the variance in Reading Comprehension score after controlling for these variables was 20.48% associated with variation among measurement months, 56.14% associated with variation between students within teachers, and 23.38% between teachers. Overall mean Reading Comprehension scores across students and teachers was again positive and significantly different from zero (000 = 7.12, t = 12.99, df = 9, p = .000). This is the mean Reading Comprehension score (000 = 7.12) when the demographic covariates are 0 and teacher 10-question OFAS score is 0. Compared to the 50-question OFAS, with the 10-question OFAS at Level 3 in the model, all the same 393
demographic controls had a significant effect on the intercepts (p < .05 for all). Thus, at initial status, males outperformed females on the Reading Comprehension subtest on average, with females having a predicted Reading Comprehension score of 6.22. Whites surpassed Minorities on the Reading Comprehension subtest on average, with Minorities having a predicted score of 5.97. Students of higher SES outperformed students in the lower SES category on the Reading Comprehension subtest on average, with lower SES students having an average initial score of 6.09. However, non-ESL/ELL students and ESL/ELL students performed statistically equally on the Reading Comprehension subtest on average (p > .05). Compared to the 50-question OFAS Full Model, the 10-question teacher OFAS score was a statistically significant influence on initial status, suggesting that there were differences in initial Reading Comprehension score across students depending of teacher OFAS score (001 = .33, t = 3.60, df = 9, p = .007). As teacher OFAS score increased by one point, the average student Reading Comprehension score increased by .33 points. There was not a significant difference in the effect of Time on Reading Comprehension score across students and teachers (p > .05). This means that the predicted Reading Comprehension growth rate (i.e., every three to four months) for Males, Whites, non-ESL/ELL, non-Free/Reduced Lunch status students with teachers who have an OFAS score of 0 was the same. The demographic controls did not have a significant effect on the rate of change (p > .05 for all). Finally, teacher OFAS score had no statistically significant effect on the Time slope, indicating that Reading Comprehension rate of change was similar regardless of teacher OFAS score (p > .05).
394
Table 94 Full Model with the DORA Reading Comprehension (RC) Subtest and the 10-Question OFAS Fixed Effects Model for Initial RC Status (0ij) Intercept (000) OFAS (001) Sex (010) Ethnicity (020) ESL/ELL (030) Free/Reduced Lunch (040) Model for RC Growth Rate (1ij) Intercept (100) OFAS (101) Sex (110) Ethnicity (120) ESL/ELL (130) Free/Reduced Lunch (140) Random Effects Level 1 Temporal Variation (etij) Level 2 Student Baseline (r0ij) Student Growth Rate (r1ij) Level 3 Teacher Baseline (00j) Teacher Mean Growth (10j) Coefficient (SE) 7.12 (.55) .33 (.09) -.90 (.34) -1.15 (.48) -.89 (.60) -1.03 (.39) .06 (.03) .00 (.00) -.00 (.04) -.01 (.05) .10 (.06) .01 (.04) Variance 2.05 5.62 .00 2.33 .00 249 249 9 9 1038.45*** 235.81 93.10*** 12.65 .000 > .500 .000 .178 df t (df) 12.99*** (9) 3.60** (9) -2.65** (259) -2.40* (259) -1.48 (259) -2.61* (259) 1.84 (9) .35 (9) -.04 (259) -.13 (259) 1.56 (259) .18 (259) 2 p .000 .007 .009 .017 .139 .010 .098 .731 .972 .897 .119 .861 p

*
p < .05; ** p < .01; *** p < .001
395
CHAPTER 5: DISCUSSION
Purpose The purposes of this study were twofold: (1) To examine the relationship between computerized/online formative assessments and summative, yearly state proficiency test scores, and (2) To validate a newly developed measure of teacher use of computerized/online formative assessment. More specifically, the relationship between a computerized/online formative assessment program in reading, DORA, and Colorado state student test scores in reading (i.e., the CSAP) were examined across several grade levels in one school district beginning in the 2004/2005 academic year and ending in 2009/2010. Multilevel growth modeling was used to explore the relationship between the computerized/online formative assessments in reading and the state test in reading. Finally, a measure of teacher use of computerized/online formative assessment (i.e., the OFAS) was developed to further explore the multilevel influence of teacher formative assessment practices on student formative assessment and state test scores. This study attempted to validate the scores on the OFAS via the following assertions: (1) DORA growth is reflective of student growth in reading, which is related to growth on state proficiency tests in reading, and (2) Teachers who use DORA more frequently are able to diagnose student reading learning barriers with specificity, and use that feedback to improve their students reading scores. Ideally, the relationship between 396
student DORA scores and student CSAP scores, and the relationship between teacher OFAS scores and student DORA scores suggests a multilevel influence of teacher use of computerized/online formative assessment on student reading scores. The above assertions attempt to delineate the influence of teacher use of computerized/online formative assessment on student reading scores not only to direct measures of student formative assessment in reading, but also more distal and high-stakes reading tests (i.e., the state tests in reading). The following chapter includes a review of the objectives, a discussion of the results for each research question, a discussion of the combined research question results in a validation argument framework, implications, limitations, future directions, and a conclusion. Objectives This preliminary investigation focused on the following three main objectives: (1) Examining if computerized/online formative assessment growth is related to state test score growth, (2) Developing a behavioral frequency measure of teacher use of computerized/online formative assessment programs, and (3) Investigating the relationship between the newly developed measure of teacher computerized/online formative assessment use and student computerized/online formative assessment scores. Specific to the first objective, it was found in the traditional formative assessment literature that its proper use can raise student standards and achievement (Black & Wiliam, 1998a). The latest studies of technology-based formative assessment have started to replicate these findings. This first objective aimed to add to this growing literature base by examining one computerized/online formative assessment program and its relationship 397
to a summative state proficiency test in the same content area. This demonstrated relationship can provide teachers and administrators with evidence to warrant the continued use of technology-based formative assessment practices, which provides not only the benefits of increased student achievement, but also several practical benefits to teachers and students alike. Related to the second objective, the role of the teacher in the formative assessment cycle was noted in the literature, with teachers who have more assessment knowledge and use more quality formative assessment strategies producing higher achieving students (Elawar & Corno, 1985; Fuchs, Fuchs, Hamlett, & Stecker, 1991; Tenenbaum & Goldring, 1989). Thus, this second objective aimed to add to the formative assessment research base by creating a measure of computerized/online formative assessment practices of teachers. Developing a psychometrically sound measure of teacher computerized/online formative assessment practices can facilitate and extend the examination of the multilevel relationships (i.e., students, teachers, schools, school districts, etc.) integral to the study of computerized/online formative assessment. Related to the third objective, as noted previously, studies have shown that teachers who engage in more frequent and quality formative assessment practices have higher learning gains in their students. This objective aimed to demonstrate that teachers with higher scores on a newly developed behavioral frequency measure of teacher use of computerized/online formative assessment will produce students with higher online formative assessment scores (i.e., growth of online formative assessment scores). Establishing this relationship, in concert with the first objective, can begin to validate the scores on this newly developed measure of teacher computerized/online formative 398
assessment practices as an indicator of student achievement, not only on various online formative assessment tests, but also on other more high-stakes tests. In general, the above purposes and objectives can support the burgeoning literature outlining the role of the Internet, and technology in general, in teaching and learning. Internet-mediated teaching and assessment is becoming commonplace in the modern classroom, and is more frequently being used to support or replace traditional modes of student evaluation. Thus, the need to examine the extent to which these methods are educationally sound is in high demand. Results from this study can not only add to the literature base theoretically and methodologically, but also practically, by bolstering support for federal initiatives and administrative demands for more efficient, technology-based ways to meet state standards and increase student achievement. Discussion of Research Question 1 Results For this research question, the discussion of the results will be separated into two sections Descriptive and Inferential. The first section will contain a discussion of the analysis sample (e.g., demographic information and CSAP and DORA descriptive statistics), which will be followed by a discussion of the substantive results specific to the hypothesis and research question of interest. Descriptive Discussion Demographic Information. The sample data used to address this first research question were existing from two sources the CDE and LGL. The CDE provided demographic information and state reading test scores for the Highland School District, and LGL supplied the online formative assessment scores for the same district. The county and district demographics were outlined first for comparison purposes with the final analysis sample. The intent in most research studies 399
such as this is to make inferences back to the population, which in this study were districts and counties containing districts similar to the Highland School District. That is, this study is believed to be generalizable to similar rural school districts with a higher proportion of Hispanic students (i.e., around 30%). Thus, a description of the county and district demographics was imperative to determine if this studys results can be generalized. This information can also provide the necessary resources for other researchers and administrators to make comparisons with their target population or research sample to determine if the results from the current study might be applicable in their contexts. The following information was summarized from the United States Census Bureau, the NCES, and the CDE. Weld County contains only one school district, the Highland School District, with only an elementary, middle, and high school. It is located in a rural area in northeast Colorado with a sizable Hispanic population. For the county and in each school in the district, in terms of ethnic/racial composition, around 70% identified as White (Non-Hispanic), and approximately 30% were Hispanic/Latino. This was consistent across all schools and years on interest in this study. The county and district ESL/ELL population hovered between 10% and 20%, with slight differences among the schools in the district. The gender composition for the county and district was nearly an even split. The only inconsistency noted between the county and district demographic information was that in the county, only 12% of individuals were noted to be below the poverty line, but in each school in the district, around 40% to 50% of the students were identified as free/reduced lunch eligible. This discrepancy may be due to the problems 400
associated with various measures of SES in education research in that multiple measures exist (e.g., parental combined income, mothers education, state poverty line cutoffs, free/reduced lunch status), with no consensus as to which one is most accurate or preferred (Kurki, Boyle, & Aladjem, 2005). Finally, the longitudinal demographic district profile provided by the NCES and the CDE remained consistent with the above information for all years provided (i.e., 2004/2005 to 2008/2009). Compared to the above population demographics, the final analysis sample appeared similar for the years provided. Although the current demographic information is not available, it can be assumed that since the district profile remained the same for the years above, that this trend would also be evidenced in the current academic year. This insight gives confidence to generalizing the results to rural school districts of similar demographic structure, containing a strong Hispanic population, a nearly equivalent gender split, smaller percentages of ESL/ELL students, and almost half of the students qualifying for free/reduced lunch (i.e., as a measure of SES). Descriptive Discussion CSAP Scores. The state and district CSAP scores (i.e., CSAP growth), as reported by the CDE, were outlined first for comparison purposes with the final analysis sample. To make inferences back to the population, a description of the state and district CSAP growth was imperative to determine if this studys results can be generalized, and if similar trends were demonstrated including all the academic years of interest for Research Question 1. State and district CSAP growth was outlined for the academic years of 2006/2007 to 2008/2009 in the results section for the first research question. The state-level data demonstrated a positive trend of moving in increasing numbers into Proficient and Advanced levels, and being able to stay proficient and above 401
over time. The results showed that students in the state were on track to maintain this trend (i.e., growth) over time. For the district (according to the CDE), the total growth percentile across the three years in reading for all grades in the district showed a slight decline. Compared to the state and district CSAP growth as reported by the CDE, the final analysis samples CSAP scores for the first research question followed a positive linear trend. This is congruent with the state-level growth information, but not the district growth descriptives provided by the CDE. This discrepancy may be due to the fact that the years of data used in the current research question span from 2004/2005 to 2008/2009 compared to the growth models on the CDE website containing only 2006/2007 to 2008/2009. Additionally, not all grade levels and students in the district were included in the final analysis sample, and problematic cases that may contribute to the slight decline noted by the CDE such as IEP students, were removed from the final analysis sample. The first research question used data from grades 3 through 11, and the CDE growth information included all grade levels and students in the district. Although the most current information is not available, it can be assumed that since the state and district profile remained the same for the years above, that this positive linear trend would also be evidenced in future growth models. This gives confidence to generalizing the results to populations with similar state test score growth in reading, containing the demographics as mentioned above. Overall, the final analysis sample demonstrated a positive linear trend in reading CSAP growth, similar to the state-level growth, which not only is favorable for the analysis of this research question (i.e., HLM), but is generalizable to populations demonstrating the same trend. 402
Descriptive Discussion DORA Scores. DORA score information for the state (or country) was not provided by LGL for comparison purposes. Additionally, DORA was implemented in the district relatively recently (i.e., 2006/2007), as with most districts across the United States. Comparative information may not be available for a few more years. However, the results from the final analysis sample included that a relatively consistent positive linear trend appropriate for the proposed analysis of the first research question for the four main subtests of Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension was observed. This is favorable for the analysis of this research question, as these subtests served as covariates in examining the relationship between another continuous, growth-related variable (i.e., the CSAP). Inferential Discussion. A Time-Varying Covariate Hierarchical Linear Growth Model was run to examine the relationship between student state test scores in reading (i.e., the CSAP) and student DORA scores across grades 3 through 10 from the academic years of 2004/2005 to 2009/2010. The goal in this first research question was to examine if student CSAP growth is related to student DORA growth. The hypothesis was that student CSAP score growth will be significantly and positively related to student DORA score growth. The multilevel growth models will be discussed below separated first by DORA subtest, and then summarized collectively. Only the final models will be discussed, as these models address the substantive research question of interest. In the Full Model with Time and Word Recognition in Level 1 and all the demographic variables in Level 2, the initial CSAP average scores were significantly greater than zero. There was a statistically significant effect of ESL/ELL status and Free/Reduced Lunch status on mean CSAP score, with non-ESL/ELL and non403
Free/Reduced Lunch status students outperforming their ESL/ELL and Free/Reduced Lunch status counterparts at initial status. The estimated average student CSAP growth was significant. That is, students, on average, grew 12.49 points every three to six months on the CSAP. The CSAP and Word Recognition covariation results were examined. The results indicated that on average, across students, the DORA Word Recognition subtest was significantly and positively related to the CSAP state test in reading. For every one point increase in Word Recognition score, there was a 1.78 point increase in the state test score. Therefore, overall, students gain in Word Recognition over time did covary significantly and positively with their CSAP gain. The next DORA subtest examined was Oral Vocabulary. With Time and Oral Vocabulary in Level 1 and all the demographic variables in Level 2, the initial CSAP average scores were significantly greater than zero. There was a significant effect of ESL/ELL status and Free/Reduced Lunch status on mean CSAP score, with nonESL/ELL and non-Free/Reduced Lunch status students outperforming their ESL/ELL and Free/Reduced Lunch status counterparts at initial status. The estimated average student CSAP growth rate was significant. That is, students, on average, grew 11.65 points every three to six months on the CSAP. Next, the CSAP and Oral Vocabulary covariation results were examined. The results indicated that on average, across students, the Oral Vocabulary subtest was significantly and positively related to the CSAP state test in reading. For every one point increase in Oral Vocabulary score, there was a 2.17 point increase in the state test score. Therefore, overall, students gain in Oral Vocabulary over time did covary significantly and positively with their CSAP gain.
404
With Spelling as the time-varying covariate in Level 1 and all the demographic variables in Level 2 in the Full Model, the initial CSAP average scores were still significantly greater than zero. As with the first two DORA covariates, there was still a statistically significant effect of ESL/ELL status and Free/Reduced Lunch status on mean CSAP score, with non-ESL/ELL and non-Free/Reduced Lunch status students outperforming their ESL/ELL and Free/Reduced Lunch status counterparts at initial status. The estimated average student CSAP growth rate was significant and positive. This means that on average, students grew 10.67 points every three to six months on the CSAP. The CSAP and Spelling covariation results were examined next. The Spelling subtest was significantly and positively related to the CSAP state test in reading. For every one point increase in Spelling score, there was a 3.41 point increase in the state test score. Therefore, overall, students gain in Spelling over time did covary significantly and positively with their CSAP gain. With Reading Comprehension as the time-varying covariate in Level 1 and all the demographic variables in Level 2 in the Full Model, the initial CSAP average scores were significantly greater than zero, as with all the other models. Similar with the other DORA covariates, there was still a statistically significant effect of ESL/ELL status and Free/Reduced Lunch status on mean CSAP score, and it was again in the same direction. Additionally, the average student CSAP growth rate was still significant and positive as with the other models. This means that on average, students grew 10.02 points every three to six months on the CSAP. The CSAP and Reading Comprehension covariation results were examined. The results indicated that on average, the Reading Comprehension subtest was significantly and positively related to the CSAP state test in 405
reading. For every one point increase in Reading Comprehension score, there was a 2.75 point increase in the state test score. Therefore, students gain in Reading Comprehension over time did covary significantly and positively with their CSAP gain. As mentioned before, the goal in this first research question was to examine if student CSAP growth is related to student DORA growth. The hypothesis was that student CSAP growth will be significantly and positively related to student DORA growth. Thus, the relationship of interest in addressing this research question is with the time-varying covariate (i.e., DORA subtest) and the state reading test (i.e., the CSAP). The hypothesis was supported in the Full Models in that all the DORA subtests were positively and significantly related to state reading test scores, indicating that these subtests are demonstrating a correlated growth in students reading to the state testing. In comparing the time-varying growth rates from the Full Models, all were significantly positively related to the state test in reading. For every one point increase in Word Recognition score, there was a 1.78 point increase in the state test score, and for every one point increase in Oral Vocabulary score, there was a 2.17 point increase in the state test score. For every one point increase in Spelling score, there was a 3.41 point increase in the state test score, and for every one point increase in Reading Comprehension score, there was a 2.75 point increase in the state test score. Therefore, student performance on the Spelling subtest resulted in faster growth on the CSAP compared to the other subtests. This is an interesting finding as typically the Reading Comprehension subtest is viewed as the most similar in structure and content to state reading tests (Lets Go Learn, Inc. , 2009c). As mentioned previously, the Reading Comprehension subtest attempts to 406
provide a window into the semantic domain of a learners reading abilities. Children silently read passages of increasing difficulty and answer questions about each passage immediately after they read it. The questions for each passage are broken up into three factual questions, two inferential questions, and one contextual vocabulary question. This is typically how many state reading tests structure their exams. As indicated above, the Spelling subtest surprisingly was related to the fastest CSAP growth rates in students. Spelling is a generative process as opposed to a decoding or meaning-making process as seen in most assessments of reading comprehension, which does not support the finding as noted above. Additionally, it is natural for young readers spelling abilities to lag a few months behind their reading comprehension abilities (Bear, Invernizzi, Templeton, & Johnston, 2000). It is surprising to see that the Word Recognition subtest growth was correlated with CSAP growth, as the testing of word identification skills out of context is typically not a skill that is the focus of standardized reading assessments (Lets Go Learn, Inc. , 2009c). This is especially true for those above the third grade level, at which time the reading of words in context has a greater instructional emphasis. As for Oral Vocabulary, it was not surprising to see a significant correlation between this subtest and CSAP growth as a learners knowledge of words and what they mean is an important part of the reading process (National Reading Panel, 2000). The knowledge of word meanings affects the extent to which the learner comprehends what he or she reads, such as in a more traditional standardized reading test. Overall, the significant findings from the Full Model indicate that as modes of online formative
407
assessment, growth on the DORA subtests is related to CSAP growth, with Spelling producing the highest growth rate. For all of the DORA subtests, the non-ESL/ELL and non-Free/Reduced Lunch status students outperformed their ESL/ELL and lower SES counterparts. This was not surprising as the measures of interest in this research question were reading tests, which require the use of standard English. Research has shown that ELL students generally perform lower than non-ELL students on reading, science, and math. Moreover, the level of impact of language proficiency on assessment of ELL students is greater in the content areas with higher language demand (Duran, 1989; Garcia, 1991). For example, analyses showed that ELL and non-ELL students had the greatest performance differences in the language-related subscales of tests in areas such as reading (Abedi, 2002). Additionally, the Free/Reduced Lunch findings were not surprising either as Free/Reduced Lunch status children have been shown to enter Kindergarten with math and reading skills substantially lower, on average, than their middle-class or higher counterparts (Kurki, Boyle, & Aladjem, 2005; Merola, 2005). It was unexpected that Gender and Minority status was not significant, as research has repeatedly shown that girls score higher than boys in reading, and a higher percentage of girls achieve reading proficiency levels in school (NCES, 2008). As for minorities, previous research has documented that there are achievement gaps among students from different ethnic groups, namely Whites versus minorities, in reading (Ferguson, 2002; Ferguson, Clark, & Stewart, 2002; Harman, Bingham, & Food, 2002). Overall, the small but positive impact of these practice tests on subsequent examination performance provides preliminary evidence that computerized/online formative assessment can serve 408
as an effective test preparation strategy. This supports the first assertion in the beginning of this section that DORA growth is reflective of student growth in reading, which is related to growth on state proficiency tests in reading. The implications of these findings will be discussed in the following paragraphs. Implications. The findings from this first research question have implications in a number of contexts and for many groups of people. As noted on multiple occasions, research has established that traditional formative assessments proper use can raise student standards and achievement, with studies of technology-based formative assessment beginning to replicate these findings. The findings from this first research question begin to contribute to this growing literature base. This first objective examined one computerized/online formative assessment program and its relationship to a summative state proficiency test in the same content area. Overall, the demonstrated relationship can provide teachers and administrators with evidence to warrant the continued use of technology-based formative assessment practices, which provides not only the benefits of increased student achievement, but also several practical benefits to teachers and students alike. Specific to the results from this first research question, use of DORA is associated with higher (and statistically significant) learning gains on a summative, end-of-year state test in reading. This supports DORA as a learning tool to gauge, or perhaps predict, student performance on the reading state test. This relationship is reassuring given the number of educators who are using technology-based teaching methods, and the number of administrators who are seeking to increase the use of technology as a learning tool in their schools. Although causal inference is limited, for teachers/educators, focusing on a 409
students DORA growth can potentially add to growth on the state reading test. For example, if a teacher can raise students DORA Spelling subtest score by just 1 point, he or she can expect to see a 3.41 point increase on the reading state test on average. Teachers (and administrators) will also benefit from the results in this first research question in garnering support for the use of computerized/online formative assessment from a practical perspective. As noted in the literature review, many advantages exist in using computerized/online formative assessment. One major benefit is the ease of disseminating feedback to students after an assessment, and using the automated, specialized feedback to diagnose problems and quickly remedy these issues in time for the state exam. Buchanan (2000) noted that the individualized feedback makes this mode of computerized/online formative assessment ideal for large, multi-section, introductory-level college courses. In the case of the current study, this mode can also be deemed ideal for large classrooms of elementary, middle, and high school students, in which their teachers may not have the time or resources to give specialized attention or feedback to everyone. Another obvious implication of the current finding is the practical advantage of ease of assessment for the large number of students being tested in the educational system. For formative assessment to be most effective, quality feedback should be provided at frequent intervals, and testing a large number of students frequently with specialized feedback can advocate the use of a technology-based mode of formative assessment. Implications also extend to the cost surrounding mass testing. Since a positive relationship was indicated between DORA and the CSAP, this may allow
410
administrators to have the necessary support to purchase site licenses and invest money in such programs, which are generally cheaper to administer frequently in bulk. For administrators, the demand for school systems, individual schools, and teachers to be accountable for student performance has increased considerably over the past two decades. This demand for accountability relates to a direct measurement of attainment of educational standards and objectives. The results from this research question support the use of DORA as a way to measure and attain various educational standards such as having students pass and excel on the end-of-year, summative state exam. Overall, these results provide support for administrative demands to find more efficient ways to meet state standards. The positive relationship between DORA and CSAP may help get support for schools to obtain the funds needed for programs to alleviate some constraints of mass assessment. The implications for researchers include the expansion of studying computerized/online formative assessment in general to younger grade levels (i.e., not just college-level). As mentioned previously, most previous studies of computerized/online formative assessment have examined only college-age populations in the university setting, usually within one course. More specifically, the examination of the relationship between measures of computerized/online formative assessment and performance on state proficiency tests will have implications for more research. Most of the current research has focused on performance on end-of-course or final exams. Research in the area of computerized/online formative assessment also has implications for methodological advances in the area that include more quantitative knowledge of computerized/online formative assessment and the relationship with high411
stakes proficiency exams. Most research in this area is qualitative, summarizing student reactions and perceptions of a technology-based platform for quizzes and exams. Finally, due to the novelty of the mode of online or computerized administration, research is lacking in longitudinal data analysis, with no studies examining multiple years of data across several cohorts. Thus, this first research question can open the door for other researchers to obtain funding to investigate this topic on a larger scale, or provide the necessary background information to determine the logistics and impact of future studies (e.g., sample size considerations, measurement of main variables of interest). Limitations and Future Directions. The following paragraphs will detail the methodological and statistical limitations in the current research question, which in turn inform the future directions for this investigation. The major methodological limitation in the current research question was the lack of a control group. Although the usefulness of studies such as the current investigation and related research questions have a place in the research process (e.g., theory generation, determining correlates for future experimental research), the conclusions drawn cannot extend past what the design and analyses can demonstrate. One of the more common problems in correlational studies such as this is to imply causation. However, without an adequate control group, inferences are relegated to correlational. Thus, the findings in the current research question cannot state that computerized/online formative assessment use causes increased state test scores, only that DORA and CSAP growth are positively related. Future research should consider implementing a similar design, but also obtain an adequate control group (e.g., other districts who are not using DORA).
412
Another major methodological limitation includes the use of one school district, which is a considerable threat to external validity. As mentioned previously, the Highland School District is a rural district in northeastern Colorado with a large Hispanic population. The current research question, and study, could have benefitted from sampling a number of school districts across the United States in other rural areas and including urban areas. For example, as noted in the results, the Highland School Districts CSAP growth was slightly different (i.e., at times, it decreased across the academic years from 2006/2007 to 2008/2009) compared to the entire state of Colorado. Sampling a number of school districts with multiple elementary, middle, and high schools could have added to the validity of the results. Future studies should include multiple school districts from a range of rural and urban areas, and perhaps involve public and private schools as well. With regards to general threats to internal validity, or the extent to which the intervention rather than extraneous influences can account for the results, some more limitations are apparent (Cook & Campbell, 1979). For example, history could have impacted the results of the current research question, if unplanned events may have disrupted the administration of the variables of interest (i.e., DORA, CSAP). Additionally, maturation is a considerable limitation in this first research question, since the data sampled took place over a few academic years. It is important to rule out that any changes over time (i.e., DORA or CSAP growth) can be distinguished from the changes associated with an intervention. Simply growing older and wiser across the academic years of interest can explain the correlated growth demonstrated in this research question.
413
Thus, future studies should include a no-treatment or control group to specifically identify the impact of the intervention. Some other threats to internal validity to consider as limitations in the current research question are testing and attrition. Testing refers to the effects that taking a test one time may have on subsequent performance on the test (Cook & Campbell, 1979). That is, changes demonstrated on DORA or the CSAP after the first administration may only be attributed to repeated testing. Attrition or loss of subjects occurs when an investigation spans more than one session and may last days, weeks, months, or years. This is a threat to internal validity as changes in overall group performance on the measures may be due to the loss of those subjects who scored in a particular direction, rather than to the impact of an intervention (Kazdin, 2003). Testing may not be an issue in the current research question, as DORA and CSAP tests were administered months and years apart, with new questions used at each administration. Attrition may also not be a concern, as in the presentation of the descriptive results, great care was used to demonstrate how the cases removed from the final analysis sample did not differ considerably on demographics, and justifications were made for removing various cases which were not of interest in the current study (i.e., IEP students). As mentioned previously, the implementation of a no-treatment or control group can great assist in the inferences drawn from the results in this first research question. There are other obvious threats to external validity in this study besides sample characteristics and stimulus characteristics and settings. For example, novelty effects can complicate the generality of the findings (Kazdin, 2003). Novelty effects refer to the possibility that the effects of an intervention may in part depend upon their 414
innovativeness or novelty in the situation (Bracht & Glass, 1968). For example, DORA was implemented only a few years ago in the district (i.e., 2006/2007), and the novelty of this new assessment or program may explain some of the DORA growth or other related findings. Future studies should implement the same design, but analyze DORA and CSAP growth after the programs novelty wears off. Reactivity is another external validity concern. The obtrusiveness of the state test and DORA tests over the course of the academic year may alter a students performance from what it would otherwise be (Kazdin, 2003). Future studies may consider implementing other unobtrusive measures in an experimental manipulation to determine if the results vary as a function of the awareness of being assessed. One limiting consideration could be the omission of analyzing cohort effects. As mentioned, four cohorts were extracted from the existing data provided by the CDE and LGL. The decision to not analyze cohort was based on a number of reasons. First, at this time, there is no literature to suspect cohort effects. Additionally, DORA as a program is too new to consider the examination of cohort effects. Also, two of the four cohorts analyzed did not display a consistent linear trend in CSAP growth over time, but the combined cohort data did demonstrate a linear trend. If cohort was analyzed separately (e.g., as another covariate in the HLM), then the non-linear trends in cohorts two and three would complicate the interpretation of the results. Finally, modeling additional covariates in HLM necessitates a larger sample size, which is already considered smaller than ideal to address this first research question. As indicated above, in examining each individual cohort, Cohorts 2 and 3 displayed a negative slope between the final two time points from the CSAP 415
administration in 2007 to 2008. Although this may be considered problematic, the total sample (i.e., all cohorts combined) showed a relatively consistent positive linear trend appropriate for the proposed analysis in the current research question. Also, these cohorts could not be dropped because a larger sample size is needed to model growth in HLM. Future studies should examine DORA and CSAP growth with other districts where consistent liner growth is demonstrated. With regards to the DORA and CSAP data, the DORA data are from 2006/2007 to current and the CSAP data range from 2004/2005 to present as well. Moreover, there are more DORA time points than CSAP time points. Therefore, the data collection time points do not line up. Thus, the coding was not exact because the data were not collected at the same time points (i.e., it was approximated). Generally, HLM can accommodate time-unstructured data such as the above; however, the accuracy and validity of results (e.g., statistical conclusion validity) many times depends on how closely the data are measured (i.e., same time/day compared to several days/weeks apart) in a time-varying covariate model (Biesanz, Deeb-Sossa, Papadakis, Bollen, & Curran, 2004). Although this could be considered problematic, analyzing only the data points that matched compared to all the data did not change the substantive results for this first research question. Another limitation to consider is the cases removed. As mentioned in the results for this first research question, missing data imputation could have been used (e.g., regression). However, this is complicated if the dataset is missing more than one data point (i.e., as the majority were in this research question), decreasing the accuracy of the imputed values. To address this concern, the models were run with all the cases with 416
missing values (i.e., on either DORA or CSAP) and without the cases removed, and the results did not change substantively. Analyzing only students in grades 3 through 11 could be considered another weakness of the current research question. Students in grades Preschool through 2 and grade 12 were not included because this study focused on the state test and regularly administered formative assessments, which only occur between grades 3 through 10 and 11. State testing in Colorado begins in grade 3, and grades 11 and 12 are given college preparatory exams and high school exit exams (i.e., not the CSAP). Additionally, DORA is administered more frequently in younger grade levels, and at least three time points are necessary to analyze the data for this research question, which supports the omission of the older grade levels. However, the omission of the noted grades above has implications for generalizing to the entire district or similar districts. Future studies should consider analyzing all grade levels with more complete data from multiple districts. With regards to the HLM assumptions, some violations were noted in examining linearity, normality, and homogeneity. Violations of one assumption can also complicate the results or interpretation of other assumptions. These violations can impact the statistical conclusion validity for this first research question. Statistical conclusion validity refers to the facets of quantitative evaluation that influence the conclusions drawn about the effect of the experiment or experimental condition (Cook & Campbell, 1979; Kazdin, 2003). Although the violations were minor (i.e., mostly heterogeneity), and the robust standard errors were reported from the HLM models, the results should be interpreted with caution. Future studies should use multiple school districts with more
417
data collection time points to have more flexibility to eliminate outliers and evaluate assumptions more accurately. Another limitation includes the number of models run in the current research question. This has an impact on statistical conclusion validity in that the more tests that are performed, the more likely a chance difference will be found even if there are no true differences between conditions (i.e., increased Type I Error rate). A series of HLM growth models were run for each DORA subtest as the time-varying covariate, which is problematic for the reason noted above. Future studies may include a DORA composite, which is being created by LGL and not currently available for evaluation. One final limitation may be the absence of directly comparing the variances and deviances of all the models. Generally, deviance comparisons (i.e., via a Chi-Square test) can be used when all models that a researcher wishes to compare are run under FEML, as in the current research study. Typically, if the deviance is smaller for a particular model compared to the others, it is considered the better fitting model. However, the goal of the current research question and this exploratory study is to examine the data, not find bestfitting model. Furthermore, exploratory studies such as this are the first in a long line of research, and calculating and comparing model fit at this first step would be somewhat premature. Future studies that consider some (or all) of these limitations may include calculating deviance comparisons when the major threats to internal, external, and statistical conclusion validity are remedied. Discussion of Research Question 2 Results For this second research question, the discussion of the results will be separated into two sections Descriptive and Rasch Analysis. The first section will contain a 418
discussion of the analysis sample (e.g., demographic information of the teacher sample), which will be followed by a discussion of the substantive results specific to the hypothesis and research question of interest. Descriptive Discussion Demographic Information. As mentioned previously, survey data were collected on two fronts: (1) all teachers in the Highland School District, and (2) online survey data collection from all teachers who currently use DORA across the United States. The survey data were combined to conduct the Rasch Analysis to address this second research question. The intent in most research studies such as this is to make inferences back to the population, which in this case is teachers who use DORA, or more generally, teachers who use a mode of computerized/online formative assessment. A comparison with the population of teachers who use DORA across the United States and Canada is not possible, as LGL does not collect or provide teacher demographic information. The final analysis sample mostly contained teachers from Colorado, with several other states (N = 12) represented. The overwhelming majority was female (85%). The mean age of teachers was around 42 years, with nearly 14 years teaching experience and 9 years teaching in their current district on average. The majority of teachers were in the elementary grade levels, with approximately the same amounts in middle school and high school. Most teachers categorized themselves as general reading teachers, with special education, language arts, and ESL/ELL teachers also represented. Finally, almost 3/4 of the teachers indicated that their highest degree obtained was a Masters. Overall, the sample used to examine the psychometric properties of the OFAS (i.e., see below) was slightly problematic for the external validity of the measure with 419
regards to the states represented. For example, the majority of teachers used in the analysis sample were from Colorado. Future studies should attempt to collect a teacher sample that represents a majority of states and various types (i.e., rural, urban) of school districts in those states. The gender composition was not problematic, as most elementary, middle, and high school teachers are female. The years experience and years in the district were not deemed obstacles as there was variability in the ranges reported (e.g., SD 10 years for total years teaching). Most of the teachers surveyed were elementary school teachers, and future studies can implement purposive sampling as a way to ensure that all grades levels are nearly equivalent. Although the development and refinement of the OFAS will require several studies and analysis samples, this first sample was acceptable in terms of generalizability to the population of teachers in the United States. Future studies should attempt to describe the population of teachers who use computerized/online formative assessment across the United States for more accurate comparisons and generalizability. Rasch Analysis Discussion. The purpose of this second research question was to examine the psychometric properties of a newly developed measure of computerized/online formative assessment practices of teachers. As mentioned previously, measuring teaching practices and behaviors is one way to gauge how progress is being made towards state standards and benchmarks of achievement. Research has demonstrated that teachers who use more frequent formative assessment practices have higher learning gains in their students (Elawar & Corno, 1985; Fuchs et al., 1991; Tenenbaum & Goldring, 1989). Therefore, the immediate purpose in creating this measure was to eventually evaluate the hypothesis (i.e., in Research Question 3) that 420
teachers with higher scores on the OFAS are related to their students computerized/online formative assessment score growth (i.e., DORA). The results of the measure and its refinement will be discussed in the following paragraphs. In the first run of the data with the original 56-question OFAS, larger separation values for items than for persons were found, which is typical as a function of the data having a smaller number of items and a larger number of people. However, in the current measure there are 56 items and 47 people. The model has a larger separation value indicative of the true variability among items being much larger than the amount of error variability; however, a larger item separation was preferable. Additionally, Cronbachs Alpha was .95, suggesting that the reliability measures indicate that the model fit was good (i.e., good internal consistency). The difficulty of the step where the transition points between one category and the next are expected to increase with each category value demonstrated the necessary information (i.e., increased with each category from Never to Almost Always). To support this, the probability curves displayed no category inversions. With regards to the items, infit measures for all items on this measure were acceptable, as well as most outfit measures; however, one item (i.e., Question 48: In a given quarter/semester, how often do you use DORA results/reports to help the low-achieving students with their reading performance?) had a larger than desired value. Further analysis of the item content revealed that this item was very easy, meaning that most teachers endorsed the highest category on the scale for this item (i.e., Almost Always). This item was grouped with items 46 and 47 in the scale as being very similar in content and wording (i.e., and grouped together in the same subsection, Using the 421
Results). Items 46 and 47 were the next highest misfitting items. Question 47 asked, In a given quarter/semester, how often do you use DORA results/reports to help the highachieving students with their reading performance? and Question 46 asked, In a given quarter/semester, how often do you use DORA results/reports to help all students with their reading performance? In reviewing the monotonically changing average theta per category for each item, there were some changes in rank. This was demonstrated for items 48 and 47, but not 46; however, it was decided that all three items should be removed because of their conceptual relation and their high outfit values. The map of persons and items was examined to determine the degree to which these items were targeted at the teachers. The scale appeared to be applicable for its purposes, with items approximately normally distributed and slightly leptokurtic. As mentioned in the results, the majority of the items were compressed between a very small logit range on the scale, which could be a function of how the items were written (i.e., grouped conceptually together by subsection and worded redundantly/similarly). According to the map, items covered a smaller range in difficulty, compared to the range for persons. This indicates that easier and harder items may need to be added in future studies to extend the range of the trait measured. Finally, as indicated in the results, at four points on the scale there are six items at the same position. This is considered a problem in that these items are redundant or are too similar with regards to the construct being measured (i.e., tapping into the same idea), and are not adding anything new to the construct being measured. Typically, a few of the redundant items could be dropped; however, further examination of these grouped items identified them as going together conceptually as a family (e.g., items 9 16, and items 422
17 24) or a subscale. Thus, no redundant items as indicated above were removed in this first run of the data or in this exploratory examination of the construct. One overarching question to address when examining all the Rasch diagnostics is, Does the order of items make sense? Pertaining to the current scale, should teachers find it harder to agree that they compare the classroom results with the school district (i.e., item 51), than that they incorporate the results into your instruction (i.e., item 6)? Although the answer to this question is subjective, item 51 may take a considerable amount of effort (e.g., compiling the district information, compiling the classroom results, entering the data into a spreadsheet, examining the data, comparing the results), compared to item 6. Additionally, there are also some gaps that need to be addressed. For example, item 3 (i.e., In a given quarter/semester, how often do you download/access the parent report after a completed assessment?) and item 36 (i.e., In a given quarter/semester, how often do you communicate the results/reports to students in a written format (i.e., either a standard letter or e-mail)?) have a large gap (i.e., .5) between them. In future studies (i.e., discussed below), items should be created that mark or fill in that level of the trait. In the second run of the data with the 53-question OFAS removing items 46, 47, and 48, the person separation had a little improvement over the first run with 56 items. Item separation was still smaller than the person separation (i.e., as in the 56-item OFAS), and was slightly smaller than the first run of the data. This was an indication that perhaps subsequent iterations might not improve considerably. Overall, the model had a separation value greater than 2, as with the 56-question OFAS, which shows that true
423
variability among items is much larger than the amount of error variability; however, a larger item separation was desirable. Many of the other Rasch diagnostics were the same or nearly the same as the 56question OFAS such as the person separation reliability, item reliability, model reliability, and Cronbachs Alpha. In examining the misfit order in the item statistics, a few other items were removed (i.e., items 53, 50, and 45). Although none of the following questions had an outfit value above 2, they were considered to be problematic in that they were grouped under the same section as the previously removed questions (i.e., 46, 47, and 48). Question 53 (i.e., In a given quarter/semester, how often do you compare individual student results with the rest of the class?), question 50 (i.e., In a given quarter/semester, how often do you compare the results with your other contentrelated classroom quiz/test/exam results (i.e., quizzes/tests/exams that you have constructed)?), and question 45 (i.e., In a given quarter/semester, how often do you link the results to your course standards and/or objectives?) all had outfit values close to 2. More specifically, questions 53 and 50 were grouped together conceptually, and with items 51 and 52, asking about comparing the results to other groups or standards. Looking at the items more closely, for items 53 and 50, there was little discrimination along the response scale in that a nearly equivalent amount of people endorsed the middle two categories (i.e., Rarely and Sometimes), and also endorsed a little more frequently the highest category (i.e., Almost Always). To explain these findings for question 53 (and perhaps 45), a closer examination of the DORA reports indicated that the information contained in the reports includes various statements of comparison between the individual student and class, and the student and state standards and district information 424
(i.e., as in items 53 and 45). This information is readily available to teachers if they need it. Perhaps some confusion was created in that this information is already calculated and prepared for the teacher in the DORA reports. Using the word compare may have lead teacher respondents to believe that they were responsible for doing the comparison. This might have lead to confusion between whether to select from the lower categories or higher categories. Additionally, question 50 may have been too confusing or wordy to answer. Further investigation of the scale was warranted based on the fact that the most poorly fitting items in the first two runs of the data were from the same subscale (i.e., Using the Results). This could indicate a number of issues, specifically that a separate subscale exists or that this subscale is sufficient for examining the construct of interest alone. As noted, using the results is a key component of formative assessment, and makes formative assessment formative. Therefore, to test this assumption (i.e., that a separate subscale or measure exists), further iterations of the data were required removing the most poorly fitting items. Aside from the reasons noted above, more items were removed in attempts to improve the psychometric properties from the original OFAS, as this was not markedly improved by removing items 46 through 48 in the 53-question OFAS. After removing items 53, 50, and 45 and running the new 50-item OFAS, person and item separation improved (i.e., was larger) compared to the previous runs. The person separation reliability estimate, item reliability, model reliability, and Cronbachs Alpha were similar to the previous runs, showing some improvement. Infit and outfit measures were again less than 2. Overall, the 50-question survey produced the lowest infit and outfit statistics, better separation, and improved reliability compared to the 56- and 53425
question OFAS. Thus, items 45, 46, 47, 48, 50, and 53 were all removed to created the final 50-question OFAS. As mentioned repeatedly, all the questions removed up to this point have been under the same heading in the survey Using the Results. Although items 46 through 48 are conceptually and with regards to word choice most similar, all items reflect using the reports generated from each DORA subtest to either improve course structure, student performance, or make comparisons with other groups (e.g., comparing classroom results with the school district). The results for Research Question 2 stated that perhaps if the above items were removed, then other questions (i.e., 49, 51, and 52) could be removed as well to further examine the theory that this section of items is indeed representing something unique. Although these items were not misfitting, and did not even have the next highest outfit, these questions go together conceptually with 53 and 50, as does 49 (i.e., Using the Results). For comparison purposes, these items were removed to examine if model fit improved. For comparison purposes, items 51, 52, and 49 were removed to potentially improve the fit. The reason for removing these next three items, was that they were the next three most poorly fitting items from the Using the Results section. The aim, therefore, was to further explore if The 47-item OFAS was run producing red flags immediately. One initial major concern was that this iteration produced an extreme person, which is not ideal. Person and item separation decreased compared to the previous models, along with the person separation reliability. Model reliability was not improved and neither was Cronbachs Alpha; however, infit and outfit measures for all
426
items were still less than 2. It should be noted though that the outfit was a little higher for the remaining items (i.e., with 52, 51, and 49 removed) compared to leaving them in. Overall, this iteration produced higher infit and outfit statistics, worse person and item separation and reliability. Thus, the decision was made to leave items 52, 51, and 49 in the measure, although they conceptually are grouped with 53 and 50. The final measure was the 50-item survey presented in the second run of the data, which produced the best reliability diagnostics and item properties and fit. All the items regularly appearing as misfitting were conceptually related, as noted above. These items were contained in the section of the survey entitled Using the Results. As outlined in the literature review, many key factors are a part of the formative assessment cycle, with the appropriate use of results and the administration of feedback (i.e., results) as arguably the most important (Shute, 2008). The author notes that research has found that good formative feedback can significantly improve learning processes and outcomes, if used and delivered correctly. This suggests that this section of the OFAS (i.e., Using the Results) is perhaps a subscale that can be used separately or in place of the larger scale to assess teacher use of computerized/online teacher formative assessment. Thus, this section of questions was also run separately using Rasch Analysis to examine the psychometric properties of this abbreviated measure. For this section of 11 questions, an extreme person was produced. It was noted that this extreme teacher responded with 3s to all 11 questions (i.e., Almost Always) and was removed in the next run of the data. Person separation was considerably lower than the previous models using the full OFAS. The lower person separation may be an indication of insufficient breadth in position. Thus, potential revision of the construct 427
(i.e., using formative assessment results) definition may be warranted, which can possibly be remedied by adding items that cover a broader range. Item separation had a larger continuum than for persons, and was greater than the 50-question OFAS. The person separation reliability was lower than all the previous models, namely the 50-question OFAS, but item and model reliability increased (i.e., was higher than the 50-question OFAS). The item mean for the 11-question OFAS was higher than all the previous models, which suggests these items, on average, were easy to agree with. The persons had a higher level of the trait than the items did. However, the higher item mean can be an indication of the items being too easy for the sample, and future studies should determine the breadth of the construct and add more difficult items to the measure. Interestingly, internal consistency dropped in this abbreviated measure, and was the lowest internal consistency value across all the versions of the survey. It should be noted that the value was still deemed acceptable by common practice in the social sciences (i.e., Cronbachs Alpha = .83). Additionally, if the value is too high, as in previous models with the full OFAS, this may suggest a high level of item redundancy. This means that a number of items are asking the same question in slightly different ways. Therefore, the slightly lower internal consistency value was found acceptable for the abbreviated OFAS. Mean infit and outfit for person and item mean squares were in the expected range, and mean standardized infit and outfit statistics indicated that the items overfit slightly, on average. Both persons and items showed little overall misfit, with persons showing a little more misfit. Based on the step logit position, the category values increased as expected. Additionally, there was no misfit for the categories, and the 428
calibrated difficulty of each step increased with category value. The probability curves also demonstrated that all the categories were being utilized, with each category value as the most likely at some point on the continuum. Taken together, despite the extreme person, these are desirable properties for the current measure, and show a little improvement compared to the larger OFAS. Regarding the item misfit diagnostics, point measure correlations for this model were slightly higher compared to the models with the larger OFAS, with correlations between .4 and .7. This is reflective of the content of this subscale, which contained items specific to using the DORA results. Infit measures for all items on this scale were acceptable, and all except for one outfit measure was under 2. Question 46 had the problematic outfit measure. This question asked, In a given quarter/semester, how often do you use DORA results to help all students with their reading performance? Upon further item analysis, it was discovered that this item was very easy, which could account for the high outfit value. This item was removed in the next run of the data to see if model fit improved. The next highest misfitting items were questions 47 and 48, similar to the findings in the larger OFAS models. However, the outfit measures for these items were still lower than the cut-off, so these items were retained in future analysis of this subscale. To confirm the removal of question 46, the monotonically changing average theta per category was reviewed and it was determined that a change in rank occurred in that the average measure logit value did not increase with each response category. The map of persons and items, overall, showed that the scale appeared to be applicable for its purposes. Unfortunately, a few large gaps in between the items were shown on the 429
variable maps. In addition, there were numerous persons whose position was above where items were measuring. This indicates that harder items may need to be added in future studies to extend the range of the trait measured, although Using the Results as a separate trait measuring computerized/online formative assessment practices of teachers needs to be more clearly explored and defined at this juncture. It was determined that some improvements could be made in this subsection of items based on the Rasch diagnostics from the 11-question OFAS. Item 46 was removed, producing a 10-question OFAS, along with one extreme teacher, reducing the analysis sample to 46. The 10-question OFAS had better convergence than the 11-question OFAS, and no extreme persons were produced in this run of the data. The person and item separation and overall model reliability was larger than the 11-item survey; however, the person separation reliability and item reliability was approximately the same as the 11question OFAS. Additionally, in this 10-question survey, the item separation and overall model reliability was higher than the 50-question OFAS. Item means were slightly lower than the 11-question OFAS, but still considerably elevated compared to the 50-question OFAS. Internal consistency was slightly lower than the 11-question OFAS, and was the lowest internal consistency value across all the versions of the survey. Although it was lower than the other models, the widely-accepted social science cut-off is .70 or higher for a set of items to be considered an internally consistent scale. Thus, the internal consistency of the 10-question OFAS was not considered problematic (i.e., Cronbachs Alpha = .81). The mean infit and outfit for person and item mean squares, the mean standardized infit and outfit, and the standard deviation of the standardized infit as an 430
index of overall misfit for persons and items were all deemed acceptable and am improvement from the 11-question OFAS. The step logit position and step calibration increased by category value as it should, and there was no misfit for the categories. Finally, the probability curves were also behaving as expected in that there were no category inversions where a higher category is more likely at a lower point than a lower category. All these characteristics were nearly the same as the 11-question OFAS, but improved drastically compared to the 50-question OFAS. Point measure correlations for this model were slightly higher, as all items on the scale had point measure correlations between .4 and .7. This was also the same point measure correlations observed in the 11-question OFAS. Again, this is reflective of the content of this subscale, which contained items specific to using the DORA results. All item infit and outfit measures on this scale were acceptable, and this was in contrast to the results of the 11-question OFAS where item 46 had an outfit measure of greater than 2. Thus, all items in this 10-question scale appeared to fit. The map of persons and items revealed that the scale appeared to be applicable for its purposes. The items were approximately normally distributed, with some items separated from the others on the far ends of the scale. Questions 50 and 53 were at the same logit value on the map, suggesting redundancy. Question 50 asked, In a given quarter/semester, how often do you compare the results with your other content-related classroom quiz/test/exam results (i.e., quizzes/tests/exams that you have constructed)? and question 53 asked, In a given quarter/semester, how often do you compare individual student results with the rest of the class? Either both of these questions (especially question 50) were worded poorly and confused the respondent, or teachers 431
interpreted the comparison made in both questions as nearly the same. That is, asking teachers about comparing individual student results with other results, whether it be with the results of other tests or other students, will render equivalent or nearly equivalent interpretations and responses. As with the 11-question OFAS, a few large gaps in between the items were shown on the variable maps. In addition, there some persons whose position was above where items were measuring, although this layout was better compared to the 11-question OFAS. Overall, the above evidence indicates that more difficult items may need to be added in future studies to extend the range of the trait measured. Overall, many of the 10question OFAS diagnostics were slightly more desirable than the 50-question OFAS (e.g., infit and outfit measures). These small improvements in model and item fit, and the general psychometric properties in this 10-question OFAS, warrant further investigation of this abbreviated measure. Both the 50-question OFAS (i.e., the best model from the full measure) and the 10-question OFAS (i.e., the best model from the Using the Results subscale) were examined further in the third research question to determine if either of these measures are related to or predictive of student computerized/online formative assessment score growth. The implications from the Rasch Analysis results will be discussed in the following paragraphs. Implications. Similar to the first research question, the findings from this second research question have implications in a number of contexts and for many groups of people. As noted on multiple occasions, the demand for school systems, individual schools, and teachers to be accountable for student performance has increased considerably in the past two decades. This demand for accountability relates to a direct 432
measurement of attainment of educational standards and objectives. Specific performance measures of those standards for teachers and students alike that track competency gains are a requirement for most educational systems. Thus, this behavioral frequency measure of teacher use of computerized/online formative assessment was created to attempt to accurately define and measure the construct and assist in further gauging attainment of educational standards and objectives. This second objective examined the psychometric properties of a newly developed measure of teacher use of computerized/online formative assessment program. The purposes for developing this measure included the following: (1) A measure of online formative assessment practices does not currently exist, (2) A quick and portable measure will potentially allow school districts and schools to examine how teachers are using their computerized/online formative assessment programs, diagnose problems, and remedy weaknesses, and (3) The measure will be flexible to use with similar programs like DORA, and will be adaptable to other content areas such as math and science. Specific to the results from this second research question, the psychometric properties, namely the 50-question OFAS, have good reliability and item properties, acceptable for the purported use of the survey as a measure of teacher use of computerized/online formative assessment. Although some refinement of the measure and more research on the construct needs to be done, the results from the initial development of the OFAS support its use in its current state as a way to gauge teacher use of this technology-based mode of formative assessment, which has implications for future research, administrators and schools, and testing companies.
433
First, for the research community, as mentioned above, a current measure of teacher use of online formative assessment practices does not exist. To date, no flexible, efficient measure exists with the potential to evaluate teacher use of this mode of formative assessment and diagnose weaknesses in the system. There are existing measures of teacher assessment practices, which emphasize teachers general assessment knowledge and skills such as the Assessment Practices Inventory (API; Zhang & BurryStock (1995; 1996) or the Assessment Literacy Inventory (ALI; Mertler & Campbell, 2005). However, these measures are very general and not specific to a technology-based platform, which requires unique questions and language such as download and online. Defining the construct thoroughly and creating a psychometrically sound measure can open doors for future research that wishes to examine and acknowledge the impact of teacher formative assessment practices on student achievement. Additionally, the flexibility of the measure will allow other researchers to make adjustments appropriate for the content area or specific computerized/online formative assessment program that they are investigating. Second, for school districts, schools, administrators, and teachers, a quick and portable measure will potentially allow for the examination of how teachers are using their computerized/online formative assessment programs, diagnose problems, and remedy weaknesses. The measure is intended, among many other reasons, to be used as a diagnostic tool for determining teaching staff needs in the school system (e.g., assessment and measurement skill deficits). With this measure, areas of weakness can be identified, and can highlight where teachers may need remediation and further training. Schools and school districts will have a more efficient means to diagnose problems by surveying their 434
teachers to determine if any problems exist in their current use of a computerized/online formative assessment program. The time and money saved by surveying the teachers can render faster feedback, and potentially remedy problems more rapidly and efficiently. Finally, with regards to the technology-based formative assessment companies, future research may deem the measure flexible to use with similar programs like DORA, and adaptable to other content areas such as math and science. This not only opens doors for the research community, but the industry side as well. For example, companies wishing to perform formative or summative program evaluations or needs assessments with their online formative assessment products may find the measure useful to incorporate in their investigations. With further refinement of the construct and measure, companies may utilize the OFAS, or even specific sections of the survey, to gather data to support the continued use of their product. And competing companies may find the OFAS useful as part of their evaluations to deter schools, school districts, or administrators from investing resources into rival products. The 10-question OFAS as a separate measure addressing the key component of formative assessment has some implications as well. The prospect of an abbreviated measure has implications for many of the groups of people and contexts mentioned above. For example, if indeed this measure of Using the Results is all that is needed to examine the construct of teacher use of online formative assessment, this 10-question measure could save researchers, schools, administrators, and online testing companies more time and money by administering a much smaller survey, as opposed to a more daunting 50-question survey. This has implications for those interested in researching this topic as well. If indeed this measure proves to be sufficient for addressing the construct of 435
teacher use of online formative assessment, then efforts should be made to refine the survey in light of the new construct definition parameters (i.e., only focusing on how the results from online formative assessment are used). Limitations and Future Directions. The following paragraphs will detail the methodological and statistical limitations in this second research question, which in turn inform the future directions for this refinement of the OFAS. One major methodological limitation is the construct definition of the OFAS. In construct-based test construction, the first step in the process is defining the construct and domain of interest. This step includes a review of the literature to examine definitions of what you are trying to measure, or thoroughly defining whatever is being examined as completely as possible (Clark & Watson, 1995). The items were created from interviews with individuals familiar with DORA, and also used a loose framework from the standard formative assessment literature as guidance. More careful consideration of the construct of teacher use of computerized/online formative assessment may have prevented some of the issues that developed in running the Rasch Analysis. For example, the existing measures of teacher assessment practices emphasize teachers general assessment knowledge and skills as defined by The American Federation of Teachers (AFT), the National Council on Measurement in Education (NCME), and the National Education Association (NEA). These organizations have combined efforts to address the problems of inadequate training for teachers by developing The Standards for Teacher Competence in the Educational Assessment of Students. The Standards stress competence in: (1) choosing assessment methods, (2) developing assessment methods, (3) administering, scoring, and interpreting assessment 436
results, (4) using assessment results for decision making, (5) grading, (6) communicating assessment results, and (7) recognizing unethical practices (AFT, NCME, & NEA, 1990). The API and the ALI use these seven standards as the definition of their construct and have made efforts in the measure development phase to include items that represent all seven areas. Perhaps future studies and iterations involving the refinement of the OFAS should consider examining if the same underlying structure for measures of general assessment knowledge (i.e., see above) are the same or approximately the same as the construct of computerized/online formative assessment. In the measure development of the OFAS, a review of the literature took place, including interviews, and subsequent attempts to define the construct based on this information. However, the construct of computerized/online formative assessment is newer compared to a long-standing construct such as standard formative assessment that has been repeatedly defined in research and the literature (i.e., Black and Wiliams five key components of formative assessment; Black & Wiliam, 1998; Clarke, 2001). Thus, the definition of the construct may still be forming, as modes of technology-based assessment become more accessible and commonplace in the classroom. Perhaps the definition of the construct will change between present day and the day when nearly every classroom has some form of technology-based assessment in place. More familiarity with this mode of formative assessment may redefine the construct. Although LGL employees were interviewed, future research should consider developing new items for the measure based on information from individuals who have used a technologybased formative assessment program for years, as well as allowing more research to come to fruition and provide insight into the construct. 437
A more thorough definition of the construct of teacher use of computerized/online formative assessment may have also aided in closing some of the gaps in the item and person maps and/or spreading some of the redundant items out. As mentioned previously, the items covered a smaller range in difficulty than the persons range. This indicated that easier and harder items may need to be added in future studies to extend the range of the trait measured. Also, at a number of points on the scale there were multiple items at the same position indicating redundancy. The removal of redundant items with careful consideration that the entire construct is being measured can remedy this. Finally, there were also some gaps that need to be addressed. In future studies, a more complete definition of the construct should be used to create items that mark or fill in that level of the trait (i.e., all/most of the items should define unique components of the construct). This refers to a revision of the current 50-question and 10-question OFAS focusing on construct validity, and with regards to item writing, content validity. Related to construct and content validity are other issues such as order effects, item phrasing, survey formatting, or other contextual variables. In the current survey, the items in some groups (e.g., Accessing Subscale/Subtest Results, Informing Instruction with Subscale/Subtest Results) may have contributed to some invariance in their phrasing and the formatting of subsections of questions. Future studies should examine the Rasch diagnostics when the items are not phrased as similarly or structured rigidly under headings in the survey. This is related to the effects of item order on responses, which may produce dependencies in the data that lead to invariance. A response to one question should not be dependent upon the context set by the prior question. Thus, the two items would not function independently, which is a violation of an assumption of the 438
Rasch model. One possible solution to the effects of item ordering would be to combine items into "testlets" and treat each set of questions as an item. Future research should more carefully consider the order of items on the OFAS, and consider examining a random or different order for comparison purposes. A more thorough understanding (and definition) of the construct may have created a better fitting model, as the Chi-Square model fit statistics were all significant. In addition, a more specific definition of the construct and more carefully constructed items reflecting this construct may have also remedied the category inversions noted in the table describing the monotonically changing average theta per category. Also, a more thorough understanding and definition of the construct may have guided the researcher in the development phase of the survey to focus solely on writing items about using the results, which was noted to represent either a separate subscale or unique measure of teacher use of online formative assessment. Overall, there are obvious limitations with regards to construct and content validity that future refinement of the OFAS and more research will resolve. Another limitation is the sample sized used for the Rasch Analysis of the OFAS. As stated previously, approximately 50 to 100 teachers are needed if item calibrations are to be stable within + 1 logits (i.e., 99% CI - 50 people) or + 1/2 logits (i.e., 95% CI - 100 people; Linacre, 1994). Fourty-seven teachers were in the final analysis sample, which is just under the ideal amount to have the item calibrations stable within + 1 logits. Another more flexible standard used to gauge sample size for Rasch Analysis is the response structure of the items on the scale (i.e., Likert). Research has shown that at least 10 observations per category are necessary for sufficient person and item measure estimate 439
stability when developing a measure. Using this criterion appeared to accommodate the small sample size used in this study of the OFAS. Future studies not restricted by time and money logistics should consider obtaining a larger sample size, representative of teachers across all states and types of school districts (i.e., a good development sample). If a larger sample size is obtained, then Exploratory Factor Analysis (EFA) can become a more viable option for future research involving this newer construct. EFA could have been considered in the current study after observing the pattern of the highest misfitting items all categorized under the same heading. This may be considered a violation of unidimensionality, one of the requirements of the Rasch model. EFA can be used to examine the underlying structure of the measure. EFA will allow for the investigation of theoretical constructs, or factors, which might be represented by a set of items (Tabachnick & Fidell, 2001). EFA is used when researchers have no predetermined hypotheses or prior theory about the nature of the underlying factor structure of their measure as in the current study. It is an inductive approach using factor loadings to uncover the factor structure of the data. Since EFA is exploratory by nature, no inferential statistical processes are used. EFA is not appropriate to use for testing hypotheses or theories, but only to clarify and describe relationships (Tabachnick & Fidell, 2001). As mentioned previously, sample size is an issue that makes EFA almost impossible at this stage in the development of the OFAS. Costello and Osborne (2005) caution researchers that EFA is a large-sample procedure. Generalizable or replicable results are unlikely if the sample is too small. Larger samples will produce correct factor structures and reduce misclassification of items compared to smaller samples. Tabachnick and Fidell (2001) state that it is comforting to have at least 300 cases for 440
factor analysis (p. 588). This large sample size is desirable to have reliably estimated correlations, but was not obtained in the current study due to practical considerations such as time and money. Additionally, running EFA on the OFAS in its current state may be premature in that the construct and content validity issues detailed above may first need to be remedied to have the correct factor structure of the properly defined construct and to reduce misclassification of items. Other limitations in the current research question include the logistics of survey research such as using global ratings, self-report, and the obtrusiveness of the measure itself. First, global ratings refers to the overall impression or summary statement of the construct of interest (Kazdin, 2003). Although they provide a very flexible assessment format and convenient format for soliciting judgments, global ratings are wrought with problems. Global ratings have a tendency to be too general and lack sensitivity to tap into the construct of interest. Defining the points on the Likert scale very clearly (e.g., Rarely was equivalent to 1 time per quarter/semester) was an attempt to alleviate this problem, but this does not remove the initial evaluation on the part of the respondent when he/she reads the word Never, Rarely, Sometimes, or Almost Always. Future studies should examine the use of different response scales for comparison purposes, and determine which response scale or structure is the best for the population of interest. Self-report measures, or surveys, measures, or scales that require individuals to report on aspects of their own personality, emotions, cognitions, or behaviors, are also problematic (Kazdin, 2003). Although there are practical benefits to using self-report measures, this mode of assessment is subject to social desirability, with the OFAS being no different. The possibility of bias and distortion on the part of the subjects in light of 441
their own motives, self-interest, or to look good is elevated with this type of measure. This is also related to the obtrusiveness of the measure. It is very apparent what construct or trait the OFAS is measuring, and it is even more obvious to teachers that responding that they execute these behaviors sometimes or almost always is desirable. Although the directions indicated that these results would not be shared with anyone specifically, and that they would be reported collectively, teachers may still have been worried about the perception of others and responded more favorably. Future studies should consider deviating from this method of assessment and perhaps use direct observation or other measures. The Rasch model is termed a "strong" model since its assumptions are more difficult to meet than those of Classical Test Theory (CTT). However, when data do not adequately fit the model, the instrument construction process must begin anew. An overall failure could occur if items are poorly constructed or are not comprehensible to the sample, or if there is a mismatch between the respondent group's abilities and item difficulties. Although the OFAS in its current state does not appear to be an overall failure with regard to the points above, the instrument should be refined based on the knowledge gained from the multiple runs of the data and the finding that a separate or unique, abbreviated measure may exist (i.e., the 10-question OFAS). Future research should focus on clearly defining the construct, validating the scores on the revised measure, and eventually demonstrate the potential flexibility of the measure with other content areas such as math or science and other technology-based formative assessment programs.
442
Discussion of Research Question 3 Results As with the first research question, this third research question will discuss the results in two sections Descriptive and Inferential. The first section will contain a discussion of the analysis sample (e.g., demographic information and DORA descriptive statistics), which will be followed by a discussion of the substantive results specific to the hypothesis and research question of interest. Descriptive Discussion Demographic Information. The sample data used to address this third research question were existing from two sources the CDE and LGL. The CDE provided demographic information for the teachers and students from the Highland School District for the 2009/2010 academic year, and LGL supplied the online formative assessment scores for the same district and academic year. The student district demographics were outlined in the discussion of the first research question for comparison purposes with the final analysis sample. The intent was to make inferences back to the population, which in this study was rural districts containing a similar demographic composition to the Highland School District (i.e., teachers and students). This information can also provide the necessary resources for other researchers and administrators to make comparisons with their target population or research sample to determine if the results from the current study might be applicable in their contexts. First, the teacher population and final analysis sample with be summarized, followed by the student population and final analysis sample. For the population of reading teachers in the Highland School District, 22 individuals were either reading general reading teachers, reading specialists, special education, ESL teachers, or ELL teachers, with the vast majority being general reading 443
teachers. Of the 22 teachers, 19 individuals completed the survey. Approximately 1/4 of the 19 teachers were female, all but one identified as White (Non-Hispanic). An average of nearly 36 years teaching experienced was cited, with roughly 8 years in the current district. The majority was elementary grade level reading teachers, and the bulk of teachers stated having Masters Degrees. Compared to the district population above, the final teacher analysis sample was comparable. Only 11 of the 18 teachers were used in the analysis because only grades 3 through 8 were analyzed due to the reasons specified in the results section. As with the population, the final analysis sample was comprised of 1/4 female teachers and nearly all were White (Non-Hispanic). An average of approximately 32 years of teaching experience was cited, with roughly 5 years in the current district. The majority of teachers were elementary general reading teachers, unsurprisingly, with most only having a Bachelors Degree. This was the only inconsistency between the district reading teacher population and final analysis sample. Overall, compared to the population demographics, the final analysis sample appeared similar for the 2009/2010 academic year. This gives confidence to generalizing the results to teacher populations of similar demographic structure in comparable districts. The student demographics for the population are the same as the first research question, with only the current (i.e., 2009/2010) academic year as the focus. As acknowledged previously, around 70% of the district identified as White (Non-Hispanic), and approximately 30% were Hispanic. The district ESL/ELL population was roughly 15%, and the gender composition for the district was nearly an even split. In each school in the district, around 40% to 50% of the students were identified as free/reduced lunch 444
eligible. The final analysis sample was comparable adding to the external validity of the results. There was a nearly even split for gender and free/reduced lunch. Almost 70% categorized themselves as White (Non-Hispanic), and near 30% claimed Hispanic status. Finally, around 20% of the students were categorized as ESL/ELL. This adds to the generalizability of the results to populations of similar demographic structure, containing a strong Hispanic population, a nearly equivalent gender split, smaller percentages of ESL/ELL students, and almost half of the students qualifying for free/reduced lunch (i.e., as a measure of SES). Descriptive Discussion DORA Scores. As mentioned in the discussion of Research Question 1, DORA score information for the state (or country) was not provided by LGL for comparison purposes. The results from the final analysis sample included that a consistent positive linear trend across the academic year of interest in this third research question was observed (i.e., 2009/2010), which is appropriate for the proposed analysis of the four main subtests of Word Recognition, Oral Vocabulary, Spelling, and Reading Comprehension. This is desirable for the analysis of this third research question, as these subtests served as the outcome of interest in examining their relationship with the OFAS. Inferential Discussion. The goal in this third research question was to examine if teacher OFAS scores are related to student DORA scores (i.e., growth) controlling for various demographic variables. A Three-Level Hierarchical Linear Growth Model was used to examine the relationship between student DORA scores across grades 3 through 8 from the current academic year of 2009/2010 and teacher OFAS scores for the same academic year. The hypothesis was that OFAS scores will be a significant and positive 445
predictor of DORA score growth. Additionally, with regards to the demographic controls, if found significant, it was expected that Females, Whites, non-ESL/ELL, and nonFree/Reduced Lunch status students would outperform their Male, Minority, ESL/ELL, and Free/Reduced Lunch counterparts. The multilevel growth models will be discussed below separated first by DORA subtest, and then summarized collectively. Only the final models will be discussed for the 50-question and 10-question OFAS, as these models address the substantive research question of interest. In the Full Model with the 50-question OFAS as a predictor at Level 3, the initial Word Recognition DORA scores across students and teachers were significantly greater than zero. There was no impact of any of the demographic covariates at initial status. Additionally, teacher OFAS scores did not have an influence on initial status, meaning that initial Word Recognition scores were similar across students regardless of teacher OFAS score. There was a significant and positive effect of Time on Word Recognition scores across students and teachers. That is, every three to four months in the current academic year, student Word Recognition scores are expected to grow .12 points. Thus, in a given academic year, a students Word Recognition score could grow anywhere between .36 and .48 points. There were no differences between the demographic covariates of interest for growth rates, and specific to the research question of interest, teacher 50-question OFAS scores had no effect on the rate of change, suggesting that Word Recognition growth was similar regardless of teacher OFAS scores. For comparisons purposes, the 10-question OFAS was used in the third level to determine if this subscale was a better predictor of student DORA scores compared to the 50-question OFAS. Similar to the 50-question OFAS model, the initial Word Recognition 446
DORA scores across students and teachers were significantly greater than zero, and there was not a significant effect of any of the demographic covariates at initial status. Unlike the 50-question OFAS Full Model, the 10-question teacher OFAS score had a significant influence on initial status, suggesting that as teacher OFAS score increases by one point, the average student DORA Word Recognition score increases by .36 points. This does not address the substantive question of interest, specific to the relationship between the OFAS and DORA growth. Remember that initial status in this research question is when time is equivalent to zero, which in these models was the late spring of 2009 before the students were with their current academic year reading teacher. Thus, this result simply reflects that teachers with higher OFAS scores had students at initial status (i.e., students who came into that teachers class the following fall) with higher DORA (pretest) scores. Similar to the 50-question OFAS, there was a significant and positive effect of Time on Word Recognition scores across students and teachers. This means that every three to four months across the academic year, a students predicted Word Recognition growth rate is .12, which is the same result as the 50-question OFAS Full Model. The demographic controls did not have a significant effect on the rate of change, which was the same finding for the 50-question OFAS. Finally, teacher 10-question OFAS scores did not have a statistically significant effect on the rate of change, suggesting that Word Recognition rate of change was similar regardless of teacher OFAS scores. This, again, was the same result with the 50-question OFAS Full Model. Overall, a significant relationship was found between teacher OFAS scores and the DORA subtest outcome at initial status in the 10-question OFAS Full Model, but not in the 50-question OFAS Full Model. 447
The next subtest examined was Oral Vocabulary with the 50-question OFAS. This Full Model contained Time in Level 1, all the demographic variables in Level 2, and the 50-question OFAS in Level 3. As with Word Recognition, the mean Oral Vocabulary scores across students and teachers were still positive and significantly different from zero. Unlike with Word Recognition, two of the demographic covariates were significant at initial status Sex and Free/Reduced Lunch status. Interestingly, Males outperformed Females in this model by .53 Oral Vocabulary points, but unsurprisingly, nonFree/Reduced Lunch status students (i.e., students of higher SES) outperformed students enrolled in the program (i.e., lower SES students) on the Oral Vocabulary subtest by .73 points on average. The Sex finding is surprising, as 2000 NAEP data found that girls score higher than boys in reading, and a higher percentage of girls achieve reading proficiency levels in school (NCES, 2008). This finding was consistent across all DORA subtests as outcomes in this third research question, and will be examined in the discussion at the end of this section. Finally, similar to Word Recognition, teacher 50question OFAS scores did not have a significant influence on initial status, meaning that initial Oral Vocabulary scores were similar across students regardless of teacher OFAS scores. There was still a significant effect of Time on Oral Vocabulary scores across students and teachers, with the predicted Oral Vocabulary score growth rate every three to four months at .07. Thus, in a given academic year, a students Oral Vocabulary score could grow anywhere between .21 and .28 points. There were no differences between the demographic covariates of interest for growth rates, and teacher 50-question OFAS scores had no effect on the rate of change, suggesting that Oral Vocabulary growth was 448
similar regardless of teacher OFAS scores. All these findings were similar to those observed in the 50-question OFAS model with Word Recognition. For comparisons purposes, the 10-question OFAS was used in the third level to determine if this subscale was a predictor of student DORA Oral Vocabulary scores compared to the 50-question OFAS. Similar to the 50-question OFAS model, the initial Oral Vocabulary DORA scores across students and teachers were significantly greater than zero; however, unlike the 50-question OFAS where only Sex and Free/Reduced Lunch were significant, all the demographic controls had a significant effect on initial status. Thus, at initial status, Males outperformed Females on the Oral Vocabulary subtest by .53 points on average; Whites surpassed Minorities by .49 points on average; Students of higher SES outperformed students in the lower SES category by .75 points on average; and non-ESL/ELL students surpassed ESL/ELL students by .72 points on average. The only unexpected finding from above is that Males outperformed Females. Finally, unlike the 50-question OFAS Full Model, 10-question teacher OFAS scores had a significant influence on initial status, suggesting that as teacher OFAS scores increase by one point, average student Oral Vocabulary scores increase by .16 points. Similar to the 50-question OFAS, there was a significant and positive effect of Time on Oral Vocabulary scores across students and teachers. This means that every three to four months across the academic year, a students Oral Vocabulary growth rate is .07, which is the same result as the 50-question OFAS Full Model. The demographic controls did not have a significant effect on the rate of change, which was the same finding for the 50-question OFAS. Finally, teacher 10-question OFAS scores had no statistically significant effect on the rate of change, suggesting that Oral Vocabulary rate 449
of change was similar regardless of teacher OFAS scores. This, again, was the same result with the 50-question OFAS Full Model. This positive relationship demonstrated between the OFAS and Oral Vocabulary scores does not address the research question of interest focusing on OFAS scores being significantly related to DORA growth. This will be discussed in more detail in future sections below. Following the discussion of results for Oral Vocabulary is Spelling with the 50question OFAS Full Model. As with all the subtests, mean Spelling scores across students and teachers were positive and significantly different from zero. Unlike Oral Vocabulary, none of the demographic covariates were significant at initial status. Finally, similar to the 50-question OFAS models for Word Recognition and Oral Vocabulary, teacher OFAS scores did not have a significant influence on initial status, meaning that initial Spelling scores were similar across students regardless of teacher OFAS scores. With regards to the slope, there was not a significant difference in the effect of Time on Spelling scores across students and teachers, and none of the demographic covariates had a significant effect on the rate of change either. Teacher OFAS scores did not have a significant effect on the rate of change, suggesting that Spelling growth was similar regardless of teacher OFAS scores. For comparison purposes, the 10-question OFAS was used in the third level of the Full Model with Spelling as the outcome. Similar to the 50-question OFAS model, mean Spelling scores across students and teachers were again positive and significantly different from zero, and the demographic covariates did not have a significant impact on initial status. Compared to the 50-question OFAS Full Model, 10-question teacher OFAS scores had a statistically significant impact on initial status, suggesting that there are 450
differences in initial Spelling scores across students depending on teacher OFAS scores. This means that as teacher OFAS scores increase by one point, average student DORA Spelling subtest scores increase by .15 points. Again, this result does not address the current focus of this research question, which was examining the relationship between teacher OFAS scores and DORA growth (i.e., rate of change or slope). As with the 50-question OFAS model, there was not a significant difference in the rate of change of Spelling scores, and the demographic controls did not have a significant effect on the growth rate either. Finally, teacher OFAS scores had no significant effect on Spellings rate of change. Aside from the significant finding that teacher 10-question OFAS score was related to student Spelling scores at initial status, the 10-question OFAS Full Model results were nearly the same as the 50-question OFAS Full Model results. The final model that will be summarized is for the DORA subtest Reading Comprehension as the outcome and the 50-question OFAS at Level 3. As with all the subtests for all the models, mean Reading Comprehension scores across students and teachers were still positive and significantly different from zero. Three of the four demographic covariates were significant in the model. Males outperformed Females at initial status by .9 points; Whites surpassed Minorities by 1.14 points on average; and non-Free/Reduced Lunch status students surpassed their Free/Reduced Lunch counterparts by 1.01 points. ESL/ELL was not significant in the model. Additionally, teacher OFAS scores did not have a significant influence on initial status, suggesting that initial Reading Comprehension scores were similar across students regardless of teacher OFAS scores. There was not a significant difference in the effect of Time on Reading Comprehension scores across students and teachers, and all the demographic controls did 451
not have a significant effect on the rate of Reading Comprehension growth as well. Finally, teacher OFAS scores had no significant impact on Reading Comprehension growth. To compare, the 10-question OFAS was also examined in the Reading Comprehension model. As with all the models, the mean Reading Comprehension scores across students and teachers were again positive and significantly different from zero. All the same demographic controls as in the 50-question OFAS Reading Comprehension model had a significant effect on the intercepts by nearly equivalent margins and in the same direction. Non-ESL/ELL students and ESL/ELL students performed statistically equally on the Reading Comprehension subtest on average. Compared to the 50-question OFAS Full Model, 10-question teacher OFAS scores were a significant influence on initial status, suggesting that as teacher OFAS scores increase by one point, the average student Reading Comprehension scores increase by .33 points. This does not, however, address the substantive research question of interest focusing on the relationship between teacher OFAS scores and student DORA growth. There was not a significant difference in the effect of Time on Reading Comprehension scores meaning that the predicted Reading Comprehension growth rate every three to four months was the same across students and teachers. Finally, the demographic controls and teacher OFAS scores did not have a significant effect on the rate of Reading Comprehension growth. Overall, the significant relationship found between teacher OFAS scores and each DORA subtest outcome at initial status in the 10question OFAS Full Model compared to the non-significant finding in the 50-question OFAS Full Model was a consistent finding. However, it was disappointing to not see a 452
significant relationship between teacher OFAS scores and student DORA growth, which is the focus of this current research question. That table below is a comparison of all the models discussed above (see Table 95 below). The table contains all the Full Models with each DORA subtest outcome and both the 50-question and 10-question OFAS as predictors at Level 3. The Xs denote a statistically significant finding, with the size of the significant effect (i.e., the coefficient) in parentheses. As shown below, across all the DORA subtests as outcomes for both the 50-question and 10-question OFAS, the mean DORA scores across students and teachers were all significantly different from zero at initial status, which is expected. Initial status is when Time is equivalent to zero, or the beginning of the current academic year (i.e., the beginning of the 2009/2010 academic year). The demographic covariates at initial status were significant in both the 50-question and 10-question Full Models for Oral Vocabulary and Reading Comprehension. For Oral Vocabulary, this included just Sex and Free/Reduced Lunch status for the 50-question OFAS Full Model, and all the demographics for the 10-question OFAS Full Model. For Reading Comprehension, Sex, Ethnicity, and Free/Reduced Lunch status were significant in both the 50-question and 10-question OFAS Full Models. Overall, this demonstrated how initial DORA subtest scores vary based on specific demographic orientations. Moving vertically down the table, for all DORA subtests, the 50-question OFAS was not a significant predictor of initial DORA status, suggesting that there were no differences in initial DORA scores depending on teacher 50-question OFAS scores. However, in the 10-question OFAS models for all DORA subtests, the OFAS was a statistically significant and positive influence on initial status, suggesting that there were 453
differences in initial DORA scores across students depending on teacher OFAS scores. The strongest relationship was between the 10-question teacher OFAS and Word Recognition subtest. For every one unit increase in teacher OFAS scores, the average student Word Recognition scores increase by .36 points. The next highest was Reading Comprehension with a .33 point increase, following by Oral Vocabulary and Spelling with .16 and .15 points increase, respectively. Unfortunately, the above finding is difficult to explain, as at initial status, the reading teachers were not specifically associated with their current academic years class. Recall that the first time point measured in this research question was in the late spring of 2009, which served as a baseline time point. Teachers were not associated with their current class of students until the second time point in the fall of 2009. Thus, since A (i.e., spring 2009 DORA scores) preceded B (i.e., 2009/2010 teacher OFAS scores obtained in the winter of 2010), it is impossible to claim that B caused or had an influence on A. One obvious explanation for why a significant and positive relationship between student DORA scores and teacher OFAS scores was observed at the intercepts was the positive correlation between teacher (i.e., student) grade level and teacher OFAS scores. For example, it was noted that the grade level of the teacher was not controlled. There were three teachers in third grade, two teachers in fourth grade, and three teachers in fifth grade (i.e., and one each in sixth, seventh, and eighth grade) in the analysis of the current research question. Post hoc analyses discovered a significant and positive correlation between the grade level of the teacher and teacher 10-question OFAS score, but not teacher 50-question OFAS score (r = .69, p < .05). This indicates that the higher the 454
student grade level, the higher the teacher OFAS score. Thus, the positive relationship between initial student DORA scores and teacher 10-question OFAS scores my have merely been driven by the fact students in higher grade levels have students with higher DORA scores, and consequently, teachers with a higher use of the DORA results. In future studies, upon replication with a reconstructed OFAS and larger districts with more teachers, teacher grade level (i.e., student grade level) should be added as a covariate in the model. The next group of significant findings in the table comparing all the models was the impact of Time on DORA subtest growth. For only Word Recognition and Oral Vocabulary, there was a significant effect of Time on DORA subtest scores for both the 50-question and 10-question OFAS models. Word Recognition had the higher rate with the predicted rate of change every three to four months at .12 for both OFAS models, and Oral Vocabulary had a slightly smaller growth rate at .07 for both OFAS models. Although this is not of particular interest in the current research question, it is remarkable that there was a significant effect of Time on DORA growth in these models due to the small amount (minimum number) of time points used to estimate growth (i.e., only three time points were possible across the current academic year). Additionally, it is disappointing that a significant and positive relationship between Time and DORA subtest growth was not observed for Spelling and Word Recognition.
455
Table 95 Comparing the 50-Question OFAS and 10-Question OFAS for all DORA Outcomes for Research Question 3 DORA Subtest Word Recognition 50a. Xc. X (.12) 10b. X X (.36)d. X (.12) Oral Vocabulary 50 X X X (.07) 10 X X X (.16) X (.07) Spelling Reading Comprehension 50 X X 10 X X X (.33) -
OFAS Initial Status Intercept Demographics OFAS Growth Rate Intercept Demographics OFAS
50 X -
10 X X (.15) -
Note. The relationship between the OFAS and DORA subtests are present at initial status only for the 10-Question OFAS Full Model.
a.
The 50-Question OFAS Full Model. The 10-Question OFAS Full Model. c. X indicates if the effect was statistically significant. d. The numbers in parentheses are the coefficients for the effect (i.e., the size of the effect).
b.
As noted above, for Oral Vocabulary and Reading Comprehension, some or all of the demographic controls were significant in the model at initial status. Specifically, Free/Reduced Lunch status was found to be significant for both the 50-question and 10question OFAS models for Oral Vocabulary and Reading Comprehension, and in the 456
hypothesized direction. It has been well-documents that Free/Reduced Lunch status children enter Kindergarten with math and reading skills substantially lower than their middle-class or higher counterparts (Kurki, Boyle, & Aladjem, 2005; Merola, 2005). Additionally, Ethnicity in both the 50-question and 10-question OFAS models for Reading Comprehension was significant and in the hypothesized direction. Previous research has documented that there are achievement gaps among students from different ethnic groups, namely Whites versus minorities, in reading (Ferguson, 2002; Ferguson, Clark, & Stewart, 2002; Harman, Bingham, & Food, 2002). Sex was a significant demographic covariate in both the 50-question and 10question OFAS models for both Oral Vocabulary and Reading Comprehension, but not in the hypothesized direction. That is, Females were hypothesized to outperform Males on all reading measures based on previous research; however, in the current research question, Males surpassed Females. As noted in the discussion of Research Question 1, research has repeatedly shown that girls score higher than boys in reading, and a higher percentage of girls achieve reading proficiency levels in school (NCES, 2008). Although not the focus of the current research question, this finding is interesting to note. Perhaps there is an interaction of the computer-based program and gender. Research has demonstrated that Males tend to participate in more informal computing experiences than Females, and feel more comfortable with technology (Campbell, 2000). Additionally, studies have indicated that boys attitudes towards computers are generally more positive than those of girls (Clariana & Schultz, 1993; Levine & Gordon, 1989). Therefore, perhaps the reason Males are outperforming Females is that DORA is a computer-based mode of formative assessment, which might favor the familiarity and attitudes of Males. 457
Of particular interest in the current research question is how the slopes (i.e., growth rates or rate of change) vary based on teacher OFAS scores. Studies have shown that teachers who engage in more frequent and quality formative assessment practices have higher learning gains in their students (Elawar & Corno, 1985; Fuchs et al., 1991; Tenenbaum & Goldring, 1989). Therefore, theoretically, it would be reasonable to expect that teacher OFAS scores should have some relationship with DORA subtest growth rates in that teachers with higher scores on the OFAS would render faster DORA growth rates. The small number of time points used to measure growth in this research question may have been problematic in addressing this relationship, which will be discussed below as a limitation and future direction. There are other hypotheses addressing why a significant relationship was not found between teacher OFAS scores and DORA growth. Examining the tau (beta) as correlations in each DORA subtest Unconditional Model (i.e., with just Time as a predictor), which is the average teacher (i.e., class) initial status correlated with average teacher (i.e., class) growth, the correlation between average initial status (i.e., pretest DORA score in the late spring of 2009) and average growth rate was different for each DORA subtest. For example, in Word Recognition, the correlation was -.90, which indicates that students who started off having lower DORA Word Recognition scores had the fastest growth. For Oral Vocabulary, the correlation was almost zero, but negative (r = -.07); for Spelling the correlation was very high and positive (r = .99); and for Reading Comprehension the correlation was again positive and moderate (r = .49). Thus, based on the fact that some relationships were negative and some were positive (i.e., not in the
458
same direction nor nearly equivalent), this could have hindered the detection of the effect of teacher OFAS scores on student DORA growth. There was no random assignment to classrooms (i.e., teachers), and based on the above discrepant correlations, the groups may not have been equivalent at the beginning of the study (i.e., some teachers ended up having more children in their classrooms with higher initial DORA scores, and some teachers ended up having more children in their classrooms with lower initial DORA scores). For example, with regards to the positive relationship as indicated above, students with higher Spelling and Reading Comprehension scores at initial status demonstrated faster growth. If this is the case, then perhaps this is an indication that these students did not grow as much or have much more room to grow (i.e., a ceiling effect). Conversely, the negative relationships revealed that students with lower DORA scores at initial status displayed faster growth, meaning that these students for these subtests had more of a chance to demonstrate growth across the academic year. Thus, the correlations are an indication of a potential uncontrolled confound in that at onset, the groups were not equivalent in this study across the classrooms. Future studies should consider an experimental design where this type of effect can be managed, use a computerized/online software program that does not have a low ceiling, as demonstrated with DORA (i.e., so the detection of growth is possible), or only select student data to analyze that have initial lower DORA scores. As mentioned before, the goal in this third research question was to examine if teacher OFAS scores were related to student DORA scores controlling for various demographic variables. The hypothesis was that OFAS scores will be a significant and 459
positive predictor of DORA scores (i.e., growth). Thus, the substantive relationship of interest in addressing this research question is between teacher OFAS scores and student DORA subtest scores. The hypothesis was not supported for both the 50-question and 10question OFAS Full Models in that teacher OFAS scores were not a significant, positive predictor of student DORA subtest score growth, which is also an indication of the validity and utility of the scores on OFAS. Future research should attempt to refine the OFAS and continue the validation process, specifically for the 10-question OFAS, as this abbreviated measure appears to be promising in measuring the construct of teacher use of computerized/online formative assessment. The findings in this research question provide no support to the second assertion mentioned at the beginning of this chapter that teachers who use DORA more frequently are able to diagnose student reading learning barriers with specificity, and use that feedback to improve their students reading scores. That is, there is no validation evidence at this point for the OFAS scores. It is important to continue the process of validation and to attempt to describe the multilevel influence of teacher use of computerized/online formative assessment on student reading scores. This hypothesized relationship can not only support the continued use of programs such as DORA, but also provide the necessary evidence to school districts and administrators to ensure that their teachers are engaging in frequent in quality computerized/online formative assessment practices as this may have an impact on their students scores. The implications of the findings from this third research question will be outlined in the following paragraphs, along with a discussion of why continuing the investigation of the multilevel influence of
460
teacher computerized/online formative assessment practice on student achievement is important. Implications. As with the discussion of the first research question, the findings from this third research question have implications on a number of fronts. As noted on multiple occasions, studies have shown that teacher assessment practices are significantly and positively related to student classroom performance (Rodriguez, 2004). Teachers who engage in more frequent and quality formative assessment practices have higher learning gains in their students (Elawar & Corno, 1985; Fuchs et al., 1991; Tenenbaum & Goldring, 1989). This third research question attempted to contribute to this growing literature base. The relationship was examined between the scores on a newly developed measure of teacher use of computerized/online formative assessment program and student computerized/online formative assessment score growth. Overall, the relationship was not demonstrated. It is important to continue the process of measure revision and score validation as this can provide teachers and administrators with some evidence to warrant the use of this measure as an efficient means to gauge teacher and student progress towards meeting state educational standards. Although there were some methodological and statistical issues that prevented the validation of the scores on the OFAS, based on the results from Research Question 2, the psychometric properties of the 10-question OFAS make this abbreviated measure worth investigating further (i.e., and the construct of computerized/online formative assessment in general). Perhaps the 10-question OFAS provides the proper amount of focus on the construct of interest compared to the 50-question OFAS, which was observed to have no relationship with student DORA scores. This lack of a relationship between the 50461
question OFAS and student DORA scores is evidence that this longer measure may contain some construct irrelevant variance (Messick, 1989), and perhaps focusing on how the results are used would be more beneficial to adequately measuring the construct of interest. It is important to continue the measure revision and validation attempts in future studies as administrators and teachers may need this research to support their increased use of technology-based formative assessment in the classroom as a means to increase student learning and achievement. Teachers should have confidence in knowing that their increased efforts in the proper use of this mode of formative assessment, namely how they use the results, will be positively related to their students achievement in a number of reading areas. Administrators will also benefit from garnering support for the use of a psychometrically sound diagnostic tool to not only gauge teacher performance and detect potential weaknesses in the system, but distally estimate their students performance on various computerized/online formative assessment outcome measures. If future studies can validate the scores on the OFAS, this may provide the necessary support to engage teachers in the proper use of this mode of formative assessment, and provide administrators and schools with the evidence to stage interventions or workshops for teachers seeking to use the program most effectively to see the highest learning gains in their students. Doran, Lawrenz, and Helgeson (1994) found that teachers do not receive much training in teacher education programs in terms of how to conduct classroom assessment, formative or otherwise, and little technical help is offered to them in their daily practice. Unfortunately, Yin and colleagues (2008) found, even when provided with quality assessment tools and training to implement them, 462
teachers experiences and prior beliefs seemed to override efforts to change teachers practices to integrate formative assessment. Perhaps this can be remedied by providing teachers with the empirical support acknowledging a positive relationship between teacher use of computerized/online formative assessment and student computerized/online formative assessment who may doubt the impact of a programs effectiveness or their efforts in concert with the program to increase student achievement. The practical advantages of detailing this relationship in future studies include minimizing the amount of diagnostic testing that needs to be done in the course of an academic year to measure progress towards state standards. For example, if an administrator is interested in examining how students may perform during the course of the year on DORA, or other computerized/formative assessment programs, measuring their teachers use and knowledge of the program can give a quicker, and more efficient, general estimate of student performance compared to surveying all the students. This also has implications for the general future and marketability of the OFAS, which may allow administrators to have the necessary support to eventually add in the OFAS as part of the regular DORA assessment process during the academic year to monitor not only how teachers are using the program, but also potentially give an indication of overall student performance by classroom. For administrators, the demand for school systems, individual schools, and teachers to be accountable for student performance has increased considerably over the past two decades. This demand for accountability relates to a direct measurement of attainment of educational standards and objectives. Although the results from this research question did not validate the scores on the OFAS, based on the results from 463
Research Question 2, some support was attained for the continued revision of the OFAS, specifically the 10-question OFAS, which demonstrated good psychometric properties. With regards to eventually using the OFAS as a diagnostic tool, administrators can proactively make recommendations that support positive change in the classroom such as prompting teachers to regularly use the assessment data and specialized feedback to modify and inform instruction. As with the first goal, the focus in this third research question was on younger grade levels (i.e., not college-age populations). The implications for researchers include the expansion of studying computerized/online formative assessment in general to elementary and middle school. As mentioned previously, most previous studies of computerized/online formative assessment have examined only college-age populations in the university setting, usually within one course. Research in the area of computerized/online formative assessment also has implications for methodological advances that include more quantitative knowledge of computerized/online formative assessment and the hierarchical relationship with teacher assessment knowledge. Many studies are just beginning to use sophisticated statistical techniques that include the hierarchical influence of classrooms, teachers, and schools on student formative assessment performance. Finally, due to the novelty of the mode of online or computerized administration, research is lacking in longitudinal data analysis, with no studies examining multiple waves of data. Although the current research question only focused on one academic year (i.e., three data points), these preliminary results can open the door for other researchers to obtain funding to investigate this topic on a larger scale, or provide the necessary 464
background information to determine the logistics and impact of future studies (e.g., sample size considerations, measurement of main variables of interest). This study is timely given the emphasis from NCLB on the professional development of teachers, with the expectation that improvements in professional development will promote positive changes in teaching practices, which will in turn enhance student achievement. Although training and workshops in the use of computerized/online formative assessment may cost money, the benefits to teaching and learning can certainly justify the expense. Thus, it is important to evidence the hypothesized multilevel relationship to have an empirical basis for supporting the use of computerized/formative assessment in the classroom, and ensuring that teachers are using this mode of formative assessment frequently and effectively. Implications of All Research Questions The Validation Argument As mentioned in the introduction and literature review, standard formative assessment practices have been suggested as a means to improve student achievement, with an evaluation of technology-based formative assessment practices not far behind. However, one complication is the lack of empirical evidence bolstered by reasoned arguments to support the claim that improvements in student achievement are associated with this new mode of formative assessment (Nichols, Meyers, & Burling, 2009). According to the Standards, validity begins with an explicit statement of the proposed interpretation of test scores along with a rationale for the relevance of the interpretation to the proposed use (AERA, APA, & NCME, 1999). In accord with this statement, the current study attempted an argument-based approach to validation as a guide for the use
465
and interpretation of student DORA scores and teacher OFAS scores as predictors of student summative test performance. The two kinds of arguments used in the validation process are interpretive arguments and validation arguments (Kane, 2007). Interpretive arguments require producing claims pertaining to test score interpretation, and validation arguments involve a collection of evidence that either supports or refutes each claim. Based on the purpose and goals of this study, propositions were made about the use of DORA scores and OFAS scores. The propositions included that if online formative assessment had a positive effect on student achievement, and a positive relationship existed between teacher online formative assessment practices and student online formative assessment scores, the validity claim was partially substantiated (Shepard, 2009). The studys purpose was to support the assertions (i.e., propositions) listed here: (1) DORA growth is reflective of student growth in reading, which is related to growth on state proficiency tests in reading, and (2) Teachers who use DORA more frequently are able to diagnose student reading learning barriers with specificity, and use that feedback to improve their students reading scores. Taking the findings from all the research questions together, this preliminary study was not able to validate the scores on a newly developed measure of teacher use of computerized/online formative assessment, and future research is necessary to support the second assertion. Full support of this validation framework is not possible at this point due to the fact that teacher OFAS scores were not a significant predictor of student DORA growth over the current academic year. Validation is a long process, and this study is only the first of many attempts to validate the scores on a measure of teacher use of computerized/online formative assessment. 466
Future validation studies adding to this preliminary body of evidence will either strengthen or contest the findings in this initial investigation. Limitations and Future Directions. The following paragraphs will detail the methodological and statistical limitations in this third research question, which in turn inform the future directions for this investigation. Many of the major methodological limitations from the first research question, such as a lack of a control group, apply to the current research question as well. Due to the lack of a control group in this correlational study, causation cannot be implied. For example, the findings in the current research question cannot state that teacher use of computerized/online formative assessment causes increased student DORA scores, only that the OFAS scores and DORA scores are positively related (at initial status). Future research should consider obtaining an adequate control group (e.g., other districts who are not using DORA). Other major methodological limitations from the first research question applicable to the current research question include the use of one school district, which is a threat to external validity. Future studies should include multiple school districts from a range of areas, and involve public and private schools as well. Additionally, reactivity is another external validity concern. The obtrusiveness of the OFAS and its face validity, which is also related to social desirability, may alter a teachers responses from what it would otherwise be (Kazdin, 2003). Future studies may consider implementing other unobtrusive measures in an experimental manipulation, or include questions on the OFAS that act as a construct validity check for social desirability, as mentioned in the discussion of Research Question 2.
467
One of the unique limitations of the current research question is the use of only the current academic years DORA scores and teacher configuration (i.e., 2009/2010). This limited the number of time points used in the HLM analysis to the absolute minimum (i.e., three time points). Additionally, the first time point served as a baseline, which was from the end of the previous academic year (i.e., 2008/2009) after the state test in reading. Thus, the baseline, or time point that served as initial status, was not a DORA score specifically associated with their current academic years primary reading teacher. This has implications for the substantive results, as previously discussed. Using only three time points also has severe consequences for the generalizability of the results, since the primary analysis was a multilevel growth model. It is difficult to fully appreciate and assess growth with only three data collection time points. Future studies should consider continuing the partnership formed with the current school district and obtain more DORA (i.e., and OFAS) scores in the upcoming academic years to more accurately assess growth (i.e., for both Research Questions 1 and 3). Using only three time points also is problematic for the stability of the coefficients in the current 3-Level HLM Growth Model, as more time points are desirable for statistical conclusion validity (Raudenbush & Bryk, 2002). Additionally, three time points may not provide enough information about a growth trajectory to claim a linear relationship. For example, previous research has documented that growth trajectories for reading are not linear in younger grade levels, where greater gains are made initially during the academic year and slow down or decrease as summer break approaches (i.e., noninstructional time; McCoach, OConnell, Reis, & Levitt, 2006). This is labeled the linearity bias by Singer and Willett (2003), and future studies should consider fitting 468
different models (e.g., quadratic) by using re-expressed predictors that can better capture nonlinear relationships. With regards to the DORA scores, the coding was not exact because the data were not collected at the same intervals. Therefore, the growth was approximated to every three to four months across the current academic year, instead of a consistent interval of every three months, for example. Generally, HLM can accommodate time-unstructured data such as the above, and alternative coding schemes did not have an impact on the substantive findings. Future research should consider analyzing data from a district that collects data at regular, consistent intervals for accuracy in interpretation. Analyzing only students in grades 3 through 8 could be considered another weakness of the current research question. Students in grades Preschool through 2 and high school were not included because this study focused on students that could be linked to specific reading teachers, which generally only occurs in the younger grade levels. Additionally, DORA is administered more frequently in younger grade levels, and at least three time points are necessary to analyze the data for this research question, which supports the omission of the older grade levels. However, the omission of the noted grades above has implications for generalizing to the entire district or similar districts. Future studies should consider analyzing all grade levels with more complete data from multiple districts. HLM assumptions for this third research question were problematic compared to the first research question. Violations were noted in examining linearity, normality, and homogeneity. As noted above, having only three points complicates the accurate assessment of not only the substantive results, but also the assumptions. In the assessment 469
of linearity, the empirical growth plots for each student for each DORA subtest suggested that most students have linear change with time. For others, the small number of waves of data made it difficult to accurately assess growth, with some trajectories appearing curvilinear and others seemingly having no linear relationship. With more than three time points, a more accurate assessment of linearity at Level 1 can be produced. For normality, the assumption was upheld in some instances, but many distributions were leptokurtic. The nonnormality demonstrated may have had an impact on the heterogeneity of variance assumption (Lomax, 2007). Level 3 homogeneity of variance was difficult to assess due to the small number of teachers (N = 11). Thus, the decision was made to examine homogeneity for the main relationship of interest in the current research question between teacher OFAS score at Level 3 and student DORA score at Level 1. This may not accurately assess homogeneity at Level 3, as some texts such as Singer and Willett (2003) recommend examining plotting the Level 2 predictors against the OFAS as a predictor at Level 3. Overall, the small sample size at Level 3 makes it difficult to reach definitive conclusions in assessing this assumption. Thus, the conclusions drawn from the homogeneity of variance assumption for Level 3 should be interpreted with caution. As described above, the assumptions were violated in this third research question, which can increase the likelihood of committing a Type I or Type II Error. However, due to the small sample size in this research question, eliminating more cases at Level 1 or groups at Level 3 is not advised, which can likely impact the validity of the study as with the assumption violations. These assumption violations can impact the statistical conclusion validity for this third research question. The robust standard errors were 470
reported from the HLM models to combat this problem. Future studies should use multiple school districts with more data collection time points to have more flexibility to eliminate outliers and evaluate assumptions more accurately. Additionally, multiple school districts or a larger school district will include more teachers at the third level, which appeared to be the most problematic for examining the current research questions assumptions. The recurring problem in the current research question is the small sample size at Levels 1 and 3. As mentioned before, if a sample size problem exists, it is usually at the group level because the group-level sample size is always smaller than the individuallevel sample size (Maas & Hox, 2005). According to simulation research, this is generally problematic for the standard errors of the second-level variances, as they are estimated too small when the number of groups is lower than 100 (Maas & Hox, 2005). This is usually applied to the second level, but is generally problematic for the third level, as the third level of data is usually smaller than the second level. The number of groups was problematic for the estimation of the standard errors for the third research question, as there were only 11 teachers in the third level of data. Future studies, as mentioned above, should find larger districts with more reading teachers, or use multiple schools districts to increase the third level sample size. Another limitation, as mentioned in Research Question 1, includes the number of models run in the current research question, specifically because two Full Models were run for each DORA outcome (i.e., one for the 50-question OFAS and another for the 10question OFAS). This has an impact on statistical conclusion validity in that the more tests that are performed, the more likely a chance difference will be found even if there 471
are no true differences between conditions (i.e., increased Type I Error rate). Future studies may include a DORA composite as the outcome, which is being considered by LGL and not currently available for evaluation. One major limitation considering all the research questions combined is the lack of any information regarding the implementation of DORA, and the growth trajectory of student state test scores before and after implementation. Using piecewise growth modeling to examine this information (i.e., the basic, molar level) can add a necessary missing piece to the validation argument. According to Seltzer, Frank, and Bryk (1994), a fundamental aim of schooling is to effect growth in childrens knowledge and skills, [and] longitudinal analysis becomes an indispensible means of assessing the success and health of educational systems (p. 48). In investigations of educational interventions, studies of academic progress before, during, and/or after a treatment period are of particular interest, specifically how well students fare after a program has been implemented (Seltzer & Svartberg, 1998). Piecewise growth modeling provides the appropriate analytical platform for analyzing longitudinal data involving an intervention such as DORA implementation (Raudenbush & Bryk, 2002). The current study could utilize piecewise growth modeling to examine if the rate of CSAP test scores significantly increases after DORA implementation. The main reason for not conducting this part of the analysis is that currently there are only five state test scores available for analysis - three before DORA implementation in the fall of 2007 and two afterwards. The necessary third CSAP score after DORA implementation will become available in August 2010, and future studies should obtain this missing time point to analyze this important baseline research question. 472
Conclusion The body of formative assessment literature has unanimously heralded the benefits of the diagnostic use of assessment to inform curriculum and instruction, and consequently, improve student performance and achievement. Previous research in this area has primarily focused on traditional formative assessment practices (i.e., paper-andpencil quizzes, oral and written feedback to students). More recently with the technology movement in schools, the literature is beginning to examine the effectiveness of computerized or Internet-based formative assessment, with the latest studies of this modern mode formative assessment beginning to replicate these findings. The current study attempted to add to this literature base by examining one computerized/online formative assessment program and its relationship to a summative state proficiency test, in addition to examining the multilevel influence of teacher use of this technology-based mode of assessment on student computerized/online formative assessments. This preliminary investigation focused on the following three main objectives: (1) Examining if computerized/online formative assessment growth is related to state test score growth, (2) Developing a behavioral frequency measure of teacher use of computerized/online formative assessment programs, and (3) Investigating the relationship between the newly developed measure of teacher computerized/online formative assessment use and student computerized/online formative assessment scores (i.e., growth). Specific to the first objective, it was hypothesized that student computerized/online formative assessment growth would be related to state test score growth. This hypothesis was supported in that all the DORA subtests were positively and
473
significantly related to state reading test scores, indicating that these subtests are demonstrating a correlated growth in students reading to the state testing. The second objective aimed to add to the formative assessment research base by creating a measure of computerized/online formative assessment practices of teachers. The hypothesis was that a psychometrically sound measure of teacher computerized/online formative assessment practices can be developed. The results rendered a 50-question OFAS focusing on all elements of computerized/online formative assessment, and a potential abbreviated version of this larger measure, the 10-question OFAS, which focused solely on how teachers use the results from the computerized/online formative assessment program. The immediate purpose in creating this measure was to evaluate the hypothesis in Research Question 3 that teachers with higher scores on the OFAS are related to their students higher computerized/online formative assessment scores (i.e., DORA growth). The measures developed in Research Question 2 were examined in the third research question, which used both the 50-question OFAS and the 10-question OFAS to determine the predictive validity of student computerized/online formative assessment scores. The third objective aimed to demonstrate that teachers with higher scores on the newly developed behavioral frequency measure of teacher use of computerized/online formative assessment will produce students with higher online formative assessment scores. The hypothesis was not supported for both the 50-question and 10-question OFAS Full Models in that teacher OFAS scores were not a significant, positive predictor of student DORA subtest score growth.
474
The above findings were used to support the following assertions in a validation argument: (1) DORA growth is reflective of student growth in reading, which is related to growth on state proficiency tests in reading, and (2) Teachers who use DORA more frequently are able to diagnose student reading problems with greater specificity, and use that feedback to improve their students scores. The purpose of this preliminary study combined across research questions was to begin to validate the scores on a newly developed measure of teacher use of computerized/online formative assessment. Ideally, the relationship between student DORA scores and student CSAP scores, and the relationship between teacher OFAS scores and student DORA scores suggests a multilevel influence of teacher use of computerized/online formative assessment on student reading scores. Although the validation argument was not supported due to the results from Research Question 3, future research should continue to define the theoretical network of relationships that may exist, and continue to attempt the validation of the scores on a measure of teacher use of computerized/online formative assessment practices. The most interesting finding and conclusion gained from the above collection of research question is that the OFAS is worth pursuing further. The results are consistent with the construct and provide preliminary support for this newly developed measure, specifically the focus on using the results. Many questions remain about the construct of computerized/online formative assessment practices, the content of the measure, and its correlates, and the examination of this information has only just begun. Developing a new scale or measure begins the path of validation completely anew, and initial studies such as this are only the first step. Establishing a positive relationship between teacher 475
and student with regards to computerized/online formative assessment, in concert with the first objective, can begin to validate the scores on this newly developed measure of teacher computerized/online formative assessment practices as an indicator of student achievement, not only on various online formative assessment tests, but also on other more high-stakes tests. In general, the above findings can provide some support to the burgeoning literature outlining the role of computerized/online formative assessment in teaching and learning. Internet-mediated teaching and assessment is becoming commonplace in the classroom, and is more frequently being used to replace traditional modes of student assessment. The need to examine the extent to which these methods are educationally sound is in high demand. Results from this study can not only add to the literature base theoretically and methodologically, but also practically, by bolstering support for federal initiatives and administrative demands for more efficient, technology-based ways to encourage teachers to invest their time in this mode of formative assessment, and in turn, meet state standards and increase student achievement.
476
REFERENCES
Abedi, J. (2002). Standardized achievement tests and English language learners: Psychometrics issues. Educational Assessment, 8(3), 231 257. American Association for the Advancement of Science (AAAS). (1989). Project 2061: Science for all Americans. Washington, DC: AAAS. American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. Washington, DC: AERA. American Federation of Teachers (AFT), National Council on Measurement in Education (NCME), National Education Association (NEA). (1990). Standards for teacher competence in educational assessment of students. Retrieved from http://www.unl.edu/buros/article3.html Andrich, D. (1978) A rating scale formulation for ordered response categories. Psychometrika, 43, 561 573. Angus, S.D., & Watson, J. (2009). Does regular online testing enhance student learning in the numerical sciences? Robust evidence from a large data set. British Journal of Educational Technology, 40(2), 255 272. Bear, D.R., Invernizzi, M., Templeton, S., & Johnston, F. (2000). Words their way: Word study for phonics, vocabulary, and spelling instruction (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Bell, B., & Cowie, B. (2000). The characteristics of formative assessment in science education. Science Education, 85, 536 553. Bennett, R.E. (2001). How the internet will help large-scale assessment reinvent itself. Education Policy Analysis Archives, 9(5), 1 23. Bennett, R.E. (2002). Inexorable and inevitable: The continuing story of technology and assessment. Journal of Technology, Learning, and Assessment, 1(1), 1 24. 477
Bhola, D.S., Impara, J.C., & Buckendahl, C.W. (2003). Aligning tests with states content standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21 29. Biesanz, J.C., Deeb-Sossa, N., Papadakis, A.A., Bollen, K.A., & Curran, P.J. (2004). The role of coding time in estimating and interpreting growth curve models. Psychological Methods, 9(1), 30 52. Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2002). Working inside the black box. London, England: Nelson Publishing Company. Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21, 5 31. Black, P., & Wiliam, D. (1998a). Assessment and classroom learning. Assessment in Education: Principles, Policy, and Practice, 5(1), 7 75. Black, P., & Wiliam, D. (1998b). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 80(2), 139 149. Boekaerts, M., & Corno, L. (2005). Self-regulation in the classroom: A perspective on assessment and intervention. Applied Psychology, 54(2), 199 231. Bond, L.A., Braskamp, D., & Roeber, E. (1996). The status report of the assessment programs in the United States: State students assessment programs database school year 1994-1995. Retrieved from ERIC database. (ED 401333) Bond, T.G., & Fox, C.M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Boston, C. (2002). The concept of formative assessment. Practical Assessment, Research & Evaluation, 8(9). Retrieved from http://PAREonline.net/getvn.asp?v=8&n=9 Bracht, G.H., & Glass, G.V. (1968). The external validity of experiments. American Educational Research Journal, 5, 437 474. Brookhart, S.M. (2007). Expanding views about formative classroom assessment: A review of the literature. In J.H. McMillan (Ed.), Formative classroom assessment: Research, theory and practice. New York, NY: Teachers College Press. Brown, S., & Knight, P. (1994). Assessing learners in higher education. London, England: Kogan Page. Buchanan, T. (2000). The efficacy of a world-wide web mediated formative assessment. Journal of Computer Assisted Learning, 16, 193 200. 478
Butler, S., Marsh, H.W., Sheppard, M.J., & Sheppard, J.L. (1985). Seven-year longitudinal study of the early prediction of reading achievement. Journal of Educational Psychology, 77, 349 361. Buzzetto-More, N., & Guy, R. (2006). Incorporating the hybrid learning model into minority education at a historically black university. Journal of Information Technology in Education, 5, 153 164. Campbell, K. (2000). Gender and educational technologies: Relational frameworks for learning design. Journal of Educational Multimedia and Hypermedia, 9(1), 131 149. Cassady, J.C., & Gridley, B.E. (2005) The effects of online formative and summative assessment on test anxiety and performance. Journal of Technology, Learning and Assessment, 4(1), 1 31. Caudle, S.L. (2004). Qualitative data analysis. In J.S. Wholey, H.P. Hatry, & K.E. Newcomer (Eds.), Handbook of practical program evaluation (2nd ed., pp. 417 438). San Francisco, CA: Jossey-Bass. Chaplin, D.D. (2003). Hierarchical linear models: Strengths and weaknesses. Unpublished manuscript, The Urban Institute, Washington, D.C. Clariana, R.B., & Schultz, C.W. (1993). Gender by content achievement: Differences in computer-based instruction. Journal of Computers in Mathematics and Science Teaching, 12(3/4), 277 288. Clark, L.A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309 319. Clarke, S. (2001). Unlocking formative assessment. London, England: Hodder and Stoughton. Colorado Department of Education. (2009a). CSAP scoring information. Retrieved from http://www.cde.state.co.us/cdeassess/documents/csap/csap_scoring.html Colorado Department of Education. (2009b). Office of Standards and Assessments. Retrieved from http://www.cde.state.co.us/cdeassess/index_osa.html Colorado Department of Education. (2009c). The Colorado Department of Education. Retrieved from http://www.cde.state.co.us/ Colorado Department of Education. (2009d). Unit of Student Assessment. Retrieved from http://www.cde.state.co.us/cdeassess/documents/csap/usa_index.html 479
Colorado Department of Education. (2009e). Colorado growth model. Retrieved from http://www.cde.state.co.us/research/GrowthModel.htm Colorado Department of Education. (2009f). The Colorado Student Assessment Program Alternate (CSAPA). Retrieved from http://www.cde.state.co.us/cdeassess/index_csapa.html Cook, T.D., & Campbell, D.T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago, IL: Rand McNally. Costello, A.B., & Osborne, J.W. (2005). Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research, and Evaluation, 10(7), 1 9. Crisp, V., & Ward, C. (2005). The PePCAA project: Formative scenario-based CAA in psychology for teachers. In M. Danson (Ed.), Ninth international computer assisted assessment conference proceedings. Loughborough, UK: Loughborough University. Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of Educational Research, 58, 438 481. Daley, E., McDermott, R., McCormack-Brown, K., & Kittleson, M. (2003). Conducting web-based survey research: A lesson in Internet designs. American Journal of Health Behavior, 27, 116 124. Dickinson, D.K., & McCabe, A. (2001). Bringing it all together: The multiple origins, skills, and environmental supports of early literacy. Learning Disabilities Research and Practice, 16, 186 202. Doran, R.L., Lawrenze, E., & Helgeson, S. (1994). Research on assessment in science. In D. Gabel, (Ed.), The handbook for research on science teaching and learning (pp. 388 442). New York, NY: Macmillan. Duran, R.P. (1989). Assessment and instruction of at-risk Hispanic students. Exceptional Children, 56(2), 154 158. Educational Testing Service (ETS). (2009). Assessment Training Institute (ATI) Your resource for classroom assessment for learning. Retrieved from http://www.assessmentinst.com/ Elawar, M.C., & Corno, L. (1985). A factorial experiment in teachers written feedback on student homework: Changing teacher behaviour a little rather than a lot. Journal of Educational Psychology, 77, 162 173. 480
Ferguson, R. F. (2002). What doesnt meet the eye: Understanding and addressing racial disparities in high achieving suburban schools. Retrieved from ERIC database. (ED474390) Ferguson, R. F., Clark, R. & Stewart J. (2002). Closing the achievement gap in suburban and urban school communities. Retrieved from ERIC database. (ED473122) Fontana, D., & Fernandes, M. (1994). Improvements in mathematics performance as a consequence of self-assessment in Portuguese primary school pupils. British Journal of Educational Psychology, 64, 407 417. Fouladi, R., McCarthy, C., & Moller, N. (2002). Paper-and-pencil or online? Evaluating mode effects on measures of emotional functioning and attachment. Assessment, 9, 204 215. Fowler, F.J., Jr. (2002). Survey research methods (3rd ed.). Thousand Oaks, CA: Sage Publications. Fox-Turnbull, W. (2006). The influences of teacher knowledge and authentic formative assessment on student learning in technology education. International Journal of Technology & Design Education, 16(1), 53 77. Francis, D.J., Fletcher, J.M., Stuebing, K.K., Davidson, K.C., & Thompson, N.R. (1991). Analysis of change: Modeling individual growth. Journal of Consulting and Clinical Psychology, 59, 27 37. Fry, E., Kress, J., & Fountoukidis, D. (2004). The reading teachers book of lists. Paramus, NJ: Prentice-Hall, Inc. Fuchs, L.S., & Fuchs, D. (1986). Effects of systematic formative evaluation: A metaanalysis. Exceptional Children, 53, 199 208. Fuchs, L.S., Fuchs, D., Hamlett, C.L., & Stecker, P.M. (1991). Effects of curriculumbased measurement and consultation on teacher planning and student achievement in mathematics operations. American Educational Research Journal, 28, 617 641. Gallagher, C., & Worth, P. (2008). Formative assessment policies, programs, and practices in the Southwest Region (Research Report No. 041). Retrieved from the U.S. Department of Education, Institute of Education Sciences website: http://ies.ed.gov/ncee/edlabs Garcia, G.E. (1991). Factors influencing the English reading test performance of Spanishspeaking Hispanic children. Reading Research Quarterly, 26(4), 371 391. 481
Gardner-Medwin, A.R., & Gahan M. (2003). Formative and summative confidence-based assessment. In J. Christie (Ed.), Seventh international computer assisted assessment conference proceedings. Loughborough, UK: Loughborough University. Gillet, J.W., & Temple, C. (1994). Understanding reading problems: Assessment and instruction (4th ed.). New York, NY: Harper Collins College Press. Goldhaber, D.E. (2000). Theories of human development: Integrative perspectives. Mountain View, CA: Mayfield Publishing Company. Goldstein, H. (1995). Multilevel statistical models (2nd ed.). New York, NY: Halstead. Greene, J.A., & Azevedo, R. (2007). A theoretical review of Winne and Hadwins model of self-regulated learning: New perspectives and directions. Review of Educational Research, 77(3), 354 372. Guzzo, R.A., Jette, R.D., & Katzell, R.A. (1985). The effects of psychologically based intervention programs on worker productivity: A meta-analysis. Personnel Psychology, 38(2), 275 291. Harman, P., Bingham, C. S., & Food, A. (2002 April). An exploratory examination of North Carolina charter schools and their potential impact on White-Minority achievement gap reduction. Paper presented at the meeting of the American Educational Research Association, New Orleans, LA. Hills, J.R. (1991). Apathy concerning grading and testing. Phi Delta Kappa, 72(7), 540 545. Hoffman, K., & Liagas, C. (2003). Status and trends in the education of Blacks (Research Report No. 034). U.S. Department of Education, Washington D.C.: National Center for Education Statistics Hox, J.J. (2002). Multilevel analysis: Techniques and applications. Mahwah, NJ: Erlbaum. Hsieh, I.L.G., & ONeil, H.F., Jr. (2002). Types of feedback in a computer-based collaborative problem-solving group task. Computers in Human Behavior, 18(6), 699 715. Hunt, N., Hughes, J., & Rowe, G. (2002). Formative Automated Computer Testing (FACT). British Journal of Educational Technology, 33(5), 525 535. Jencks, C., & Phillips, M. (1998). The Black-White test score gap. Washington, D.C.: Brookings Institution Press. 482
Jenkins, M. (2004). Unfulfilled promise: Formative assessment using computer-aided assessment. Learning and Teaching in Higher Education, 1, 67 80. Kane, M.T. (2007). Validation. In R.L. Brennan (Ed.), Educational measurement. Westport, CT: Praeger Publishers. Kazdin, A.E. (2003). Research design in clinical psychology. Boston, MA: Allyn & Bacon. Keppell, M., & Carless, D. (2006). Learning-oriented assessment: A technology-based case study. Assessment in Education, 13(2), 179 191. Kluger, A.N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119, 254 284. Kurki, A., Boyle, A., & Aladjem, D.K. (2005, April). Beyond free lunch Alternative poverty measures in educational research and program evaluation. Paper presented at the meeting of the American Educational Research Association, Montreal, Canada. Lauer, P.A., Snow, D., Martin-Glenn, M., VanBuhler, R.J., Stoutemeyer, K., & SnowRenner, R. (2005). The influence of standards on K-12 teaching and learning: A research synthesis. Paper presented at the meeting of the Mid-Continent Research for Education and Learning, Aurora, CO. Leslie, L., & Caldwell, J. (1994). The Qualitative Reading Inventory (2nd ed.). New York, NY: Harper Collins College Press. Lets Go Learn, Inc. (2009a). Diagnostic Online Reading Assessment (K 12). Retrieved from http://www.letsgolearn.com/lglsite/DORA_K_12/educators/ Lets Go Learn, Inc. (2009b). Lets Go Learn. Retrieved from http://www.letsgolearn.com/ Lets Go Learn, Inc. (2009c). Lets Go Learn research page. Retrieved from http://www.letsgolearn.com/lglsite/research Levine, T., & Gordon, C. (1989). Effect of gender and computer on attitudes toward computers. Journal of Educational Computing Research, 5(1), 69 88. Linacre, J.M. (2002). Understanding Rasch measurement: Optimizing rating scale category effectiveness. Journal of Applied Measurement, (3)1, 85 106.
483
Linacre, J.M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, (3)2, 103 122. Linacre, J.M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328. Linacre, J. (2009). A Users Guide to Winsteps. Program Manual Guide 3.68.0. Lomax, R.G. (2007). An introduction to statistical concepts (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Maas, C.J., & Hox, J.J. (2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3), 86 92. Mackenzie, D. (2003). Assessment for E-learning: What are the features of an ideal Eassessment system? In J. Christie (Ed.), Seventh international computer assisted assessment conference proceedings, Loughborough, UK: Loughborough University. Martinez, J.G.R., & Martinez, N.C. (1992). Re-examining repeated testing and teacher effects in a remedial mathematics course. British Journal of Educational Psychology, 62, 356 363. Mayer, R.E. (1997). Multimedia learning: Are we asking the right questions? Educational Psychologist, 32, 1 19. Mayer, R.E., & Anderson, R.B. (1992). The instructive animation: Helping students build connections between words and pictures in multimedia learning. Journal of Educational Psychology, 84, 444 452. Mayer, R.E., & Anderson, R.B. (1991). Animations need narrations: An experimental test of a dual-coding hypothesis. Journal of Educational Psychology, 83, 484 490. Mayer, R.E., Bove, W., Bryman, A., Mars, R., & Tapangco, L. (1996). When less is more: Meaningful learning from visual and verbal summaries of science textbook lessons. Journal of Educational Psychology, 88, 64 73. Mayer, R.E., & Gallini, J.K. (1990). When is an illustration worth ten thousand words? Journal of Educational Psychology, 82, 715 726. Mayer, R.E., & Moreno, R. (1998). A cognitive theory of multimedia learning: Implications for design principles. Retrieved from http://www.unm.edu/~moreno/PDFS/chi.pdf 484
McCoach, D.B., OConnell, A.O., Reis, S.M., & Levitt, H.A. (2006). Growing readers: A hierarchical linear model of childrens reading growth during the first 2 years of school. Journal of Educational Psychology, 98(1), 14 28. McDevitt, T.M., & Ormrod, J.E. (2004). Child development: Educating and working with children and adolescents (2nd ed.). Upper Saddle River, NJ: Prentice Hall. McGuire, L. (2005). Assessment using new technology. Innovations in Education and Teaching International, 42(3), 265 276. Merola, S.S. (2005, August). The problem of measuring SES on educational assessments. Paper presented at the meeting of the American Sociological Association, Philadelphia, PA. Mertler, C.A., & Campbell, C. (2005, April). Measuring teachers knowledge and application of classroom assessment concepts: Development of the Assessment Literacy Inventory. Paper presented at the meeting of the American Educational Research Association, Montreal, Quebec. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.). New York, NY: Macmillan. Miller, G.A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63, 81 97. Moreno, R., & Mayer, R.E. (1999). Cognitive principles of multimedia learning: The role of modality and contiguity. Journal of Educational Psychology, 91, 358 368. Maxwell, S. E., & Delaney, H. D. (1990). Designing experiments and analyzing data: A model comparison perspective. Belmont, CA: Wadsworth. National Center for Education Statistics. (2003). Navigating resources for rural schools. Retrieved from http://nces.ed.gov/surveys/ruraled/ National Center for Education Statistics. (2008). The condition of education 2008. Retrieved from http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2008031 National Center for Education Statistics. (2010). Search for public schools: Weld County School District No. Re-9. Retrieved from http://nces.ed.gov/ccd/schoolsearch/ National Research Council. (2000). How people learn: Brain, mind, experience, and school. Retrieved from the National Academy Press, Washington, D.C. website: http://www.nap.edu/openbook.php?isbn=0309070368
485
National Research Council. (1996). National science education standards. Retrieved from the National Academy Press, Washington, D.C. website: http://www.nap.edu/openbook.php?record_id=4962 Natriello, G. (1987). The impact of evaluation processes on students. Educational Psychologist, 22(2), 155 175. Nichols, P.D., Meyers, J.L., & Burling, K.S. (2009). A framework for evaluating and planning assessments intended to improve student achievement. Educational Measurement: Issues and Practice, 28(3), 14 23. OConnell, A.A., & McCoach, D.B. (2004). Applications of hierarchical linear models for evaluations of health interventions: Demystifying the methods and interpretation of multilevel models. Evaluation & the Health Professions, 27(2), 119 151. Olson, B.L., & McDonald, J.L. (2004). Influence of online formative assessment upon student learning in biomedical science courses. Journal of Dental Education, 68(6), 656 659. Pascarella, E. & Terenzini, P. (2005). How college affects students (Vol. II): A third decade of research. San Francisco, CA: Jossey-Bass. Peat, M., & Franklin, S. (2002). Supporting student learning: The use of computer-based formative assessment modules. British Journal of Educational Technology, 33(5), 515 523. Petrides, L. (2006). Data use and school reform. T. H. E. Journal, 33(8), 38 41. Phillips, S. E. (1994). High stakes testing accommodations: Validity versus disabled rights. Applied Measurement in Education, 7(2), 93 120. Plake, B.S. (1993). Teacher assessment literacy: Teachers competencies in the educational assessment of students. Mid-Western Educational Researcher, 6(1), 21 27. Pressley, M., & Woloshyn, V. (1995). Cognitive strategy instruction that really improves childrens academic performance. Cambridge, MA: Brookline Books. Raudenbush, S.W., & Bryk, A.S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Newbury Park, CA: Sage Publications. Raudenbush, S.W., Bryk, A.S, & Congdon, R. (2004). HLM 6 for Windows [Computer software]. Lincolnwood, IL: Scientific Software International, Inc. 486
Richardson, M., Baird, J., Ridgway, J., Ripley, M., Shorrocks-Taylor, D., & Swan, M. (2002). Challenging minds? Students perceptions of computer-based World Class Tests of problem solving. Computers and Human Behavior, 18(6), 633 649. Ricketts, C., & Wilks, S.J. (2002). Improving student performance though computerbased assessment: Insights from recent research. Assessment & Evaluation in Higher Education, 27(5), 475 479. Ridgway, J., & McCusker, S. (2003). Using computers to assess new educational goals. Assessment in Education: Principles, Policy and Practice, 10(3), 309 328. Ridgway, J., McCusker, S., & Pead, D. (2004). Literature review of e-assessment. Bristol, UK: Nesta Future Lab. Rodriguez, M.C. (2004). The role of classroom assessment in student performance on TIMSS. Applied Measurement in Education, 17(1), 1 24. Ruiz-Primo, M.A., & Furtak, E.M. (2007). Exploring teachers' informal formative assessment practices and students' understanding in the context of scientific inquiry. Journal of Research in Science Teaching, 44(1), 57 84. Ruiz-Primo, M.A., & Furtak, E.M. (2006). Informal formative assessment and scientific inquiry: Exploring teachers practices and student learning. Educational Assessment, 11, 205 235. Schunk, D.H. (1996). Goal and self-evaluative influences during children's cognitive skill learning. American Educational Research Journal, 33, 359 382. Scriven, M. (1967). The methodology of evaluation. In M.E. Gredler (Ed.), Program evaluation. Upper Saddle River, NJ: Prentice Hall. Seba, J. (2005). Policy and practice in assessment for learning: The experience of selected OECD countries. In J. Gardner (Ed.), Assessment and learning. London, England: Sage Publications. Seltzer, M.H., Frank, K.A., & Bryk, A.S. (1994). The metric matters: The sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation and Policy Analysis, 16(1), 41 49. Seltzer, M.H., & Svartberg, M. (1998). The use of piecewise growth models in evaluation and intervention. Los Angeles, CA: The Regents of the University of California.
487
Shadish, W.R., Cook, T.D., & Campbell, D.T. (2002). Experimental and quasiexperimental designs for generalized causal inference. Boston, MA: HoughtonMifflin. Shepard, L.A. (2009). Commentary: Evaluating the validity of formative and interim assessment. Educational Measurement: Issues and Practice, 28(3), 32 37. Shute, V.J. (2008). Focus on formative feedback. Review of Educational Research, 78, 153 189. Singer, J.D., & Willett, J.B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York, NY: Oxford University Press. Smith, J.R., Brooks-Gunn, J., & Klebanov, P. (1997). Consequences of living in poverty for young childrens cognitive and verbal ability and early school achievement. In G.J. Duncan & J. Brooks-Gunn (Eds.), Consequences of growing up poor (pp. 132 189). New York, NY: Russell Sage Foundation. Statistical Packages for the Social Sciences. (2009). Statistical Packages for the Social Sciences 17.0 for Windows [Computer software]. Chicago, IL: SPSS, Inc. Svinicki, M.D. (2005). Learning and motivation in the postsecondary classroom. Bolton, MA: Anker Publishing Company, Inc. Sweller, J., & Chandler, P. (1994). Why some material is difficult to learn. Cognition and Instruction, 12, 185 233. Tabachnick, B.G., & Fidell, L.S. (2001). Using multivariate statistics (4th ed.). Needham Heights, MA: Allyn & Bacon. Tenenbaum, G., & Goldring, E. (1989). A meta-analysis of the effect of enhanced instruction: Cues, participation, reinforcement and feedback and correctives on motor skill learning. Journal of Research and Development in Education, 22(3), 53 64. Thissen, D., & Mislevy, R.J. (2000). Testing algorithms. In H. Wainer, N. Dorans, D. Eignor, R. Flaugher, B. Green, R. Mislevy, L. Steinberg & D. Thissen (Eds.), Computerized adaptive testing: A primer (2nd ed., pp. 101 133). Hillsdale, NJ: Lawrence Erlbaum Associates. Thompson, M., Paek, P., Goe, L., & Ponte, E. (2004). Study of the impact of the California Formative Assessment and Support System for Teachers: Relationship of BTSA/CFASST engagement and student achievement. Princeton, NJ: Educational Testing Service and the California Commission on Teacher Credentialing. 488
Tindal, G., Heath, B., Hollenbeck, K., Almond, P., & Harniss, M. (1998). Accommodating students with disabilities on large-scale tests: An experimental study. Exceptional Children, 64(4), 439 450. Tuttle, H.G. (2008). Digital age assessment. Technology and Learning, 28(8), 28. Twining, P., Broadie, R., Cook, D., Ford, K., Morris, D., Twiner, A., & Underwood, J. (2006). Educational change and ICT: An exploration of Priorities 2 and 3 of the DfES e-strategy in schools and colleges. Retrieved from http://partners.becta.org.uk/page_documents/research/educational_change_and_ ict.pdf United States Census Bureau. (2010). State and county quickfacts: Weld County, Colorado. Retrieved from http://quickfacts.census.gov/qfd/states/08/08123.html Vereecken, C. (2001). Paper pencil versus pc administered querying of a study on health behaviour in school-aged children. Archives of Public Health, 59, 43 61. Weld RE-9 School District. (2009). Weld RE-9 School District homepage. Retrieved from http://www.weldre9.k12.co.us/ Whiting, B., Van Burgh, J.W., & Render, G.F. (1995, April). Mastery learning in the classroom. Poster presented at the meeting of the American Educational Research Association, San Francisco, CA. Wiliam, D. (2007b). Content then process: Teacher learning communities in the service of formative assessment. In D. Reeves (Ed.), Ahead of the curve: The power of assessment to transform teaching and learning (pp. 182 204). Bloomington, IN: Solution Tree. Wiliam, D. (2000). Formative assessment in mathematics part 3: The learners role. Equals: Mathematics and Special Educational Needs, 6(1), 19 22. Wiliam, D., & Black, P. (1996). Meanings and consequences: A basis for distinguishing formative and summative functions of assessment? British Educational Research Journal, 22, 537-548. Wiliam, D., & Leahy, S. (2007). A theoretical foundation for formative assessment. In J. H. McMillan (Ed.), Formative classroom assessment: Research, theory and practice. New York, NY: Teachers College Press. Wiliam, D., Lee, C., Harrison, C., & Black, P. (2004). Teachers developing assessment for learning: Impact on student achievement. Assessment in Education, 11(1), 49 65. 489
Winne, P.H., & Hadwin, A.F. (1998). Studying self-regulated learning. In D.J. Hacker, J. Dunlosky, & A.C. Graesser (Eds.), Metacognition in educational theory and practice (pp. 277 304). Mahwah, NJ: Lawrence Erlbaum Associates. Wood, J., & Burrow, M. (2002). Formative assessment in engineering using TRIADS software. In M. Danson (Ed.), Sixth international computer assisted assessment conference proceedings, Loughborough, UK: Loughborough University. Woodward, H., & Nanlohy, P. (2004). Digital portfolios: Fact or fashion? Assessment & Evaluation in Higher Education, 29(2), 227 238. Wright, B.D. & Stone, M.H. (1979). Best test design: Rasch measurement. Chicago, IL: Mesa Press. Young, A., & Cafferty, S. (2003) Simulation as a tool for computer-assisted formative assessment: First aid as a case study. In J. Christie (Ed.), Seventh international computer assisted assessment conference proceedings, Loughborough, UK: Loughborough University. Zhang, Z. (1996, April). Teacher assessment competency: A Rasch model analysis. Paper presented at the meeting of the American Educational Research Association, New York, NY. Zhang, Z., & Burry-Stock, J.A. (1995, November). A multivariate analysis of teachers perceived assessment competency as a function of measurement training and years teaching. Paper presented at the meeting of the Mid-South Educational Research Association, Biloxi, MS.
490
STATUTES
No Child Left Behind Act of 2001, Pub. L. No., 107110, 115 Stat. 1425 (2002).
491
APPENDICES Appendix A: Online Formative Assessment Survey Final Version The Online Formative Assessment Survey (OFAS) Directions: This inventory contains 56 items that address issues in online formative assessment use (i.e., the Diagnostic Online Reading Assessment DORA). For each item, please indicate how frequently you use the assessment practice described by the item. The rating scale includes Never (i.e., 0 times a quarter/semester), Rarely (i.e., 1 time a quarter/semester), Sometimes (i.e., 2 3 times a quarter/semester), and Almost Always (i.e., 4 or more times a quarter/semester). Your completion of this survey is your consent to participate in this study. If you do not want to participate, simply do not complete the survey. Respond to each item with the following prompt: In a given quarter/semester, how often do you General: 1. ensure that all your students have taken DORA? 2. download/access student results after a completed assessment? 3. download/access the parent report after a completed assessment? 4. share individual results from the DORA reports directly with a student? 5. share group/classroom results from the DORA reports directly with the entire group/classroom? 6. incorporate the results into your instruction? 7. use results to make decisions about student placement (e.g., gifted or remedial)? 8. communicate with the DORA administrators at your school or in your school district? Accessing Subscale Results: 9. download/access subscale results for high-frequency words? 10. download/access subscale results for word recognition? 11. download/access subscale results for phonics? 12. download/access subscale results for phonemic awareness? 13. download/access subscale results for oral vocabulary? 492
14. download/access subscale results for spelling? 15. download/access subscale results for reading comprehension? 16. download/access subscale results for fluency? Informing Instruction with Subscale Results: 17. use subscale results for high-frequency words to inform your instruction? 18. use subscale results for word recognition to inform your instruction? 19. use subscale results for phonics to inform your instruction? 20. use subscale results for phonemic awareness to inform your instruction? 21. use subscale results for oral vocabulary to inform your instruction? 22. use subscale results for spelling to inform your instruction? 23. use subscale results for reading comprehension to inform your instruction? 24. use subscale results for fluency to inform your instruction? Providing Feedback from Subscale Results: 25. use subscale results for high-frequency words to provide feedback to your students? 26. use subscale results for word recognition to provide feedback to your students? 27. use subscale results for phonics to provide feedback to your students? 28. use subscale results for phonemic awareness to provide feedback to your students? 29. use subscale results for oral vocabulary to provide feedback to your students? 30. use subscale results for spelling to provide feedback to your students? 31. use subscale results for reading comprehension to provide feedback to your students? 32. use subscale results for fluency to provide feedback to your students? Communicating the Results: 33. communicate the results to parents orally? 34. communicate the results to parents in a written format (e.g., a letter or an e-mail)? 35. communicate the results to students orally? 36. communicate the results to students in a written format (e.g., a letter or an e-mail)? 37. communicate the results to other educators or practitioners orally? 38. communicate the results to other educators or practitioners in a written format (e.g., a letter or an e-mail)? Grade-Level Equivalency Results: 39. download/access the grade-level equivalency results? 40. use the grade-level equivalency results to inform your instruction? 41. use the grade-level equivalency results to provide feedback to your students? 42. use the grade-level equivalency results to make decisions about student placement (e.g., gifted or remedial)? 493
Using the Results: 43. use the results to make instructional decisions? 44. use the results to guide your classroom quiz/test/exam construction? 45. link the results to your course standards and/or objectives? 46. use DORA results to help all students with their reading performance? 47. use DORA results to help the high-achieving students with their reading performance? 48. use DORA results to help the low-achieving students with their reading performance? 49. use the results linking state standards for reading to DORA scores? 50. compare the results with your other content-related classroom quiz/test/exam results (i.e., quizzes/tests/exams that you have constructed)? 51. compare your classroom results with the school district? 52. compare individual student results with the school district? 53. compare individual student results with the rest of the class? Other Questions: 54. use the individual student summary reports (e.g., the abbreviated 1-page reports)? 55. use the individual student full reports (e.g., the extended 17-page reports)? 56. use DORA to prepare your students for your end-of-year state test in reading?
494
Appendix B: Highland School District Permission Letter
Highland Schools Weld RE-9

210 West First Ault, CO 80610 970-834-1345 September 25, 2009 To Whom It May Concern, This letter is intended to act as permission for Aryn C. Karpinski to use our districts student data from the state assessment/district testing for her doctoral program of study (i.e., her dissertation) at The Ohio State University. She will use the data to discuss teacher use of online formative assessment programs and student state test results. No student or teacher names will be used during this study (i.e., no identifiers). We look forward to seeing the results of this study. In education,
Sue Ann Highland, M.A. Director of Federal Programs, Curriculum, and Instruction Weld RE-9 Schools Ault, CO 80610
495
Appendix C: Lets Go Learn, Inc. Permission Letter
496
Appendix D: The Ohio State University Institutional Review Board Exempt Status
497
Appendix E: Informal Interview Questions to Develop the Online Formative Assessment Survey (OFAS) 1. Choosing assessment methods. How did you come to choose this assessment method (i.e., DORA)? What resources did you use to select this assessment method? How often and in what capacity do you use DORA? 2. Developing assessment methods. Do you use DORA information to develop assessments/tests? What are you general assessment criteria for tests/exams? What are your general assessment criteria for homework assignments? How do you use DORA reports to formulate assessment criteria? 3. Administering, scoring, and interpreting assessment results. How are your assessments administered? How do you interpret your assessments? Do you interpret the assessment results from DORA? If you do not understand how to interpret an assessment, what do you do? 4. Using assessment results for decision making. How do you determine that a student is struggling with a subject/topic? How do you determine if a student is allowed to re-take or re-do an assessment? 5. Grading. What do you base a students course grade on? How do you determine a students homework or test grade? To what extent do you allow other teachers/practitioners or administrators to examine your grading rubrics/techniques? 6. Communicating assessment results. What kinds of feedback do you give students? How do you administer feedback individually and/or collectively? How and when do you review homework problems/test questions? How do you incorporate homework/test results into your teaching/lectures? How do you incorporate DORA reports into your instruction/curriculum? 498
Appendix F: Reviewer Feedback on the Online Formative Assessment Survey (OFAS) Respond to each item as follows: In a given quarter/semester, how often do you Question 1. administer DORA? Reviewer Suggestion This sounds like it is a question of do you yourself administer it, not the frequency. Our staff doesnt necessarily do the administration of the actual assessment, but we give it three times per year. None None Depends on the grade level. Do you mean share the results directly with the student? Could you add a place to comment on how the feedback is given? Same as above. Continued
2. download/access student results/reports after a completed assessment? 3. download/access the parent report after a completed assessment? 4. provide individual student feedback from the DORA results/reports?
5. provide group/classroom feedback from the DORA results/reports?
499
Appendix F Continued 6. incorporate the results/reports into your curriculum? 7. use results/reports to make decisions about student placement 8. communicate with the DORA contact person at your school or in your school district? 9. download/access subscale results/reports for high-frequency words? 10. download/access subscale results/reports for word recognition? 11. download/access subscale results/reports for phonics? 12. download/access subscale results/reports for phonemic awareness? 13. download/access subscale results/reports for oral vocabulary? 14. download/access subscale results/reports for spelling? 15. download/access subscale results/reports for reading comprehension? 16. download/access subscale results/reports for fluency? 17. use subscale results/reports for highfrequency words to inform your curriculum? Reword curriculum to instruction. None Who is the contact person? What do you mean by contact person? Maybe separate out the results and the reports. They really serve two purposes. Same as question 9. Same as question 9. Same as question 9. Same as question 9. Same as question 9. Same as question 9. Same as question 9. This should be changed to instruction, not curriculum. Curriculum is not determined by the teacher. It is determined by the district. So teachers cant change that. Continued
500
Appendix F Continued 18. use subscale results/reports for word recognition to inform your curriculum? 19. use subscale results/reports for phonics to inform your curriculum? 20. use subscale results/reports for phonemic awareness to inform your curriculum? 21. use subscale results/reports for oral vocabulary to inform your curriculum? 22. use subscale results/reports for spelling to inform your curriculum? 23. use subscale results/reports for reading comprehension to inform your curriculum? 24. use subscale results/reports for fluency to inform your curriculum? 25. use subscale results/reports for highfrequency words to provide feedback to your students? 26. use subscale results/reports for word recognition to provide feedback to your students? 27. use subscale results/reports for phonics to provide feedback to your students? 28. use subscale results/reports for phonemic awareness to provide feedback to your students? 29. use subscale results/reports for oral vocabulary to provide feedback to your students? Same as question 17. Same as question 17. Same as question 17.
Same as question 17. Same as question 17. Same as question 17. Same as question 17. Maybe split the results/reports. We dont use the reports directly with students. Same as question 25.
Same as question 25. Same as question 25.
Same as question 25. Continued
501
Appendix F Continued 30. use subscale results/reports for spelling to provide feedback to your students? 31. use subscale results/reports for reading comprehension to provide feedback to your students? 32. use subscale results/reports for fluency to provide feedback to your students? 33. communicate the results/reports to parents orally? 34. communicate the results/reports to parents in a written format (i.e., either a standard letter or e-mail)? 35. communicate the results/reports to students orally? 36. communicate the results/reports to students in a written format (i.e., either a standard letter or e-mail)? 37. communicate the results/reports to other educators or practitioners orally? 38. communicate the results/reports to other educators or practitioners in a written format (i.e., either a standard letter or e-mail)? 39. download/access the grade-level equivalency results/reports? 40. use the grade-level equivalency results/reports to inform your curriculum? 41. use the grade-level equivalency results/reports to provide feedback to your students? Same as question 25. Same as question 25.
Same as question 25. You need to say parents and students. This is confusing. Same as question 33.
Same as question 33. Same as question 33.
None None
None None None Continued
502
Appendix F Continued 42. use the grade-level equivalency results/reports to make decisions about student placement (e.g., gifted or remedial program)? 43. use the results/reports to make instructional decisions? 44. use the results/reports to guide your classroom quiz/test/exam construction? 45. link the results/reports to your course objectives? 46. use DORA results/reports to help all students with their Reading performance? 47. use DORA results/reports to help the high-achieving students with their Reading performance? 48. use DORA results/reports to help the low-achieving students with their Reading performance? 49. use the results/reports linking state standards for reading to DORA scores? 50. compare the results/reports with your other content-related classroom quiz/test/exam results (i.e., quizzes/ tests/exams that you have constructed)? 51. compare your classroom results/reports with the school district? 52. compare individual student results/reports with the school district? 53. compare individual student results/reports with the rest of the class? None
This seems like a repeat of the question above. This is not needed. None Could you add standards with objectives? None None
None
None None
None None None Continued
503
Appendix F Continued 54. use the individual student summary reports (e.g., the abbreviated 1-page reports)? 55. use the individual student full reports (e.g., the extended 17-page reports)? 56. use DORA to prepare your students for the CSAP? None None None
504
Appendix G: Invitation to Participate in the Study

Date Dear DORA Teacher/Administrator: You are invited to participate in a study conducted at The Ohio State University that will provide researchers and professionals with information to examine teacher online formative assessment practices. Participation in this study is completely voluntary. However, information gathered will guide further research in this burgeoning area that can assist researchers and professionals in understanding teacher online formative assessment use. You have until February 20, 2010 to complete this survey. The survey should take no more than 15 minutes to complete. There are eight sections in this survey comprised of closed-response items using a 4-point Likert scale (i.e., Never, Rarely, Sometimes, Almost Always). There is also a brief demographic intake at the end of the survey. Information obtained from this survey will be kept completely confidential, and will be analyzed and reported collectively. Completing this survey serves as your informed consent, and no compensation will be provided. There are no risks to participating in this study, and you will not directly benefit from participating in this study. You also do not give up any personal legal rights by consenting to participate. Your participation is voluntary, and you can refuse to answer questions that you do not wish to answer. You can also refuse to participate or withdraw at any time without penalty or repercussion. The survey can be taken online, and the like is provided here: http://www.surveymonkey.com/s/OFAS. Click on the above link to be directed to the survey where you will find directions on how to complete the survey. Upon completion of the study, your anonymous responses will be sent to the primary investigator. If you have any questions or concerns, or would like to obtain a copy of the results from this study, please e-mail the Co-Investigator at The Ohio State University at karpinski.10@osu.edu. Thank you! Sincerely, Aryn C. Karpinski, Co-Investigator The Ohio State University Doctoral Candidate Quantitative Research, Evaluation, and Measurement
505

Karpinski Aryn C

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Karpinski Aryn C

Enviado por

Direitos autorais:

Formatos disponíveis

The Relationship between Online Formative Assessment Scores and State Test Scores: Measure Development and Multilevel

The Ohio State University 2010

Copyright by Aryn C. Karpinski 2010

CHAPTER 2: LITERATURE REVIEW

Pnik = e(Bn Di Fk) / 1 + e(Bn Di Fk)

Note. Information summarized from http://www.cde.state.co.us/Finance_Text/DecEnrollStudy/3145districtprofile.pdf

n 162 184 208 207 215

p < .001 ( = .01; .05/5 = .01 for the Bonferroni correction).

p < .005 ( = .005; .05/11 = .005 for the Bonferroni correction).

n 171 189 210 214 223

p < .001 ( = .01; .05/5 = .01 for the Bonferroni correction).

p < .005 ( = .005; .05/11 = .005 for the Bonferroni correction).

n 169 187 204 194 189

p < .001 ( = .01; .05/5 = .01 for the Bonferroni correction).

p < .005 ( = .005; .05/11 = .005 for the Bonferroni correction).

Yti = 0i + 1i(DORA)ti + 2i(Time)ti + eti

0i = 00 + 01(SEX)i + 02(ETHNIC)i + 03(ESLELL)i + 04(FREERED)i + r0i 1i = 10 2i = 20

Note. Deviance (FEML) = 7402.28; 3 estimated parameters.

Note. Deviance (FEML) = 7070.20; 6 estimated parameters.

Note. Deviance (FEML) = 7061.62; 10 estimated parameters.

p < .01; *** p < .001

Note. Deviance (FEML) = 7039.38; 9 estimated parameters.

p < .05; ** p < .01; *** p < .001

Note. Deviance (FEML) = 7055.55; 10 estimated parameters.

p < .05; ** p < .001

Note. Deviance (FEML) = 7035.16; 9 estimated parameters.

p < .05; ** p < .01; *** p < .001

Note. Deviance (FEML) = 7043.38; 10 estimated parameters.

p < .01; *** p < .001

Note. Deviance (FEML) = 7022.08; 9 estimated parameters.

p < .05; *** p < .001

Note. Deviance (FEML) = 7022.98; 10 estimated parameters.

Note. Deviance (FEML) = 7010.47; 9 estimated parameters.

p < .05; *** p < .001

Table 33 Summary of 47 Measured Persons (56 Measured Items) Winsteps Output

Table 34 Summary of 56 Measured Items Winsteps Output

Table 35 Summary of Category Structure (56 Measured Items) Winsteps Output

Q28 Q55 Q34 Q24 Q17 Q22 Q14 Q21 Q31

Table 38 Summary of 47 Measured Persons (53 Measured Items) Winsteps Output

Table 39 Summary of 53 Measured Items Winsteps Output

Table 42 Summary of 47 Measured Persons (50 Measured Items) Winsteps Output

Table 43 Summary of 50 Measured Items Winsteps Output

Table 44 Summary of Category Structure (50 Measured Items) Winsteps Output

Q55 Q32 Q34 Q24 Q17 Q18 Q21 Q13

Q56 Q29 Q20 Q19 Q33 Q31

Q30 Q49 Q22 Q40 Q39

Q41 Q38 Q9 Q42

Table 49 Summary of 47 (Non-Extreme) Measured Items Winsteps Output

Table 53 Summary of 11 (Non-Extreme) Measured Items Winsteps Output

Table 54 Summary of Category Structure (11 Measured Items) Winsteps Output

Table 57 Summary of 46 Measured Persons (10 Measured Items) Winsteps Output

Table 58 Summary of 10 Measured Items Winsteps Output

Table 59 Summary of Category Structure (10 Measured Items) Winsteps Output

4 (11.8) 30 (88.2) 5 (17.9) 23 (82.1)

Table 69 Continued DORA Subtest Scores ESL/ELL M (SD) Non-ESL/ELL M (SD) t df p

p < .002 ( = .002; .05/21 = .002 for the Bonferroni correction).

Table 71 Continued DORA Subtest Scores IEP M (SD) Non-IEP M (SD) t df p

p < .002 ( = .002; .05/21 = .002 for the Bonferroni correction).

p < .002 ( = .002; .05/21 = .002 for the Bonferroni correction)

Ytij = 0ij + 1ij(Time)tij + etij

Note. Deviance (FEML) = 3235.15; 4 estimated parameters.

Note. Deviance (FEML) = 3156.35; 9 estimated parameters.

p < .01; *** p < .001

p < .05; p < .01; * p < .001

p < .05; p < .01; * p < .001

p < .05; p < .01; * p < .001

p < .05; p < .01; * p < .001

p < .05; p < .01; * p < .001

p < .05; p < .01; * p < .001

p < .05; p < .01; * p < .001 370

p < .05; p < .01; * p < .001

p < .05; p < .01; * p < .001

p < .05; p < .01; * p < .001