Facets To Adjust For Rater Discrepancies

Investigating the Effect of Raters L1 Background on Writing Assessment
A Presentation for IJAS

Paris, France April 8, 2013 by Farah Bahrouni
Sultan Qaboos University (SQU)
OMAN bahrouni@squ.edu.om
Confusion is the beginning of learning.

Socrates (469-399 BC)
If we knew what we were doing, we wouldnt call it research.

Albert Einstein
These 2 quotations might explain why I am here!
Outline:
1) Claim
2) Study
Data collection Analysis: FACETS & One-Way ANOVA Results
3) Conclusion
Implication & Significance
1. Claim
Research has established that writing assessment can by no means be objective Studies have probed possible reasons extensively:
Weigle (1994: 23, 24) grouped sources of raters' disagreement into three categories:
within the text : prompt, writers background & ability within the rater: physical & psychological conditions within the rating context: when, where & under what conditions the rating is done She adds that interactions among these sources are also possible: A rater from a certain background may react to a text written in a certain style differently from the way a rater from a different background would. p. 24
Bachman (1990) refers to the above sources as: potential sources of measurement error and categorizes them into three groups: test method factors (e.g. raters, prompt type, etc.), personal attributes (e.g. test taker's cognitive style, knowledge of particular content, etc.) random factors (e.g. fatigue, time of day, etc) Most of the other studies revolve around these points with respect to their different contexts. The claim in this study is that L1, which has been neglected to a great extent, is a significant source of discrepancy between raters that should be studied thoroughly on its own
Quantitative Data collection

20 ESL teachers from 4 different language backgrounds ( 5 native speakers, 5 Arabs sharing the students mother tongue, 5 Indians, and 5 Russians scored 3 essays written by 3 Omani university students. All raters are experienced ESL/EFL teachers, and have taught in the Omani context for a minimum of 2 years Analysis:
2. Analysis: 2.1 vertical rule 2.2
2.2 Data collection (II)

Write: 1) construct definitions based on Bachman & Palmers (1996) communicative approach 2) definitions of performance levels based on LOs, Ts responses and the 65 studied reports
Piloting: 5 teachers scored 10 samples twice:
RS1
RS2
Analysis: FACETS + One-Way ANOVA
Results from FACETS:

RS1 Category Measurement Report
(arranged by MN). ---------------------------------------------------------------------------------------------| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| | | Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| N Category | ---------------------------------------------------------------------------------------------| 179 50 3.6 3.65 | -.28 .25 | 1.27 1.1 1.31 1.3| .74 | 1 CONT | | 175 50 3.5 3.58 | -.04 .24 | 1.01 .0 1.01 .1| .96 | 2 ORG | | 174 50 3.5 3.56 | .02 .24 | 1.07 .3 .94 -.2| 1.06 | 4 SCES | | 169 50 3.4 3.47 | .30 .23 | .67 -1.6 .75 -1.1| 1.26 | 3 Lge use | -----------------------------------------------------------------------------------------Model, Sample: RMSE .24 Adj (True) S.D. .00 Separation .00 Reliability .00 Model, Fixed (all same) chi-square: 3.0 d.f.: 3 significance (probability): .40 ---------------------------------------------------------------------------------------------RS2 Category Measurement Report (arranged by MN). | ---------------------------------------------------------------------------------------------| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| | | Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| N Category | ---------------------------------------------------------------------------------------------| 187 50 3.7 3.77 | -.71 .26 | .85 -.5 .89 -.4| 1.11 | 2 ORG | | 184 50 3.7 3.71 | -.51 .26 | .51 -2.4 .53 -2.4| 1.48 | 1 CONT | | 164 50 3.3 3.39 | .58 .21 | 2.11 3.5 2.05 3.3| .47 | 4 SCES | | 163 50 3.3 3.38 | .63 .21 | .73 -1.1 .89 -.3| .90 | 3 Lge Use | ---------------------------------------------------------------------------------------------Model, Sample: RMSE .24 Adj (True) S.D. .66 Separation 2.82 Reliability .89 Model, Fixed (all same) chi-square: 26.7 d.f.: 3 significance (probability): .00 ----------------------------------------------------------------------------------------------
Results from One-Way ANOVA
5 Raters ANOVA : Rater Total scores Sum of Mean df F Squares Square Between 136.32 34.08 2.17 4 Groups RS1 Within 708.1 15.74 45 TOTAL Groups Total 844.42 49 Between 88 Groups RS2 Within 692 TOTAL Groups Total 780 4 45 49 22 15.38 1.43
Sig. 0.088
5 Raters ANOVA : Samples Total scores Sum of Mean Sig. df F Squares Square Between 379.22 9 42.136 3.62 0.002 Groups RS1 Within 465.2 40 11.63 TOTAL Groups Total 844.42 484.4 295.6 780 49 9 53.822 7.28 0
0.239
Between Groups RS2 Within TOTAL Groups Total
40 7.39 49
3. Implication & significance:

Analysis indicates that RS2 function more effectively than RS1 Ts involvement in defining what they think should be assessed in sts writing & describing the levels of performance (what those labels as Excellent, Good, or Poor stand for) helped Ts reach a more common understanding of the lge aspects being assessed and a shared interpretation of the score descriptions The rating scales I have developed arehome made, based on LOs and tailored to FPE, and therefore the LC, needs. They can be generalised to any similar multi-cultural context to produce a less personalized and more institutionalized objective assessment of students writing performance.
REFERENCES Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (Eds.), Language Testing in the 1990s: The Communicative Legacy (Vol. 71 - 86). London and Basingstoke: Macmillan Publishers Limited. Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation: Cambridge University Press. Bachman, L. F. (1990). Fundamental Considerations in Language Testing: Oxford: Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests.: Oxford: Oxford University Press. Brindley, G. (1998). Describing language development? Rating scales and SLA. In: L. F. Bachman & A. D. Cohen (Eds .), Interfaces between second language acquisition and language testing research. CUP. Fulcher, G. (2000). The 'communicative' legacy in language testing. System, 28, 483 -497. Fulcher, G. (2010). Practical Language Testing. Hodder Education, An Hachette UK Company Fulcher, G., Davidson, F. & Kemp, J. (2011) Effective rating scale development for speaking tests: Performance decision trees. Language Testing 28 (1) 5-29 Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing Second Language Writing in Academic Contexts (pp. 241-276). Norwood, NJ: Ablex Publishing Corporation. Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The Use of Holistic versus Analytic Scoring for Large-Scale Assessment of Writing. The Canadian Journal of Program Evaluation, 11(2), 61 - 85. North, B. (2000) The development of a Common Framework Scale of Language Proficiency: Theoretical Studies in Second Language Acquisition P. Lang. North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation formats. TOEFL Monograph, 24. North, B. & Schneider, G. (1998) Scaling descriptors for language proficiency scales. Language Testing 15 (2) 217-263 Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and Qualitative approaches. University of California, Los Angeles. Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.
Thank you

Facets To Adjust For Rater Discrepancies

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Facets To Adjust For Rater Discrepancies

Enviado por

Direitos autorais:

Formatos disponíveis

Investigating the Effect of Raters L1 Background on Writing Assessment

A Presentation for IJAS

Confusion is the beginning of learning.

If we knew what we were doing, we wouldnt call it research.

These 2 quotations might explain why I am here!

Quantitative Data collection

2. Analysis: 2.1 vertical rule 2.2

2.2 Data collection (II)

Piloting: 5 teachers scored 10 samples twice:

Analysis: FACETS + One-Way ANOVA

Results from FACETS:

Results from One-Way ANOVA

Between Groups RS2 Within TOTAL Groups Total

3. Implication & significance:

Você também pode gostar