Você está na página 1de 15

Assessment & Evaluation in Higher Education

Vol. 33, No. 4, August 2008, 431444

Applying the many-facet Rasch model to evaluate PowerPoint


presentation performance in higher education
Ramazan Basturk*
Pamukkale University, Turkey
Assessment
10.1080/02602930701562775
CAEH_A_256134.sgm
0260-2938
Original
Taylor
02007
00
rbasturk@pau.edu.tr
RamazanBasturk
000002007
and
&
Article
Francis
(print)/1469-297X
Francis
& Evaluation in Higher
(online)
Education

This study investigated the usefulness of the many-facet Rasch model (MFRM) in
evaluating the quality of performance related to PowerPoint presentations in higher
education. The Rasch Model utilizes item response theory stating that the probability of
a correct response to a test item/task depends largely on a single parameter, the ability of
the person. MFRM extends this one-parameter model to other facets of task difficulty,
for example, rater severity, rating scale format, task difficulty levels. This paper
specifically investigated presentation ability in terms of items/task difficulty and rater
severity/leniency. First-year science education students prepared and used the
PowerPoint presentation software program during the autumn semester of the 2005
2006 school year in the Introduction to the Teaching Profession course. The students
were divided into six sub-groups and each sub-group was given an instructional topic,
based on the content and objectives of the course, to prepare a PowerPoint presentation.
Seven judges, including the course instructor, evaluated each groups PowerPoint
presentation performance using A+ PowerPoint Rubric. The results of this study show
that the MFRM technique is a powerful tool for handling polytomous data in
performance and peer assessment in higher education.

Introduction
The educational research society is engaged in serious rethinking about the structure of
measurement and evaluation in education (Stiggins 1987; Wiggins 1989; Mehrens 1992;
Messick 1994; Wolf 1995). Traditional objective tests are being criticized for a variety of
different reasons. For example, test preparation practices may increase the test scores in
high-stakes situations; objective tests can lead to a narrowing of the educational curriculum; there are consistent differences in average performance of racial, ethnic and gender
groups, etc. (Haney and Madaus 1989; Neil and Medina 1989; Shepard 1989; Koretz et al.
1993). Many educational researchers support the use of alternative measurement
approaches over traditional assessment. On the other hand, different labels have been used
to describe alternative approaches over traditional objective tests (Worthen 1993). The
most common labels include Performance Assessment, Portfolio Assessment, Authentic Assessment or Alternative Assessment. Researchers indicate that this kind of assessment has become more and more popular as a means of linking large-scale education
testing to classroom instruction (Lumley and McNamara 1993; Linn 1994; Wolfe and
Chiu 1997).
Performance Assessment refers to a form of evaluation that requires students to perform
a task rather than select an answer from a ready-made list (Stiggins 1987). According to
Stiggins (1987), Performance Assessment is an exercise in which a student shows specific
*Email: rbasturk@pau.edu.tr
ISSN 0260-2938 print/ISSN 1469-297X online
2008 Taylor & Francis
DOI: 10.1080/02602930701562775
http://www.informaworld.com

432

R. Basturk

skills and competences rather than selecting one of several predetermined answers to an
exercise. Performance Assessment provides a means for assessing a variety of student
skills that are not measured well by objective tests. In addition to written responses, such
performance may include oral communication, and the construction of models, graphs,
diagrams and maps. Or, it might include the use of tools and equipment, for example
computers, scientific instruments, etc. (Mearoff 1991; Mehrens 1992). All of these
approaches emphasize application, focus on direct assessment, culminate in student product/performance, utilize realistic problems, use varied learning styles, encourage openended thinking, and reflect real-life situations (Stiggins 1987; Wolfe and Chiu 1997; Linn
and Gronlund 1999). Unlike objective tests, which are limited to product, both process
assessment and product assessment resulting from the performance can be assessed at the
same time. Because this approach is time-consuming both for students to do and for judges
to evaluate, the emphasis in Performance Assessment should be on measuring complex
achievements that cannot be measured well by objective tests (Mehrens 1992; Linn and
Gronlund 1999).
Changing the role of students
Performance Assessment changes the role of students in the assessment process. Instead of
being passive learners and test-takers, students become active participants in learning
assessment activities. According to Wiggins (1990), traditional tests just recall information
learned out of context. They are often simplistic and limited to paper-and-pencil activities.
On the other hand, performance assessment tools offer students a full array of tasks to
accomplish. Zemelman et al. (1998) suggested that authentic tasks help students set goals,
monitor their work, evaluate their efforts, encourage collaboration and simulate real-world
activities.
Performance tasks usually require a student to carry out a task instead of completing a
written test. Students are expected to apply acquired knowledge and use multiple skills to
perform a task rather than recall answers. The tasks are related to real situations that people
encounter everyday (Burke 1999). When students demonstrate skills that they have learned,
they have a greater chance of transferring the skills to real-life situations (Kallick 1992;
Burke 1999).
Changing the role of teachers
Performance Assessment changes the role of teachers as well as of students (Tomlinson
2001). Stiggins (2002) stated that performance assessment provides several benefits for
teachers. As students become more motivated to learn, instructional decisions become
easier. Whereas traditional testing promotes a teacher-centered classroom, performance
assessment requires a more student-centered classroom. In such a classroom, the teachers
main role is to become a facilitator rather than an administrator of information. The teacher
assists students in taking responsibility for their learning and in becoming accomplished
self-evaluators (Mearoff 1991; Linn and Gronlund 1999).
Presentation software programs
Presentation software is one of the popular performance-based educational applications in
higher education. Several different types of presentation software exist. Some of them are
Gold Disk Astound, Adobe Persuasion, DeltaGraph Pro and Microsoft PowerPoint.

Assessment & Evaluation in Higher Education

433

According to McCleland (1994), all of these presentation software packages are user
friendly and compatible. In this research, Microsoft PowerPoint (full name Microsoft Office
PowerPoint) was used because of its advantages over others.
Critics generally agree that PowerPoints ease of use can save a lot of time for educators
who otherwise would have used other types of visual aids, for example, hand-drawn or
mechanically typeset slides, blackboard or overhead projectors. PowerPoint presentations
have become the most prevalent form of multimedia in education. Students prefer them to
presentations made from transparencies, and have expressed this particularly in the last
decade (Bartsch and Cobern 2003; Jonassen et al. 2003).
Many-facet Rasch model
The many-facet Rasch model (MFRM, Linacre 1993) represents an extension of the oneparameter Rasch Measurement Model (Rasch 1980), which is one of several models
developed within Item Response Theory. MFRM helps overcome some of the problems
and assumptions associated with Classical Test Theory (CTT). It provides information for
decision-making that is not available through CTT. According to Linacre (1993), MFRM
has several distinct advantages over classical data analysis. First, Rasch measurement
places each facet of the measurement context (group performance, item/tasks difficulty,
judge severity) on a common underlying linear scale. This results in a measure that can
be subjected to traditional statistical analysis, while allowing for unambiguous interpretation of group performance as it relates to judge severity and item/task difficulty. Second,
the Rasch-based calibration of examinees, items/tasks and judges is sample-free. In other
words, Rasch techniques remove the influence of sampling variability from its measures
so that valid generalizations can be made beyond the current sample of groups, collections of items/tasks and pool of judges. This feature is useful in the applied setting
because it allows group performance proficiency to be determined even if the study has
missing data, e.g. the group has not responded to all of the assessment tasks, or if some
groups are rated by only some of the judges in the pool. Third, Rasch fit procedures can
be used to derive unexpected response patterns that are useful for evaluating the extent to
which individual groups, tasks or judges are behaving in ways that are inconsistent with
the measurement model (Engelhard 1992; Linacre and Wright 1993; Schumacker 1996;
Linacre 1999).
MFRM allows users to create a single interval scale of scores relevant to both the
difficulty of the items/tasks and the ability of the persons tested (Linacre 1999). These
scores are reported in units called logits and are typically placed on a vertical ruler called
a logistic ruler. Because logits can be added, subtracted, multiplied and divided, comparisons and statistical studies can be made, which again is useful to assess educational
gains, displays of strengths and weaknesses, and comparisons of demographic groups. In
principle, Rasch models can be applied in any experimental context in which persons
interact with assessment items in a manner that provides for comparisons between
persons with regard to the magnitude of some attribute or trait (Engelhard 1992; Linacre
1999).
In this study, MFRM was considered a more effective data analysis procedure in identifying the quality of group PowerPoint presentation performance because multiple raters
observed and evaluated each group performance (Linacre 1999). Another reason for using
MFRM in this research is that the theoretical and statistical support for performance and
peer assessment has not been well established in higher education. MFRM does not require
assumptions about sampling or normality of distribution; therefore it is useful for

434

R. Basturk

performance assessment with different item/task structures in any educational level


including higher education.
Specific purpose of study
This study attempts to address some issues facing performance assessment, with particular
focus on instructor assessment and peer assessment of group PowerPoint presentation skills
in Introduction to the Teaching Profession course, using a real-life classroom setting.
MFRM provides a powerful tool for handling polytomous data involving raters judgments
(Linacre 1999). The present study explores the multi-faceted assessment reports provided
by MFRM with regard to the following research issues:
(1)
(2)
(3)
(4)
(5)

the relationship among the three facets of assessment (groups, items/tasks, judges);
judge severity/leniency;
groups presentation ability;
item/task difficulty;
bias interaction analysis.

Methods
Participants
Participants in this study were first-year science education students from a medium-size
university in Turkey. One class of 30 students was randomly selected from the five classes
that were taught by the same instructor. The class consisted of 14 (47%) male and 16 (53%)
female students. Mean age was 19.3 with a standard deviation of 0.7 for this group. This
class used PowerPoint presentations during the autumn semester of the 20052006 school
year. There were six sub-groups, with five students in each sub-group who prepared a
PowerPoint presentation and utilized a 20-minute time slot to make their presentation to
their peers in the classroom. During the presentation all group members were expected to
speak at least once.
All sub-groups were first instructed on how to use PowerPoints basic application in
conjunction with an informative PowerPoint presentation assignment. This took three class
periods (50 minutes each) in the educational technology lab. The first class period was for
instruction and demonstration; the second and third allowed students to experience on their
own how to manipulate PowerPoint software.
Students were provided with basic information about the study goals, including the
process they would go through during the research. The presentation topic based on the
course content and objectives, e.g. the social, psychological, economical, philosophical,
historical and legal foundations of education, was given by the instructor to each group
randomly. The students were given a month of free study time to complete their work.
The participants were informed in particular that they needed to conduct research, find
new material and arrive at their own judgements and decisions based on information
obtained through different resources including printed documents (books, journals, etc.),
available experts (scientist or other instructors), and electronic information resources
including the Internet. Students were especially encouraged to use Internet resources to
obtain current electronic material (audiovisual effects, pictures, images, etc.) related to their
topic and to integrate these into their PowerPoint presentations. At the end of the given time,
the presentations were examined, and then each made his/her 20-minute PowerPoint

Assessment & Evaluation in Higher Education

435

presentation in the classroom. During this study, the primary role of the instructor was to
facilitate the process.
Judges
Each Power Point presentation was assessed by seven judges (one instructor and six
students, who were chosen at randomly from each group) using nine assessment items
(based on their introduction, content, text elements, layout, graphics, sound, etc.). The three
facets of this study (group performance ability, item/task difficulty, judge severity/leniency)
and the related assessment rating categories will be thoroughly discussed in what follows.
Instrument
The A+ PowerPoint Rubric created by Vandervelde (2006) was chosen and modified for
the present research. This rubric included the topics listed below. Items 19 in the rating list
were scored on a six-point scale (0 = none, 6 = excellent).
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)

Research and Methodology.


Preproduction Plan and Storyboard.
Introduction.
Content.
Text Elements.
Layouts.
Citations.
Graphs/Sound/Animation.
Writing Mechanics.

The rubric developed for this study was used for self-assessment and particularly for peer
feedback. For overall and grade-based evaluation, the group PowerPoint presentations were
graded based on the following evaluation scale: A = Exemplary: 5054 points; B = Proficient: 4549 points; Incomplete: less than 45 points.
Data analysis
The data were analyzed using FACETS (Linacre 1993), an MFRM computer software
program that provides detailed information on the calibration of the three aspects of the
study (group performance, item/task difficulty, judge severity/leniency). The data were
investigated mainly from the viewpoint of unexpected scores and fit statistics. The benchmark for acceptable range of the infit and outfit statistics was set between 0.6 and 1.4.
Furthermore, the separation index for the group performance measurement report was
expected to be over 2.0 in theory (Linacre 1993).
Findings
Figure 1 shows the measures for the three facets: group performance, judge severity and
item/task difficulty for the data. Figure 1 is to be interpreted as follows. The scale along the
left of Figure 1 represents the logit scale, ranging from 1 to + 1, which is the same for all
three facets. Groups are represented by their group numbers. Judges are ordered so that the
most lenient judges are at the top, while the severe judges are at the bottom. In general,

436

R. Basturk

Figure 1.

Groups, judges and questions, summary reports.

facets are ordered so that the most difficult element of each facet is seen at the top, while
the least difficult element is at the bottom. In terms of judges, for example, the most lenient
judge is the topmost judge in Figure 1 (in this case Judge 5). Similarly, the most difficult
task in the scale is topmost in Figure 1, in this case Graph/Sound/Animation. Thus,
Figure 1 shows graphically the differences across the different elements of each facet.
In Figure 1, the six groups are ordered with the highest quality performer at the top and
the lowest quality performer at the bottom. As Figure 1 also indicates, their estimate range
is quite wide and it is not tightly clustered at the mean. Looking now at the column for
Judges, we can see that Judge Performance is largely around (0), but two judges are above
0 and three judges below 0. Looking finally at the column for Items in Figure 1, we can see
that there is a spread and they are neither easy nor difficult. Three out of nine items tend to
be difficult, while two are easy.
Figure 1. Groups, judges and questions, summary reports

Group performance analysis


Table 1 presents the overall measure of group PowerPoint presentation performance. In
other words, it shows group PowerPoint presentation ability. Groups are presented in
descending order of quality of group performance. Specifically, the measure column indicates that Group 3 is the most able (logit measure = 3.8), followed by Group 5 (logit
measure = 3.6), whereas Group 4 is the poorest (logit measure = 2.8).
Table 1 also reports on the two fit statistics infit and outfit. Outfit statistics are more
sensitive to extreme scores. High infit statistics, therefore, are a little more problematic than
high outfit statistics. In analyzing Rasch data, users typically are concerned when the mean
square (MNSQ) fit statistics exceed 1.5. The higher fit statistic demonstrates the more
questionable information. Through the use of fit statistics, the Rasch model helps the user
identify any items that are not fitting the model (thereby decreasing both the validity and
reliability of the test), and any candidate whose scores do not appear to be consistent with
the model. In Table 1, the fit statistics show that all the groups fit the model, and that infit

63
63
63
63
63
63
63.0
0.0

3.8
3.6
3.2
3.0
2.9
2.8
3.2
0.4

Obsvd Average
3.77
3.59
3.25
3.00
2.95
2.81
3.23
0.35

Fair Avrage

RMSE (Model) .12 Adj S.D. .29 Separation 2.44 Reliability .86
Fixed (all same) chi-square: 40.7 d.f.: 5 significance: .00
Random (normal) chi-square: 5.0 d.f.: 4 significance: .29

237
225
203
187
184
175
201.8
22.5

Obsvd Count

Groups measurement report.

Obsvd Score

Table 1.

.49
.32
.01
.21
.25
.36
.00
.31

Measure

Model

.12
.12
.12
.12
.11
.11
.12
.00

S.E.
1.2
1.0
1.3
1.0
0.7
0.8
1.0
0.2

MnSq

Infit

0
0
1
0
2
1
0.2
1.3

ZStd
1.2
1.0
1.3
1.0
0.7
0.8
1.0
0.2

MnSq

0
0
1
0
2
1
0.2
1.3

ZStd

Outfit

3 Group 3
5 Group 5
1 Group 1
2 Group 2
6 Group 6
4 Group 4
Mean (Count: 6)
S.D.

N Groups

Assessment & Evaluation in Higher Education


437

438

R. Basturk

and outfit scores are within the acceptable range (0.61.4). The separation index (2.44) of
this measure is acceptable as an indicator for separating groups because it exceeds the minimum requirements of an acceptable score of 2.0. It can be said that this presentation test was
able to separate the groups in terms of PowerPoint ability.
Judge quality analysis
A more detailed analysis of judges behavior is shown in Table 2, the judge measurement
report. Judges are presented in descending order of leniency. When we look at the Measure
column, Judge 5 is the most lenient, followed by Judge 1, while Judge 3 is the severest
among the seven judges. According to our benchmark for the fit statistics of the acceptable
range (0.61.4) for this research, where judges judgment is involved in a PowerPoint
presentation performance, all the judges are working reasonably and none of them is outside
of the critical range for fit statistics.
The separation index (2.66) of this measure is acceptable as an indicator for separating
judges because it exceeds the minimum requirements of an acceptable score of 2.0. It can
be said that this presentation test was able to separate the judges reasonably well. It should
be noted that the student judges as a whole do a good job, with all of them functioning at a
consistent level of severity. This is shown by the fit statistics.
Tasks/items difficulty analysis
A more detailed analysis of tasks/items difficulty is given in Table 3, the item measurement
report. Items are presented in descending order of difficulty. When we look at the Measure
column in Table 3, Item 8 is the most difficult item, followed by Item 7, while Item 3
is the easiest among the nine items.
The fit statistics indicate that all the items are working reasonably except Item 4,
content, whose infit and outfit statistics are both 1.7, a value beyond the maximum range
(1.4). The cause of misfit of Item 4 should be examined. One of the explanations is that
there were enough resources for students to obtain and locate in the PowerPoint presentation
Table 2.

Judges measurement report.


Model

Infit

Outfit

Obsvd
Score

Obsvd
Average

Fair
Avrage

Measure

S.E.

MnSq

ZStd

MnSq

ZStd

N Judges

211
199
172
168
166
149
146
173.0
22.4

3.9
3.7
3.2
3.1
3.1
2.8
2.7
3.2
0.4

3.92
3.70
3.21
3.13
3.10
2.79
2.73
3.22
0.41

.68
.47
.02
.04
.07
.33
.37
.05
.36

.13
.13
.13
.13
.13
.12
.12
.13
.00

0.8
0.8
1.1
1.2
1.0
0.8
1.1
1.0
0.2

1
0
0
0
0
1
0
0.2
0.9

0.8
0.8
1.1
1.2
1.0
0.8
1.1
1.0
0.2

1
0
0
0
0
0
0
0.1
0.8

5 J5
1 J1
4 J4
6 J6
2 J2
7 J7
3 J3
Mean (Count: 7)
S.D.

RMSE (Model) .13 Adj S.D. .34 Separation 2.66 Reliability .88
Fixed (all same) chi-square: 54.9 d.f.: 6 significance: .00
Random (normal) chi-square: 6.0 d.f.: 5 significance: .31

42
42
42
42
42
42
42
42
42
42.0
0.0

2.6
2.7
2.8
3.2
3.3
3.3
3.5
3.6
3.8
3.2
0.4

Obsvd Average
2.67
2.72
2.82
3.26
3.28
3.31
3.47
3.66
3.84
3.23
0.39

Fair Avrage

RMSE (Model) .14 Adj S.D. .31 Separation 2.14 Reliability .82
Fixed (all same) chi-square: 50.2 d.f.: 8 significance: .00
Random (normal) chi-square: 8.0 d.f.: 7 significance: .33

111
113
117
136
137
138
145
153
161
134.6
16.7

Obsvd Count

Items measurement report.

Obsvd Score

Table 3.

.47
.43
.36
.02
.04
.06
.21
.38
.56
.00
.34

Measure

Model

.14
.14
.14
.14
.14
.14
.15
.15
.15
.14
.00

S.E.
1.0
0.8
1.4
0.9
0.9
0.9
0.7
1.7
0.6
1.0
0.3

MnSq

Infit

0
1
1
0
0
0
1
2
2
0.2
1.4

ZStd
1.0
0.8
1.4
0.9
0.8
0.9
0.7
1.7
0.6
1.0
0.3

MnSq
0
1
1
0
0
0
1
2
2
0.2
1.4

ZStd

Outfit

8 Graph Sound anim


7 Citations
1 Research & Notetaking
2 plan and Storyboard
9 Writing Mechanics
6 Layouts
5 Text Elements
4 Content
3 Introduction
Mean (Count: 9)
S.D.

N Questions

Assessment & Evaluation in Higher Education


439

440

R. Basturk

because the instructional topics given by the instructor were well-known topics and easy to
find out. The fairaverage column suggests that the easiest item is Item 3 (introduction)
while the hardest is Item 8 (Graphs/Sounds/Animations). One explanation for this result
is that the introduction part is structured and well organized by all groups and this item
does not include pure information. An explanation for Item 8 being hardest is that
Graphs/Sounds/Animations are sometimes very difficult for students who have less ability
developing PowerPoint presentations, particularly introductory-level students.
The separation index (2.66) of this measure is acceptable as an indicator for separating
items because it exceeds the minimum requirements of an acceptable score of 2.0. It can be
said that these items do function well in generating the expected differentiation of groups
performance on the scale. This is probably because all items are great enough to differentiate among the groups various presentation abilities, so that some extremely good or
extremely poor groups were measured well by these items.
The Rasch analysis also provides Root Mean-Square Standard Error (RMSE) for all
non-extreme measures over groups, judges and items (0.12, 0.13 and 0.14 respectively).
These RMSE scores illustrate that groups, judges and items measurement errors are very
low. After application, judges and item variances have been adjusted for measurement error;
three variances are below the 1.0 (Adj. SD = 0.29, Adj. SD = 0.34, Adj. SD = 0.31 for
groups, judges and items, respectively). The ratio of Adjusted SD to RMSE (2.44 for
groups, 2.66 for judges and 2.14 for items) for groups, judges and items separation index
is relatively high due to low RMSEs.
The reliability statistics provided by the Rasch analysis indicate the degree to which the
analysis reliability distinguishes the level of quality among the groups, judges and items.
For groups, judges and items, Rasch analysis produced reliability scores of 0.86, 0.88 and
0.82, respectively. These reliability scores indicate that the analysis is fairly reliably
separating groups, judges and items into different levels of quality.
Judge/group interaction
Z scores above 2.0 or below 2.0 would indicate an interaction effect. According to the
Bias/Interaction report in Rasch analysis in Table 4, there was a judge who seemed to be
too severe on certain applications.
Specifically, Judge 4 with an expected score of 33.6 had an observed score of 40 on
Group 3, translating into a z-score of 2.04. On the other hand, there was no overall
statistically significant Judges by Groups interaction effect (2 = 39.9, df = 42, p > 0.56).
Discussion and implications
This research demonstrates that the Rasch-based analysis provides (a) the relationship
among three facets of evaluation (group presentation ability, judge severity/lenient and
item/task difficulty), (b) judge severity and fit statistics, (c) group presentation ability
with fit statistics, (d) item/task difficulty and fit statistics and (e) bias interaction analysis by model. With all or part of these pieces of information, the facets of the data can
be thoroughly investigated individually, which is not always possible in the traditional
test analysis. Rasch measurement can therefore provide complementary information that
is useful to performance assessment developers and teachers who use performance
assessment techniques in their classroom. In addition to improving overall estimates of
the dependability of assessment results, Rasch measurement allows the identification of
specific elements of assessment and research procedures that are affecting those scores.

Bias interaction analysis by model.

33.6
28.8
4.9

40
28.8
5.8

.71
.00
.35

Obs-Exp
Average

Model
S.E.
.34
.31
.02

Bias+
Measure

.70
.01
.31

Fixed (all = 0) chi-square: 39.9 d.f.: 42 significance: .56

9
9.0
0.0

Exp. Obsvd
Score Count

Obsvd
Score
2.04
.00
.97

Z-Score
1.2
0.9
0.9

Infit
MnSq
1.2
0.9
0.3

Outfit
MnSq
21
Mean
S.D.

Sq

3
(Count: 42)

Applying Rasch Measurement to Performance Assessment in Higher Education 10112006 16:16:29


Table 13.1.1 Bias/Interaction Calibration Report (arranged by 0fN).
Bias/Interaction analysis specified by Model: ?B, ?B, ?, RATINGS

Table 4.

Group 3

Groups

.49

measr

J4

Jud

.02

measr

Assessment & Evaluation in Higher Education


441

442

R. Basturk

Our Rasch analysis allowed the identifying of specific judges, specific items/tasks and
specific combinations of judges, tasks and groups that are affecting the dependability
of judgments. Thus, the instrument tasks/items can be improved by examining the fit
statistics (infit and outfit). In addition, the assessment can be improved by having
discussions with judges and participating groups/students when the misfit elements are
found.
It can be said that teacher and peer assessment functions better for three reasons. One is
that groups recognize the target level of the PowerPoint presentation using the rating criteria that the instructor set up. Group members recognize that the more ratings they do as
judges, the more accurate their evaluations become. The second reason is that by taking
turns (presenter groups and judges) they must always concentrate on the in-class activities.
In other words, the two tasks (rating, presenting) always had groups actively participating
in their classroom activities. Finally, these groups, and specifically students, can learn
better how to present the topics related to the content and objectives of the course when the
rating occurs in a real classroom situation. The group members know that the better their
PowerPoint presentation becomes organized, the more impact the evaluations have on the
groups.
In summary, (1) in-class peer assessment can be successful in keeping students on task
during the entire class period, (2) students presentation of knowledge and communication
ability was improved by using PowerPoint presentation, (3) students can function as reasonable judges of their peers and their ability to analyze and critique the presented materials is
enhanced, (4) the Rasch model can be useful in yielding a considerable amount of significant data on several factors related to PowerPoint presentation assessment, including group
presentation ability, item difficulty, judge severity and the bias in interaction between
judges, groups and items.
The findings of the present study have several important implications for instructors
and researchers who have used performance and peer assessment techniques in their
classroom in higher education. Educational assessment is not an add-on to instruction.
Rather it is an essential part of the instructional process and it can inform both the teacher
and the learner (Asmus 1999). Educational assessment also is not an enterprise that takes
place outside the classroom; it should be one in which teachers and students are actively
engaged on a recurring basis as they articulate and apply criteria to themselves and one
another (Barrotchi and Keshavarz 2002). Performance Assessment is far more than a
procedure for demonstrating samples of student work. Using peer assessment techniques
changed the climate of the classroom and the nature of teacherstudent interaction in this
course. Similar applications of performance and peer assessment in other courses and
subjects in higher education can lead to improvements in both student achievement and
course quality. Furthermore, Performance Assessment allows instruction and assessment
to be woven together in a way that more traditional approaches fail to accomplish.
Finally, this study reveals that analyzing outcomes of peer assessment through MFRM
techniques provides rich approaches suitable for assessing student performance in higher
education.

Notes on contributor
Ramazan Basturk is an assistant professor in the Faculty of Education at Pamukkale University,
Denizli, Turkey. His main interests are measurement and evaluation in education, with
particular reference to performance and peer assessment, the Rasch measurement model and teaching
statistics.

Assessment & Evaluation in Higher Education

443

References
Asmus, E. 1999. Music assessment concepts. Music Educators Journal 3: 1924.
Barrotchi, N., and M.H. Keshavarz. 2002. Assessment of achievement through portfolios and
teacher-made tests. Educational Research no. 44: 279288.
Bartsch, R.A., and K.M. Cobern. 2003. Effectiveness of PowerPoint presentations in lectures.
Computers & Education 41: 7786.
Burke, K. 1999. How to assess authentic learning. Arlington Heights, IL: Skylight Professional
Development.
Engelhard, J.G. 1992. The measurement of writing ability with a many-faced Rasch model. Applied
Measurement in Education 5: 171191.
Haney, W., and G. Madaus, 1989. Searching for alternatives to standardized tests: whys, whats, and
whithers. Phi Delta Kappan 70: 683687.
Jonassen, D.H., Howland, J., J.L. Moore, and R.M. Marra. 2003. Learning to solve problems with
technology: a constructivist perspective. New Jersey: USA, Pearson Education.
Kallick, B. 1992. Evaluation: a collaborative process. In If minds matter: a foreword to the future, eds.
A. Costa, J. Bellanca and R. Fogarty, 313319. Arlington Heights, IL: IRI/Skylights.
Koretz, D., B. Stecher, and E. Deibert. 1993. The reliability of scores from the 1992 Vermont Portfolio
Assessment Program. Los Angeles, CA: University of California, CRESST: Center for the Study
of Evaluation.
Linacre, J. M 1993. Many0facet Rasch measurement. Chicago, IL: MESA Press.
Linacre, J.M., 1999. Measurements of judgments. In Advances in measurement in educational research
and assessment, eds. G.N. Masters, and J.P. Keeves, 244253. Oxford: Pergamon.
Linacre, J. M., and B.D. Wright. 1993. A users guide to FACETS: Rasch model computer program,
version 2.4 for PC compatible computers. Chicago, IL: MESA Press.
Linn, R.C. 1994. Performance assessment: policy promises and technical measurement standards.
Educational Researcher 23, no. 9: 414.
Linn, R.L., and N.E. Gronlund. 1999. Measurement and assessment in teaching. 7th edn. Columbus,
OH: Merill.
Lumley, T., and T.F. McNamara. 1993. Rater characteristics and rater bias: implications for training.
Paper presented at the Language Testing Research Colloquium. Cambridge, UK: August 1993.
ED: 365 091.
McCleland, D. 1994. Presentation software that delivers. Macworld 20: 144148.
Mearoff, G.I. 1991. Assessing alternative assessment. Phi Delta Kappan 73, no. 4: 272281.
Mehrens, W.A. 1992. Using performance assessment for accountability purposes. Educational
Measurement: Issues and Practice 2, no. 1: 39.
Messick, S. 1994. The interplay of evidence and consequences in the validation of performance
assessment. Educational Researcher 23, no. 2: 1323.
Neil, D.M., and N.J. Medina. 1989. Standardized testing: harmful to educational health. Phi Delta
Kappan 70: 688697.
Rasch, G. 1980. Probabilistic models for some intelligence and attainment tests. Chicago, IL: MESA Press.
Schumacker, R.E. 1996. Many-faced Rasch model selection criteria: examining residuals and more.
Paper presented at the Annual Meeting of the American Educational Research Association, New
York.
Shepard, L.A. 1989. Why we need better assessments. Educational Leadership 46, no. 7: 49.
Stiggins, R.J. 1987. Design and development of performance assessment. Educational Measurement:
Issues and Practice 6, no. 3: 3342.
. 2002. Assessment crisis: the absence of assessment for learning. Available online at http://
www.pdkintl.org/kappan/k0256sti.htm (accessed 10 September 2005).
Tomlinson, C.A. 2001. Grading for success. Educational Leadership 3: 1215.
Vandervelde, J. 2006. A+ PowerPoint rubric. Available online at http://www. uwstout.edu /soe/
profdev/pptrubric.html (accessed 6 September 2005).
Wiggins, G.P. 1989. A true test: toward more authentic and equitable assessment. Phi Delta Kappan
70: 703713.
. 1990. The case for authentic assessment. Available online at http://ericae.net/pare/
getvn.asp?v=2&n=2. (accessed 19 February 2006).
Wolf, A. 1995. Authentic assessments in a competitive sector: institutional prerequisites and
cautionary tales. In Evaluating authentic aAssessment, ed. H. Torrance, 7887. Buckingham:
Open University Press.

444

R. Basturk

Wolfe, E.W., and C.W.T. Chiu. 1997. Detecting rater effects with a multi-facet Rating Scale model.
Paper presented at the Annual Meeting of the National Council on Measurement in Education.
Chicago, IL: 2527. March. ED: 408 324.
Worthen, B.R. 1993. Critical issues that will determine the future of alternative assessment. Phi
Delta Kappan 74, no. 6: 444454.
Zemelman, S., H. Daniels, and A. Hyde. 1998. Best practices. Portsmouth, NH: Heinemann.

Você também pode gostar