Você está na página 1de 14

Scoring Situational Judgment Tests: Once You Get

the Data, Your Troubles Begin


Mindy E. Bergman*
Department of Psychology,
Texas A&M University
Fritz Drasgow
Department of Psychology, University of
Illinois at Urbana-Champaign
Michelle A. Donovan
Intel Corporation
Jaime B. Henning
Department of Psychology,
Texas A&M University
Suzanne E. Juraska
Personnel Decisions Research Institute
Although situational judgment tests (SJTs) have been in use for decades, consensus has not
been reached on the best way to score these assessments or others (e.g., biodata) whose items
donot have a single demonstrably correct answer. The purpose of this paper is toreviewandto
demonstrate the scoring strategies that have been described in the literature. Implementation
and relative merits of these strategies are described. Then, several of these methods are applied
to create 11 different keys for a video-based SJTin order to demonstrate how to evaluate the
quality of keys. Implications of scoring SJTs for theory and practice are discussed.
S
ituational judgment tests (SJTs) have been in use since
the 1920s, but have recently enjoyed a resurgence in
attention in the research literature (e.g., Chan & Schmitt,
1997, 2005; Clevenger, Pereira, Wiechmann, Schmitt, &
Harvey, 2001; Dalessio, 1994; McDaniel, Morgeson,
Finnegan, Campion, &Braverman, 2001; Olson-Buchanan,
Drasgow, Moberg, Mead, Keenan, & Donovan, 1998;
Smith & McDaniel, 1998; Weekley & Jones, 1997, 1999).
McDaniel et al. (2001) recently showed the validity of SJTs
in predicting job performance. SJTs have also been found to
provide incremental validity beyond more typically used
assessments, such as cognitive ability tests, and appear to
have less adverse impact (Chan &Schmitt, 2002; Clevenger
et al., 2001; Motowidlo, Dunnette, & Carter, 1990; Olson-
Buchanan et al., 1998; Weekley & Jones, 1997, 1999).
However, important questions still persist. One critical
issue is the selection of scoring methods (Ashworth &
Joyce, 1994; Desmarais, Masi, Olson, Barbara, & Dyer,
1994; McHenry &Schmitt, 1994). Unlike cognitive ability
tests, SJT items often do not have objectively correct
answers; many of the response options are plausible. It is a
question of which answer is best rather than which is
right. However, there are many ways to determine the
best answer and consensus has not yet been reached as to
which method is superior. This paper delineates scoring
strategies, discusses their merits, and demonstrates them
for one SJT. We hope that this paper stimulates and serves
as an example for further scoring research.
What are SJTs?
SJTs are a measurement method that can be used to assess
a variety of constructs, (McDaniel et al., 2001, p. 732;
McDaniel & Nguyen, 2001), although some constructs
might be particularly amenable to SJT measurement (Chan
& Schmitt, 2005). Most SJTs measure a constellation of
job-related skills and abilities, often based on job analyses
(Weekley & Jones, 1997). SJT formats vary, with some
using paper-and-pencil tests with written descriptions of
situations (Chan & Schmitt, 2002) and others using
computerized multimedia scenarios (McHenry & Schmitt,
*Address for correspondence: Mindy Bergman, Department of
Psychology (MC 4235), College Station, TX 77843-4235. E-mail:
mindybergman@tamu.edu
INTERNATIONAL JOURNAL OF SELECTION AND ASSESSMENT VOLUME 14 NUMBER 3 SEPTEMBER 2006
223
r2006 The Authors
Journal compilation r2006 Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford, OX4 2DQ, UK and 350 Main St, Malden, MA 02148, USA
1994; Olson-Buchanan et al., 1998; Weekley & Jones,
1997). SJT response options also vary. Some SJTs propose
solutions to problems, to which respondents rate their
agreement (Chan & Schmitt, 2002). Others offer multiple
solutions from which respondents choose the best and/or
worst option (Motowidlo et al., 1990; Olson-Buchanan
et al., 1998).
This study examines scoring issues within the context of
one SJT, the Leadership Skills Assessment (LSA). The
development of this video-based, computer-delivered SJTis
described in the Methods. Briefly, the LSA contains 21
items depicting situations in a leader-led group. Respon-
dents indicate which of four multiple-choice options they
would choose if they were the leader. Response options
vary in degree of participation in decision-making. For the
LSA, item refers to a video scenario (i.e., item stem) and
its four response options. Similar to Motowidlo et al.s
(1990) procedure, correct options are scored as 11,
incorrect options as 1, and all other options as 0. A
key is the set of 11, 1, and 0 values assigned to the
options of every itemon a test; multiple keys can be created
for a single test through different scoring approaches. A
sample item from the LSA, with a summary of the video
itemstem, appears in Table 1, along with the scoring of that
item for each key described in the following section.
Scoring Strategies
Some SJT scoring issues parallel those in the biodata
literature, where consensus on optimal scoring methods
also has not yet been reached. Similar to SJTs, biodata
measures which ask individuals to report past behaviors
and experiences often do not have demonstrably or
objectively correct answers (Mael, 1991; Mumford &
Owens, 1987; Mumford & Stokes, 1992; Nickels, 1994).
However, differences between SJTs and biodata limit the
extent to which scoring research can generalize. The
prevailing issue in scoring SJTs which typically contain
a handful of items is howto choose the best response from
among the multiple-choice options for each item. In
contrast, because biodata measures are often lengthy
(sometimes containing several hundred items; Schoenfeldt
& Mendoza, 1994), much of the biodata scoring research
examines which items should be included in composites
and which should be eliminated fromthe assessment. Thus,
Table 1. Example item from the leadership skills assessment and its scoring across 11 keys
Summary of item stem (video scenario) and response options
At the request of the team leader, Brad has reviewed several quarterly reports. Through discussion, Brad and the team
leader come to the conclusion that Steve, another teammember, has been making errors in the last several reports and is
currently working on the next one. As the leader in this situation, what would you do?
A. At the next team meeting discuss the report and inform the team that there are a few errors in it. Ask the team
members how they want to revise the report
B. Tell Steve about the errors, then work with him to fix the errors
C. Tell Steve about the errors and have him go over the report and correct the errors
D. Tell Steve that you asked Brad to look over this report and that Brad found some errors in it. Ask Steve to work with
Brad to correct the report
Key
Score for option
A B C D
Empirical 0 1 0 0
Initiating structure 1 1 0 0
Participation 0 1 0 1
Empowerment 1 1 0 0
Hybrid initiating structure 1 1 0 0
Hybrid participation 0 0 0 1
Hybrid empowerment 1 0 0 0
Vroom time-based 0 0 1 1
Vroom developmental 0 1 0 0
Subject matter experts 0 1 0 1
Novice vs. experts 0 0 1 0
Note: Entries in the table reflect that options are correct (1), incorrect ( 1), or unscored (0) for that particular key.
224 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA
International Journal of Selection and Assessment
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006
although there are similarities among the scoring issues in
SJTs and biodata, there are also important differences
between these methods and their scoring needs.
Three categories of scoring methods have been described
in the SJT literature: (1) empirical, (2) theoretical, and (3)
expert-based. Four scoring methods appear in the biodata
literature: (1) empirical, (2) rational, (3) factorial, and (4)
subgrouping. To these lists we add another method, which
we call hybrid keying. Empirical keying is common across
SJTs and biodata. Elements of the biodata literatures
rational keying appear in the SJT literatures theoretical
keying and expert-based keying. Thus, there are six distinct
strategies for scoring.
Empirical Scoring
In empirical approaches, items or options are scored
according to their relationships with a criterion measure
(Hogan, 1994). Although many specific methods exist
(Devlin, Abrahams, & Edwards, 1992; Legree, Psotka,
Tremble, & Bourne, 2005; Mumford & Owens, 1987), all
share several processes including: choosing a criterion,
developing decision rules, weighting items, and cross-
validating results. Empirical keys often have high validity
coefficients compared with other methods (Hogan, 1994;
Mumford & Owens, 1987), but also have a number of
challenges, including dependency on criterion quality
(Campbell, 1990a, b), questionable stability and general-
izability (Mumford &Owens, 1987), and capitalization on
chance (Cureton, 1950).
In effect, empirical scoring relies on item responses and
criterion scores of a sample. For assessments with non-
continuous multiple-choice options, empirical keys are
constructed by first creating dummy variables for each
option of each item, with 1 indicating that the option was
chosen. Positive (negative) dummy variable-criterion
correlations occur when options are chosen frequently by
individuals with high (low) criterion scores; zero correla-
tions indicate an option is unrelated to the criterion. The
result is a trichotomous scoring key. It should be noted that
some empirical procedures weight response options in
other ways (Devlin et al., 1992; England, 1961).
Although mathematically straightforward, decision
rules must be created in order to implement empirical
keying based on optioncriterion correlations. First, a
minimum base rate of option endorsement is required
because spurious optioncriterion correlations can occur if
only a handful of individuals endorse an option. Second, a
minimum correlation (in absolute value) must be set to
score an option as correct (positive correlation) or incorrect
(negative correlation). Options are scored when both
scoring standards are met. For some items, all options are
scored as 0. This may be because of lowbase rates for some
options, near-zero correlations between options and the
criterion, or both. Such items are particularly vexing
because it is not clear whether the underlying construct is
unrelated to the criterion, the item was poorly written, or
there were too fewtest-takers who had the ability to discern
the correct answer (i.e., the correct option did not meet the
minimum endorsement criterion).
For the LSA, our criteria were: (a) an option endorse-
ment rate of at least 25 respondents (20.3% of our eligible
sample) and (b) an optioncriterion correlation (in absolute
value) of .10 with the criterion. The entire sample was used
to derive the empirical key, inflating its validity coefficient
due to capitalization on chance. We used N-fold cross-
validation to address this issue (Brieman, Friedman,
Olshen, & Stone, 1984; Mead, 2000). N-fold cross-
validation holds out the responses of person j and computes
an empirical key based on the remaining N1 persons,
which is used to score to person j. Person j is then returned
to the sample, person j11 is removed, and an empirical key
is created on the N1 remaining individuals. This
procedure is repeated N times so that every respondent
has a score free from capitalization on chance. Correlating
these holdout scores with the criterion measure provides a
validity estimate that does not capitalize on chance. N-fold
cross-validation is superior to traditional half-sample
approaches because it keys in subsamples of N1 (rather
than N/2) and uses all N individuals for cross-validation
(rather than the other N/2). Note that although N samples
of N1 people are used to estimate validity, empirical
key refers to the key obtained from all N individuals.
Theoretical Scoring
Similar to biodatas rational method (Hough & Paullin,
1994), SJTkeys can reflect theory. Items and options can be
constructed to reflect theory, or theory can be used to
identify the best and worst options in a completed test.
Options reflecting (contradicting) the theory are scored as
correct (incorrect); items or options that are irrelevant or
unrelated to the theory are scored as zero. Theoretical
methods address major criticisms of empirical keying such
as it being atheoretical. Theoretical keys may be more
likely to generalize. However, theoretical keys might be
more susceptible to faking due to their greater transparency
(Hough & Paullin, 1994). Further, the theory might be
flawed or fundamentally incorrect.
The LSAs options reflect three graduated levels of
delegation of decision-making to the team. Keys for each of
these leadership styles were developed, as were keys based
on Vrooms (Vroom&Jago, 1978; Vroom&Yetton, 1973)
contingency model.
Theoretical Keys Based on Levels of Delegation
Three scoring keys were created by treating one particular
delegation style as the most effective across situations. For
empowerment, the group makes decisions; for participative
SCORING SJTS 225
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006
decision making, the leader delegates the decision, or a part
thereof, to the group (e.g., the leader allows the group to
decide, or the leader facilitates group discussion but still
holds veto power); and for initiating structure, the leader
retains the decision-making authority. In this last style, the
leader might gather information fromsubordinates, but the
subordinates have no voice in the decision that is made
(Liden & Arad, 1996). The empowerment key coded
empowering responses as correct and initiating structure
options as incorrect. This was reversed for the initiating
structure key.
1
For both of these keys, all other options
were scored as zero. For the participative key, participative
responses were scored 11 and initiating structure options
as 1; other options were scored as zero, including
empowerment options, because empowerment allows for
participation, but not the best kind if participative
management is the best leadership style.
Vrooms Decision Model Keys
The model proposed by Vroom and colleagues (Vroom,
2000; Vroom & Jago, 1978; Vroom & Yetton, 1973)
incorporates the critical notion that leader behavior should
be situation-dependent. The model describes several
factors (e.g., decision significance, team competence) that
should influence leaders choices among five-decision
strategies ranging from AI (autocratic decision by the
leader) to GII (group-based decision-making). Vroom
(2000) delineated two forms of the model. In the first,
time-related factors are emphasized; in the second, group
member development is emphasized. Although there is
considerable overlap, the two models sometimes make
different recommendations. Thus, both a time-based key
and a development-based key were created.
Each LSA scenario was rated on the factors in Vrooms
(2000) model. Raters were 15 students (five males, 10
females; nine Caucasians, six Asian/Pacific Islanders;
M
age
526.4 years) at a large Midwestern university with
substantial backgrounds in psychology (M
classes
520.67).
Raters averaged 5.6 years in the workforce; 40% had
managerial experience (M52.5 years). Mean ratings on
the factors were used to navigate the Vroom (2000)
decision trees to identify the best decision strategy(-ies)
for each situation. Independently, the third author classi-
fied each of the LSAs response options into Vrooms five-
decision strategy categories. Options that matched
Vrooms recommended strategy were scored as correct;
other options were scored as zero.
Hybridized Scoring
Combining different scoring methods could potentially
increase predictive power (Mumford, 1999; Olson-
Buchanan et al., 1998). Hybrid scoring combines two
independently generated keys. Two keys could, for
example, be added at the option level, allowing a positive
score on one key to cancel out a negative score on the other.
Another hybridization approach is substitution for zeroes.
One key is designated as the primary key and the other as
secondary. The hybrid key is initially assigned the keying of
the primary key; the secondary key is used to replace only
zero scores in the primary key. Keys can also be
differentially weighted, such that one key is used with the
full scores and the other is fractionally weighted. Hybrid
keying can be implemented in a straightforward way in any
standard statistical software package.
Hybridizing an empirical key and a theoretical key
resolves some concerns about each because it both
recognizes theory and relies less on pure empiricism. It is
partially based on data, so there is an opportunity to
remedy flaws in theory. Empiricaltheoretical hybrids can
also shed light on whether an option is unscored or
scored as zero. However, problems attendant in the
original keys approaches transfer to hybrid keys. Further,
the choice of which keys to hybridize is not simple, as there
are theoretical (e.g., is combining these keys theoretically
justified?) and practical (e.g., is it possible to implement
this key without cross-validation?) issues to consider.
Here, three additive hybrid keys (initiating structure,
participation, and empowerment hybridized with empiri-
cal) were created; N-fold cross-validation was used to
estimate validity.
Expert-Based Scoring
Expert scoring creates keys based on the responses of
individuals with substantial knowledge about the topic.
Decision rules must be implemented in order to identify
consensus around the appropriate answer(s). Two expert-
based keys were created for the LSA.
Subject Matter Experts
The most common way to develop expert-based keys is to
ask subject matter experts (SMEs) to make judgments
about the items. In this sense, SME keying is similar to
rational keying in biodata, in that experts make judgments
about options relevance to the criterion. SMEs examine
each item and its options to identify the best and worst
choices, which are scored as correct or incorrect, respec-
tively. All other options receive a score of zero.
For the LSA, 15 SMEs were recruited from a graduate-
level I/O psychology course at a large midwestern
university. Respondents were well versed in leadership
theory through their coursework and reported a mean level
of 20.67 courses in psychology throughout their academic
careers. SMEs individually watched the LSA and identified
which options they believed were best or worst. We
required that at least one-third of the SMEs select an
option as best or as worst and that there also be a
226 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA
International Journal of Selection and Assessment
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006
five-response (i.e., one-third of the sample) differential
between the number of endorsements as best vs. worst
before an option could be scored. This allowed for multiple
correct or incorrect answers per item; however, some items
had all options scored as zero.
Comparison of Novices vs. Experts
Another expert scoring approach contrasts experts and
novices responses. Instead of asking for judgments about
the best and worst options, experts and novices simply
complete the assessment, choosing the best option. Options
chosen frequently by experts are scored as correct
regardless of novices endorsement rate; options that
non-experts choose frequently but experts do not are
scored as incorrect. Following this procedure, 128
introductory psychology students and 20 masters stu-
dents in labor and industrial relations at a large Mid-
western university completed the LSA to fulfill a course
requirement. By item, if an option was picked by at least
one-third of the Masters group, it was scored as correct.
Where the introductory psychology students most fre-
quently endorsed option differed from the Masters
students choice and was endorsed by at least one-third,
the option was coded as incorrect. All other options were
scored as zero.
Factorial Scoring
The factorial approach forms construct-laden scales based
on factor analysis and itemcorrelations (Hough & Paullin,
1994). Factorial approaches are used when specific,
construct-based scales are not specified a priori and items
are not assumed to measure particular constructs. This
procedure is useful when theory does not define the
relevant constructs, yet the item pool could produce
meaningful dimensions. It can also be used to winnow
item pools, as items that do not belong to a readily
identifiable factor can be dropped. We did not implement
this method.
Subgrouping
Infrequently used, subgrouping attempts to identify natu-
rally occurring groups by sorting individuals based on
similar patterns of responding to biodata items (Devlin et
al., 1992; Hein & Wesley, 1994). Subgrouping might
capture individuals self-models, which guide their beha-
vior in a range of situations (Mumford & Whetzel, 1997).
We did not use this method for the LSA.
Evaluating Scoring Keys Relative Utility
In biodata research, although results have been mixed,
empirical keys seemto have the greatest predictive validity,
whereas rational keys allow for greater understanding of
theory (Karas & West, 1999; Mitchell & Klimoski, 1982;
Schoenfeldt, 1999; Such & Hemingway, 2003; Such &
Schmidt, 2004; see Stokes & Searcy, 1999, for an
exception). Empirical keys display greater shrinkage in
cross-validation than other methods, such as rational
scoring, but the cross-validity coefficients of empirical
keys are still often higher than rational keys (Mitchell &
Klimoski, 1982; Schoenfeldt, 1999). Few studies compar-
ing scoring methods for one SJT have been conducted.
Paullin and Hanson (2001) compared one rational and
several empirical keys for a leadership SJT in two U.S.
military samples. There were few differences in the
predictive validity of the various keys for supervisory
ratings or promotion rate; cross-validities for the keys did
not differ. Other studies have found similar results (Krukos,
Meade, Cantwell, Pond, & Wilson, 2004; MacLane,
Barton, Holloway-Lundy, & Nickels, 2001; Such &
Schmidt, 2004).
In sum, few studies have examined the relative
effectiveness of various scoring methods. Further, the
criteria for effectiveness of a key have not been clearly
established; some studies focus on understanding the
constructs underlying a test, others emphasize prediction,
and yet others stress incremental validity. We believe that
like any test development process, an SJTkeys usefulness is
determined with several kinds of information. First,
different keys yield different scores and, therefore, different
correlations with the criterion. Obviously, high criterion-
related validity is one desideratum for a key. Keys will also
be differentially related to other predictors such as
cognitive ability or personality; therefore, keys incremen-
tal validities can vary. Thus, high incremental validity is a
second standard for a key. It is also important to consider
adverse impact. Keys from one test might vary in amount
of subgroup differences; minimizing these differences is
the third standard. The fourth standard for a key is
construct validity. The key should produce scores that
correlate with other measures that the SJTshould correlate
with (i.e., convergent validity), but do not correlate with
measures that the SJT should not correlate with (i.e.,
discriminant validity; Campbell & Fiske, 1959). Ulti-
mately, the key should produce scores that indeed measure
the construct that the test developer intended. Note that in
the past, developers of SJTs may have paid less attention to
the construct validity of their measures than the developers
of other psychological instruments. Nonetheless, we
believe it is important for SJT researchers to begin to
articulate and investigate theoretical frameworks for their
measures.
In this study, we first evaluated the validities of the LSAs
keys. Next, we examined the keys incremental validities in
predicting leadership and overall job performance above
and beyond the effects of personality and cognitive ability,
which are related to leadership performance (Borman,
White, Pulakos, & Oppler, 1991; Campbell, Dunnette,
SCORING SJTS 227
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006
Lawler & Weick, 1970; Judge, Bono, Ilies, & Gerhart,
2002). Given the ubiquity of these types of assessments in
selection, as well as their ready availability, it seems
important to demonstrate SJTs added value beyond these
more common assessments. Third, we examined subgroup
differences across sex (sample sizes for ethnic minority
groups were too small for meaningful analysis). Finally, we
investigated one aspect of construct validity by examining
the keys correlations with measures of cognitive ability
and personality. LSA keys, as measures of leadership skill,
should not be redundant with cognitive ability or
personality; correlations with these traits should be
modest.
Method
Participants and Procedure
Participants were 181 non-academic supervisors from a
large Midwestern university who managed other indivi-
duals, not technical functions (e.g., computer system
administrators); their departments included building ser-
vices, housing, grants and contracts, and student affairs.
All completed the LSA and other measures, except the
criterion measures that were completed by the participants
supervisors. Because not all criterion ratings forms were
returned, the eligible sample size for this study was 123. Of
the eligible sample, 67.5%were female. The majority were
Caucasian (91.9%), with African Americans (2.4%),
Asians (1.6%), and Hispanics (1.6%) completing the
sample (2.4% chose other or did not respond). The
median age category was 4549 years; most participants
(80.5%) were older than 40. The sample was highly
educated, with 53.7% holding graduate or professional
degrees; 21.1%held a college degree, with or without some
graduate training; 15.4% had completed some college,
4.1% had a high school diploma with additional technical
training, and 5.7% had a high school diploma or GED
only.
The modal tenure category of the sample was greater
than 16 years (41.5%), followed by 710 (26.8%) and 11
15 years (23.6%). Fewreported three or less (1.6%) or 46
(6.5%) years of tenure. Most reported supervising small
numbers of employees, with 13 (37.4%), 46 (24.4%), or
710 (8.9%) subordinates. However, larger numbers of
supervisees were not uncommon, with 1115 (7.3%), 16
20 (7.3%), or 21 or more (14.6%) subordinates. Partici-
pants missing data did not differ from the effective sample
on any demographic measures.
Measures
Leadership Skills Assessment. The LSA is a 21-item
computerized multimedia assessment of leadership skills.
Items are set in either an office (N510) or a manufacturing
(N511) environment. Each environment has its own four-
person team, including a leader. Leaders appear in all
scenes; other team members did not. Scenes were filmed
with the leaders back to the respondent to emphasize the
leaders point of view. Respondents indicated which of the
four response options best describes what they would do if
they were the leader in that situation.
Test development followed a method similar to Olson-
Buchanan et al. (1998) and occurred with employees at a
building products distribution company. These employees
were in leader-led teams that were encouraged to partici-
pate in decisions and were empowered to make some
decisions. Using critical incidents techniques (Flanagan,
1954), we conducted individual and group interviews with
employees and leaders (N543). Common and critical
leadership situations were identified from these interviews
and then summarized in descriptive paragraphs.
To generate realistic options, 53 supervisors were
presented with the descriptions and asked how they would
respond. Situations producing little response variability
were discarded. For situations that yielded variability,
responses were clustered into styles based on Liden and
Arads (1996) modified version of Vrooms (2000; Vroom
& Jago, 1978; Vroom & Yetton, 1973) model, which
includes three graduated levels of delegation: the decision is
(a) made entirely by the group (empowerment); (b)
delegated from the leader to the group (participative); or
(c) controlled entirely by the leader (initiating structure).
Short descriptions of each response cluster were written
and became candidate multiple-choice options for the
scenarios.
The situations and their candidate multiple-choice
options were presented to employees (N536) who were
asked to identify the option that they thought: (1) was the
best way of responding; and (2) others would most likely
select as the best way of responding (i.e., the socially
desirable response). Situations with little variability for
either question across candidate multiple-choice options
were discarded. For each scene, one option was selected for
each level of delegation; the fourth option was a distractor
(e.g., the leader ignores the problem).
Using details from the interview sessions that began this
test development process, scripts were written to expand
on the paragraph summaries of the item stems. After
revision for clarity and conversational tone, scripts were
filmed at a large midwestern university using local actors.
The Wonderlic Personnel Test (WPT). Cognitive
ability was assessed by the computerized version of the
WPT (for validity evidence, see Dodrill, 1983; Dodrill &
Warner, 1988; Wonderlic Personnel Test Inc., 1992). Test
retest reliabilities for the WPT range from .82 to .94
(Dodrill, 1983; Wonderlic Personnel Test Inc., 1992).
Sixteen Personality Factor Questionnaire (16PF). Per-
onality was assessed with the computerized 16PF, Fifth
Edition (Cattell, Cattell, & Cattell, 1993), which contains
185 items that can be mapped onto 16 primary factors and/
or five global factors. We used the global factors:
228 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA
International Journal of Selection and Assessment
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006
introversion, independence (likened to the Big 5 agreeable-
ness dimension), anxiety (neuroticism), self-control
(conscientiousness), and tough-mindedness (openness).
Reliabilities are: extraversion (0.90), Anxiety (0.90), Self-
control (0.88), Independence (0.81), and Toughminded-
ness (0.84) (S. Bedwell, personal communication, January
17, 2005).
Criterion Measures. A six-item scale, Empowering
Leadership, was based on Arnold, Arad, Rhoades, and
Drasgows (2000) research on the six dimensions of
empowerment. It assessed the main criterion variable:
leadership performance. Supervisors of the participants
were contacted and asked to indicate how often the
participants engaged in empowering leadership behaviors
(1 5never; 7 5always). Coefficient a was .95. This
measure was used to create the empirical key.
Supervisors were asked to rate the overall performance
of participants (15Very poor; 7 5Very good) using three
items, yielding the Overall Job Performance scale. Coeffi-
cient a was .96.
Results
Descriptive statistics and correlations for all keys and other
measures appear in Table 2.
Validity
The keys provided a wide range of validity coefficients
( .03 to .32; see Table 2) and generally had stronger
relationships with Empowering Leadership than the Over-
all Job Performance scale. The empirical, SME, and hybrid
participation keys all showed significant if moderate
correlations with leadership ratings. No other keys were
significantly related to the criteria.
Incremental Validity Analysis
The contribution of each of the keys to the prediction of
leadership performance was tested through a series of
planned hierarchical regression analyses. Each key had its
own set of regressions. For every key, the hierarchical
regressions proceeded by first regressing leadership ratings
onto the WPT. Next, the five personality scales were
entered as a block. Then, each key was entered at the last
step in each hierarchical regression. Steps 1 and 2 and their
results are the same for each set of regressions, as the keys
were entered only in the last step of each set.
The results (Table 3) show that the WPT significantly
predicted leadership ratings, accounting for approximately
10% of the variance in leadership ratings. Personality
factors were not significantly related to the leadership
criterion. Incremental validity of each of the 11 different
keys entered in the third steps in the 11 sets of hierarchical
regressions was assessed by comparing the change in R
2
between Step 2, which included both the WPT and the
16PF, and Step 3, the final equation that also included an
SJT key. The results indicate that four keys (empirical,
2.3%; hybrid initiating structure, 3.0%; hybrid participa-
tion, 1.7%; and SME, 4.9%) accounted for significant
additional variance in leadership ratings.
Subgroup Differences by Sex
Table 4 contains the results of sex differences analyses. Job
performance ratings were nearly identical across sex. Only
two predictors (extraversion, toughmindedness) exhibited
significant mean differences across sex and produced
medium effect sizes. Although some keys had larger
cross-sex effect sizes than others, examination of the effect
sizes indicates that the differences across sex were generally
small (.3 or less) for all keys. Building on the results
described in the previous section, additional regression
analyses were conducted that included the main effect for
sex and the sex-by-key interactions. The inclusion of these
variables assesses the across-sex equality of slopes and
intercepts, respectively (Cleary, 1968). None of these
parameters were significant or produced even a moderate
effect size. Thus, none of the keys seem likely to produce
practically significant levels of adverse impact vis-a` -vis sex.
Discriminant Validity
Table 2 contains correlations of the various keys with the
WPTand the personality measures. No correlation exceeds
.32 in absolute value, suggesting that the LSA is not overly
redundant with cognitive ability or personality.
Best Keys for the LSA
Based on test validation standards, we can make recom-
mendations regarding the best keys for the LSA in this
sample. The empirical, SME, and hybrid participation keys
all predicted the leadership criterion. These three keys, as
well as the hybrid initiating structure key, provided
significant incremental validity over cognitive ability and
personality measures. None of the keys showed subgroup
differences by sex and all of the keys showed discriminant
validity. Therefore, for users of the LSA, we would
recommend the empirical, SME, and hybrid participation
keys. However, it is important to note that these results
must be replicated in other samples before it can be
determined whether these particular keys are the best
across samples, settings, and uses.
Discussion
Our analyses demonstrated various approaches to creating
and evaluating SJT keys. The wide variability
2
in validity
coefficients across the 11 keys emphasizes the importance
of the choice of scoring method. Although some keys had
SCORING SJTS 229
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006
T
a
b
l
e
2
.
M
e
a
n
s
,
s
t
a
n
d
a
r
d
d
e
v
i
a
t
i
o
n
s
,
a
n
d
c
o
r
r
e
l
a
t
i
o
n
s
o
f
v
a
r
i
a
b
l
e
s
M
e
a
n
(
S
D
)
0
i
t
e
m
s
a
1
2
3
4
5
6
7
8
9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
.
W
o
n
d
e
r
l
i
c
2
7
.
3
1
(
5
.
7
7
)

2
.
E
x
t
r
a
v
e
r
s
i
o
n
5
.
1
3
(
1
.
8
9
)

0
2
3
.
A
n
x
i
e
t
y
4
.
5
5
(
1
.
8
5
)

0
4

3
0
4
.
T
o
u
g
h
-
m
i
n
d
e
d
n
e
s
s
5
.
3
9
(
2
.
0
5
)

1
6

3
3
1
2
5
.
I
n
d
e
p
e
n
d
e
n
c
e
5
.
2
0
(
1
.
6
9
)

0
4
3
4

1
4

2
6
6
.
S
e
l
f
-
c
o
n
t
r
o
l
5
.
8
1
(
1
.
4
7
)

2
1

2
8

0
4
4
6

2
2
7
.
E
m
p
i
r
i
c
a
l
3
.
3
0
(
2
.
1
8
)
1
2
3
2

0
5

0
6

1
0
0
9

1
8
8
.
I
n
i
t
i
a
t
i
n
g
s
t
r
u
c
t
u
r
e

1
.
0
9
(
4
.
6
7
)
0

1
5

1
2
0
6
1
3

1
2
0
3

3
3
9
.
P
a
r
t
i
c
i
p
a
t
i
o
n

.
3
8
(
2
.
9
7
)
0
1
7
2
5

1
6

2
4
1
3

1
1
3
2

6
8
1
0
.
E
m
p
o
w
e
r
m
e
n
t
1
.
0
9
(
4
.
6
7
)
0
1
5
1
2

0
6

1
3
1
2

0
3
3
3

1
0
0
6
8
1
1
.
H
y
b
r
i
d
i
n
i
t
i
a
t
i
n
g
s
t
r
u
c
t
u
r
e
.
6
4
(
3
.
9
8
)
1
0
0

1
3
0
6
0
9

1
1

0
6
1
1
8
9

5
5

8
9
1
2
.
H
y
b
r
i
d
p
a
r
t
i
c
i
p
a
t
i
o
n
1
.
2
1
(
3
.
2
6
)
0
2
3
2
1

1
4

2
8
1
5

2
3
6
4

6
4
8
9
6
4

3
5
1
3
.
H
y
b
r
i
d
e
m
p
o
w
e
r
m
e
n
t
3
.
7
2
(
4
.
9
8
)
0
2
3
0
9

0
6

1
7
1
3

1
1
5
9

9
3
6
9
9
3

6
9
7
8
1
4
.
V
r
o
o
m
t
i
m
e
-
b
a
s
e
d
8
.
8
9
(
1
.
6
6
)
4
2
1

1
5
0
8
1
0

0
1
0
2
0
9
3
7

2
9

3
7
4
5

2
3

3
1
1
5
.
V
r
o
o
m
d
e
v
e
l
o
p
m
e
n
t
a
l
7
.
4
5
(
1
.
4
8
)
6
2
2

0
3
0
3

0
3
0
5

0
2
3
9

4
0
2
1
4
0

2
1
3
3
5
0
2
7
1
6
.
S
u
b
j
e
c
t
m
a
t
t
e
r
e
x
p
e
r
t
s
8
.
6
7
(
2
.
3
0
)
3
3
2
1
0

1
1

1
4

0
4

1
2
5
1

3
2
1
5
3
2

1
0
3
1
4
3

2
0
1
4
1
7
.
N
o
v
i
c
e
v
s
.
e
x
p
e
r
t
s
9
.
2
0
(
2
.
5
4
)
0
0
9

0
3

0
9

0
6
0
1

0
2
0
4
2
1

3
3

2
1
2
3

2
2

1
9
2
2
3
9
1
2
1
8
.
L
e
a
d
e
r
s
h
i
p
r
a
t
i
n
g
3
4
.
8
5
(
5
.
0
1
)

3
2
0
7

0
1

0
8
0
0

1
5
2
5

0
3
0
4
0
3
1
7
2
2
1
7
0
7
1
3
3
2
1
2
1
9
.
O
v
e
r
a
l
l
p
e
r
f
o
r
m
a
n
c
e
r
a
t
i
n
g
1
8
.
5
9
(
2
.
9
4
)

2
6

1
1
1
2
0
1

0
5

0
1
1
5

0
3

0
1
0
3
1
1
1
0
1
1
1
1
1
1
2
2
0
4
7
2
N
o
t
e
s
:
D
e
c
i
m
a
l
p
o
i
n
t
s
o
m
i
t
t
e
d
i
n
c
o
r
r
e
l
a
t
i
o
n
s
.
E
n
t
r
i
e
s
w
i
t
h
a
n
a
b
s
o
l
u
t
e
v
a
l
u
e
o
f
.
1
9
o
r
g
r
e
a
t
e
r
a
r
e
s
i
g
n
i
f
i
c
a
n
t
a
t
p
o
.
0
5
.
a

0
i
t
e
m
s

r
e
f
e
r
s
t
o
t
h
e
n
u
m
b
e
r
o
f
i
t
e
m
s
i
n
t
h
e
l
e
a
d
e
r
s
h
i
p
s
k
i
l
l
s
a
s
s
e
s
s
m
e
n
t
(
L
S
A
)
k
e
y
t
h
a
t
a
r
e
u
n
s
c
o
r
e
d
.
D
a
s
h
e
d
l
i
n
e
i
n
d
i
c
a
t
e
s
t
h
a
t
t
h
i
s
v
a
r
i
a
b
l
e
i
s
n
o
t
r
e
l
e
v
a
n
t
t
o
t
h
e
p
a
r
t
i
c
u
l
a
r
s
c
a
l
e
.
230 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA
International Journal of Selection and Assessment
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006
statistically significant validity coefficients, many more did
not. Obviously, with only one SJT and one sample, we
cannot reach abiding conclusions about the goodness of
scoring approaches for all SJTs, or the boundary conditions
under which some approaches might be better than others.
What we can conclude is that the validity of an SJTdepends
in part on its scoring, and that poor choices could lead to
the conclusion that the SJTs content is not valid when it
may only be the scoring key that is not valid.
Our recommendation to carefully follow standard
validation procedures is not surprising. However, doing
so may be especially important for SJTs. Although a key
may be criterion-related, it might add little value once
cognitive ability and personality measures (which are
widely available and relatively inexpensive) are accounted
for. Because some SJTs especially multimedia, computer-
ized SJTs are costly to construct and, importantly, to
administer, it is not enough to know whether an SJT
predicts a criterion; it must also provide incremental value.
Further, because of the difficulty in determining correct
answers, organizations facing legal challenges to their SJT
use will need to be able to explain not only why the SJTs
content is job relevant but also why a particular scoring
strategy was used. Careful validation should both minimize
legal challenges and help organizations survive those that
do arise.
The challenging aspects of scoring are likely to increase
exponentially as the breadth of the SJTincreases. Although
empirical keying could proceed in the same general fashion
regardless of the breadth of the SJT (assuming that the
criterion was not deficient), other scoring strategies would
likely become more complicated. For every content subset
in an SJT, there will be different sets of theories to apply for
theoretical keying and different SMEs to query for expert
keying. The various keys for the subtests could be
combined in more or less optimal ways, such that the best
key for one subtest depends in part on the key for another
subtest. Hybrid scoring systems could then be applied to
the various keys, carrying over these same concerns. To
complicate matters, non-linear scoring (e.g., Breiman
et al.s (1984) classification and regression tree analysis)
might lead to the highest validity. In short, as the breadth
increases for an SJT, scoring can become more complex.
These issues speak to the importance of test development.
Awell-constructedSJTis one that wouldreflect clear content
domains, rather than contain a hodgepodge of items. As
difficult as scoring SJTs becomes as the breadth of the test
increases, it would be even more complicated if specific
content areas cannot be identified. Without clear content
domains, there is little guidance as to where test constructors
should look for theories to determine the scoring key or how
SMEs should think about the meanings of the items. Thus,
although broader SJTs are likely to have more scoring
difficulties than narrower ones, some of these problems can
be ameliorated if the SJT is carefully constructed to reflect
rational if not theoretical content domains.
Table 3. Hierarchical regressions
Step Variables entered b t R
2
F
a
DR
2
F for DR
2b
1. Wonderlic .32 13.09
*
.101 13.61
*

2. Extraversion .08 .76 .112 2.44
*
.011 .29
Anxiety .01 .14
Toughmindedness .01 .13
Independence .03 .26
Self-control .07 .68
3a. Empirical .17 1.77 .136 2.58
*
.023 3.06
*
3b. Initiating structure .03 .33 .113 2.09
*
.001 .13
3c. Participation .03 .35 .113 2.09
*
.001 .13
3d. Empowerment .03 .33 .113 2.09
*
.001 .13
3e. Hybrid initiating structure .18 2.01
*
.142 2.72
*
.030 4.02
*
3f. Hybrid participation .14 1.51 .129 2.44
*
.017 2.24
*
3g. Hybrid empowerment .10 1.14 .122 2.28
*
.010 1.31
3h. Time-based Vroom .02 .25 .113 2.08 .001 .13
3i. Development-based Vroom .07 .75 .116 2.17 .004 .52
3j. SMEs .24 2.60
*
.161 3.16
*
.049 6.72
*
3k. Novice vs. experts .09 1.05 .121 2.25
*
.009 1.18
Notes: Only the incremental additions to the hierarchical regressions are shown. Steps 1 and 2 were the same across all
sets of regressions.
a
Degrees of freedom for F tests were: step 1 (1, 121), step 2 (6, 116), step 3 (7, 115).
b
Degrees of freedom for F tests in change in R
2
were: step 2 (5, 116), step 3 (1, 115).
*
po.05.
SCORING SJTS 231
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006
Further, test administration can affect responses. For
example, different instruction sets lead to different
responses that, at the key level, are differentially related
to criteria (McDaniel & Nguyen, 2001; Ployhart &
Ehrhart, 2003). Even instructions that differ in seemingly
minor ways, such as identify the best option and
identify what you would do (what Ployhart & Ehrhart,
2003 referred to as should do and would do
instructions), appear to lead to different responses. This
suggests that keys and the constructs they represent vary
not only due to chosen scoring strategies but also because
of the ways that the respondents approach the assessment.
Different keys can lead to an SJT assessing different
constructs evenwhenit was designedtomeasure performance
in a specific domain, such as our LSA. For the LSA, we keyed
three different theoretical constructs: initiating structure,
participation, and empowerment. On one hand, these keys all
reflect the domain of leadership skills. On the other hand,
these keys refer to different components of the leadership
skills domain and it could be argued that they represent
different constructs. For empirical keys, as well as con-
tingency theory keys such as the Vroom(2000) keys, different
domains could be the best answer for different questions.
Because of the potentially multi-dimensional nature of
SJTs as well as the many constructs that this method can be
applied to, it may be difficult to conduct meta-analyses on
some of the questions raised here (Hesketh, 1999). To
minimize this potential problem, researchers should
include in their reports, in addition to standard validity
coefficients and effect sizes, descriptions of: (a) the
domain(s) that the items of the SJT measure; (b) the scoring
methods used; and, (c) the instruction set for the SJT.
Without this information, it will be impossible for future
meta-analytic efforts to reach any meaningful conclusions.
Limitations
As with any study, this one has limitations. First, the small
sample size makes strong conclusions difficult. Larger
samples would allowfor greater confidence in the results. A
larger sample might permit additional analyses, such as
subgroup differences across ethnicity groups. Further, the
low power afforded by the small sample size makes it
difficult to interpret differences in the validities of the keys.
However, because our predictors and criteria were
collected from different sources (managers and their
supervisors, respectively), some problems common to
concurrent validation such as common method bias
are not at issue here.
Further, we must acknowledge that a different empirical
key could emerge with a different or larger sample. The
minimum endorsement criterion is, in part, dependent on
the sample size (i.e., one should not require a minimum
endorsement rate that is unachievable in a particular
sample). Additionally, a sample froma different population
might lead to different results. Large samples from diverse
populations should improve the stability and general-
izability of keys.
Table 4. Subgroup differences across sex
Mean (SD), male Mean (SD), female t d
Wonderlic Personnel Test 27.73 (5.91) 27.11 (5.73) .55 .11
Extraversion 4.50 (1.77) 5.43 (1.88) 2.63
*
.51
Anxiety 4.49 (2.02) 4.59 (1.77) .27 .05
Tough-mindedness 6.30 (2.05) 4.96 (1.91) 3.59
*
.69
Independence 5.20 (1.48) 5.20 (1.79) .02 .003
Self-control 6.11 (1.47) 5.67 (1.46) 1.55 .30
Empirical 3.70 (2.16) 3.11 (2.18) 1.42 .27
Initiating structure 1.33 (4.68) .98 (4.70) .39 .07
Participation .40 (3.02) .37 (2.96) .05 .01
Empowerment 1.33 (4.68) .98 (4.70) .39 .07
Hybrid initiating structure .78 (4.53) .58 (3.71) .26 .05
Hybrid participation 1.10 (3.25) 1.27 (3.28) .26 .05
Hybrid empowerment 4.03 (4.56) 3.58 (5.20) .46 .09
Time-based Vroom 9.18 (1.68) 8.76 (1.64) 1.31 .25
Development-based Vroom 7.53 (1.40) 7.41 (1.52) .40 .08
SMEs 8.30 (1.52) 8.86 (2.58) 1.26 .24
Novice vs. expert 9.08 (2.69) 9.27 (2.48) .39 .07
Leadership ratings 34.73 (5.01) 34.90 (5.04) .58 .11
Overall performance ratings 18.60 (2.45) 18.59 (3.17) .10 .02
Note: N540 male, 83 female, degrees of freedom5121.
*
po.05.
232 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA
International Journal of Selection and Assessment
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006
Most notably, we cannot drawa firmconclusion based
on this single sample in this single organization at this
single time using a single SJT about which scoring method
is best. There are likely to be boundary conditions on the
best scoring method, based on test content, test instruc-
tions, response options, and the like, that will aid in the
determination of the best scoring method for this SJT and
other assessments used in the future. However, this paper
provides a useful guide in both (a) the methods that are
currently available to create SJT keys and (b) the ways to
evaluate the relative effectiveness of keys.
Practical Issues in SJT Scoring
One important issue in all scoring is that there are other
keys that could be constructed. Although there is not an
infinite number of keys for an assessment, the possible
permutations of the pattern of scoring as correct, incorrect,
and zero across the number of options and items is of a very
large magnitude for a test of any reasonable length; a non-
trivial number of these mathematically possible keys are
likely to make some rational or theoretical sense. Further,
different criteria, approaches, and minimum scoring
requirements could lead to a multitude of other empirical
keys than the ones constructed. In short, there are many
ways to create a key within each general scoring strategy.
How one chooses keying systems should depend on the
tests intended use, theory (not just for theoretical keys, but
also to determine which scoring strategies are most useful),
and practical considerations.
Potential Effects of Studying Keys on
Broader Theory
One potential application of studying keys could be
providing support for theories about the content domain
of assessments. Support for a theory would be found when
empirical and theoretical keys overlap greatly in their best
option identification. For example, the LSA could be used
to provide support for a theory of leadership, such as
Vrooms (2000; Vroom & Jago, 1978; Vroom & Yetton,
1973). The extent to which the empirical key which, by
design, is related to leadership performance and the
theoretical key identify the same best and worst options
would indicate support for the theory.
The utility of this approach hinges on three issues. First,
the criterion measure must be reasonably construct valid so
that an effective and appropriate empirical key is created.
Second, the theoretical key must be developed carefully so
that it accurately reflects the theory. Finally, unscored
items on the empirical key must be minimized through the
use of a large sample. This is necessary so that there are
enough opportunities to evaluate the congruence of the
empirical and theoretical keys. Additionally, the unscored
options on the empirical key must be examined so that it is
clear whether they are unscored due to low correlations or
low endorsement rates. Options that have low correlations
and meet the minimum endorsement rate on the empirical
key are more informative about the option and its possible
relation to theory than options that are not scored because
they have not met the minimum endorsement criterion. It
may be useful to think of the first case as a score of zero
and the second as unscored, because in the second case it
is unclear what the score will be if the minimum
endorsement requirement is met. As noted in the introduc-
tion, there are many reasons why an option is scored as
zero; some of these reasons are mitigated when the
minimum endorsement criterion is met.
Conclusion
We described and illustrated a process for determining the
best key(s) fromamong many possible keys. Keys should be
assessed for validity, incremental validity, adverse impact,
and construct validity as described in this paper. Fromthese
analyses, the best key(s) can be identified. Although this
validation strategy seems basic, studies in the SJT literature
have rarely addressed the potential differential validity of
the multiple keys available for a given test. As demon-
strated here, it is essential that researchers critically
evaluate their SJT keying choices.
As we noted at the start of this paper, the major purpose
of this paper is to stimulate research on the topic of keying
in SJTs. We have reviewed the six general approaches to
scoring that have been examined or discussed in the SJTor
biodata scoring literatures to date, and we have demon-
strated four of them. Other scoring strategies might be
developed in the future, which will expand the possible
repertoire of scoring methodologies. Our goal is to
encourage SJT developers and researchers to investigate
and implement multiple scoring methods in their research
and to publish the various results of these keys. Ideally, in
10 years time, we would be able to revisit this topic to
conduct a meta-analysis on scoring strategies in order to
assess which approach is best.
Notes
1. Although the empowerment and initiating structure
keys are perfectly negatively correlated, both are
described because they were used in hybrid scoring;
the hybrid keys were not perfectly negatively correlated.
For ease of comparison, both the empowerment and the
initiating structure keys are presented here and included
in the analyses.
2. We must acknowledge that due to our small sample size,
sampling variability and error could also contribute the
variability of validity coefficients across the keys.
SCORING SJTS 233
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006
References
Arnold, J.A., Arad, S., Rhoades, J.A. and Drasgow, F. (2000) The
empowering leadership questionnaire: The construction of a
new scale for measuring leader behaviors. Journal of Organiza-
tional Behavior, 21, 249269.
Ashworth, S.D. and Joyce, T.M. (1994) Developing score protocols
for a computerized multimedia in-basket exercise. Paper
presented at the Ninth Annual Conference of the Society for
Industrial and Organizational Psychology, Nashville, TN, April.
Borman, W.C., White, L.A., Pulakos, E.D. and Oppler, S.H. (1991)
Models of supervisory job performance ratings. Journal of
Applied Psychology, 76, 863872.
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984)
Classification and regression trees. Belmont, CA: Wadsworth.
Campbell, D.T. and Fiske, D.W. (1959) Convergent and discrimi-
nant validation by the multitraitmultimethod matrix. Psycho-
logical Bulletin, 56, 81105.
Campbell, J.P. (1990a) Modeling the performance prediction
problem in industrial and organizational psychology. In M.D.
Dunnette and L.M. Hough (Eds), Handbook of industrial and
organizational psychology, Vol. 1 (pp. 687732). Palo Alto:
Consulting Psychologists Press.
Campbell, J.P. (1990b) The role of theory in industrial and
organizational psychology. In M.D. Dunnette and L.M. Hough
(Eds), Handbook of industrial and organizational psychology,
Vol. 1 (pp. 3973). Palo Alto: Consulting Psychologists Press.
Campbell, J.P, Dunnette, M.D., Lawler, E.E. III. and Weick, K.E.
(1970) Managerial behavior, performance, and effectiveness.
New York: McGraw-Hill.
Cattell, R.B., Cattell, A.K. and Cattell, H.E. (1993) Sixteen
personality factor questionnaire, 5th Edn). Champaign, IL:
Institute for Personality and Ability Testing Inc.
Chan, D. and Schmitt, N. (1997) Video-based versus paper-and-
pencil method of assessment in situational judgment tests:
Subgroup differences in test performance and face validity
perceptions. Journal of Applied Psychology, 82, 143159.
Chan, D. and Schmitt, N. (2002) Situational judgment and job
performance. Human Performance, 15, 233254.
Chan, D. and Schmitt, N. (2005) Situational judgment tests. In A.
Evers, N. Anderson and O. Voskuijil (Eds), Handbook of
personnel selection (pp. 219246). Oxford: Blackwell.
Cleary, T.A. (1968) Test bias: Prediction of grades of Negro and
White students in integrated colleges. Journal of Educational
Measurement, 5, 115124.
Clevenger, J., Pereira, G.M., Wiechmann, D., Schmitt, N. and
Harvey, V.S. (2001) Incremental validity of situational judgment
tests. Journal of Applied Psychology, 86, 410417.
Cureton, E.E. (1950) Validity, reliability, and baloney. Educational
and Psychological Measurement, 10, 9496.
Dalessio, A.T. (1994) Predicting insurance agent turnover using a
video-based situational judgment test. Journal of Business and
Psychology, 9, 2332.
Desmarais, L.B., Masi, D.L., Olson, M.J., Barbara, K.M. and Dyer,
P.J. (1994) Scoring a multimedia situational judgment test: IBMs
experience. Paper presented at the Ninth Annual Conference of
the Society for Industrial and Organizational Psychology,
Nashville, TN, April.
Devlin, S.E., Abrahams, N.M. and Edwards, J.E. (1992) Empirical
keying of biographical data: Cross-validity as a function of scaling
procedure and sample size. Military Psychology, 4, 119136.
Dodrill, C.B. (1983) Long term reliability of the Wonderlic
Personnel Test. Journal of Consulting and Clinical Psychology,
51, 316317.
Dodrill, C.B. and Warner, M.H. (1988) Further studies of the
Wonderlic Personnel Test as a brief measure of intelligence.
Journal of Consulting and Clinical Psychology, 56, 145147.
England, G.W. (1961) Development and use of weighted applica-
tion blanks. Dubuque: Brown.
Flanagan, J.C. (1954) The critical incident technique. Psychological
Bulletin, 51, 327358.
Hein, M. and Wesley, S. (1994) Scaling biodata through subgroup-
ing. In G.S. Stokes, M.D. Mumford and W.A. Owens (Eds),
Biodata handbook: Theory, research, and use of biographical
information in selection and performance prediction (pp.
171196). Palo Alto: Consulting Psychologists Press.
Hesketh, B. (1999) Introduction to the International Journal of
Selection and Assessment special issue on biodata. International
Journal of Selection and Assessment, 7, 5556.
Hogan, J.B. (1994) Empirical keying of background data measures.
In G.S. Stokes, M.D. Mumford and W.A. Owens (Eds), Biodata
handbook: Theory, research, and use of biographical informa-
tion in selection and performance prediction (pp. 69107). Palo
Alto: Consulting Psychologists Press.
Hough, L. and Paullin, C. (1994) Construct-oriented scale
construction: The rational approach. In G.S. Stokes, M.D.
Mumford and W.A. Owens (Eds), Biodata handbook: Theory,
research, and use of biographical information in selection and
performance prediction (pp. 109145). Palo Alto: Consulting
Psychologists Press.
Judge, T.A., Bono, J.E., Ilies, R. and Gerhart, M.W. (2002)
Personality and leadership: A qualitative and quantitative
review. Journal of Applied Psychology, 87, 765780.
Karas, M. and West, J. (1999) Construct-oriented biodata develop-
ment for selection to a differentiated performance domain.
International Journal of Selection and Assessment, 7, 8696.
Krukos, K., Meade, A.W., Cantwell, A., Pond, S.B. and Wilson,
M.A. (2004). Empirical keying of situational judgment tests:
Rationale and some examples. Paper presented at the 19th
annual meeting of the Society for Industrial/Organizational
Psychology, Chicago, IL.
Legree, P.J., Psotka, J., Tremble, T. and Bourne, D.R. (2005) Using
consensus based measurement to assess emotional intelligence.
In R. Schulze and R.D. Roberts (Eds), Emotional intelligence:
An international handbook (pp. 155180). Cambridge, MA:
Hogrefe and Huber.
Liden, R.C. and Arad, S. (1996) A power perspective of empower-
ment and work groups: Implications for human resources
management research. In G.R. Ferris (Ed.), Research in
personnel and human resources management (pp. 205252).
Greenwich, CT: JAI Press.
MacLane, C.N., Barton, M.G., Holloway-Lundy, A.E. and Nickels,
B.J. (2001). Keeping score: Expert weights on situational
judgment responses. Paper presented at the 16th Annual
Conference of the Society for Industrial and Organizational
Psychology, San Diego, CA.
Mael, F.A. (1991) A conceptual rationale for the domain and
attributes of biodata items. Personnel Psychology, 44, 763792.
McDaniel, M.A., Morgeson, F.P., Finnegan, E.B., Campion, M.A.
and Braverman, E.P. (2001) Use of situational judgment tests to
predict job performance: Aclarification of the literature. Journal
of Applied Psychology, 86, 6079.
McDaniel, M.A. and Nguyen, N.T. (2001) Situational judgment
tests: A review of practice and constructs assessed. International
Journal of Selection and Assessment, 9, 103113.
McHenry, J.J. and Schmitt, N. (1994) Multimedia testing. In M.J.
Rumsey, C.D. Walker and J. Harris (Eds), Personnel selection
and classification research (pp. 193232). Mahwah, NJ:
Lawrence Erlbaum Publishers.
234 MINDY E. BERGMAN, FRITZ DRASGOW, MICHELLE A. DONOVAN, JAIME B. HENNING AND SUZANNE E. JURASKA
International Journal of Selection and Assessment
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006
Mead, A.D. (2000) Properties of a resampling validation tech-
nique for empirically scoring psychological assessments. Unpub-
lished doctoral dissertation, University of Illinois at Urbana-
Champaign.
Mitchell, T.W. and Klimoski, R.J. (1982) Is it rational to be
empirical? A test of methods for scoring biographical data.
Journal of Applied Psychology, 67, 411418.
Motowidlo, S.J., Dunnette, M.D. and Carter, G.W. (1990) An
alternative selection procedure: The low-fidelity simulation.
Journal of Applied Psychology, 75, 640647.
Mumford, M.D. (1999) Construct validity and background data:
Issues, abuses, and future directions. Human Resource Manage-
ment Review, 9, 117145.
Mumford, M.D. and Owens, W.A. (1987) Methodology review:
Principles, procedures, and findings in the application of
background data measures. Applied Psychological Measure-
ment, 11, 131.
Mumford, M.D. and Stokes, G.S. (1992) Developmental determi-
nants of individual action: Theory and practice in applying
background measures. In M.D. Dunnette and L.M. Hough
(Eds), Handbook of industrial and organizational psychology,
2nd Edn (pp. 61138). Palo Alto: Consulting Psychologists
Press.
Mumford, M.D. and Whetzel, D.L. (1997) Background data. In D.
Whetzel and G. Wheaton (Eds), Applied measurement methods
in industrial psychology (pp. 207239). Palo Alto: Davies-Black
Publishing.
Nickels, B.J. (1994) The nature of biodata. In G.S. Stokes, M.D.
Mumford and W.A. Owens (Eds), Biodata handbook: Theory,
research, and use of biographical information in selection and
performance prediction (pp. 116). Palo Alto: Consulting
Psychologists Press.
Olson-Buchanan, J.B., Drasgow, F., Moberg, P.J., Mead, A.D.,
Keenan, P.A. and Donovan, M.A. (1998) An interactive video
assessment of conflict resolution skills. Personnel Psychology,
51, 124.
Paullin, C. and Hanson, M.A. (2001) Comparing the validity of
rationally-derived and empirically-derived scoring keys for a
situational judgment inventory. Paper presented at the 16th
Annual Conference of the Society for Industrial and Organiza-
tional Psychology, San Diego, CA.
Ployhart, R.E. and Ehrhart, M.G. (2003) Be careful what you ask
for: Effects of response instructions on the construct validity and
reliability of situational judgment tests. International Journal of
Selection and Assessment, 11, 116.
Schoenfeldt, L.F. (1999) From dust bowl empiricism to rational
constructs in biographical data. Human Resource Management
Review, 9, 147167.
Schoenfeldt, L.F. and Mendoza, J.L. (1994) Developing and using
factorially derived biographical scales. In G.S. Stokes, M.D.
Mumford and W.A. Owens (Eds), Biodata handbook: Theory,
research, and use of biographical information in selection and
performance prediction (pp. 147169). Palo Alto: Consulting
Psychologists Press.
Smith, K.C. and McDaniel, M.A. (1998). Criterion and
construct validity evidence for a situational judgment measure.
Poster presented at the 13th Annual Meeting of the Society
for Industrial and Organizational Psychology, Dallas, TX,
April.
Stokes, G.S. and Searcy, C.A. (1999) Specification of scales in
biodata form development: Rational vs. empirical and global vs.
specific. International Journal of Selection and Assessment, 7,
7285.
Such, M.J. and Hemingway, M.A. (2003) Examining the usefulness
of empirical keying in the cross-cultural implementation of a
biodata inventory. Paper presented in F. Drasgow (Chair),
Resampling and Other Advances in Empirical Keying. Sympo-
sium conducted at the 18th annual conference of the Society for
Industrial and Organizational Psychology.
Such, M.J. and Schmidt, D.B. (2004) Examining the effectiveness of
empirical keying: A cross-cultural perspective. Paper presented
at the 19th Annual Conference of the Society for Industrial and
Organizational Psychology, Chicago, IL.
Vroom, V.H. (2000) Leadership and the decision-making process.
Organizational Dynamics, 28, 8294.
Vroom, V.H. and Jago, A.G. (1978) On the validity of the
VroomYetton model. Journal of Applied Psychology, 63,
151162.
Vroom, V.H. and Yetton, P.W. (1973) Leadership and decision
making. Pittsburgh: University of Pittsburgh Press.
Weekley, J.A. and Jones, C. (1997) Video-based situational testing.
Personnel Psychology, 50, 2549.
Weekley, J.A. and Jones, C. (1999) Further studies of situational
tests. Personnel Psychology, 52, 679700.
Wonderlic Personnel Test Inc. (1992) Users manual for the
Wonderlic Personnel Test and the Scholastic Level Exam.
Libertyville, IL: Wonderlic Personnel Test Inc.
SCORING SJTS 235
r2006 The Authors
Journal compilation rBlackwell Publishing Ltd. 2006 Volume 14 Number 3 September 2006