Você está na página 1de 56

PERSONNEL PSYCHOLOGY

2008, 61, 871925

DEVELOPMENTS IN THE CRITERION-RELATED


VALIDATION OF SELECTION PROCEDURES: A
CRITICAL REVIEW AND RECOMMENDATIONS
FOR PRACTICE
CHAD H. VAN IDDEKINGE
Florida State University
ROBERT E. PLOYHART
University of South Carolina

The use of validated employee selection and promotion procedures is


critical to workforce productivity and to the legal defensibility of the personnel decisions made on the basis of those procedures. Consequently,
there have been numerous scholarly developments that have considerable implications for the appropriate conduct of criterion-related validity
studies. However, there is no single resource researchers can consult to
understand how these developments impact practice. The purpose of this
article is to summarize and critically review studies published primarily
within the past 10 years that address issues pertinent to criterion-related
validation. Key topics include (a) validity coefficient correction procedures, (b) the evaluation of multiple predictors, (c) differential prediction
analyses, (d) validation sample characteristics, and (e) criterion issues.
In each section, we discuss key findings, critique and note limitations of
the extant research, and offer conclusions and recommendations for the
planning and conduct of criterion-related studies. We conclude by discussing some important but neglected validation issues for which more
research is needed.

The use of validated employee selection and promotion procedures is


crucial to organizational effectiveness. For example, valid selection procedures 1 can lead to higher levels of individual, group, and organizational

We thank the following scientistpractitioners for their insightful comments and suggestions on previous drafts of this article: Mike Campion, Huy Le, Dan Putka, Phil Roth,
and Neil Schmitt.
Correspondence and requests for reprints should be addressed to Chad H. Van Iddekinge,
Florida State University, College of Business, Department of Management, Tallahassee,
FL, 32306-1110; cvanidde@fsu.edu.
1
Validity in a selection context does not refer to the validity of selection procedures
themselves but rather to the validity of the inferences we draw on the basis of scores from
selection procedures (American Educational Research Association [AERA], American
Psychological Association, National Council on Measurement in Education, 1999; Binning
& Barrett, 1989; Messick, 1998; Society for Industrial and Organizational Psychology
[SIOP], 2003). However, to be concise, we often use phrases such as valid selection
procedures.
C 2008 WILEY PERIODICALS, INC.
COPYRIGHT 

871

872

PERSONNEL PSYCHOLOGY

performance (Barrick, Stewart, Neubert, & Mount, 1998; Huselid, 1995;


Schmidt & Hunter, 1998; Wright & Boswell, 2002). Valid procedures
are also essential for making legally defensible selection decisions. Indeed, selection procedures that have been properly validated should be
more likely to withstand the legal scrutiny associated with employment
discrimination suits (Sharf & Jones, 2000) and may even reduce the likelihood of litigation in the first place.
It is therefore critical that researchers 2 use proper and up-to-date methods to assess the validity of these procedures. For example, using outdated
techniques can impede theory development and evaluation by decreasing
the accuracy of inferences researchers draw from their results. This, in
turn, can lead to incomplete or even inaccurate guidance to researchers
who use this research to inform their selection practices.
Despite the critical importance of selection system validation procedures, no recent publications have reviewed and integrated research
findings in this area. Existing publications also do not give the kinds of
prescriptive guidance researchers may need. For example, the two main
sets of professional guidelines relevant to validation research (i.e., AERA,
1999; SIOP, 2003) discuss the major aspects of criterion-related validation
procedures. However, because neither set of guidelines was meant to be
exhaustive, they devote less attention to the specific analytic decisions and
approaches that research suggests can influence conclusions with respect
to validity. The Principles (SIOP, 2003), for instance, devote less than two
pages to the analysis of validation data. These two guidelines also cite relatively few source materials that interested readers can consult for more
specific guidance. The Uniform Guidelines on Employee Selection Procedures (Equal Employment Opportunity Commission, 1978) similarly
lacks specific guidance and is now 30 years old.
Recent scholarly reviews of the personnel selection literature (e.g., Anderson, Lievens, van Dam, & Ryan, 2004; Ployhart, 2006; Salgado, Ones,
& Viswesvaran, 2001) have also tended to be broad in scope, focusing, for
example, on different types of selection constructs and methods (e.g., cognitive ability, assessment centers), content areas (e.g., applicant reactions,
legal issues), and/or emerging trends (e.g., cross-cultural staffing, internetbased assessments). Thus, there is not a single resource researchers can
consult that summarizes and critiques recent developments in this important area.
The overarching goal of this article is to provide selection researchers
with a resource that supplements the more general type of validation
2
We use the term researcher throughout the article to refer to both practitioners and
scholars involved in selection procedure validation.

VAN IDDEKINGE AND PLOYHART

873

information contained in the professional guidelines and in recent reviews of the selection literature. To accomplish this goal, we review and
critique articles published within the past decade on issues pertinent to
criterion-related validation research. Given the central role of criteria in
the validation process, we also review new findings in this area that have
direct relevance for validation research. We critically review and highlight
key findings, limitations, and gaps and discrepancies in the literature. We
also offer conclusions and provide recommendations for researchers involved in selection procedure validation. Finally, we conclude by noting
some important but neglected validation issues that future research should
address.

Method and Structure of Review

It is widely accepted that validity is a unitary concept, and that various


sources of evidence can contribute to an understanding of the inferences
that can be drawn from scores on a selection procedure (Binning & Barrett,
1989; Landy, 1986; Messick, 1998; Schmitt & Landy, 1993). Nonetheless, the primary inference of concern in an employment context is that
test scores predict subsequent work behavior (SIOP, 2003). Thus, even
though establishing the content- and construct-related validity of selection procedures is highly important, we focus on criterion-related validity
because of its fundamental role in evaluating selection systems. In addition, we limit our review to articles published primarily within the past
10 years because this timeframe yielded a large but manageable number of
relevant articles. This timeframe also corresponds roughly with the most
recent revision of the Standards (AERA, 1999).
With these parameters in mind, we searched the table of contents (between 1997 and 2007) of 12 journals that are the most likely outlets for validation research: Academy of Management Journal, Applied Psychological Measurement, Educational and Psychological Measurement, Human
Performance, International Journal of Selection and Assessment, Journal
of Applied Psychology, Journal of Business and Psychology, Journal of
Management, Journal of Organizational and Occupational Psychology,
Organizational Research Methods, Personnel Psychology, and Psychological Methods. This search yielded over 100 articles relevant to five main
validation research topics: (a) validity coefficient correction procedures,
(b) the evaluation of multiple predictors, (c) differential prediction analysis, (d) validation sample characteristics, and (e) validation criteria. We
believe our coverage of validation criteria is important because many of
the criterion issues reviewed have been studied outside of the validation

874

PERSONNEL PSYCHOLOGY

context; thus, selection researchers may not be aware of this work or its
implications for validation.
Validity Coefficient Corrections

Researchers are typically interested in estimating the relationship between scores on one or more selection procedures and one or more criteria
in some population (e.g., individuals in the relevant applicant pool) on the
basis of the relationship observed within a validation sample (e.g., a group
of job incumbents). It is well known, however, that sample correlations
can deviate from population correlations due to various statistical artifacts, and these statistical artifacts can attenuate the true size of validity
coefficients. Recent studies have focused on two prominent artifacts: measurement error (i.e., unreliability) and range restriction (RR). Researchers
have attempted to delineate the influence these artifacts can have on validity, as well as the most appropriate ways to correct these artifacts when
estimating criterion-related validity.
Corrections for measurement error. Researchers often correct for
unreliability in criterion measures to estimate the operational validity
of selection procedures. This fairly straightforward correction procedure
involves dividing the observed validity coefficient by the square root of
the estimated reliability of the criterion measure. Corrections for predictor
unreliability are made less often because researchers tend to be more
interested in the validity of selection procedures in their current form than
in their potential validity if the predictors measured the target constructs
with perfect reliability.
Although many experts believe that validity coefficients should be
corrected for measurement error (in the criterion), there is disagreement
about the most appropriate way to estimate the reliability of ratings criteria. This is a concern because performance ratings remain the most
common validation criterion (Viswesvaran, Schmidt, & Ones, 2002), and
the reliability estimate one uses to correct for attenuation can affect the
validity of inferences drawn from validation results (Murphy & DeShon,
2000).
When only one rater (e.g., an immediate supervisor) is available to
evaluate the job performance of each validation study participant, researchers often compute an internal consistency coefficient (e.g., Cronbachs alpha) to estimate reliability. Such coefficients provide estimates of
intrarater reliability and indicate the consistency of ratings across different performance dimensions (Cronbach, 1951). The problem with using
internal consistency coefficients for measurement error corrections is that
they assign specific error (i.e., the rater by ratee interaction effect, which

VAN IDDEKINGE AND PLOYHART

875

represents a raters idiosyncratic perceptions of ratees job performance)


to true score variance, and there is evidence that this rater-specific error
is very large for performance ratings (Schmidt & Hunter, 1996). For example, raters often fail to distinguish among multiple dimensions, and as
result, performance ratings tend to produce high levels of internal consistency (Viswesvaran, Ones, & Schmidt, 1996). Thus, sole use of internal
consistency coefficients to estimate measurement error in ratings criteria will tend to overestimate reliability and thus underestimate corrected
validity coefficients.
When multiple raters are available, the traditional approach has been
to estimate Pearson correlations (two raters) or intraclass correlation coefficients (ICCs; more than two raters), adjust these values for the number
of raters (e.g., using the SpearmanBrown formula for Pearson correlations), and use the adjusted estimates to correct the observed validity
coefficient. For example, use of interrater correlations to estimate the
operational validity of selection procedures has been a common practice in recent meta-analyses of various selection constructs and methods
(e.g., Arthur, Bell, Villado, & Doverspike, 2006; Hogan & Holland, 2003;
Huffcutt, Conway, Roth, & Stone, 2001; Hurtz & Donovan, 2000; McDaniel, Morgeson, Finnegan, Campion, & Braverman, 2001; Roth, Bobko,
& McFarland, 2005). 3
However, the appropriateness of this approach has been questioned
given that such a correction assumes the differences between raters are
not substantively meaningful but instead reflect sources of irrelevant
(error) variance. Murphy and DeShon (2000) went as far as to suggest
that interrater correlations are not reliability coefficients. They maintained
that different raters rarely can be considered parallel assessments (e.g.,
because different raters often observe ratees performing different aspects
of their job), which is a key assumption of the classical test theory on
which interrater correlations are based. Furthermore, they identified a variety of systematic sources of variance in performance ratings that can lead
raters to agree or to disagree but that are not reflected in rater correlations.
For example, evaluations of different raters may covary due to similar
goals and biases, yet this covariation is considered true score variance in
classical test theory.

3
We note that most of these studies used single-rater reliability estimates (e.g., .52 from
Viswesvaran et al., 1996) to correct the meta-analytically derived validity coefficients.
However, the performance measures from some of the primary studies undoubtedly were
based on ratings from more than one rater per ratee. To the extent that this was the case, use
of single-rater reliabilities will overcorrect the observed validities and thus overestimate
true validity.

876

PERSONNEL PSYCHOLOGY

Murphy and DeShon (2000) also identified several potential systematic sources of rater disagreement (e.g., rater position level) that are treated
as random error when computing interrater correlations but that may have
different effects on reliability than does disagreement due to nonsystematic
sources. Recent studies using undergraduate raters (e.g., Murphy, Cleveland, Skattebo, & Kinney, 2004; Wong & Kwong, 2007) have provided
some initial evidence that raters goals (e.g., motivate ratees vs. identify
their strengths and weaknesses) can indeed influence their evaluations
(although neither study examined the effects of goal incongruence on interrater reliability). DeShon (2003) concluded that if rater disagreements
reflect not only random response patterns but also systematic sources,
then conventional validity coefficient corrections do not correct for measurement error but rather for a lack of understanding about what factors
influence the ratings.
In a rejoinder, Schmidt, Viswesvaran, and Ones (2000) argued that
interrater correlations are appropriate for correcting for attenuation. For
example, the researchers maintained that raters can be considered alternative forms of the same measure, and therefore, the correlation between
these forms represents an appropriate estimate of reliability. Schmidt and
colleagues suggested that the fact that different raters observe different
behaviors at different times actually is an advantage of using interrater
correlations because this helps to control for transient error, which, for example, reflects variations in ratee performance over time due to changes in
mood, mental state, and so forth (see Schmidt, Le, & Ilies, 2003). Schmidt
and colleagues also rebuffed the claim that classical measurement methods (e.g., Pearson correlations) model random error only. In fact, they
contended that classical reliability coefficients are the only ones that can
estimate all the main potential sources of measurement error relevant to
job performance ratings, including rater leniency effects, halo effects (i.e.,
rater by ratee interactions), transient error, and random response error (for
an illustration, see Viswesvaran, Schmidt, & Ones, 2005).
Furthermore, some of the concerns Murphy and DeShon (2000) raised
may be less relevant within a validation context. For example, the presence of similar or divergent rater goals (e.g., relationship building vs.
performance motivation) and biases (e.g., leniency) may be less likely
when confidential, research-based performance ratings can be collected
for validation purposes.
An alternative approach to conceptualizing and estimating measurement error, which has been used in the general psychometrics literature
for decades but that only has recently made it into the validation literature,
is generalizability (G) theory (Cronbach, Gleser, Nanda, & Rajaratnam,
1972). Conventional corrections for measurement error are based on classical test theory, which conceptualizes error as any factor that makes an

VAN IDDEKINGE AND PLOYHART

877

observed score differ from a true score. From this perspective, error is undifferentiated and is considered to be random. In G-theory, measurement
error is thought to comprise a multitude of systematic, unmeasured, and
even interacting sources of error (DeShon, 2002, 2003). Using analysis
of variance (ANOVA), G-theory allows researchers to partition the variance associated with sources of error and, in turn, estimate their relative
contribution to the overall amount of error present in a set of scores.
Potential sources of error (or facets in G-theory terminology) for
job performance ratings collected for a validation study might include the
raters, the type of rater (i.e., supervisors vs. peers), and the performance
dimensions being rated. Using G-theory, the validation researcher could
compute a generalizability coefficient that indexes the combined effects of
these error sources. The researcher could also compute separate variances,
and the corresponding generalizability coefficients that account for them,
for each error source to determine the extent to which each source (e.g.,
raters vs. dimensions) contributes to overall measurement error.
G-theory has the potential to be a very useful tool for validation researchers, and we encourage more extensive use of this technique in validation research. For example, in addition to more precisely determining
the primary source(s) of error in an existing set of job performance ratings,
G-theory can be useful for planning future validation efforts. Specifically,
G-theory encourages researchers to consider the facets that might contribute to error in ratings and then design their validation studies in a
way that allows them to estimate the relative contribution of each facet.
G-theory can also help determine how altering the number of raters, performance dimensions, and so on will affect the generalizability of ratings
collected in the future.
At the same time, there are some potentially important issues to consider when using a G-theory perspective to estimate measurement error
in validation research. First, inferences from G-theory focus on the generalizability of scores rather than on reliability per se. That is, G-theory
estimates the variance associated with whatever facets are captured in the
measurement design (e.g., raters, items, time periods) used to obtain scores
on which decisions are made. If, for example, the variance associated with
the rater facet is large, then the ratings obtained from different raters are
not interchangeable, and thus decisions made on the basis of those ratings
might not generalize to decisions that would be made if a different set of
raters was used (DeShon, 2003). Likewise, a generalizability coefficient
that considers the combined effects of all measurement facets indicates the
level of generalizability of scores given that particular set of raters, items,
and so on. Further, as with interrater correlations computed on the basis
of classical test theory, unless the assumption of parallel raters is satisfied,
generalizability coefficients cannot be considered reliability coefficients

878

PERSONNEL PSYCHOLOGY

(Murphy & DeShon, 2000). Thus, G-theory may not resolve all concerns
that have been raised about measurement error corrections in performance
ratings.
Finally, to capitalize on the information G-theory provides, researchers
must incorporate relevant measurement facets into their validation study
designs. For instance, to estimate the relative contribution of raters and performance dimensions on measurement error, validation participants must
be rated on multiple performance dimensions by multiple raters. In other
words, G-theory is only as useful as is the quality and comprehensiveness
of the design used to collect the data.
Corrections for RR. Another statistical artifact relevant to validation
research is RR. RR occurs when there is less variance on the predictor,
criterion, or both in the validation sample relative to the amount of variation on these measures in the relevant population. The restricted range of
scores results in a criterion validity estimate that is downwardly biased.
Numerous studies conducted during the past decade have addressed the
issue of RR in validation research. Sackett and Yang (2000) identified three
main factors that can affect the nature and degree of RR: (a) the variable(s)
on which selection occurs (predictor, criterion, or a third variable), (b)
whether the unrestricted variances for the relevant variables are known,
and (c) whether the third variable, if involved, is measured or unmeasured
(e.g., unquantified judgments made on the basis of interviews or letters of
recommendation). The various combinations of these factors resulted in
11 plausible scenarios in which RR may occur.
Yang, Sackett, and Nho (2004) updated the correction procedure for
situations when selection decisions are made on the basis of unmeasured
or partially measured predictors (i.e., scenario 2d in Sackett & Yang, 2000)
to account for the additional influencing factor of applicants rejection of
job offers. However, modeling the effects of self-selection requires data
concerning plausible reasons why applicants may turn down a job offer,
such as applicant employability as judged by interviewers. Therefore, the
usefulness of this procedure depends on whether reasons for applicant
self-selection can be identified and effectively measured.
A key distinction in conceptualizing RR is that between direct and
indirect restriction. In a selection context, direct RR occurs when individuals were screened on the same procedure that is being validated. This
can occur, for example, when a structured interview is being validated on
a sample of job incumbents who initially were selected solely on the basis
of the interview. In contrast, indirect RR occurs when the procedure being
validated is correlated with one or more of the procedures currently used
for selection. For instance, the same set of incumbents from the above
example also may be given a biodata inventory as part of the validation
study. If biodata scores are correlated with performance in the interview

VAN IDDEKINGE AND PLOYHART

879

on which incumbents were selected, then the relationship between the


biodata inventory and the validation criteria (e.g., job performance ratings) may be downwardly biased due to indirect restriction vis-`a-vis the
interview.
Because applicants are rarely selected in a strict topdown manner using a single procedure (a requirement for direct RR; Schmidt, Oh, & Le,
2006), and because researchers often validate selection instruments prior
to using them operationally, it has been suggested that most RR in personnel selection is indirect rather than direct (Thorndike, 1949). However, the
existing correction procedure for indirect restrictionThorndikes (1949)
case 3 correctionrarely can be used given (a) the data assumptions that
must be met (e.g., topdown selection on a single predictor) and (b) the
(un)availability of information regarding the third variable(s) on which
prior selection occurred. Thus, researchers have had to use Thorndikes
case 2 correction for direct restriction in instances in which the restriction
actually is indirect. Unfortunately, using this procedure in cases of indirect RR tends to undercorrect the validity coefficient (Linn, Harnisch, &
Dunbar, 1981).
Recent studies by Schmidt and colleagues have clarified various issues
involving corrections for both direct and indirect RR. Hunter, Schmidt,
and Le (2006) noted that under conditions of direct RR, accurate corrections for both RR and measurement error require a particular sequence
of corrections. Specifically, researchers first should correct for unreliability in the criterion and then correct for RR in the predictor. Hunter and
colleagues described the input data and formulas required for each step.
They also presented a correction method for indirect RR that can be used
when the information needed for Thorndikes (1949) case 3 correction is
not available.
Schmidt et al. (2006) reanalyzed data from several previously published validity generalization studies to compare validity coefficients corrected using the case 2 formula (which assumes direct restriction) to those
obtained using their new correction procedure for indirect restriction.
Results suggested that the direct restriction correction underestimated operational validity by 21% for predicting job performance and by 28%
for predicting training performance. This suggests that prior research (in
which direct RR corrections were applied to situations involving indirect restriction) may have substantially underestimated the validity of
the selection procedures (however, see Schmitt, 2007, for an alternative
perspective on these findings).
Sackett, Lievens, Berry, and Landers (2007) discussed the special case
in which a researcher wants to correct the correlation between two or
more predictors for RR when the predictors comprise a composite used
for selection. Suppose, for example, that applicants were selected on a

880

PERSONNEL PSYCHOLOGY

composite of a cognitive ability test and a personality measure, and the


researcher wants to estimate the correlation between the two predictors
in a sample of applicants who ultimately were selected. Given the compensatory nature of the composite, applicants who obtained low scores
on one predictor must have obtained very high scores on the other predictor in order to obtain a passing score on the overall composite. Sackett
and colleagues demonstrated how this phenomenon can severely reduce
observed correlations between predictors and how applying traditional
corrections for direct RR does not accurately estimate the population
correlation (though the appropriate indirect correction would recover the
population value). The underestimation of predictor correlations, in turn,
can distort conclusions regarding their incremental validity in relation to
the criterion.
RR correction formulas assume that the unrestricted SD of predictor
scores can be estimated. When using commercially available selection
tests, it is common for researchers to rely on normative data reported in
test manuals. Ones and Viswesvaran (2003) investigated whether population norms are more heterogeneous than job-specific applicant norms.
They compared the variability in scores on the comprehensive personality
profile (Wonderlic Personnel Test, 1998) within the general population to
the variability of scores across 111 job-specific applicant samples. The
researchers concluded that use of population norm SDs to correct for RR
may not appreciably inflate relationships between personality variables
and validation criteria, and that the score variance in most applicant pools,
although affected by self-selection, represents a fairly accurate estimate
of the unrestricted variance. Of course, these results may not generalize
to other personality variables (e.g., the Big Five factors) or to measures of
other selection constructs (e.g., see Sackett & Ostgaard, 1994).
Researchers have also noted that RR affects reliability coefficients in
the same way it affects validity coefficients (Callender & Osburn, 1980;
Guion, 1998; Schmidt, Hunter, & Urry, 1976). To the extent that RR
downwardly biases reliability estimates, correcting validity coefficients
for criterion unreliability may overestimate validity. Sackett, Laczo, and
Arvey (2002) investigated this issue with estimates of interrater reliability. They examined three scenarios in which the range of job performance
ratings could be restricted: (a) indirect RR due to truncation on the predictor (e.g., selection on personality test scores that are correlated with
job performance), (b) indirect RR due to truncation on a third variable
(e.g., retention decisions based on employees performance during a probationary period that preceded collection of the performance ratings), and
(c) direct RR on the performance ratings (e.g., the ratings originally were
used to make retention and/or promotion decisions and predictor data are
collected from the surviving employees only).

VAN IDDEKINGE AND PLOYHART

881

Results of a Monte Carlo simulation revealed that the underestimation of interrater reliability was quite small under the first scenario (i.e.,
because predictor-criterion correlations in validation research tend to be
rather modest), whereas there often was substantial underestimation under
the latter two scenarios. Interrater reliability was underestimated the most
when there was direct RR on the performance ratings (i.e., scenario C)
and when the range of performance ratings was most restricted (i.e., when
the selection/retention ratio was low). In terms of the effects of criterion
RR on validity coefficient corrections, restriction due to truncation on the
predictor (i.e., scenario A) did not have a large influence on the corrected
validities. Overestimation of validity was more likely under the other two
scenarios given the smaller interrater reliability estimates that resulted
from the RR, which, in turn, were used to correct the observed validity coefficients. Nonetheless, when direct RR exists on the performance ratings
(i.e., scenario C), researchers will likely have the data (e.g., performance
ratings on both retained and terminated employees) to correct the reliability estimate for restriction prior to using the estimate to correct the validity
coefficients for attenuation.
LeBreton and colleagues (LeBreton, Burgess, Kaiser, Atchley, &
James, 2003) investigated the extent to which the modest interrater reliability estimates often found for job performance ratings are due to
true discrepancies between raters (e.g., in terms of opportunities to observe certain job behaviors) or to restricted between-target variance due
to a reduced amount of variability in employee job performance resulting
from various human resources systems (e.g., selection). The researchers
noted that whether low interrater estimates are due to rating discrepancies or to lack of variance cannot be determined using correlationbased approaches to estimating interrater reliability alone. Thus, they
also examined use of r wg (James, Demaree, & Wolf, 1984) to estimate
interrater agreement, a statistic unaffected by between-target variance
restrictions.
LeBreton and colleagues conducted a Monte Carlo simulation to
demonstrate the relationship between between-target variance and estimates of interrater reliability. The simulation showed that Pearson correlations decreased from .83 with no between-variance restriction to .36
with severe variance restriction. The researchers then examined actual data
from several multirater feedback studies. Results revealed that estimates
of interrater reliability (i.e., Pearson correlations and ICCs) consistently
were low (e.g., mean single-rater estimates of .30) in the presence of low to
modest between-target variance. In contrast, estimates of interrater agreement (i.e., r wg coefficients) were moderate to high (e.g., mean = .71 based
on a slightly skewed distribution). Because the agreement coefficients
were relatively high, LeBreton et al. concluded that the results provided

882

PERSONNEL PSYCHOLOGY

support for the restriction of variance hypothesis rather than for the rater
discrepancy hypothesis. We note, however, that reduction in betweentarget variance will always decrease interrater reliability estimates given
the underlying equations and that agreement estimates such as r wg will
never be affected by between-target variance because of how they are
computed.
Finally, corrected validity coefficients typically provide closer approximations of the population validity than do uncorrected coefficients. However, there are a few assumptions that, when violated, can lead to biased
estimates of corrected validity. For instance, the Hunter et al. (2006)
indirect RR procedure assumes that the predictor of interest captures
all the constructs that determine the criterion-related validity of whatever process or measures were used to make selection decisions. This
assumption may be violated in some validation contexts. For example,
an organization originally may have used a combination of assessments
(e.g., a cognitive ability test, a semistructured interview, and recommendation letters) to select a group of job incumbents. If the organization
later wants to estimate the criterion-related validity of the cognitive test,
the range of scores on that test will be indirectly restricted by the original selection battery, and thus, the effect of the original battery on the
criterion (e.g., job performance) cannot be fully accounted for by the
predictor of interest. Le and Schmidt (2006) found that although this
violation results in an undercorrection for RR, use of their procedure
still provides less biased validity estimates than does the conventional
correction procedure based on the direct RR model (i.e., Thorndikes
case 2).
Another assumption of the Hunter et al. (2006) procedure is that there
is no indirect RR on the criteria. However, if a restricted range of criterion
values is due to something other than RR on the predictor (e.g., restriction
resulting from a third variable, such as probationary period decisions;
Sackett et al., 2002), then their procedure, along with all other correction
procedures, will undercorrect the observed validity coefficient. Last, the
Thorndike correction procedures (which one might use when predictor
RR truly is direct) require two basic assumptions: (a) There is a linear
relationship between the predictor and criterion, and (b) the conditional
variance of scores on the criterion does not depend on the value of the
predictor (i.e., there is homoscedasticity). If either these assumptions is
violated, then corrected validities can be underestimated (for a review, see
Dunbar & Linn, 1991).
Conclusions and recommendations. There is convincing evidence
that statistical artifacts such as measurement error and RR can downwardly
bias observed relationships between predictors and criteria, and, in

VAN IDDEKINGE AND PLOYHART

883

turn, affect the accuracy of conclusions regarding the validity of


selection procedures. Although recent research has provided valuable
insights, these studies also underscore how complicated it can be to determine the specific artifacts affecting validity coefficients. Furthermore,
even when the relevant artifacts can be identified, it is not always clear
whether they should be corrected, and if so, the most appropriate way to
make the corrections. With the hope of at least identifying the key issues,
we recommend the following.
(a) When feasible, report both observed validity coefficients and those
corrected for criterion unreliability and/or RR. Always specify the
type of corrections performed, their sequence, and which formulas
were used.
(b) There is disagreement regarding the appropriate correction of ratings criteria for measurement error. Until the profession reaches
some consensus, report validity coefficients corrected for interrater reliability (e.g., using an ICC; Bliese, 1998; McGraw &
Wong, 1996). Alternatively, it may be possible to compute generalizability coefficients that estimate unreliability due to items,
raters, and the combination of these two potential sources of
error.
(c) Be aware that it may be difficult to obtain accurate estimates of
corrected validity coefficients when only one rater is available to
evaluate the job performance of each validation study participant.
Researchers would appear to have two main options, although
neither is ideal. First, they could correct validity coefficients for
measurement error using current meta-analytic estimates of interrater reliability. These values are .52 for supervisor ratings and .42
for peer ratings (Viswesvaran et al., 1996). Second, researchers
could correct validity coefficients for intrarater reliability, such as
by computing coefficient alpha. However, as discussed, this approach will likely provide a conservative estimate of corrected validity because intrarater statistics do not account for rater-specific
error and, in turn, tend to overestimate the reliability of ratings criteria. In this case, researchers might be advised to include multiple
ratings of each performance dimension and then correct each validity coefficient using the internal consistency estimates of these
unidimensional measures (Schmitt, 1996).
(d) Because employees often have only one supervisor, consider collecting performance information from both supervisors and peers.
Although peer ratings may not be ideal for administrative purposes, such ratings would seem to be appropriate for validation

884

PERSONNEL PSYCHOLOGY

(e)

(f)

(g)

(h)

(i)

purposes. Further, recent research suggests that rater level effects


might not be as strong as commonly thought (Viswesvaran et al.,
2002). Thus, collecting both supervisor and peer ratings may allow
validation researchers to combine the ratings and, in turn, reduce
rater-specific error.
Learn the basics of G-theory and consider likely sources of measurement error when designing validation studies. When possible,
collect ratings criteria in a way that will allow you to estimate the
relative contribution of multiple sources to measurement error in
the ratings.
Compute measures of interrater agreement. Although we do not
advocate using agreement coefficients to correct for measurement
error, they can help determine the extent to which limited betweenratee variance may contribute to low interrater reliability estimates.
Examination of within-rater SDs also can be informative in this
regard.
There are at least 11 different types of RR (Sackett & Yang, 2000),
and applying the wrong RR correction can influence conclusions
regarding criterion-related validity. For example, using correction
formulas that assume strict topdown selection when there is not
can overestimate the amount of RR and, in turn, inflate corrected
validity estimates. Consult articles by Sackett (e.g., Sackett &
Yang, 2000) and Schmidt and Hunter (e.g., Hunter et al., 2006)
to identify likely sources of RR and the appropriate correction
procedures.
None of the standard predictor RR correction procedures (e.g.,
Hunter et al., 2006) consider whether restriction of range exists in
the criterion. Thus, to the extent that there is RR (direct or indirect)
on one or more validation criteria due to something other than the
restriction on the predictor(s), all standard correction procedures
will underestimate validity (Schmidt et al., 2006). Therefore, if
criterion RR is a concern, researchers will have to correct reliability
estimates for restriction prior to using them to correct validity
coefficients for measurement error (Sackett et al., 2002), assuming
that the information needed to correct for criterion RR is available
(e.g., variance of job performance ratings from an unrestricted
sample).
Be cognizant that there remain some concerns over the use of
corrections when reporting validity coefficients. At issue is not
whether measurement error and RR exist but rather what specific corrections should be applied (see Schmitt, 2007). It must
be remembered that corrections are no substitute for using good
validation designs and measures in the first place.

VAN IDDEKINGE AND PLOYHART

885

Evaluation of Multiple Predictors

Organizations frequently use multiple predictors to assess job applicants. This is because job analysis results often reveal a large number of job-relevant knowledge, skills, abilities, and other characteristics
(KSAOs), which may be difficult or impossible to capture with a single
assessment. In addition, use of multiple predictors can increase the validity of the overall selection system beyond that of any individual predictor
(Bobko, Roth, & Potosky, 1999; Schmidt & Hunter, 1998). We discuss
two important considerations for evaluating multiple predictors: relative
importance and cross-validity.
Estimating predictor relative importance. Relative importance refers
to the relative contribution each predictor makes to the predictive power
of an overall regression model. Relative importance statistics are useful
for determining which predictors contribute most to predictive validity, as
well as for evaluating the extent to which a new predictor (or predictors)
contributes to an existing predictor battery. Perhaps the most common approach for assessing relative importance has been to examine the magnitude and statistical significance of the standardized regression coefficients
for individual predictors. When predictors are uncorrelated, the squared
regression coefficient for each variable represents the proportion of variance in the criterion for which that predictor accounts. However, when
the predictors are correlated (as often is the case in selection research),
the squared regression coefficients do not sum to the total variance explained (i.e., R2 ), which makes conclusions concerning relative validity
ambiguous and possibly misleading (Budescu & Azen, 2004; Johnson &
LeBreton, 2004).
Another common approach used to examine predictor relative importance is to determine the incremental validity of a given predictor beyond
that provided by a different predictor(s). For instance, a researcher may
need to determine whether a new selection procedure adds incremental
prediction of valued criteria beyond that provided by existing procedures.
However, as LeBreton, Hargis, Griepentrog, Oswald, and Ployhart (2007)
noted, new predictors often account for relatively small portions of unique
variance in the criterion beyond that accounted for by the existing predictors. This is because incremental validity analyses assign any shared
variance (i.e., between the new predictor and the existing predictors) to
the existing predictors, which reduces the amount of validity attributed to
the new predictor. Such analyses do not provide information concerning
the contribution the new predictor makes to the overall regression model
(i.e., R2 ) relative to the other predictors.
Relative weight analysis (RWA; Johnson, 2000) and dominance analysis (DA; Budescu, 1993) represent complementary methods for assessing

886

PERSONNEL PSYCHOLOGY

the relative importance of correlated predictors. These statistics indicate


the contribution each predictor makes to the regression model, considering
both the predictors individual effect and its effect when combined with
the other predictors in the model (Budescu, 1993; Johnson & LeBreton,
2004). In RWA, the original predictors are transformed into a set of orthogonal variables, the orthogonal variables are related to the criterion, and
the orthogonal variables are then related back to the original predictors.
These steps reveal the percentage of variance in the criterion associated
with each predictor. Similarly, DA yields a weight for each predictor
that represents the predictors average contribution to R2 (specifically, its
mean squared semipartial correlation, or R2 ) across all possible subsets
of regression models. The results of DA are practically indistinguishable
from those obtained from RWA (LeBreton, Ployhart, & Ladd, 2004). The
main difference between the two approaches is that relative weights can
be computed more quickly and easily, particularly when analyzing a large
set of predictors (Johnson & LeBreton, 2004).
RWA and DA offer two main advantages over traditional methods for
assessing relative importance. First, these procedures provide meaningful
estimates of relative importance in the presence of multicollinearity. Second, they indicate the percentage of model R2 attributable to each predictor.
The percentage of R2 helps determine the relative magnitude of predictor
importance, and it provides researchers with a relatively straightforward
metric to communicate validation study results to decision makers.
Because these relative importance methods are relatively new, only
a few empirical studies have examined their use. LeBreton et al. (2004)
used a Monte Carlo simulation to compare DA results to those obtained
using traditional methods (i.e., squared correlations and squared regression coefficients). Results revealed that as predictor validity, predictor
collinearity, and number of predictors increased, so did the divergence
between the traditional and newer relative importance approaches. These
findings led the researchers to suggest caution when using correlation and
regression coefficients as indicators of relative importance. In a related
study, Johnson (2004) discussed the need to consider sampling error and
measurement error when interpreting the results of relative importance
analyses. His Monte Carlo simulation revealed that conclusions regarding
relative importance can depend on whether measures of the constructs of
interest are corrected for unreliability.
We found very few studies that have used RWA or DA to evaluate
the relative importance of predictors within a selection context. LeBreton
et al. (2007) used DA to reanalyze data originally reported by Mount, Witt,
and Barrick (2000). Mount et al. examined the incremental validity of biodata scales beyond measures of mental ability and the Big Five factors for
predicting performance within a sample of clerical workers. The original

VAN IDDEKINGE AND PLOYHART

887

incremental validity results suggested that the biodata scales accounted


for relatively small increases in model R2 . However, by using relative
importance analysis, LeBreton and colleagues showed that biodata consistently emerged as the most important predictor of performance. Ladd,
Atchley, Gniatczyk, and Baumann (2002) also assessed the relative importance of predictors within a selection context. These researchers found
that dimension-based assessment center ratings tended to be relatively
more important to predicting managerial success than did exercise-based
ratings.
Estimating cross-validity. Cross-validation involves estimating the
validity of an existing predictor battery on a new sample. This is important because when predictors are chosen on the basis of a given sample, validity estimates are likely to be higher than if the same predictors
were administered to new samples (a phenomenon known as shrinkage;
Larson, 1931). The traditional approach for estimating cross-validity has
been to split a sample into a two-third development sample and one-third
cross-validation sample. However, splitting a sample creates less stable
regression weights (i.e., because it reduces sample size), and therefore,
formula-based estimation is preferable. Although a variety of formulabased estimates exist, two recent studies have helped identify that formulas
work best.
Schmitt and Ployhart (1999) used a Monte Carlo simulation to examine
11 different cross-validity formulas when stepwise regression is used
to select and weight predictor composites. They found that no single
formula consistently produced more accurate cross-validity estimates than
the other formulas in terms of the discrepancy between the population
values and the obtained estimates. However, when a reasonable sample
size-to-predictor ratio is achieved (see below), a formula from Burket
(1964) provided slightly superior estimates. They also found that crossvalidity cannot be accurately estimated when sample sizes are small and
that the ratio of sample size to number of predictors should be about 10:1
for accurate cross-validity estimates.
Raju, Bilgic, Edwards, and Fleer (1999) compared a similar set of
cross-validity formulas using data obtained from some 85,000 U.S. Air
Force enlistees. Consistent with the results of Schmitt and Ployhart (1999),
Burkets (1964) shrinkage formula provided the most accurate estimates
of cross-validity when compared to empirical cross-validity estimates in
which the regression weights from one randomly drawn sample were
applied to another random sample.
Raju et al. (1999) also compared ordinary least squares (OLS) and
equal weights procedures for empirical cross-validation. In the equal
weights procedure, scores on each predictor are converted to z scores,
or each predictor is weighted by the reciprocal of its SD (these approaches

888

PERSONNEL PSYCHOLOGY

yield identical multiple correlations; Raju, Bilgic, Edwards, & Fleer,


1997). Results revealed that differences between the sample validities
and the cross-validities were smaller for equal weights than for OLS.
Furthermore, although OLS consistently yielded higher initial validities,
the cross-validities always were higher for the equal weights procedure.
These findings are consistent with earlier research (e.g., Schmidt, 1971)
that showed that unit-weighted predictors are as or more predictive than
OLS-weighted predictors, particularly when samples sizes are small and
when there is an absence of suppressor variables.
Finally, we wish to clarify a common misconception concerning crossvalidity estimates provided by software programs such as SPSS and SAS.
Specifically, researchers sometimes report adjusted R2 values from the
output of linear regression analysis as estimates of cross-validity. Actually, these values estimate the population squared multiple correlation
rather than the cross-validated squared multiple correlation (Cattin, 1980).
Therefore, we urge researchers to refrain from using these adjusted R2 values to estimate cross-validity.
Conclusions and recommendations. Researchers must often choose
a subset of validated predictors for use in operational selection. This task
is complicated because a variety of statistical methods exist for determining predictor relative importance and estimating cross-validity. Recent
research has helped clarify some of these complex issues.
(a) Different statistics (e.g., zero-order correlations, regression coefficients from incremental validity analyses, relative importance
indices) provide different information concerning predictor importance; they are not interchangeable. Therefore, it is frequently
useful to consider multiple indices when evaluating predictor importance.
(b) RWA and DA are useful supplements to the information provided
by traditional multiple regression analysis. These methods can
help determine which predictors contribute most to R2 , and they
are most useful when evaluating a large number of predictors with
moderate-to-severe collinearity. For instance, if two predictors
increase validity by about the same amount and a researcher wants
to keep only one of them, he or she could retain the one with the
larger relative weight.
(c) Note, however, that RWA and DA were not intended for use in
developing regression equations for future prediction or for identifying the set of predictors that will yield the largest R2 . Multiple
linear regression analysis remains the preferred approach for these
goals.

VAN IDDEKINGE AND PLOYHART

889

(d) Relatively little is known about how these methods function in


relation to one another or to correlation and regression approaches
using data from actual selection system validation projects. Therefore, we suggest using relative importance methodologies in conjunction with traditional statistics until additional research evidence is available. LeBreton et al. (2007) may be useful in this
regard, as the authors outlined a series of steps for evaluating
predictor importance.
(e) If statistical artifacts such as RR and criterion unreliability are a
concern, use the corrected validities (and the predictor intercorrelations corrected for RR, as applicable) when comparing predictors
and assessing incremental validity. This is particularly important
when varying degrees of RR exist across the predictors being
evaluated.
(f) Use Burkets (1964) formula to estimate the cross-validity of predictors. Never use the adjusted R2 values from SPSS and SAS for
this purpose.
(g) Weighting predictors equally (rather than by OLS-based weights)
may provide larger and more accurate cross-validity estimates,
particularly when validation samples are modest (i.e., N =
150 or less) and when predictor-criterion relations are small to
moderate.
Differential Prediction

An important aspect of any validation study is the examination of


differential prediction. Differential prediction (also referred to as predictive bias) occurs when the relationship between predictor and criterion
in the subgroups of interest (e.g., men vs. women) cannot be accurately
described by a common regression line (SIOP, 2003). Differences in regression intercepts indicate that members of one subgroup tend to obtain
lower predicted scores than members of another group, whereas regression slope differences indicate that the selection procedure predicts performance better for one group than for another (Bartlett, Bobko, Mosier,
& Hannan, 1978; Cleary, 1968). Existence of differential prediction typically is examined using moderated multiple regression (MMR) in which
the criterion is regressed on the predictor score, a dummy-coded variable
representing subgroup membership, and the interaction term between the
predictor and subgroup variable. A significant increase in R2 when the
subgroup variable is added to the predictor indicates an intercept difference between the two groups, and a significant increase in R2 when the
interaction is added indicates a difference in slopes.

890

PERSONNEL PSYCHOLOGY

Recent studies have examined a variety of differential prediction


issues. Saad and Sackett (2002) investigated differential prediction of
personality measures with regard to gender using data from nine military occupational specialties collected during Project A. The MMR results revealed evidence of differential prediction in about one-third of the
predictor-criterion combinations. Most instances of differential prediction
(about 85%) were due to intercept differences rather than to slope differences. Interestingly, differential prediction appeared to be more a function
of the criteria than of the predictors. Instances of differential prediction for
three personality measures were roughly equal across criteria, whereby
the intercepts for male scores tended to be higher than the intercepts for
female scores. This occurred despite that fact that women tended to score
between one-third and one-half SDs higher than men on a dependability
scale (men scored around one-third SDs higher than women on adjustment, and there were small or no subgroup differences on achievement
orientation). In contrast, about 90% of the instances of differential prediction involved the effort and leadership criterion (i.e., female performance
was consistently overpredicted using a common regression line).
Given these results, it is important to consider how subgroup differences on validation criteria may affect conclusions researchers draw from
differential prediction analyses. Two recent meta-analyses have examined
subgroup differences on job performance measures. Roth, Huffcutt, and
Bobko (2003) investigated WhiteBlack and WhiteHispanic differences
across various criteria. For ratings of overall job performance, the mean
observed and corrected (for measurement error) d values were .27 and .35
for Whites versus Blacks and .04 and .05 for Whites versus Hispanics.
In terms of moderator effects, differences between Whites and minorities
tended to be larger for objective than for subjective criteria and larger for
work samples and job knowledge tests than for promotion and on-the-job
training performance.
McKay and McDaniel (2006) also examined WhiteBlack differences
in performance. Across criteria, their results were nearly identical to those
of Roth et al. (2003). Specifically, the rated performance of White employees tends to be roughly one-third a SD higher than that of Black employees
(corrected d = .38). However, the researchers found notable variation in
subgroup differences among the individual criteria, which ranged from
.09 for accidents to .60 for job knowledge test scores. As for moderators,
larger differences were found for cognitively oriented criteria, data from
unpublished sources, and performance measures that comprised multiple
items.
The above results lead to the question of whether criterion subgroup
differences in subjective criteria reflect true performance disparities or
rater bias. Rotundo and Sackett (1999) addressed this issue by examining

VAN IDDEKINGE AND PLOYHART

891

whether rater race influenced conclusions regarding differential prediction


of a cognitive ability composite. The researchers created three subsamples
of data: White and Black employees rated by White supervisors, White
and Black employees rated by a supervisor of the same race, and Black
and White employees rated by both a White and a Black supervisor.
Results revealed no evidence of differential prediction, which suggests
that conclusions concerning predictive bias were not a function of whether
performance ratings were provided by a supervisor of the same or a
different race.
Other studies have investigated various methodological issues in the
assessment of differential prediction. The omitted variables problem occurs when a variable that is related to the criterion and to the other predictor(s) is excluded from the regression model (James, 1980). This can
result in a misspecified model in which the regression coefficient for the
included predictor(s) is biased. This problem can also occur when examining differential prediction if the omitted predictor is related to both the
criterion and subgroup membership. Using data from Project A, Sackett,
Laczo, and Lippe (2003) found that inclusion of a previously omitted
predictor can change conclusions of differential prediction analyses. For
example, existence of significant intercept differences when personality
variables were used to predict core task performance dropped from 100%
to 25% when the omitted predictor (i.e., a measure of general mental
ability) was included in the model.
Several studies have examined the issue of statistical power in MMR
analysis. Insufficient power is particularly problematic in the assessment
of differential prediction given the potential consequences of test use.
Aguinis and Stone-Romero (1997) conducted a Monte Carlo study to
examine the extent to which various factors influence power to detect a
statistically significant categorical moderator. Results indicated that factors such as predictor RR, total sample size, subgroup sample size, and
predictorsubgroup correlations can have considerable effects on statistical power.
It often is difficult to obtain an individual validation sample large
enough to provide sufficient power to detect differential prediction. To
help address this problem, Johnson, Carter, Davison, and Oliver (2001)
developed a synthetic validity-based approach to assess differential prediction. In synthetic validity, justification for use of a selection procedure
is based upon relations between scores on the procedure and some assessment of performance with respect to one or more domains of work
within a single job or across different jobs (SIOP, 2003). This technique
involves the identification of different jobs that have common components
(e.g., customer service), collecting data on the same predictor and criterion measures from job applicants or incumbents, and combining the data

892

PERSONNEL PSYCHOLOGY

across jobs to estimate an overall validity coefficient. Johnson and colleagues outlined the necessary formulas and data requirements to conduct
differential prediction analyses within this framework.
Finally, Aguinis and colleagues have developed several freely available
computer programs that can aid in the assessment of differential prediction. For instance, Aguinis, Boik, and Pierce (2001) developed a program
called MMRPOWER that allows researchers to approximate power by inputting parameters such as sample size, predictor RR, predictorsubgroup
correlations, and reliability of measurement. Also, the MMR approach to
testing for differential prediction assumes that the variance in the criterion
that remains after predicting the criterion from the predictor is roughly
equal across subgroups. Violating this assumption can influence type I
error rates and reduce power, which, in turn, can lead to inaccurate conclusions regarding moderator effects (Aguinis, Peterson, & Pierce, 1999;
Oswald, Saad, & Sackett, 2000). Aguinis et al. (1999) developed a program
(i.e., ALTMMR) that determines whether the assumption of homogeneity of error variance has been violated and, if so, computes alternative
inferential statistics that test for a moderating effect. Last, Aguinis and
Pierce (2006) described a program for computing the effect size (f 2 ) of a
categorical moderator.
Conclusions and recommendations. Accurate assessment of differential prediction is critically important for predictor validation. Interestingly,
recent research has tended to focus on issues associated with detecting
differential prediction rather than estimating the differential prediction
associated with the predictors themselves. One possible reason for this is
that many researchers appear to have asserted earlier findings of no differential prediction for cognitive ability tests (e.g., Dunbar & Novick, 1988;
Houston & Novick, 1987; Schmidt, Pearlman, & Hunter, 1981) onto other
predictors and demographic group comparisons other than BlackWhite
and malefemale. The results of our review suggest that such assertions
may be unfounded. With this in mind, we offer the following recommendations.
(a) When technically feasible (e.g., when there is sufficient statistical
power, when unbiased criteria are available), conduct differential prediction analyses as a standard component of validation
research.
(b) Be aware that even predictor constructs with small subgroup mean
differences (e.g., certain personality variables) can contribute to
predictive bias.
(c) Differential prediction may be as much a function of the criteria as
it is a function of the predictor(s) being validated (Saad & Sackett,
2002). Thus, think carefully about the choice of validation criteria

VAN IDDEKINGE AND PLOYHART

893

and always estimate and report criterion subgroup differences. If


there is evidence of differential prediction, examine whether it
appears to be due to the predictors, criteria, or both.
(d) Avoid the omitted variables problem by including all relevant
predictors in the MMR model. Furthermore, if a composite of
predictors is to be used, then the composite (and not the individual
predictors that comprise it) should be the focus of the differential
prediction analyses (Sackett et al., 2003).
(e) Use power analysis to determine the sample size required to draw
valid inferences regarding differential prediction, and report the
actual level of power for all relevant analyses. Also, assess possible
violations of homogeneity of error variance, and report the effect
size associated with the predictorsubgroup interaction term(s).
Validation Sample Characteristics

Selection researchers have long been concerned about how various


sample characteristics may affect conclusions regarding criterion-related
validity (e.g., Barrett, Philips, & Alexander, 1981; Dunnette, McCartney,
Carlson, & Kirchner, 1962; Guion & Cranny, 1982). The sampling issues
that have received the most recent research attention are the validation
design/sample and the inclusion of repeat applicants.
Validation design and sample. Validation design refers to when the
predictor and criterion data are collected. In concurrent designs, predictors and criteria are collected at the same time. Conversely, in a predictive
design, the predictors are administered first and the criteria are collected
at a later point (see Guion & Cranny, 1982, for a description of the various
kinds of predictive designs). Validation sample refers to the individuals from whom the validation data are collected. The two main types of
participants are job incumbents and job applicants. Generally speaking,
concurrent validation designs tend to use existing employees, whereas
predictive designs tend to use job applicants. However, data from incumbents can be collected using a predictive design. For example, incumbents
may be administered an experimental test battery during new hire training
and then their job performance is assessed after 6 months on the job. Thus,
validation study design and sample are not necessarily isomorphic, though
researchers often treat them as such.
Validation design is an important issue because conclusions with regard to relations between predictors and criteria can differ depending
on when the variables are measured. For example, as we discuss later,
correlations between predictors and criteria can decrease as the time lag
between their measurement increases. Validation sample is an important
issue because applicants and incumbents may think and behave differently

894

PERSONNEL PSYCHOLOGY

when completing predictor measures. For example, applicants are likely to


have higher levels of test-taking motivation (Arvey, Strickland, Drauden,
& Martin, 1990) than incumbents because they want to be selected. Thus,
applicants may take assessments more seriously than incumbents and, for
example, devote more careful thought to their responses. Applicants are
also thought to be more likely than incumbents to attempt to distort their
responses (i.e., fake) on noncognitive predictors to increase their chances
of being selected.
Meta-analysis has been the primary method by which recent studies
have examined the effects of study design/sample on the validity of selection constructs and methods. For example, Hough (1998) reanalyzed
personality data collected during Project A (Hough, 1992). In addition to
the Big Five factors of Agreeableness, Emotional Stability, and Openness,
Hough compared validity estimates for rugged individualism, the achievement and dependability facets of Conscientiousness, and the affiliation and
potency facets of Extraversion. The criteria were job proficiency, training success, educational success, and counterproductive behavior. Across
criteria, observed correlations were between .04 and .15 smaller for predictive designs than for concurrent designs, with an average difference of
.07. Although small, a difference of .07 represents approximately half of
the observed validity for personality dimensions such as Conscientiousness and Emotional Stability (Ployhart, Schneider, & Schmitt, 2006).
Other studies have examined whether validation design affects the
criterion validity of particular selection methods. For example, McDaniel
et al. (2001) used meta-analysis to estimate the criterion validity of situational judgment tests (SJTs) in relation to job performance. Results
revealed mean validity coefficients (corrected for criterion unreliability)
of .18 for predictive designs and .35 for concurrent designs. However,
the predictive validity estimate was based on only six studies and 346
individuals (vs. k = 96 and N = 10,294 for concurrent designs); thus, the
researchers urged caution in interpreting this finding.
Huffcutt, Conway, Roth, and Klehe (2004) compared the validity of
situational and behavior description interviews (BDI) for predicting overall job performance. They discovered correlations (corrected for predictor
RR and criterion unreliability) of .38 and .48 for predictive and concurrent
designs, respectively. The mean difference between the two validation designs was greater for BDI studies (r = .33 vs. .54) than for situational
interview studies (r = .41 vs. .44). Nonetheless, these results should be
interpreted cautiously because the .33 estimate for BDI predictive is based
on three studies only. And most recently, Arthur et al. (2006) examined the
validity of selection-oriented measures of personorganization (P-O) fit.
In comparing validities based on predictive versus concurrent designs, the
researchers found correlations (corrected for unreliability in both predictor

VAN IDDEKINGE AND PLOYHART

895

and criterion) of .12 and .14, respectively, in relation to job performance


ratings.
We found only two primary studies that examined the effects of validation design/sample on criterion-related validity. 4 Weekley, Ployhart,
and Harold (2004) compared the validity of SJTs and measures of three of
the Big Five factors (i.e., Agreeableness, Conscientiousness, and Extraversion) across three predictive studies with job applicants and five concurrent
studies with job incumbents. The overall results revealed nonsignificant
validity differences between the applicant and incumbent samples.
Harold, McFarland, and Weekley (2006) estimated the validity of a
biodata inventory administered to incumbents during a concurrent validation study and to applicants as part of the selection process. Supervisors of
the incumbents and selected applicants provided job performance ratings
(though the time-lag between predictor and criterion measurement was
not reported). Observed correlations with ratings of overall job performance were .30 and .24, respectively, for the two groups. Interestingly,
validity coefficients for verifiable biodata items were comparable between
incumbents and applicants (r = .21 vs. .22), whereas the validity of nonverifiable items was stronger in the incumbent sample (r = .30 vs. .18).
The researchers did not appear to correct the incumbent biodata scores for
RR. Thus, operational validity differences between the two samples may
have been even larger.
Inclusion of repeat applicants. Another validation sample issue that
has received recent attention is the inclusion of repeat applicants. This is an
important issue because many organizations allow previously unsuccessful applicants to retake selection tests. Indeed, current professional guidelines state that employers should provide opportunities for reassessment
and reconsidering candidates whenever technically and administratively
feasible (SIOP, 2003, p. 57). A pertinent question becomes whether the
validity of inferences drawn from the selection procedures differs for
first-time and repeat applicants. This question addresses concerns such as

4
Many recent studies have examined differences in the psychometric properties of
noncognitive predictors (e.g., personality measures) between applicant and nonapplicant
groups. Some studies have found evidence of measurement invariance (e.g., Robie, Zickar,
& Schmit, 2001; D. B. Smith & Ellingson, 2002; D. B. Smith, Hanges, & Dickson,
2001), whereas other studies have found nontrivial between-group differences, such as
the existence of an ideal-employee factor among applicants but not among nonapplicants (e.g., Cellar, Miller, Doverspike, & Klawsky, 1996; Schmit & Ryan, 1993). One
consistent finding is that applicants tend to receive higher scores than do incumbents
(Birkeland, Kisamore, Brannick, & Smith, 2006). Higher mean scores may affect criterionrelated validity estimates to the extent they reduce score variability on the predictor. High
mean scores also can reduce the extent to which predictors differentiate among applicants, result in an increased numbers of tie scores, and create challenges for setting cut
scores.

896

PERSONNEL PSYCHOLOGY

whether repeat testtakers score higher (thereby changing rank ordering


and affecting validity), whether higher scores are due to changes in the latent construct or to extraneous factors (e.g., practice effects), and whether
the percentage of repeat testers in a sample affects validity.
Several recent articles have addressed the issue of repeat applicants.
Hausknecht, Halpert, Di Paolo, and Moriarty Gerrard (2007) used metaanalysis to estimate mean differences in cognitive ability scores across
testing occasions (most data were obtained from education settings). Test
takers increased their scores between .18 and .51 SDs upon retesting. Score
improvements were larger when respondents received coaching prior to
retesting and when identical (rather than alternate) test forms were used.
Raymond, Neustel, and Anderson (2007) reported somewhat larger mean
score gains (d = .79 and .48) in two samples of individuals who repeated
certification exams. Interestingly, score gains did not vary on the basis of
whether participants received an identical exam or a parallel exam upon
retesting.
Hausknecht, Trevor, and Farr (2002) investigated retesting effects on
a cognitive ability test and an oral presentation exercise. Using scores
on a final training exam as criteria, results revealed that the validity coefficients were somewhat higher for first-time than for repeat applicants
on both the cognitive test (r = .36 vs. .24) and the presentation exercise
(r = .16 vs. .07). The researchers also found that applicants tended to
increase their scores upon retesting, such that the standardized mean difference between applicants initial cognitive scores and their second and
third scores were .34 and .76, respectively. Given this, Hausknecht and
colleagues speculated that the score improvements likely did not represent
increases in job-relevant KSAOs (i.e., because cognitive ability tends to
be fairly stable over time) but rather some form of construct-irrelevant
variance (e.g., test familiarity). Last, for both the cognitive test and the
oral presentation, the number of retests was positively related to training
performance and negatively related to turnover. Thus, persistence in test
taking may be an indication of an applicants motivation and commitment
to the organization.
Lievens, Buyse, and Sackett (2005) examined retest effects in a sample
of medical school applicants. The predictors were a science knowledge
test, a cognitive ability test, and a SJT, and the criterion was grade point
average (GPA). Retest status (i.e., first-time or repeat applicant) provided
incremental prediction of GPA beyond that provided by a composite of
the above predictors. Specifically, the corrected validity estimates tended
to be higher for first-time applicants on the knowledge test (r = .54
vs. .37) and the cognitive test (r = .28 vs. .03), but not on the SJT
(r = .20 vs. .28). When comparing criterion validity within repeat
applicants, Lievens and colleagues found that scores on the most recent
knowledge test were better predictors than were scores on the initial test (r

VAN IDDEKINGE AND PLOYHART

897

= .37 vs. .23), whereas nonsignificant validity differences were observed


for the cognitive test and SJT.
In a subsequent study of these data, Lievens, Reeve, and Heggestad
(2007) examined the existence of measurement bias and predictive bias on
the cognitive ability test. Results revealed a lack of evidence of metric invariance (i.e., the same factor accounted for different amounts of variance
in each of its indicators across groups) and uniqueness invariance (i.e., indicator error terms differed across groups) upon retesting. Consistent with
the results of their first study, there was also evidence of predictive bias in
that initial test scores demonstrated criterion-related validity (in relation
to GPA), whereas retest scores did not. Together, these results suggest that
the constructs measured by the cognitive test may have changed from the
initial test to the retest. Indeed, Lievens and colleagues found that retest
scores were more correlated with scores on a memory association test
than were initial test scores.
Finally, Hogan, Barrett, and Hogan (2007) examined retesting effects
on measures of the Big Five personality factors across multiple samples.
Applicants for a customer service job, who were originally rejected because they did not pass a battery of selection tests, reapplied for the same
job and completed the same set of tests. Results revealed mean score
improvements that neared zero for all Big Five factors. In fact, change
scores were normally distributed across applicants, who were as likely to
achieve lower scores upon retesting as they were to achieve higher scores.
In addition, there was evidence that applicants who obtained higher scores
on social skills and social desirability and lower scores on integrity were
somewhat more likely to achieve higher retest scores.
Conclusions and recommendations. The composition of validation
samples can have important implications for conclusions regarding the
criterion-related validity of selection procedures. Recent research suggests
this may be true not only for comparing whether the sample comprises
applicants or incumbents but also for whether applicant samples comprise
first-time and/or repeat test takers. Furthermore, longitudinal research on
relations between predictors and criteria over time (which we discuss
later) suggests that validation design (predictive vs. concurrent) also may
influence validity. With these issues in mind, we recommend the following.
(a) Criterion-related validity estimates based on incumbent samples
may overestimate the validity one can expect when the predictor(s) are used for applicant selection. Validity estimates derived
on the basis of predictive designs are slightly to moderately lower
(i.e., about .05 to .10) than concurrent designs for personality
inventories, structured interviews, P-O fit measures, and biodata
inventories. Although based on a small number of studies, predictive designs may yield considerably lower validity estimates

898

PERSONNEL PSYCHOLOGY

for SJTs. We found no studies that examined the effects of study


design/sample on cognitive ability, though the results of earlier
research (e.g., Barrett et al., 1981) have suggested that the effects
tend to be negligible (although see Guion and Cranny, 1982, for a
counterargument).
(b) Be precise in reporting sample characteristics, such as whether the
term predictive design designates that the predictor data were
collected prior to the criterion (and if so, what the time-lag was)
or whether this term represents a proxy for an applicant sample.
This is important because the extent to which some of the aforementioned studies confounded validation design and validation
sample is unclear. Thus, it is uncertain whether some of the observed validity differences are due to design differences, sample
differences, or some combination of the two.
(c) If the validation sample comprises job applicants, examine the
effects of retesting on criterion validity (e.g., compare validity of
initial vs. subsequent test scores).
(d) Inclusion of repeat applicants in validity studies likely will result
in higher mean scores but lower validity coefficients. Therefore,
estimating validity using a sample of first-time applicants may
overestimate the validity that will be obtained when selection includes repeat applicants. Retesting may also alter the construct(s)
assessed by the selection procedures. Nonetheless, applicants who
choose to reenter the selection process can be productive employees, and their subsequent test scores can be as or more predictive
than their initial scores.
Validation Criteria

One of the most noticeable recent trends in selection research has been
the increased attention given to the criterion domain. This is a welcomed
trend because accurate specification and measurement of criteria is vital
for the effective selection, development, and validation of predictors. After all, predictors derive their importance from criteria (Wallace, 1965).
Recent studies have examined a wide range of criterion issues, and a
comprehensive treatment of this literature is well beyond the scope of
this article. Instead, we focus on key findings that have the most direct
implications for use of criteria in predictor validation.
Expanding the Performance Domain

In contrast to decades of research on task performance, studies conducted during the past decade have increasingly focused on expanding the
criterion domain to include behaviors that may fall outside of job-specific

VAN IDDEKINGE AND PLOYHART

899

task requirements. The main implication of the research for validation


work is that these newer criteria may allow for (or require) the development of different and/or additional predictors. We discuss three types
of criteria: citizenship performance, counterproductive performance, and
adaptive performance.
Citizenship performance. By far the most active line of recent criterion research has concerned the consideration of citizenship performance,
which also has been referred to as contextual performance (Borman & Motowidlo, 1993, 1997), organizational citizenship behavior (C. A. Smith,
Organ, & Near, 1983), and prosocial organizational behavior (Brief &
Motowidlo, 1986). Task performance involves behaviors that are a formal
part of ones job and that contribute directly to the products or services
an organization provides. It represents activities that differentiate one job
from another. Citizenship performance, on the other hand, involves behaviors that support the organizational, social, and psychological context
in which task behaviors are performed. Examples of citizenship behaviors
include volunteering to complete tasks not formally part of ones job,
persisting with extra effort and enthusiasm, helping and cooperating with
coworkers, following company rules and procedures, and supporting and
defending the organization (Borman & Motowidlo, 1993). Thus, whereas
task behaviors tend to vary from job-to-job, citizenship behaviors are quite
similar across jobs.
Although there is considerable overlap among the various models
of citizenship behavior, researchers have had differing views concerning
whether such behaviors are required or discretionary. For example, Organ (1988) originally indicated that organizational citizenship behaviors
(OCBs) were discretionary and not formally rewarded, whereas Borman
and Motowidlo (1993) did not state that contextual behaviors had to be
discretionary. However, Organ (1997) revised his definition of OCB and
dropped the discretionary aspect, which resulted in a definition more
aligned with that of contextual performance (see Motowidlo, 2000).
The results of recent citizenship performance research have at least
two key implications for validation research. First, research suggests that
ratings of task performance and citizenship performance are moderately
to strongly correlated, and that correlations tend to increase when the
same individuals provide both ratings (e.g., Chan & Schmitt, 2002; Conway, 1999; Ferris, Witt, & Hochwarter, 2001; Hattrup, OConnell, &
Wingate, 1998; Johnson, 2001; McManus & Kelly, 1999; Morgeson, Reider, & Campion, 2005; Van Scotter, Motowidlo, & Cross, 2000). Hoffman,
Blair, Meriac, and Woehr (2007) constructed a meta-analytic correlation
matrix that included ratings of task performance (one overall dimension)
and ratings of Organs (1988) five OCBs, including behaviors directed
toward the organization (three dimensions) and behaviors directed toward individuals within the organization (two dimensions). Results of a

900

PERSONNEL PSYCHOLOGY

confirmatory factor analysis (CFA) suggested that OCB is best viewed


as a single latent factor, rather than as two separate factors that reflect
organizational- and individual-directed OCBs. Further, although overall
OCB ratings were highly correlated with task performance ratings ( =
.74), model fit was somewhat better when the two types of performance
comprised separate factors.
Other research has investigated the relative contribution of task and
citizenship behaviors to ratings of overall job performance. In general, although supervisors tend to assign greater weight to task performance, they
also consider citizenship performance when evaluating workers (Conway,
1999; Johnson, 2001; Rotundo & Sackett, 2002). Interestingly, Conway
(1999) found that supervisors may give more weight to task performance,
whereas peers may give more weight to citizenship performance. There
also is evidence that the two types of performance make independent
contributions to employees attainment of rewards and promotions (Van
Scotter et al., 2000).
A second implication of recent research on the task-citizenship distinction concerns whether the two performance dimensions have different
antecedents. Because task performance concerns the technical core of
ones job, it is thought to be best predicted by ability and experiencerelated individual differences. Alternatively, because some dimensions of
citizenship performance are discretionary and/or interpersonally oriented,
citizenship is thought to be best predicted by dispositional constructs such
as personality. The results of some studies have provided evidence that
task and citizenship performance do have different correlates (e.g., Hattrup
et al., 1998; LePine & Van Dyne, 2001; Van Scotter & Motowidlo, 1996).
For example, Van Scotter and Motowidlo found that job experience was
a significantly stronger predictor of task performance than of citizenship
performance. Similarly, LePine and Van Dyne discovered that cognitive
ability was a better predictor of individual decision making than of cooperation, whereas Agreeableness, Conscientiousness, and Extraversion
were better predictors of cooperation.
Other researchers, however, have found less consistent support for
different predictors of task and citizenship performance (e.g., Allworth
& Hesketh, 1999; Ferris et al., 2001; Hurtz & Donovan, 2000; Johnson,
2001). When validity differences between task and citizenship performance are found, they tend to be specific to a given predictor or for
very specific predictors (i.e., rather than broad constructs, such as the
Big Five factors). To illustrate, Johnson (2001) examined the validity of
measures of cognitive ability and personality (i.e., Agreeableness, dependability, and achievement) in relation to dimensions of task, citizenship, and
adaptive performance. He found that cognitive ability correlated stronger
with task than with citizenship performance, though the differences were

VAN IDDEKINGE AND PLOYHART

901

not consistently large, and ability was similarly related to some facets of
task and citizenship performance. Further, although Agreeableness tended
to be more related to citizenship performance than to task performance,
dependability and achievement were similarly related to the two types of
performance.
Counterproductive work behavior. The second major expansion of
the criterion space involves counterproductive work behavior (CWB).
CWBs reflect voluntary actions that violate organizational norms and
threaten the well-being of the organization and/or its members (Bennett &
Robinson, 2000; Robinson & Bennett, 1995). Researchers have identified
a large number of CWBs, including theft, property destruction, unsafe
behavior, poor attendance, and intentional poor performance. However,
empirical research typically has found evidence for a general CWB factor
(e.g., Bennett & Robinson, 2000; Lee & Allen, 2002), or for a small set
of subfactors (e.g., Gruys & Sackett, 2003; Sackett, Berry, Wiemann, &
Laczo, 2006). For example, Sackett et al. (2006) found two CWB factors,
one that reflected behaviors aimed at the organization (i.e., organizational
deviance) and another that reflected behaviors aimed at other individuals
within the organization (i.e., interpersonal deviance). Moreover, results
of a recent meta-analysis revealed that although highly related ( = .62),
interpersonal and organizational deviance had somewhat different correlates (Berry, Ones, & Sackett, 2007). For example, interpersonal deviance
was more strongly related to Agreeableness, whereas organizational deviance was more strongly related to Conscientiousness and citizenship
behaviors.
As with citizenship performance, an important issue for researchers is
whether CWBs can be measured in such a way that they provide performance information not captured by other criterion measures. Preliminary
evidence suggests some reason for optimism in this regard. For instance, a
meta-analysis by Dalal (2005) revealed a modest relationship ( = .32)
between CWBs and citizenship behaviors (though the relationship was
much stronger [ = .71] when supervisors rated both sets of behaviors).
There also was some evidence that the two types of behaviors were differentially related to variables such as job satisfaction and negative affect.
Sackett et al. (2006) found that treating CWB and citizenship behaviors as
separate factors in a CFA provided a better fit to the data than did treating
them as a single entity. In addition, the Big Five personality factors exhibited somewhat different relations with the two criteria. Similarly, Dudley,
Orvis, Lebiecki, and Cortina (2006) found that CWBs were predicted
by different facets of Conscientiousness than were task and citizenship
performance.
Adaptive performance. Todays work environment often requires
workers to adapt to new and ever-changing situations, including

902

PERSONNEL PSYCHOLOGY

globalization, mergers, technological advances, and diversity (Ilgen &


Pulakos, 1999). Adaptive performance has been defined as the proficiency with which individuals alter their behavior to meet work demands
(Pulakos, Arad, Donovan, & Plamondon, 2000). Pulakos and colleagues
analyzed critical incidents from 21 different jobs that described adaptive
behaviors. They identified eight dimensions of adaptive performance, such
as handling emergencies or crisis situations, handling work stress, solving
problems creatively, and demonstrating cultural adaptability. However, results of a subsequent study revealed evidence of a strong general adaptability factor rather than more specific facets (Pulakos et al., 2002). Further,
although measures of adaptive experiences, interests, and self-efficacy
predicted an overall adaptive performance composite, the criterion was
best predicted by a measure of achievement orientation.
Relatively few published studies have investigated whether adaptive
performance is distinct from task and citizenship performance. Allworth
and Hesketh (1999) reported the results of a principal components analysis of supervisor ratings of task, citizenship, and adaptive performance.
Results revealed the existence of a strong general ratings factor. However,
when the researchers specified three factors, there was some evidence that
the three performance dimensions comprised separate (yet highly correlated) factors. Allworth and Hesketh also examined whether the performance dimensions were differentially predicted by biodata, personality,
and cognitive ability. In general, measures of constructs thought to be
particularly relevant to adaptive performance (e.g., emotional stability,
openness) were not better predictors of adaptability than they were of
task and citizenship performance. The one exception was that adaptive
performance was somewhat better predicted by a set of biodata scales
measuring prior experience with change, although cognitive ability was
an equally strong predictor of adaptability.
Johnson (2001) suggested that most dimensions of adaptive performance can also be classified as elements of task performance, citizenship
performance, or both. For instance, he found some evidence that handling
work stress loaded with citizenship performance rather than with task
performance or as its own factor. Johnson (2007) concluded that dealing
with uncertain and unpredictable work situations may be the only adaptive
dimension that is conceptually distinct from task and citizenship performance. He found that although highly correlated with task and citizenship
performance, there was some evidence that this dimension of adaptability
comprised its own factor. However, Johnson (2001, 2007) and Allworth
and Hesketh (1999) measured only 13 of the eight adaptive performance
dimensions identified by Pulakos et al. (2000). Thus, additional research
is needed to replicate and extend these initial findings concerning whether
adaptive performance measures can be distinguished from measures of
other types of performance.

VAN IDDEKINGE AND PLOYHART

903

Conclusions and recommendations. Recent research has made great


strides in expanding how job performance is conceptualized. Our review
of this literature identified some recommendations for researchers who
wish to consider these newer criteria into validation research but also
several challenges and areas of caution.
(a) Carefully consider whether all valued criteria for the target job
have been identified. Be aware that behaviors related to citizenship
performance and CWB may not emerge from job analysis practices
that focus more narrowly on task performance. Unfortunately, this
literature does not provide guidance as to how these types of
behaviors should be identified. Subject matter experts could rate
the importance of such behaviors. However, there currently is no
single set of performance models to serve as a basis for such
ratings (e.g., there are several similar yet different taxonomies of
citizenship performance; see Rotundo & Sackett, 2002).
(b) Discriminant validity evidence for task and citizenship performance might not always be very strong within the validation context. Similarly, despite widespread interest in adaptability on the
job, there is very limited empirical evidence that adaptive performance is distinct from overall task performance.
(c) Nearly all of the studies we reviewed used research-based performance ratings. Our experience suggests that evidence of discriminant validity (i.e., among ratings of different performance
dimensions) tends to be much weaker for administrative ratings,
which, unfortunately, are the type of criteria organizational decision makers tend to prefer.
(d) The task-cognitive and citizenship-personality distinction may not
be as large as previously thought. Even when differences in validity are found, they often are modest and may not be practically
significant. In contrast, it appears that CWBs often have somewhat different antecedents than do other types of performance
dimensions and that different CWBs (e.g., interpersonal vs. organizational deviance) also have different correlates.
(e) Researchers need further guidance concerning the measurement
of some of these constructs. For example, initial evidence suggests
that CWB may be distinct from task and citizenship performance,
but there would appear to be a variety of potential measurement
challenges, including the low base rate of some CWBs (e.g., theft
and sabotage), the inability of others (e.g., supervisors) to observe
such behaviors, and the likelihood of socially desirable responding
for self-reports.
(f) The discretionary nature of some types of citizenship behaviors
may have implications for the legal defensibility of selection

904

PERSONNEL PSYCHOLOGY

decisions made on the basis of selection procedures that predict


only such behaviors. Behaviors such as putting forth extra effort
to perform ones job, cooperating with others, and following organizational policies are likely to be required by the job and thus
are appropriate validation criteria. In contrast, behaviors such as
promoting and defending the organization and helping coworkers
with personal matters, although important, seem more peripheral
to ones job. We suggest that researchers exercise caution when
validating predictors using criterion measures that include these
latter types of citizenship behaviors.
(g) We do not mean to paint a pessimistic picture about the implications of this research for validation practice. Nonetheless, theory
appears to have outpaced practice with respect to the use of some
of these newer types of criteria. Therefore, researchers should be
cautious when considering such criteria for validation until research has addressed some of the issues noted herein.
Broad versus Narrow Criteria

A related decision researchers face is whether to use a single broad


criterion or multiple narrower criteria to validate selection procedures.
This has been a longstanding issue within the literature (e.g., Schmidt &
Kaplan, 1971). Some scholars have suggested that different performance
dimensions should be combined into a single composite that reflects overall performance (e.g., Nagle, 1953), whereas others have suggested that
different dimensions be examined independently (e.g., Dunnette, 1963).
Several recent studies have demonstrated the potential benefits of using multiple criteria. Murphy and Shiarella (1997) examined the effects
of varying weights for two predictors (i.e., cognitive ability and conscientiousness), weights for two criteria (i.e., task and citizenship performance),
and the SD of the criteria on the correlation between the predictor composite and the criterion composite. Results indicated that the weights assigned
to different predictors and to different criteria can have a dramatic effect
on criterion-related validity. Specifically, the 95% confidence intervals for
validities of the weighted predictor composites ranged from .20 to .78. 5
Other studies have investigated the effects of matching conceptually
related predictors and criteria on validity. A meta-analysis by Hogan and
5
See related articles by De Corte and colleagues (e.g., De Corte, 1999; De Corte &
Lievens, 2003; De Corte, Lievens, & Sackett, 2006, 2007), Hattrup, Rock, and Scalia (1997),
and Schmitt, Rogers, Chan, Sheppard, and Jennings (1997) for information concerning the
effects of weighting predictors and/or criteria on the tradeoff between criterion-related
validity and adverse impact. In addition, see Aguinis and Smith (2007) for a framework
that allows researchers to estimate the effects of criterion-related validity and predictive
bias on selection errors and adverse impact.

VAN IDDEKINGE AND PLOYHART

905

Holland (2003) showed that aligning personality scales with narrow criteria, instead of with an overall criterion composite, resulted in higher
validity estimates. For instance, operational validities based on narrow
and aligned criteria versus broad criteria were .31 versus .20 for Conscientiousness, and .29 versus .08 for Openness. In a similar study, Bartram
(2005) found that observed correlations between personality test scores
and performance ratings were larger when the predictors and criteria were
theoretically aligned than when they were not (mean r = .16 vs. .02).
Although the idea of multidimensional criteria is appealing, in practice, ratings criteria often are so intercorrelated that the use of separate
criterion variables cannot be justified empirically. Indeed, researchers frequently describe the diverse nature of the performance dimensions rated,
but then end up having to use a single criterion composite because the
dimension ratings are highly correlated or because some type of factor
analysis suggested the existence of a dominant single factor. For example,
a meta-analysis by Viswesvaran et al. (2005) found that when all forms
of measurement error were removed from the ratings, a general factor
accounted for 60% of the total variance. They concluded that the common practice of combining individual dimension ratings into an overall
performance composite to serve as a validation criterion is indeed justifiable. However, the researchers also noted that more specific performance
factors could account for some of the variance unexplained by the general
factor.
Conclusions and recommendations. The results of our review generally support the conclusions of Schmidt and Kaplan (1971): Broad criteria
are best predicted by broad predictors, and narrow criteria are best predicted by narrow predictors. That said, there are some important new
developments.

(a) Criterion-related validity may be enhanced by linking narrow criteria to narrow predictors. However, such relationships may not
generalize across jobs to the same extent that relations between
broad criteria and broad predictors generalize (i.e., because narrow
criteria are less likely to be equally important for each job).
(b) Although weighting conceptually aligned predictor and criterion composites may enhance predictive validity, weighting also
poses some potential limitations. First, it assumes that researchers
can indeed extract multiple criterion variables from their performance measures, which, as we discussed, remains a persistent
challenge. Second, if researchers choose to derive empirically
weighted predictor composites on the basis of validation results
(rather than, for example, rationally weighted composites according to criterion importance to the organization), the predictors

906

PERSONNEL PSYCHOLOGY

likely will experience shrinkage upon cross-validation (Raju


et al., 1999).
(c) Develop criterion measures that are conceptually aligned with the
latent criterion constructs and that maximize the potential use
of multiple criteria for predictor validation. Some ideas for developing multiple criterion measures include collecting ratings
data from multiple sources (e.g., supervisors to assess task performance, peers to assess citizenship performance, and even selfratings to assess certain types of CWBs not easily observed by
others), using multiple items to measure each criterion dimension
so that reliable composites can be created, and using nonratings
criteria such as job knowledge tests and work sample tests, as
appropriate.
(d) Researchers may also wish to consider withdrawal criteria such
as absenteeism and turnover as additional validation criteria. Unfortunately, we are not aware of any research that has provided
specific guidance for incorporating withdrawal variables into validation research. For instance, the vast majority of turnover research
has been conducted with current employees, and as a result, relations between preemployment predictors and turnover are not well
understood (Barrick & Zimmerman, 2005). Further, withdrawal
behaviors tend to be complex (e.g., turnover can be voluntary or
involuntary, avoidable or unavoidable, functional or dysfunctional;
Abelson, 1987), and thus, developing valid and reliable measures
of such criteria can be challenging.
(e) Ultimately, use an overall criterion composite to estimate the validity of an overall predictor composite for operational use (Guion,
1961). This suggestion does not argue against the value of multidimensional criteria. Indeed, the items that comprise the criterion composite should be broadly based to consider aspects of
task, citizenship, counterproductive, and adaptive performance, as
relevant.
Maximum versus Typical Performance

Another way to categorize validation criteria is according to whether


they reflect maximum performance or typical performance (DuBois, Sackett, Zedeck, & Fogli, 1993; Sackett, Zedeck, & Fogli, 1988). Three factors
distinguish the two types of performance (Sackett et al., 1988). Under
maximum performance conditions, those being rated know they are being
evaluated, they accept instructions to maximize their performance on the
task, and the task is of short enough duration such that ratees can devote
maximum effort. Conversely, typical performance is defined by conditions

VAN IDDEKINGE AND PLOYHART

907

in which ratees are not directly or continuously evaluated, there are no


explicit instructions to exert total effort, and performance occurs over an
extended period (e.g., an annual performance appraisal period).
Despite its theoretical appeal, we found only three field studies that
investigated the extent to which measures of maximum and typical performance are empirically distinct (Klehe & Latham, 2006; Marcus, Goffin,
Johnston, & Rothstein, 2007; Ployhart, Lim, & Chan, 2001). Both Ployhart et al. (2001) and Marcus et al. (2007) used assessment center ratings
to measure maximum performance and peer and/or supervisor ratings of
training or on-the-job performance to measure typical performance. In
contrast, Klehe and Latham (2006) used peer ratings of a 5-day academic
team project as the maximum criteria and peer ratings of semester-long
performance as the typical criteria. All three studies reported (a) small
to moderate observed correlations (r = .25 to .40) between maximum
and typical performance measures and (b) evidence that the two sets of
measures were better described by two separate factors than by a single,
overall factor.
These and other studies also have investigated whether the two types
of performance have different antecedents. Sackett et al. (1988) suggested
that motivation is the primary difference between maximum and typical
performance. Specifically, motivation is thought to play a large role in what
workers will do from day-to-day (i.e., typical performance) and to play a
lesser role in what workers can do in those relatively shorter situations that
require maximum performance because motivation is constrained to be
high. Thus, typical performance is expected to be best predicted by dispositional constructs related to motivation, whereas maximum performance
is expected to best predicted by ability-related constructs.
Subsequent empirical research has provided somewhat mixed support for ability and motivation as differential antecedents of maximum and typical performance. Marcus et al. (2007) found that cognitive ability was a better predictor of maximum performance, whereas
personality variables such as Conscientiousness and Extraversion were
better predictors of typical performance. Kirk and Brown (2003) found
that two motivation oriented predictorsgeneral self-efficacy and need
for achievementcorrelated with scores on a maximum performance oriented test used for promotion. These results appear to challenge the notion
that motivation does not have predictive value in a maximum performance context. In a laboratory setting, Klehe and Anderson (2007) found
evidence that measures of motivation (e.g., self-efficacy) correlated higher
with typical performance than with maximum performance, whereas the
opposite was true for knowledge- and skill-based predictors. However,
some of the motivational predictors were also predictive of performance
under maximum conditions, which is consistent with the findings of Kirk

908

PERSONNEL PSYCHOLOGY

and Brown (2003). Finally, Klehe and Latham (2006) discovered that
ratings of both situational and BDI questions were stronger predictors of
typical performance than of maximum performance, despite the fact that
selection interviews would seem to represent a maximum performanceoriented context.
Conclusions and recommendations. The distinction between maximum and typical performance is a potentially important but frequently
overlooked issue in validity research. We believe that much more research is needed to further understand this distinction and its implications
for predictor development and validation. On the basis of the relatively
limited amount of research that has been conducted, we recommend the
following.
(a) Devote careful thought to whether the performance domain of the
target job reflects maximum performance, typical performance, or
some combination of the two and then develop selection procedures accordingly. A mismatch of predictors and typical/maximum
criteria could obscure evidence of criterion-related validity and
lead to inaccurate conclusions about predictor effectiveness.
(b) Be aware, however, that recent findings suggest that traditional
distinctions between maximum and typical performance are not
always very clear. For example, there is evidence that motivation is
not constant in maximum performance contexts. Likewise, the distinction between predictors of typical and maximum performance
is not clear-cut. It appears that maximum and typical performance
represent a continuum rather than a strict dichotomy.
(c) The maximum-typical distinction may be most important for jobs
that have a strong maximum performance component. In such
cases, failure to include a maximum criterion would neglect a
potentially important aspect of the performance domain and, in
turn, the identification of valid predictors. As an example, Jackson
and colleagues (Jackson, Harris, Ashton, McCarthy, & Tremblay,
2000) noted that law enforcement jobs often require maximum
performance (e.g., apprehending suspects) but that such performance often is not easily observed by supervisors. They described
how standardized work sample tests can serve as useful validation
criteria for such jobs.
Dynamic Criteria

A final criterion issue concerns the stability of job performance over


time, or what often is referred to as dynamic criteria (Ghiselli, 1956;

VAN IDDEKINGE AND PLOYHART

909

Humphreys, 1960). To predict performance, one must be sure that it will


be stable over some reasonable amount of time. If not, then the performance prediction problem becomes considerably more difficult because
one must not only determine the types of performance to predict, but also
when to predict them (Ployhart et al., 2006). Although some debate has
existed about the presence of dynamic criteria (e.g., Barrett, Caldwell, &
Alexander, 1985), several recent studies have provided important insights
concerning performance dynamism.
First, recent research has discovered evidence that the mean and rank
order of employee performance tends to change over time. Ployhart and
Hakel (1998), for example, found that changes in employee sales performance across eight quarters approximated a learning curve by increasing
initially and then leveling off over time. Results also revealed nontrivial
rank-order changes in performance across employees. However, there still
appears to be a stable component to performance. For instance, Sturman,
Cheramie, and Cashen (2005) meta-analyzed subjective and objective performance data collected over time. Estimated correlations (corrected for
measurement error, but not for RR) ranged from .67 to .85 after 1 year to
between .44 and .49 after 3 years.
Researchers also have explored the extent to which relations between
predictors and criteria change over time. The most common finding is
that validity coefficients decrease as the time-lag between predictor and
criterion measurement increases (i.e., a simplex pattern). A meta-analysis
by Keil and Cortina (2001), for example, revealed that the validity of
ability tests deteriorated over time. This result was fairly consistent across
predictors (i.e., cognitive, perceptual, and psychomotor ability), criteria
(i.e., performance of consistent and inconsistent tasks), and time periods
(i.e., short- and long-term performance).
Last, several studies have discovered that different predictors relate
to different types of performance change (Deadrick, Bennett, & Russell, 1997; Farrell & McDaniel, 2001; Keil & Cortina, 2001; Stewart,
1999; Thoresen, Bradley, Bliese, & Thoresen, 2004). Study results tend
to provide support for Murphys (1989) framework of transition versus
maintenance stages. In the transition stage, job tasks are novel and employees need to learn new skills and tasks. In the maintenance stage,
employees have learned how to perform major job tasks, and performance
becomes more automatized. Therefore, ability constructs should be most
predictive of transition performance and dispositional constructs should
be most predictive of maintenance performance.
For example, Stewart (1999) found that the order facet of Conscientiousness was more strongly related to sales performance for transitional
stage employees than for maintenance stage employees (r = .27 vs. .06),

910

PERSONNEL PSYCHOLOGY

whereas the opposite was true for the achievement facet (r = .01 vs.
.22). Farrell and McDaniel (2001) examined task consistency as a potential moderator of predictor-criterion relations over time. Results revealed
that for jobs with primarily consistent tasks, cognitive ability was the best
predictor of initial performance, whereas psychomotor ability was a better
predictor of long-term performance. Conversely, for jobs with inconsistent
tasks, cognitive ability was the best predictor of both initial and long-term
performance.
Conclusions and recommendations. Performance is dynamic over
time yet is still explainable and predictable. Recent studies have helped
clarify the nature of criterion dynamism.
(a) Most research has found that performance follows a learning
curve, such that it tends to increase rapidly when employees begin
a job and then reaches an asymptote and levels off thereafter. Thus,
even though performance changes over time, there is systematic
variability present in the form of performance trends that can be
predicted and explained by individual differences.
(b) Different predictor constructs often are related to different stages
of job performance. Using Murphys (1989) framework, transition
performance tends to be best predicted by cognitively oriented
constructs, whereas maintenance performance tends to be best
predicted by motivational/dispositional constructs.
(c) Think carefully about when to collect the criterion data used to
validate selection procedures. As a general rule, it may be advantageous to collect performance data from employees who are in
the maintenance stage of their jobs. However, this may not always be easy to determine, as the duration of transition stages can
very depending on the job (e.g., complexity), the individual (e.g.,
prior experience), and the situation (e.g., quality of supervision;
Deadrick et al., 1997). Furthermore, transition times are becoming
increasingly important given the changing nature of work (Ilgen
& Pulakos, 1999), and early performance matters a great deal for
many jobs (e.g., first responders).
(d) If the validation sample comprises job incumbents, examine
whether tenure has any substantive influence on predictor scores,
criterion scores, or criterion-related validity. For example, researchers might calculate partial validity coefficients that remove
variance due to tenure when estimating predictor validity. Similarly, one could see whether tenure moderates validity. These types
of analyses may help researchers determine whether tenure has a
practically meaningful influence on predictive validity.

VAN IDDEKINGE AND PLOYHART

911

Directions for Future Research

We conclude by noting several key validation issues for which more


research is necessary, as well as some emerging issues that thus far have
received very little attention. Space limitations prevent us from providing
more comprehensive suggestions, but we highlight the issues we feel are
particularly critical.
(a) Questions surrounding the appropriate types and sequence of corrections for measurement error and RR remain paramount. There is
no question that these artifacts can affect the validity of inferences
drawn from selection procedure validation research, but there are
still many questions about what to do about it. For example, does
Hunter et al.s (2006) recent correction approach trump existing approaches? How does the research by Sackett and colleagues
(e.g., Sackett & Yang, 2000) fit with Hunters approach? Answers
to such questions are critical for the conduct of both primary and
meta-analytic research relevant to selection procedure validation.
(b) One of the most vexing issues concerning validity coefficient correction procedures is how to correct ratings criteria for measurement error. Rather than continue to debate the appropriateness
of this type of correction, or which estimate of criterion reliability should be preferred, it may be productive for researchers to
focus on more fundamental questions, such as what specific factors contribute most to the unreliability of performance ratings,
and what can we do to increase interrater consistency? We also
encourage further research on alternative validation criteria that
may be more objective and contain less measurement error than
performance ratings, yet still capture the motivational aspects of
employee performance.
(c) RWA and DA represent promising new approaches for assessing the relative validity of predictors. However, very few studies
have implemented these methods within a validation context. Future research should examine the use and implications of these
methods for validation research, particularly when the information
that RWA and/or DA provides is inconsistent with the information
provided by more traditional methods, such as regression weights
and incremental validity analyses.
(d) We know very little about the degree of differential prediction for
a host of selection constructs (e.g., adaptability, customer service
orientation) and selection methods (e.g., structured interviews,
SJTs, assessment centers). We also know little about differential
prediction with regard to gender, age, or for ethnic groups such

912

PERSONNEL PSYCHOLOGY

(e)

(f)

(g)

(h)

as Hispanics and Asians. These issues are particularly important


given the changing demographics of the domestic workforce.
Given the importance of sample selection, and the common decision to use job incumbents to validate predictors, we call for
research that directly tests alternate explanations for the validation
study design and sample differences noted in our review. For example, Huffcutt et al. (2004) speculated that the higher validity
estimates for concurrent validity studies of behavioral interviews
may be partially due to the fact that interviewer and supervisor
ratings are based on similar sets of job behaviors. That is, job incumbents serving as validity study participants are likely to base
their answers, at least in part, on how they behaved in their current
position, and supervisors may consider these same behaviors in
their performance evaluations of the incumbents. There is a need
to go beyond simple documentation of validation design/sample
differences; we need more theory and data to help explain why
such differences often exist.
Researchers should also attempt to examine the relative contribution of various validation design factors to criterion-related validity estimates within a multivariate framework. Variance due to
job experience, test-taking motivation and social desirability, and
time-lag between predictor and criteria data collection could be estimated. Studies also could use a within-subjects design, whereby
predictor data are collected from applicants during the selection
process and then again after the selected individuals are on the job
to determine more clearly how test context may influence validity
(for a recent example, see Ellingson, Sackett, & Connelly, 2007).
We need more research that examines the effects of retesting on
the validation of selection procedures for noncognitive constructs.
This literature has focused primarily on cognitively oriented predictors, and there is little or no evidence regarding retest effects on
the validity (or differential prediction) of constructs and methods
such as personality, biodata, and interviews. Further, conclusions
regarding effects on criterion-related validity are tentative because
studies have not used job performance as a criterion.
Much has been written about the construct-method distinction in
selection research (e.g., Arthur & Villado, 2008; Hough, 2001;
Schmitt & Chan, 2006). Nonetheless, the literature provides relatively little guidance concerning which method(s) are best suited
for assessing specific constructs (i.e., few studies have compared
the validity of alternative methods while holding the construct constant). For example, it would be useful to know whether the predictive validity of personality variables such as Conscientiousness
varies depending on whether they are measured using self-report

VAN IDDEKINGE AND PLOYHART

913

questionnaires, biodata inventories, structured interviews, and so


forth.
(i) Validation researchers need more guidance for developing valid
and reliable criterion measures that cover the range of behaviors important to job performance. For example, how might job
analysis be used to better identify relevant citizenship and adaptive performance dimensions? What are the primary factors that
comprise the domain of CWBs, and what are the most valid
and reliable methods for measuring CWBs within a validation
context?
(j) Finally, there are several emerging validation issues that require
additional research. These issues include validation of selection
procedures using unit-level predictors and criteria, validation of
selection-based measures of personenvironment fit, and crosscultural differences on validation predictors and criteria.
Concluding Thoughts

The use of validated procedures is vital for selecting a productive


workforce, as well as for ensuring that the decisions made on the basis of
those procedures are legally defensible. Validation work is challenging,
and researchers are faced with many critical choices that require informed
decisions. Scholarly research has focused on a multitude of issues that may
improve validity research, and this literature has grown rapidly in recent
years. This makes it difficult for researchers in general, and practitioners
in particular, who already are pressed with limited time, to keep up with
this large and often complex literature.
This article was written to help address this gap. We located over
100 studies published in the past decade whose results have direct relevance for validation research. We reviewed and summarized this research to identify key findings, recommendations, and areas for which
researchers should be cautious. We hope our review helps clarify issues
important to those involved in validation research. For example, we hope
our review of the growing and complex literature on validity coefficient
corrections helps researchers better understand the likely sources of measurement error and RR in the planning of validation studies, as well as
how to treat these artifacts when they cannot be controlled. Similarly,
we hope our review of the vast research base regarding the conceptualization and measurement of job performance highlights some of the key
implications this research has for the development and use of validation
criteria.
We conclude with one additional contribution. Table 1 provides a
quick reference guide of source materials relevant to the five topics we

914

PERSONNEL PSYCHOLOGY
TABLE 1
Source Materials for Conducting Criterion-Related Validity Studies

Validation topic
I. Validity coefficient
corrections

II. Evaluation of
multiple predictors

III. Differential
prediction

IV. Validation sample


characteristics
V. Validation criteria

Source materials
Schmidt and Hunter (1996). Describe various situations requiring
measurement error corrections that have relevance for validation
research.
DeShon (2002). Comprehensible overview of G-theory, including
SPSS and SAS syntax for computing g-coefficients.
Materials from a conference workshop presented by R. A. McCloy
and D. J. Putka on estimating interrater reliability in the real
world: http://www.humrro.org/corpsite/siop_reliability.php.
Sackett and Yang (2000). Describe various scenarios in which RR
may be present and suggested treatments.
Hunter et al. (2006). Step-by-step procedure for indirect RR
corrections.
Sackett et al. (2002) and LeBreton et al. (2003). Discuss criterion
RR and suggested treatments.
Johnson and LeBreton (2004). Describe concept of relative
importance and various types of importance indices.
LeBreton et al. (2007). Step-by-step procedure for evaluating
predictor importance (see Table 3 of their article).
SPSS program for conducting a relative weight analysis: Contact
Jeff johnson (jeff.johnson@pdri.com).
SAS program for conducting dominance analysis:
http://www.uwm.edu/azen.damacro.html.
Excel program for computing DA with six or fewer predictors:
http://www2.psych.purdue.edu/%7Elebreton/
Saad and Sackett (2002). Discuss differential validity resulting
from predictors versus criteria.
Sackett et al. (2003). Describe omitted variables problem and
potential solutions.
Computer programs by Aguinis and colleagues for estimating
power, checking violations of assumptions, and computing
predictor-subgroup effect sizes:

http://carbon.cudenver.edu/haguinis/mmr.
Murphy and Myors (2004). Readable book on statistical power.
Lievens et al. (2005). Framework for retesting effects and
associated analyses.
Coleman and Borman (2000), Lee and Allen (2002), and Van
Scotter et al. (2000). Sample citizenship performance measures.
Bennett and Robinson (2000) and Gruys and Sackett (2003).
Sample CWB measures.
Pulakos et al. (2002). Definitions and behavioral examples of
adaptive performance dimensions.
Murphy and Shiarella (1997). Step-by-step procedure for
computing validity coefficients for multivariate models.
Klehe, Anderson, and Viswesvaran (2007). Entire volume of
Human Performance devoted to the maximum-typical performance
distinction.
Ployhart et al. (2006). Chapter 4 covers a range of key issues
regarding the conceptualization and measurement of performance.

VAN IDDEKINGE AND PLOYHART

915

reviewed. We identify articles that describe the key issues, outline relevant
procedures and formulas, and include example measures. We also provide
Web addresses for freely available computer programs that can aid in the
analysis of validation data.
This article provided a comprehensive treatment of several core validation issues. Nonetheless, there are several issues we did not address that
researchers also should remember when validating selection procedures.
First, although highly important, planning and carrying out criterionrelated validity analyses represents only one aspect of selection research.
There are a variety of other important issues researchers must consider,
such as the identification of the performance domain for the target job and
the KSAOs that impact performance, the initial choice of selection procedures, the development of the procedures (e.g., mode of administration,
item formats, development of alternative forms), assessment of contentand construct-related validity, and the development of cutoff scores, just
to name a few. Furthermore, although critical, criterion-related validity is
one of several factors researchers should consider when selecting a final
set of predictors for operational use. Cost, subgroup differences and adverse impact, likelihood of faking, administrator and applicant reactions,
and time for administration and scoring also must carefully weighed.
Second, there are various reasons why local validation studies may not
always be feasible, including lack of access to large samples, inability to
collect valid and reliable criterion measures, and lack of resources to conduct a comprehensive validity study. In such instances, researchers should
consider alternative validation strategies, such as validity transportation,
synthetic validation, and validity generalization (for recent reviews, see
McPhail, 2007, and Scherbaum, 2005). Further, Newman, Jacobs, and
Bartram (2007) recently discussed how the combined use meta-analysis
and a local validation study (via Bayesian estimation) can lead to more
accurate estimates of criterion-related validity than can either method
alone.
Finally, above all else, effective validation research requires sound
professional judgment (Guion, 1991). Being knowledgeable about the
validation literature certainly can contribute to sound judgment. However, researchers also must possess a thorough understanding of the target
job and the organizational context in which the job is performed. Moreover, every validation study presents unique challenges and opportunities.
Therefore, researchers should apply what they have learned from the literature in light of their given circumstances.
REFERENCES
Abelson MA. (1987). Examination of avoidable and unavoidable turnover. Journal of
Applied Psychology, 72, 382386.

916

PERSONNEL PSYCHOLOGY

Aguinis H, Boik RJ, Pierce CA. (2001). A generalized solution for approximating the
power to detect effects of categorical moderator variables using multiple regression.
Organizational Research Methods, 4, 291323.
Aguinis H, Peterson SA, Pierce CA. (1999). Appraisal of the homogeneity of error variance assumption and alternatives to multiple regression for estimating
moderating effects of categorical variables. Organizational Research Methods, 2,
315329.
Aguinis H, Pierce CA. (2006). Computation of effect size for moderating effects of categorical variables in multiple regression. Applied Psychological Measurement, 30,
440442.
Aguinis H, Smith MA. (2007). Understanding the impact of test validity and bias on
selection errors and adverse impact in human resource selection. P ERSONNEL P SYCHOLOGY , 60, 165199.
Aguinis H, Stone-Romero EF. (1997). Methodological artifacts in moderated multiple
regression and their effects on statistical power. Journal of Applied Psychology, 82,
192206.
Allworth E, Hesketh B. (1999). Construct-oriented biodata: Capturing change-related and
contextually relevant future performance. International Journal of Selection and
Assessment, 7, 97111.
American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational
and psychological testing. Washington, DC: AERA.
Anderson N, Lievens F, van Dam K, Ryan AM. (2004). Future perspectives on employee
selection: Key directions for future research and practice. Applied Psychology: An
International Review, 53, 487501.
Arthur WJ, Bell ST, Villado AJ, Doverspike D. (2006). The use of person-organization
fit in employment decision making: An assessment of its criterion-related validity.
Journal of Applied Psychology, 91, 786801.
Arthur WJ, Villado AJ. (2008). The importance of distinguishing between constructs and
methods when comparing predictors in personnel selection research and practice.
Journal of Applied Psychology, 93, 435442.
Arvey RD, Strickland WJ, Drauden G, Martin C. (1990). Motivational components of test
taking. P ERSONNEL P SYCHOLOGY, 43, 695716.
Barrett GV, Caldwell MS, Alexander RA. (1985). The concept of dynamic criteria: A
critical reanalysis. P ERSONNEL P SYCHOLOGY, 5, 4156.
Barrett GV, Philips JS, Alexander RA. (1981). Concurrent and predictive validity designs:
A critical reanalysis. Journal of Applied Psychology, 66, 16.
Barrick MR, Stewart GL, Neubert MJ, Mount MK. (1998). Relating member ability and
personality to work-team processes and team effectiveness. Journal of Applied
Psychology, 83, 377391.
Barrick MR, Zimmerman RD. (2005). Reducing voluntary, avoidable turnover through
selection. Journal of Applied Psychology, 90, 159166.
Bartlett CJ, Bobko P, Mosier SB, Hannan R. (1978). Testing for fairness with a moderated
multiple regression strategy: An alternative to differential prediction. P ERSONNEL
P SYCHOLOGY, 31, 233241.
Bartram D. (2005). The great eight competencies: A criterion-centric approach to validation.
Journal of Applied Psychology, 90, 11851203.
Bennett RJ, Robinson SL. (2000). Development of a measure of workplace deviance.
Journal of Applied Psychology, 85, 349360.
Berry CM, Ones DS, Sackett PR. (2007). Interpersonal deviance, organizational deviance,
and their common correlates: A review and meta-analysis. Journal of Applied Psychology, 92, 410424.

VAN IDDEKINGE AND PLOYHART

917

Binning JF, Barrett GV. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74,
478494.
Birkeland SA, Kisamore JL, Brannick MT, Smith MA. (2006). A meta-analytic investigation of job applicant faking on personality measures. International Journal of
Selection and Assessment, 14, 317335.
Bliese PD. (1998). Group size, ICC values, and group-level correlations: A simulation.
Organizational Research Methods, 1, 355373.
Bobko P, Roth PL, Potosky D. (1999). Derivation and implications of a meta-analysis
matrix incorporating cognitive ability, alternative predictors, and job performance.
P ERSONNEL P SYCHOLOGY, 52, 561589.
Borman WC, Motowidlo SJ. (1993). Expanding the criterion domain to include elements
of contextual performance. In Schmitt N, Borman WC (Eds.), Personnel selection
in organizations (pp. 7198). San Francisco: Jossey-Bass.
Borman WC, Motowidlo SJ. (1997). Task performance and contextual performance: The
meaning for personnel selection research. Human Performance, 10, 99109.
Brief AP, Motowidlo SJ. (1986). Prosocial organizational behaviors. Academy of Management Review, 11, 710725.
Budescu DV. (1993). Dominance analysis: A new approach to the problem of relative
importance of predictors in multiple regression. Psychological Bulletin, 114, 542
551.
Budescu DV, Azen R. (2004). Beyond global measures of relative importance:
Some insights from dominance analysis. Organizational Research Methods, 7,
341350.
Burket GR. (1964). A study of reduced rank models for multiple prediction. Psychometricka
Monograph Supplement, 12.
Callender JC, Osburn HG. (1980). Development and test of a new model for validity
generalization. Journal of Applied Psychology, 65, 543558.
Cattin P. (1980). Estimating the predictive power of a regression model. Journal of Applied
Psychology, 65, 407414.
Cellar DF, Miller ML, Doverspike DD, Klawsky JD. (1996). Comparison of factor
structures and criterion-related validity coefficients for two measures of personality based of the five factor model. Journal of Applied Psychology, 81,
694704.
Chan D, Schmitt N. (2002). Situational judgment and job performance. Human Performance, 15, 233254.
Cleary TA. (1968). Test bias: Prediction of grades of negro and white students in integrated
colleges. Journal of Educational Measurement, 5, 115124.
Coleman VI, Borman WC. (2000). Investigating the underlying structure of the citizenship
performance domain. Human Resource Management Review, 10, 2544.
Conway JM. (1999). Distinguishing contextual performance from task performance for
managerial jobs. Journal of Applied Psychology, 84, 313.
Cronbach LJ. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,
16, 297334.
Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York:
Wiley.
Dalal RS. (2005). Meta-analysis of the relationship between organizational citizenship
behavior and counterproductive work behavior. Journal of Applied Psychology, 90,
12411255.

918

PERSONNEL PSYCHOLOGY

De Corte W. (1999). Weighting job performance predictors to both maximize the quality of
the selected workforce and control the level of adverse impact. Journal of Applied
Psychology, 84, 695702.
De Corte W, Lievens F. (2003). A practical procedure to estimate the quality and the adverse
impact of single-stage selection decisions. International Journal of Selection and
Assessment, 11, 8997.
De Corte W, Lievens F, Sackett PR. (2006). Predicting adverse impact and mean criterion performance in multistage selection. Journal of Applied Psychology, 91,
523537.
De Corte W, Lievens F, Sackett PR. (2007). Combining predictors to achieve optimal tradeoffs between selection quality and adverse impact. Journal of Applied Psychology,
92, 13801393.
Deadrick DL, Bennett N, Russell CJ. (1997). Using hierchical linear modeling to examine
dynamic performance criteria over time. Journal of Management, 23, 745757.
DeShon RP. (2002). Generalizability theory. In Drasgow F, Schmitt N (Eds.), Measuring and analyzing behavior in organizations: Advances in measurement and data
analysis (pp. 189220). San Francisco: Jossey-Bass.
DeShon RP. (2003). A generalizability theory perspective on measurement error corrections
in validity generalization. In Murphy KR (Ed.), Validity generalization: A critical
review (pp. 365402). Mahwah, NJ: Erlbaum.
DuBois CLZ, Sackett PR, Zedeck S, Fogli L. (1993). Further exploration of typical and
maximum performance criteria: Definitional issues, prediction, and white-black
differences. Journal of Applied Psychology, 78, 205211.
Dudley NM, Orvis KA, Lebiecki JA, Cortina JM. (2006). A meta-analytic investigation of
conscientiousness in the prediction of job performance: Examining the intercorrelations and the incremental validity of narrow traits. Journal of Applied Psychology,
91, 4057.
Dunbar SB, Linn RL. (1991). Range restriction adjustments in the prediction of military
job performance. In Wigdor AK, Green BF Jr. (Eds.), Performance assessment for
the workplace (Vol. 2, pp. 127157). Washington, DC: National Academy Press.
Dunbar SB, Novick MR. (1988). On predicting success in training for men and women:
Examples from Marine Corps clerical specialties. Journal of Applied Psychology,
75, 545550.
Dunnette MD. (1963). A note on the criterion. Journal of Applied Psychology, 47, 251254.
Dunnette MD, McCartney J, Carlson HC, Kirchner WK. (1962). A study of faking behavior
on a forced-choice self-description checklist. P ERSONNEL P SYCHOLOGY, 15, 13
24.
Ellingson JE, Sackett PR, Connelly BS. (2007). Personality assessment across selection
and development contexts: Insights into response distortion. Journal of Applied
Psychology, 92, 386395.
Equal Employment Opportunity Commission C. S. C, Department of Labor, and Department of Justice. (1978). Uniform guidelines on employee selection procedures,
Federal Register, 43 (166), 3829538309.
Farrell JN, McDaniel MA. (2001). The stability of validity coefficients over time: Ackermans (1988) model and the General Aptitude Test Battery. Journal of Applied
Psychology, 86, 6079.
Ferris GR, Witt LA, Hochwarter WA. (2001). Interaction of social skill and general mental
ability on job performance and salary. Journal of Applied Psychology, 86, 1075
1082.
Ghiselli EE. (1956). Dimensional problems of criteria. Journal of Applied Psychology, 40,
14.

VAN IDDEKINGE AND PLOYHART

919

Gruys ML, Sackett PR. (2003). Investigating the dimensionality of counterproductive work
behavior. International Journal of Selection and Assessment, 11, 3042.
Guion RM. (1961). Criterion measurement and personnel judgments. P ERSONNEL P SYCHOLOGY , 14, 141149.
Guion RM. (1991). Personnel assessment, selection, and placement. In Dunnette MD,
Hough LM (Eds.), Handbook of industrial and organizational psychology (2nd ed.,
Vol. 2, pp. 327397). Palo Alto, CA: Consulting Psychologists Press.
Guion RM. (1998). Assessment, measurement, and prediction for personnel decisions.
Mahwah, NJ: Erlbaum.
Guion RM, Cranny CJ. (1982). A note on concurrent versus predictive validity designs: A
critical reanalysis. Journal of Applied Psychology, 67, 239244.
Harold CM, McFarland LA, Weekley JA. (2006). The validity of verifiable and nonverifiable biodata items: An examination across applicants and incumbents. International Journal of Selection and Assessment, 14, 336346.
Hattrup K, OConnell MS, Wingate PH. (1998). Prediction of multidimensional criteria:
Distinguishing task and contextual performance. Human Performance, 11, 305319.
Hattrup K, Rock J, Scalia C. (1997). The effects of varying conceptualizations of job
performance on adverse impact, minority hiring, and predicted performance. Journal
of Applied Psychology, 82, 656664.
Hausknecht JP, Halpert JA, Di Paolo NT, Moriarty Gerrard MO. (2007). Retesting in
selection: A meta-analysis of coaching and practice effects for tests of cognitive
ability. Journal of Applied Psychology, 92, 373385.
Hausknecht JP, Trevor CO, Farr JL. (2002). Retaking ability tests in a selection setting:
Implications for practice effects, training performance, and turnover. Journal of
Applied Psychology, 87, 243254.
Hoffman BJ, Blair CA, Meriac JP, Woehr DJ. (2007). Expanding the criterion domain?
A quantitative review of the OCB literature. Journal of Applied Psychology, 92,
555566.
Hogan J, Barrett P, Hogan R. (2007). Personality measurement, faking, and employment
selection. Journal of Applied Psychology, 92, 12701285.
Hogan J, Holland B. (2003). Using theory to evaluate personality and job-performance
relations: A socioanalytic perspective. Journal of Applied Psychology, 88, 100112.
Hough LM. (1992). The Big Five personality variables-construct confusion: Description
versus prediction. Human Performance, 5, 139155.
Hough LM. (1998). Personality at work: Issues and evidence. In Hakel M (Ed.), Beyond multiple choice: Evaluating alternatives to traditional testing for selection
(pp. 131166). Mahwah, NJ: Erlbaum.
Hough LM. (2001). I/Owes its advances to personality. In Roberts B, Hogan RT (Eds.),
Applied personality psychology: The intersection between personality and I/O psychology (pp. 1944). Washington, DC: American Psychological Association.
Houston WM, Novick MR. (1987). Race-based differential prediction in Air Force technical
training programs. Journal of Educational Measurements, 24, 309320.
Huffcutt AI, Conway JM, Roth PL, Klehe U-C. (2004). The impact of job complexity and
study design on situational and behavior description interview validity. International
Journal of Selection and Assessment, 12, 262273.
Huffcutt AI, Conway JM, Roth PL, Stone NJ. (2001). Identification and meta-analytic
assessment of psychological constructs measured in employment interviews. Journal
of Applied Psychology, 86, 897913.
Humphreys LG. (1960). Investigations of the simplex. Psychometricka, 25, 313323.
Hunter JE, Schmidt FL, Le H. (2006). Implications for direct and indirect range restriction
for meta-analysis methods and findings. Journal of Applied Psychology, 91, 594
612.

920

PERSONNEL PSYCHOLOGY

Hurtz GM, Donovan JJ. (2000). Personality and job performance: The Big Five revisited.
Journal of Applied Psychology, 85, 869879.
Huselid MA. (1995). The impact of human resource management practices on turnover, productivity, and corporate financial performance. Academy of Management Journal,
38, 635672.
Ilgen DR, Pulakos ED. (1999). Employee performance in todays organizations. In Ilgen DR, Pulakos ED (Eds.), The changing nature of work performance: Implications for staffing, motivation, and development (pp. 120). San Francisco:
Jossey-Bass.
Jackson DN, Harris WG, Ashton MC, McCarthy JM, Tremblay PF. (2000). How useful are work samples in validation studies. International Journal of Selection and
Assessment, 8, 2933.
James LR. (1980). The unmeasured variables problem in path analysis. Journal of Applied
Psychology, 65, 415421.
James LR, Demaree RG, Wolf G. (1984). Estimating within-group interrater reliability with
and without response bias. Journal of Applied Psychology, 69, 8598.
Johnson JW. (2000). A heuristic method for estimating the relative weights of predictor variables in multiple regression. Multivariate Behavioral Research, 35,
119.
Johnson JW. (2001). The relative importance of task and contextual performance dimensions to supervisor judgments of overall performance. Journal of Applied Psychology, 86, 984996.
Johnson JW. (2004). Factors affecting relative weights: The influence of sampling and
measurement error. Organizational Research Methods, 7, 283299.
Johnson JW. (2007, April). Distinguishing adaptive performance from task and citizenship
performance. In Oswald FL, Oberlander EM (Chairs), Adaptive skills and adaptive
performance: Todays organizational reality. Symposium conducted at the 22nd
Annual Conference of the Society for Industrial and Organizational Psychology,
New York.
Johnson JW, Carter GW, Davison HK, Oliver DH. (2001). A synthetic validity approach
to testing differential prediction hypotheses. Journal of Applied Psychology, 86,
774780.
Johnson JW, LeBreton JM. (2004). History and use of relative importance indices in
organizational research. Organizational Research Methods, 7, 238257.
Keil CT, Cortina JM. (2001). Degradation of validity over time: A test and extension of
Ackermans model. Psychological Bulletin, 127, 673697.
Kirk AK, Brown DF. (2003). Latent constructs of proximal and distal motivation predicting
performance under maximum test conditions. Journal of Applied Psychology, 88,
4049.
Klehe U-C, Anderson N. (2007). Working hard and working smart: Motivation and ability
during typical and maximum performance. Journal of Applied Psychology, 92, 978
992.
Klehe U-C, Anderson N, Viswesvaran C. (2007). More than peaks and valleys: Introduction
to the special issue on typical and maximum performance. Human Performance, 20,
173178.
Klehe U-C, Latham GP. (2006). What would you do - really or ideally? Constructs underlying the behavioral description interview and situational interview for predicting
typical versus maximum performance. Human Performance, 19, 357381.
Ladd RT, Atchley EK, Gniatczyk LA, Baumann LB. (2002, April). An evaluation of
the construct validity of an assessment center using multiple-regression importance

VAN IDDEKINGE AND PLOYHART

921

analysis. Paper presented at the 17th Annual Conference of the Society for Industrial
and Organizational Psychology, Toronto.
Landy FJ. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 11831192.
Larson SC. (1931). The shrinkage of the coefficient of multiple correlation. Journal of
Educational Psychology, 22, 4555.
Le H, Schmidt FL. (2006). Correcting for indirect range restriction in meta-analysis: Testing
a new meta-analytic procedure. Psychological Methods, 4, 416438.
LeBreton JM, Burgess JRD, Kaiser RB, Atchley EK, James LR. (2003). The restriction
of variance hypothesis and interrater reliability and agreement: Are ratings from
multiple sources really dissimilar? Organizational Research Methods, 6, 80128.
LeBreton JM, Hargis MB, Griepentrog B, Oswald FL, Ployhart RE. (2007). A multidimensional approach for evaluating variables in organizational research and practice.
P ERSONNEL P SYCHOLOGY, 60, 475498.
LeBreton JM, Ployhart RE, Ladd RT. (2004). A Monte Carlo comparison of relative
importance methodologies. Organizational Research Methods, 7, 258282.
Lee K, Allen NJ. (2002). Organizational citizenship behavior and workplace deviance: The
role of affect and cognitions. Journal of Applied Psychology, 87, 131142.
LePine JA, Van Dyne L. (2001). Voice and cooperative behavior as contrasting forms
of contextual performance: Evidence of differential relationships with Big Five
personality characteristics and cognitive ability. Journal of Applied Psychology, 86,
326336.
Lievens F, Buyse T, Sackett PR. (2005). Retest effects in operational selection settings: Developments and test of a framework. P ERSONNEL P SYCHOLOGY, 58,
9811007.
Lievens F, Reeve CL, Heggestad ED. (2007). An examination of psychometric bias due to
retesting on cognitive ability tests in selection settings. Journal of Applied Psychology, 92, 16721682.
Linn RL, Harnisch DL, Dunbar SB. (1981). Corrections for range restriction: An empirical
investigation of conditions resulting in conservative corrections. Journal of Applied
Psychology, 66, 655663.
Marcus B, Goffin RD, Johnston NG, Rothstein MG. (2007). Personality and cognitive
ability as predictors of typical and maximum managerial performance. Human
Performance, 20, 275285.
McDaniel MA, Morgeson FP, Finnegan EB, Campion MA, Braverman EP. (2001). Predicting job performance from common sense. Journal of Applied Psychology, 86,
730740.
McGraw KO, Wong SP. (1996). Forming inferences about some intraclass correlations
coefficients. Psychological Methods, 1, 3046.
McKay PF, McDaniel MA. (2006). A reexamination of Black-White mean differences in
work performance: More data, more moderators. Journal of Applied Psychology,
91, 538554.
McManus MA, Kelly ML. (1999). Personality measures and biodata: Evidence regarding their incremental predictive value in the life insurance industry. P ERSONNEL
P SYCHOLOGY, 52, 137148.
McPhail SM. (2007). Alternative validation strategies: Developing new and leveraging
existing validity evidence. San Francisco: Wiley.
Messick SJ. (1998). Alternative models of assessment, uniform standards of validity. In
Hakel MD (Ed.), Beyond multiple choice: Evaluating alternatives to traditional
testing for selection (pp. 5974). Mahwah, NJ: Erlbaum.

922

PERSONNEL PSYCHOLOGY

Morgeson FP, Reider MH, Campion MA. (2005). Selecting individuals in team settings: The
importance of social skills, personality characteristics, and teamwork knowledge.
P ERSONNEL P SYCHOLOGY, 58, 583611.
Motowidlo SJ. (2000). Some basic issues related to contextual performance and organizational citizenship behavior in human resource management. Human Resource
Management Review, 10, 115126.
Mount MK, Witt LA, Barrick MR. (2000). Incremental validity of empirically keyed
biodata scales over GMA and the five-factor personality constructs. P ERSONNEL
P SYCHOLOGY, 53, 299323.
Murphy KR. (1989). Is the relationship between cognitive ability and job performance
stable over time? Human Performance, 2, 183200.
Murphy KR, Cleveland JN, Skattebo AL, Kinney TB. (2004). Raters who pursue different
goals give different ratings. Journal of Applied Psychology, 89, 158164.
Murphy KR, DeShon RP. (2000). Interrater correlations do not estimate the reliability of
job performance ratings. P ERSONNEL P SYCHOLOGY, 53, 873900.
Murphy KR, Myors B. (2004). Statistical power analysis: A simple and general
model for traditional and modern hypothesis tests (2nd ed.). Mahwah, NJ:
Erlbaum.
Murphy KR, Shiarella AH. (1997). Implications of the multidimensional nature of job
performance for the validity of selection tests: Multivariate frameworks for studying
test validity. P ERSONNEL P SYCHOLOGY, 50, 823854.
Nagle BF. (1953). Criterion development. P ERSONNEL P SYCHOLOGY, 6, 271289.
Newman DA, Jacobs RR, Bartram D. (2007). Choosing the best method for local validity
estimation: Relative accuracy of meta-analysis versus a local study versus Bayesanalysis. Journal of Applied Psychology, 92, 13941413.
Ones DS, Viswesvaran C. (2003). Job-specific applicant pools and national norms for personality scales: Implications for range-restriction corrections in validation research.
Journal of Applied Psychology, 88, 570577.
Organ DW. (1988). Organizational citizenship behavior: The good soldier syndrome.
Lexington, MA: Lexington.
Organ DW. (1997). Organizational citizenship behavior: Its construct clean-up time.
Human Performance, 10, 8597.
Oswald FL, Saad S, Sackett PR. (2000). The homogeneity assumption in differential
prediction analysis: Does it really matter. Journal of Applied Psychology, 85, 536
541.
Ployhart RE. (2006). Staffing in the 21st century: New challenges and strategic opportunities. Journal of Management, 32, 868897.
Ployhart RE, Hakel MD. (1998). The substantive nature of performance variability: Predicting interindividual differences in intraindividual performance. P ERSONNEL P SYCHOLOGY , 51, 859901.
Ployhart RE, Lim B-C, Chan K-Y. (2001). Exploring relations between typical and maximum performance ratings and the five-factor model of personality. P ERSONNEL
P SYCHOLOGY, 54, 809843.
Ployhart RE, Schneider B, Schmitt N. (2006). Staffing organizations: Contemporary practice and theory (3rd ed.). Mahwah, NJ: Erlbaum.
Pulakos ED, Arad S, Donovan MA, Plamondon KE. (2000). Adaptability in the workplace: Development of a taxonomy of adaptive performance. Journal of Applied
Psychology, 85, 612624.
Pulakos ED, Schmitt N, Dorsey DW, Arad S, Hedge JW, Borman WC. (2002). Predicting
adaptive performance: Further tests of a model of adaptability. Human Performance,
15, 299323.

VAN IDDEKINGE AND PLOYHART

923

Raju NS, Bilgic R, Edwards JE, Fleer PF. (1997). Methodology review: Estimation of
population validity and cross-validity, and the use of equal weights in prediction.
Applied Psychological Measurement, 21, 291305.
Raju NS, Bilgic R, Edwards JE, Fleer PF. (1999). Accuracy of population validity and
cross-validity estimation: An empirical comparison of formula-based, traditional
empirical, and equal weights procedures. Applied Psychological Measurement, 23,
99115.
Raymond MR, Neustel S, Anderson D. (2007). Retest effects and parallel forms in certification and licensure testing. P ERSONNEL P SYCHOLOGY, 60, 367396.
Robie C, Zickar MJ, Schmit MJ. (2001). Measurement equivalence between applicant
and incumbent groups: An IRT analysis of personality. Human Performance, 14,
187207.
Robinson SL, Bennett RJ. (1995). A typology of deviant workplace behaviors: A multidimensional scaling study. Academy of Management Journal, 38, 555572.
Roth PL, Bobko P, McFarland LA. (2005). A meta-analysis of work sample test validity:
Updating and integrating some classic literature. P ERSONNEL P SYCHOLOGY, 58,
10091037.
Roth PL, Huffcutt AI, Bobko P. (2003). Ethnic group differences in measures of job
performance: A new meta-analysis. Journal of Applied Psychology, 88, 694706.
Rotundo M, Sackett PR. (1999). Effect of rater race on conclusions regarding differential
prediction in cognitive ability tests. Journal of Applied Psychology, 84, 815822.
Rotundo M, Sackett PR. (2002). The relative importance of task, citizenship, and counterproductive performance to global ratings of job performance: A policy-capturing
approach. Journal of Applied Psychology, 87, 6680.
Saad S, Sackett PR. (2002). Investigating differential prediction by gender in employmentrelated personality measures. Journal of Applied Psychology, 87, 667674.
Sackett PR, Berry CM, Wiemann SA, Laczo RM. (2006). Citizenship and counterproductive
behavior: Clarifying relations between the two domains. Human Performance, 19,
441464.
Sackett PR, Laczo RM, Arvey RD. (2002). The effects of range restriction on estimates
of criterion interrater reliability: Implications for validation research. P ERSONNEL
P SYCHOLOGY, 55, 807825.
Sackett PR, Laczo RM, Lippe ZP. (2003). Differential prediction and the use of multiple
predictors: The omitted variables problem. Journal of Applied Psychology, 88, 1046
1056.
Sackett PR, Lievens F, Berry CM, Landers RN. (2007). A cautionary note on the effects of
range restriction on predictor intercorrelations. Journal of Applied Psychology, 92,
538544.
Sackett PR, Ostgaard DJ. (1994). Job-specific applicant pools and national norms for
cognitive ability tests: Implications for range restriction corrections in validation
research. Journal of Applied Psychology, 79, 680684.
Sackett PR, Yang H. (2000). Correction for range restriction: An expanded typology.
Journal of Applied Psychology, 85, 112118.
Sackett PR, Zedeck S, Fogli L. (1988). Relations between measures of typical and maximum
job performance. Journal of Applied Psychology, 73, 482486.
Salgado JF, Ones DS, Viswesvaran C. (2001). Predictors used for personnel selection: An
overview of constructs, methods and techniques. In Anderson N, Ones DS, Sinangil
HK, Viswesvaran C (Eds.), Handbook of industrial, work, and organizational psychology (Vol. 1, pp. 165199). Thousand Oaks, CA: Sage.
Scherbaum CA. (2005). Synthetic validity: Past, present, and future. P ERSONNEL P SYCHOLOGY , 58, 481515.

924

PERSONNEL PSYCHOLOGY

Schmidt FL. (1971). The relative efficiency of regression and simple unit predictor weights
in applied differential psychology. Educational and Psychological Measurement,
31, 699714.
Schmidt FL, Hunter JE. (1996). Measurement error in psychological research: Lessons
from 26 research scenarios. Psychological Methods, 1, 199223.
Schmidt FL, Hunter JE. (1998). The validity and utility of selection methods in personnel
selection: Practical and theoretical implications of 85 years of research findings.
Psychological Bulletin, 124, 262274.
Schmidt FL, Hunter JE, Urry VW. (1976). Statistical power in criterion-related validation
studies. Journal of Applied Psychology, 61, 473485.
Schmidt FL, Kaplan LB. (1971). Composite versus multiple criteria: A review and resolution of the controversy. P ERSONNEL P SYCHOLOGY, 24, 419434.
Schmidt FL, Le H, Ilies R. (2003). Beyond alpha: An empirical examination of the effects
of different sources of measurement error on reliability estimates for measures of
individual-differences constructs. Psychological Methods, 8, 206224.
Schmidt FL, Oh I-S, Le H. (2006). Increasing the accuracy of corrections for range restriction: Implications for selection procedure validities and other research results.
P ERSONNEL P SYCHOLOGY, 59, 281305.
Schmidt FL, Pearlman K, Hunter JE. (1981). The validity and fairness of employment
and educational tests for Hispanic Americans: A review and analysis. P ERSONNEL
P SYCHOLOGY, 33, 705724.
Schmidt FL, Viswesvaran C, Ones DS. (2000). Reliability is not validity and validity is not
reliability. P ERSONNEL P SYCHOLOGY, 53, 901912.
Schmit MJ, Ryan AM. (1993). The Big Five in personnel selection: Factor structure in
applicant and nonapplicant populations. Journal of Applied Psychology, 78, 966
974.
Schmitt N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8,
350353.
Schmitt N. (2007). The value of personnel selection: Reflections on some remarkable
claims. Academy of Management Perspectives, 21, 1923.
Schmitt N, Chan D. (2006). Situational judgment tests: Method or construct. In Weekley JA, Ployhart RE (Eds.), Situational judgment tests: Theory, measurement, and
application (pp. 135154). Mahwah, NJ: Erlbaum.
Schmitt N, Landy FJ. (1993). The concept of validity. In Schmitt N, Borman W
(Eds.), Personnel selection in organizations (pp. 275309). San Francisco:
Jossey-Bass.
Schmitt N, Ployhart RE. (1999). Estimates of cross-validity for stepwise regression and
with predictor selection. Journal of Applied Psychology, 84, 5057.
Schmitt N, Rogers W, Chan D, Sheppard L, Jennings D. (1997). Adverse impact and predictor efficiency of various predictor combinations. Journal of Applied Psychology,
82, 719730.
Sharf JC, Jones DP. (2000). Employment risk management. In Kehoe JF (Ed.), Managing
selection in changing organizations (pp. 271318). San Francisco: Jossey-Bass.
Smith CA, Organ DW, Near JP. (1983). Organizational citizenship behavior: Its nature and
antecedents. Journal of Applied Psychology, 68, 653663.
Smith DB, Ellingson JE. (2002). Substance versus style: A new look at social desirability
in motivating contexts. Journal of Applied Psychology, 87, 211219.
Smith DB, Hanges PJ, Dickson MW. (2001). Personnel selection and the five-factor model:
Reexamining the effects of applicants frame of reference. Journal of Applied Psychology, 86, 304315.

VAN IDDEKINGE AND PLOYHART

925

Society for Industrial and Organizational Psychology, I. (2003). Principles for the validation
and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author.
Stewart GL. (1999). Trait bandwidth and stages of job performance. Journal of Applied
Psychology, 84, 959968.
Sturman MC, Cheramie RA, Cashen LH. (2005). The impact of job complexity and performance measurement on the temporal consistency, stability, and test-retest reliability
of employee job performance ratings. Journal of Applied Psychology, 90, 269283.
Thoresen CJ, Bradley JC, Bliese PD, Thoresen JD. (2004). The big five personality traits
and individual job performance growth trajectories in maintenance and transitional
job stages. Journal of Applied Psychology, 89, 835853.
Thorndike RL. (1949). Personnel selection. New York: Wiley.
Van Scotter JR, Motowidlo SJ. (1996). Interpersonal facilitation and job dedication as
separate facets of contextual performance. Journal of Applied Psychology, 81, 525
531.
Van Scotter JR, Motowidlo SJ, Cross TC. (2000). Effects of task performance and contextual
performance on systemic rewards. Journal of Applied Psychology, 85, 526535.
Viswesvaran C, Ones DS, Schmidt FL. (1996). Comparative analysis of the reliability of
job performance ratings. Journal of Applied Psychology, 81, 557574.
Viswesvaran C, Schmidt FL, Ones DS. (2002). The moderating influence of job performance
dimensions on convergence of supervisory and peer ratings of job performance:
Unconfounding construct-level convergence and rating difficulty. Journal of Applied
Psychology, 87, 345354.
Viswesvaran C, Schmidt FL, Ones DS. (2005). Is there a general factor in ratings of job
performance? A meta-analysis framework for disentangling substantive and error
influences. Journal of Applied Psychology, 90, 108131.
Wallace SR. (1965). Criteria for what? American Psychologist, 20, 411417.
Weekley JA, Ployhart RE, Harold CM. (2004). Personality and situational judgment tests
across applicant and incumbent settings: An examination of validity, measurement,
and subgroup differences. Human Performance, 17, 433461.
Wonderlic Personnel Test, Inc. (1998). Comprehensive personality profile. Libertyville, IL:
Author.
Wong KFE, Kwong JYY. (2007). Effects of rater goals and rating patterns: Evidence from
an experimental field study. Journal of Applied Psychology, 92, 577585.
Wright PM, Boswell WR. (2002). Desegregating HRM: A review and synthesis of micro
and macro human resources management research. Journal of Management, 28,
247276.
Yang H, Sackett PR, Nho Y. (2004). Developing a procedure to correct for range restriction
that involves both institutional selection and applicants rejection of job offers.
Organizational Research Methods, 7, 442455.

Você também pode gostar