Escolar Documentos
Profissional Documentos
Cultura Documentos
We thank the following scientistpractitioners for their insightful comments and suggestions on previous drafts of this article: Mike Campion, Huy Le, Dan Putka, Phil Roth,
and Neil Schmitt.
Correspondence and requests for reprints should be addressed to Chad H. Van Iddekinge,
Florida State University, College of Business, Department of Management, Tallahassee,
FL, 32306-1110; cvanidde@fsu.edu.
1
Validity in a selection context does not refer to the validity of selection procedures
themselves but rather to the validity of the inferences we draw on the basis of scores from
selection procedures (American Educational Research Association [AERA], American
Psychological Association, National Council on Measurement in Education, 1999; Binning
& Barrett, 1989; Messick, 1998; Society for Industrial and Organizational Psychology
[SIOP], 2003). However, to be concise, we often use phrases such as valid selection
procedures.
C 2008 WILEY PERIODICALS, INC.
COPYRIGHT
871
872
PERSONNEL PSYCHOLOGY
873
information contained in the professional guidelines and in recent reviews of the selection literature. To accomplish this goal, we review and
critique articles published within the past decade on issues pertinent to
criterion-related validation research. Given the central role of criteria in
the validation process, we also review new findings in this area that have
direct relevance for validation research. We critically review and highlight
key findings, limitations, and gaps and discrepancies in the literature. We
also offer conclusions and provide recommendations for researchers involved in selection procedure validation. Finally, we conclude by noting
some important but neglected validation issues that future research should
address.
874
PERSONNEL PSYCHOLOGY
context; thus, selection researchers may not be aware of this work or its
implications for validation.
Validity Coefficient Corrections
Researchers are typically interested in estimating the relationship between scores on one or more selection procedures and one or more criteria
in some population (e.g., individuals in the relevant applicant pool) on the
basis of the relationship observed within a validation sample (e.g., a group
of job incumbents). It is well known, however, that sample correlations
can deviate from population correlations due to various statistical artifacts, and these statistical artifacts can attenuate the true size of validity
coefficients. Recent studies have focused on two prominent artifacts: measurement error (i.e., unreliability) and range restriction (RR). Researchers
have attempted to delineate the influence these artifacts can have on validity, as well as the most appropriate ways to correct these artifacts when
estimating criterion-related validity.
Corrections for measurement error. Researchers often correct for
unreliability in criterion measures to estimate the operational validity
of selection procedures. This fairly straightforward correction procedure
involves dividing the observed validity coefficient by the square root of
the estimated reliability of the criterion measure. Corrections for predictor
unreliability are made less often because researchers tend to be more
interested in the validity of selection procedures in their current form than
in their potential validity if the predictors measured the target constructs
with perfect reliability.
Although many experts believe that validity coefficients should be
corrected for measurement error (in the criterion), there is disagreement
about the most appropriate way to estimate the reliability of ratings criteria. This is a concern because performance ratings remain the most
common validation criterion (Viswesvaran, Schmidt, & Ones, 2002), and
the reliability estimate one uses to correct for attenuation can affect the
validity of inferences drawn from validation results (Murphy & DeShon,
2000).
When only one rater (e.g., an immediate supervisor) is available to
evaluate the job performance of each validation study participant, researchers often compute an internal consistency coefficient (e.g., Cronbachs alpha) to estimate reliability. Such coefficients provide estimates of
intrarater reliability and indicate the consistency of ratings across different performance dimensions (Cronbach, 1951). The problem with using
internal consistency coefficients for measurement error corrections is that
they assign specific error (i.e., the rater by ratee interaction effect, which
875
3
We note that most of these studies used single-rater reliability estimates (e.g., .52 from
Viswesvaran et al., 1996) to correct the meta-analytically derived validity coefficients.
However, the performance measures from some of the primary studies undoubtedly were
based on ratings from more than one rater per ratee. To the extent that this was the case, use
of single-rater reliabilities will overcorrect the observed validities and thus overestimate
true validity.
876
PERSONNEL PSYCHOLOGY
Murphy and DeShon (2000) also identified several potential systematic sources of rater disagreement (e.g., rater position level) that are treated
as random error when computing interrater correlations but that may have
different effects on reliability than does disagreement due to nonsystematic
sources. Recent studies using undergraduate raters (e.g., Murphy, Cleveland, Skattebo, & Kinney, 2004; Wong & Kwong, 2007) have provided
some initial evidence that raters goals (e.g., motivate ratees vs. identify
their strengths and weaknesses) can indeed influence their evaluations
(although neither study examined the effects of goal incongruence on interrater reliability). DeShon (2003) concluded that if rater disagreements
reflect not only random response patterns but also systematic sources,
then conventional validity coefficient corrections do not correct for measurement error but rather for a lack of understanding about what factors
influence the ratings.
In a rejoinder, Schmidt, Viswesvaran, and Ones (2000) argued that
interrater correlations are appropriate for correcting for attenuation. For
example, the researchers maintained that raters can be considered alternative forms of the same measure, and therefore, the correlation between
these forms represents an appropriate estimate of reliability. Schmidt and
colleagues suggested that the fact that different raters observe different
behaviors at different times actually is an advantage of using interrater
correlations because this helps to control for transient error, which, for example, reflects variations in ratee performance over time due to changes in
mood, mental state, and so forth (see Schmidt, Le, & Ilies, 2003). Schmidt
and colleagues also rebuffed the claim that classical measurement methods (e.g., Pearson correlations) model random error only. In fact, they
contended that classical reliability coefficients are the only ones that can
estimate all the main potential sources of measurement error relevant to
job performance ratings, including rater leniency effects, halo effects (i.e.,
rater by ratee interactions), transient error, and random response error (for
an illustration, see Viswesvaran, Schmidt, & Ones, 2005).
Furthermore, some of the concerns Murphy and DeShon (2000) raised
may be less relevant within a validation context. For example, the presence of similar or divergent rater goals (e.g., relationship building vs.
performance motivation) and biases (e.g., leniency) may be less likely
when confidential, research-based performance ratings can be collected
for validation purposes.
An alternative approach to conceptualizing and estimating measurement error, which has been used in the general psychometrics literature
for decades but that only has recently made it into the validation literature,
is generalizability (G) theory (Cronbach, Gleser, Nanda, & Rajaratnam,
1972). Conventional corrections for measurement error are based on classical test theory, which conceptualizes error as any factor that makes an
877
observed score differ from a true score. From this perspective, error is undifferentiated and is considered to be random. In G-theory, measurement
error is thought to comprise a multitude of systematic, unmeasured, and
even interacting sources of error (DeShon, 2002, 2003). Using analysis
of variance (ANOVA), G-theory allows researchers to partition the variance associated with sources of error and, in turn, estimate their relative
contribution to the overall amount of error present in a set of scores.
Potential sources of error (or facets in G-theory terminology) for
job performance ratings collected for a validation study might include the
raters, the type of rater (i.e., supervisors vs. peers), and the performance
dimensions being rated. Using G-theory, the validation researcher could
compute a generalizability coefficient that indexes the combined effects of
these error sources. The researcher could also compute separate variances,
and the corresponding generalizability coefficients that account for them,
for each error source to determine the extent to which each source (e.g.,
raters vs. dimensions) contributes to overall measurement error.
G-theory has the potential to be a very useful tool for validation researchers, and we encourage more extensive use of this technique in validation research. For example, in addition to more precisely determining
the primary source(s) of error in an existing set of job performance ratings,
G-theory can be useful for planning future validation efforts. Specifically,
G-theory encourages researchers to consider the facets that might contribute to error in ratings and then design their validation studies in a
way that allows them to estimate the relative contribution of each facet.
G-theory can also help determine how altering the number of raters, performance dimensions, and so on will affect the generalizability of ratings
collected in the future.
At the same time, there are some potentially important issues to consider when using a G-theory perspective to estimate measurement error
in validation research. First, inferences from G-theory focus on the generalizability of scores rather than on reliability per se. That is, G-theory
estimates the variance associated with whatever facets are captured in the
measurement design (e.g., raters, items, time periods) used to obtain scores
on which decisions are made. If, for example, the variance associated with
the rater facet is large, then the ratings obtained from different raters are
not interchangeable, and thus decisions made on the basis of those ratings
might not generalize to decisions that would be made if a different set of
raters was used (DeShon, 2003). Likewise, a generalizability coefficient
that considers the combined effects of all measurement facets indicates the
level of generalizability of scores given that particular set of raters, items,
and so on. Further, as with interrater correlations computed on the basis
of classical test theory, unless the assumption of parallel raters is satisfied,
generalizability coefficients cannot be considered reliability coefficients
878
PERSONNEL PSYCHOLOGY
(Murphy & DeShon, 2000). Thus, G-theory may not resolve all concerns
that have been raised about measurement error corrections in performance
ratings.
Finally, to capitalize on the information G-theory provides, researchers
must incorporate relevant measurement facets into their validation study
designs. For instance, to estimate the relative contribution of raters and performance dimensions on measurement error, validation participants must
be rated on multiple performance dimensions by multiple raters. In other
words, G-theory is only as useful as is the quality and comprehensiveness
of the design used to collect the data.
Corrections for RR. Another statistical artifact relevant to validation
research is RR. RR occurs when there is less variance on the predictor,
criterion, or both in the validation sample relative to the amount of variation on these measures in the relevant population. The restricted range of
scores results in a criterion validity estimate that is downwardly biased.
Numerous studies conducted during the past decade have addressed the
issue of RR in validation research. Sackett and Yang (2000) identified three
main factors that can affect the nature and degree of RR: (a) the variable(s)
on which selection occurs (predictor, criterion, or a third variable), (b)
whether the unrestricted variances for the relevant variables are known,
and (c) whether the third variable, if involved, is measured or unmeasured
(e.g., unquantified judgments made on the basis of interviews or letters of
recommendation). The various combinations of these factors resulted in
11 plausible scenarios in which RR may occur.
Yang, Sackett, and Nho (2004) updated the correction procedure for
situations when selection decisions are made on the basis of unmeasured
or partially measured predictors (i.e., scenario 2d in Sackett & Yang, 2000)
to account for the additional influencing factor of applicants rejection of
job offers. However, modeling the effects of self-selection requires data
concerning plausible reasons why applicants may turn down a job offer,
such as applicant employability as judged by interviewers. Therefore, the
usefulness of this procedure depends on whether reasons for applicant
self-selection can be identified and effectively measured.
A key distinction in conceptualizing RR is that between direct and
indirect restriction. In a selection context, direct RR occurs when individuals were screened on the same procedure that is being validated. This
can occur, for example, when a structured interview is being validated on
a sample of job incumbents who initially were selected solely on the basis
of the interview. In contrast, indirect RR occurs when the procedure being
validated is correlated with one or more of the procedures currently used
for selection. For instance, the same set of incumbents from the above
example also may be given a biodata inventory as part of the validation
study. If biodata scores are correlated with performance in the interview
879
880
PERSONNEL PSYCHOLOGY
881
Results of a Monte Carlo simulation revealed that the underestimation of interrater reliability was quite small under the first scenario (i.e.,
because predictor-criterion correlations in validation research tend to be
rather modest), whereas there often was substantial underestimation under
the latter two scenarios. Interrater reliability was underestimated the most
when there was direct RR on the performance ratings (i.e., scenario C)
and when the range of performance ratings was most restricted (i.e., when
the selection/retention ratio was low). In terms of the effects of criterion
RR on validity coefficient corrections, restriction due to truncation on the
predictor (i.e., scenario A) did not have a large influence on the corrected
validities. Overestimation of validity was more likely under the other two
scenarios given the smaller interrater reliability estimates that resulted
from the RR, which, in turn, were used to correct the observed validity coefficients. Nonetheless, when direct RR exists on the performance ratings
(i.e., scenario C), researchers will likely have the data (e.g., performance
ratings on both retained and terminated employees) to correct the reliability estimate for restriction prior to using the estimate to correct the validity
coefficients for attenuation.
LeBreton and colleagues (LeBreton, Burgess, Kaiser, Atchley, &
James, 2003) investigated the extent to which the modest interrater reliability estimates often found for job performance ratings are due to
true discrepancies between raters (e.g., in terms of opportunities to observe certain job behaviors) or to restricted between-target variance due
to a reduced amount of variability in employee job performance resulting
from various human resources systems (e.g., selection). The researchers
noted that whether low interrater estimates are due to rating discrepancies or to lack of variance cannot be determined using correlationbased approaches to estimating interrater reliability alone. Thus, they
also examined use of r wg (James, Demaree, & Wolf, 1984) to estimate
interrater agreement, a statistic unaffected by between-target variance
restrictions.
LeBreton and colleagues conducted a Monte Carlo simulation to
demonstrate the relationship between between-target variance and estimates of interrater reliability. The simulation showed that Pearson correlations decreased from .83 with no between-variance restriction to .36
with severe variance restriction. The researchers then examined actual data
from several multirater feedback studies. Results revealed that estimates
of interrater reliability (i.e., Pearson correlations and ICCs) consistently
were low (e.g., mean single-rater estimates of .30) in the presence of low to
modest between-target variance. In contrast, estimates of interrater agreement (i.e., r wg coefficients) were moderate to high (e.g., mean = .71 based
on a slightly skewed distribution). Because the agreement coefficients
were relatively high, LeBreton et al. concluded that the results provided
882
PERSONNEL PSYCHOLOGY
support for the restriction of variance hypothesis rather than for the rater
discrepancy hypothesis. We note, however, that reduction in betweentarget variance will always decrease interrater reliability estimates given
the underlying equations and that agreement estimates such as r wg will
never be affected by between-target variance because of how they are
computed.
Finally, corrected validity coefficients typically provide closer approximations of the population validity than do uncorrected coefficients. However, there are a few assumptions that, when violated, can lead to biased
estimates of corrected validity. For instance, the Hunter et al. (2006)
indirect RR procedure assumes that the predictor of interest captures
all the constructs that determine the criterion-related validity of whatever process or measures were used to make selection decisions. This
assumption may be violated in some validation contexts. For example,
an organization originally may have used a combination of assessments
(e.g., a cognitive ability test, a semistructured interview, and recommendation letters) to select a group of job incumbents. If the organization
later wants to estimate the criterion-related validity of the cognitive test,
the range of scores on that test will be indirectly restricted by the original selection battery, and thus, the effect of the original battery on the
criterion (e.g., job performance) cannot be fully accounted for by the
predictor of interest. Le and Schmidt (2006) found that although this
violation results in an undercorrection for RR, use of their procedure
still provides less biased validity estimates than does the conventional
correction procedure based on the direct RR model (i.e., Thorndikes
case 2).
Another assumption of the Hunter et al. (2006) procedure is that there
is no indirect RR on the criteria. However, if a restricted range of criterion
values is due to something other than RR on the predictor (e.g., restriction
resulting from a third variable, such as probationary period decisions;
Sackett et al., 2002), then their procedure, along with all other correction
procedures, will undercorrect the observed validity coefficient. Last, the
Thorndike correction procedures (which one might use when predictor
RR truly is direct) require two basic assumptions: (a) There is a linear
relationship between the predictor and criterion, and (b) the conditional
variance of scores on the criterion does not depend on the value of the
predictor (i.e., there is homoscedasticity). If either these assumptions is
violated, then corrected validities can be underestimated (for a review, see
Dunbar & Linn, 1991).
Conclusions and recommendations. There is convincing evidence
that statistical artifacts such as measurement error and RR can downwardly
bias observed relationships between predictors and criteria, and, in
883
884
PERSONNEL PSYCHOLOGY
(e)
(f)
(g)
(h)
(i)
885
Organizations frequently use multiple predictors to assess job applicants. This is because job analysis results often reveal a large number of job-relevant knowledge, skills, abilities, and other characteristics
(KSAOs), which may be difficult or impossible to capture with a single
assessment. In addition, use of multiple predictors can increase the validity of the overall selection system beyond that of any individual predictor
(Bobko, Roth, & Potosky, 1999; Schmidt & Hunter, 1998). We discuss
two important considerations for evaluating multiple predictors: relative
importance and cross-validity.
Estimating predictor relative importance. Relative importance refers
to the relative contribution each predictor makes to the predictive power
of an overall regression model. Relative importance statistics are useful
for determining which predictors contribute most to predictive validity, as
well as for evaluating the extent to which a new predictor (or predictors)
contributes to an existing predictor battery. Perhaps the most common approach for assessing relative importance has been to examine the magnitude and statistical significance of the standardized regression coefficients
for individual predictors. When predictors are uncorrelated, the squared
regression coefficient for each variable represents the proportion of variance in the criterion for which that predictor accounts. However, when
the predictors are correlated (as often is the case in selection research),
the squared regression coefficients do not sum to the total variance explained (i.e., R2 ), which makes conclusions concerning relative validity
ambiguous and possibly misleading (Budescu & Azen, 2004; Johnson &
LeBreton, 2004).
Another common approach used to examine predictor relative importance is to determine the incremental validity of a given predictor beyond
that provided by a different predictor(s). For instance, a researcher may
need to determine whether a new selection procedure adds incremental
prediction of valued criteria beyond that provided by existing procedures.
However, as LeBreton, Hargis, Griepentrog, Oswald, and Ployhart (2007)
noted, new predictors often account for relatively small portions of unique
variance in the criterion beyond that accounted for by the existing predictors. This is because incremental validity analyses assign any shared
variance (i.e., between the new predictor and the existing predictors) to
the existing predictors, which reduces the amount of validity attributed to
the new predictor. Such analyses do not provide information concerning
the contribution the new predictor makes to the overall regression model
(i.e., R2 ) relative to the other predictors.
Relative weight analysis (RWA; Johnson, 2000) and dominance analysis (DA; Budescu, 1993) represent complementary methods for assessing
886
PERSONNEL PSYCHOLOGY
887
888
PERSONNEL PSYCHOLOGY
889
890
PERSONNEL PSYCHOLOGY
891
892
PERSONNEL PSYCHOLOGY
across jobs to estimate an overall validity coefficient. Johnson and colleagues outlined the necessary formulas and data requirements to conduct
differential prediction analyses within this framework.
Finally, Aguinis and colleagues have developed several freely available
computer programs that can aid in the assessment of differential prediction. For instance, Aguinis, Boik, and Pierce (2001) developed a program
called MMRPOWER that allows researchers to approximate power by inputting parameters such as sample size, predictor RR, predictorsubgroup
correlations, and reliability of measurement. Also, the MMR approach to
testing for differential prediction assumes that the variance in the criterion
that remains after predicting the criterion from the predictor is roughly
equal across subgroups. Violating this assumption can influence type I
error rates and reduce power, which, in turn, can lead to inaccurate conclusions regarding moderator effects (Aguinis, Peterson, & Pierce, 1999;
Oswald, Saad, & Sackett, 2000). Aguinis et al. (1999) developed a program
(i.e., ALTMMR) that determines whether the assumption of homogeneity of error variance has been violated and, if so, computes alternative
inferential statistics that test for a moderating effect. Last, Aguinis and
Pierce (2006) described a program for computing the effect size (f 2 ) of a
categorical moderator.
Conclusions and recommendations. Accurate assessment of differential prediction is critically important for predictor validation. Interestingly,
recent research has tended to focus on issues associated with detecting
differential prediction rather than estimating the differential prediction
associated with the predictors themselves. One possible reason for this is
that many researchers appear to have asserted earlier findings of no differential prediction for cognitive ability tests (e.g., Dunbar & Novick, 1988;
Houston & Novick, 1987; Schmidt, Pearlman, & Hunter, 1981) onto other
predictors and demographic group comparisons other than BlackWhite
and malefemale. The results of our review suggest that such assertions
may be unfounded. With this in mind, we offer the following recommendations.
(a) When technically feasible (e.g., when there is sufficient statistical
power, when unbiased criteria are available), conduct differential prediction analyses as a standard component of validation
research.
(b) Be aware that even predictor constructs with small subgroup mean
differences (e.g., certain personality variables) can contribute to
predictive bias.
(c) Differential prediction may be as much a function of the criteria as
it is a function of the predictor(s) being validated (Saad & Sackett,
2002). Thus, think carefully about the choice of validation criteria
893
894
PERSONNEL PSYCHOLOGY
895
4
Many recent studies have examined differences in the psychometric properties of
noncognitive predictors (e.g., personality measures) between applicant and nonapplicant
groups. Some studies have found evidence of measurement invariance (e.g., Robie, Zickar,
& Schmit, 2001; D. B. Smith & Ellingson, 2002; D. B. Smith, Hanges, & Dickson,
2001), whereas other studies have found nontrivial between-group differences, such as
the existence of an ideal-employee factor among applicants but not among nonapplicants (e.g., Cellar, Miller, Doverspike, & Klawsky, 1996; Schmit & Ryan, 1993). One
consistent finding is that applicants tend to receive higher scores than do incumbents
(Birkeland, Kisamore, Brannick, & Smith, 2006). Higher mean scores may affect criterionrelated validity estimates to the extent they reduce score variability on the predictor. High
mean scores also can reduce the extent to which predictors differentiate among applicants, result in an increased numbers of tie scores, and create challenges for setting cut
scores.
896
PERSONNEL PSYCHOLOGY
897
898
PERSONNEL PSYCHOLOGY
One of the most noticeable recent trends in selection research has been
the increased attention given to the criterion domain. This is a welcomed
trend because accurate specification and measurement of criteria is vital
for the effective selection, development, and validation of predictors. After all, predictors derive their importance from criteria (Wallace, 1965).
Recent studies have examined a wide range of criterion issues, and a
comprehensive treatment of this literature is well beyond the scope of
this article. Instead, we focus on key findings that have the most direct
implications for use of criteria in predictor validation.
Expanding the Performance Domain
In contrast to decades of research on task performance, studies conducted during the past decade have increasingly focused on expanding the
criterion domain to include behaviors that may fall outside of job-specific
899
900
PERSONNEL PSYCHOLOGY
901
not consistently large, and ability was similarly related to some facets of
task and citizenship performance. Further, although Agreeableness tended
to be more related to citizenship performance than to task performance,
dependability and achievement were similarly related to the two types of
performance.
Counterproductive work behavior. The second major expansion of
the criterion space involves counterproductive work behavior (CWB).
CWBs reflect voluntary actions that violate organizational norms and
threaten the well-being of the organization and/or its members (Bennett &
Robinson, 2000; Robinson & Bennett, 1995). Researchers have identified
a large number of CWBs, including theft, property destruction, unsafe
behavior, poor attendance, and intentional poor performance. However,
empirical research typically has found evidence for a general CWB factor
(e.g., Bennett & Robinson, 2000; Lee & Allen, 2002), or for a small set
of subfactors (e.g., Gruys & Sackett, 2003; Sackett, Berry, Wiemann, &
Laczo, 2006). For example, Sackett et al. (2006) found two CWB factors,
one that reflected behaviors aimed at the organization (i.e., organizational
deviance) and another that reflected behaviors aimed at other individuals
within the organization (i.e., interpersonal deviance). Moreover, results
of a recent meta-analysis revealed that although highly related ( = .62),
interpersonal and organizational deviance had somewhat different correlates (Berry, Ones, & Sackett, 2007). For example, interpersonal deviance
was more strongly related to Agreeableness, whereas organizational deviance was more strongly related to Conscientiousness and citizenship
behaviors.
As with citizenship performance, an important issue for researchers is
whether CWBs can be measured in such a way that they provide performance information not captured by other criterion measures. Preliminary
evidence suggests some reason for optimism in this regard. For instance, a
meta-analysis by Dalal (2005) revealed a modest relationship ( = .32)
between CWBs and citizenship behaviors (though the relationship was
much stronger [ = .71] when supervisors rated both sets of behaviors).
There also was some evidence that the two types of behaviors were differentially related to variables such as job satisfaction and negative affect.
Sackett et al. (2006) found that treating CWB and citizenship behaviors as
separate factors in a CFA provided a better fit to the data than did treating
them as a single entity. In addition, the Big Five personality factors exhibited somewhat different relations with the two criteria. Similarly, Dudley,
Orvis, Lebiecki, and Cortina (2006) found that CWBs were predicted
by different facets of Conscientiousness than were task and citizenship
performance.
Adaptive performance. Todays work environment often requires
workers to adapt to new and ever-changing situations, including
902
PERSONNEL PSYCHOLOGY
903
904
PERSONNEL PSYCHOLOGY
905
Holland (2003) showed that aligning personality scales with narrow criteria, instead of with an overall criterion composite, resulted in higher
validity estimates. For instance, operational validities based on narrow
and aligned criteria versus broad criteria were .31 versus .20 for Conscientiousness, and .29 versus .08 for Openness. In a similar study, Bartram
(2005) found that observed correlations between personality test scores
and performance ratings were larger when the predictors and criteria were
theoretically aligned than when they were not (mean r = .16 vs. .02).
Although the idea of multidimensional criteria is appealing, in practice, ratings criteria often are so intercorrelated that the use of separate
criterion variables cannot be justified empirically. Indeed, researchers frequently describe the diverse nature of the performance dimensions rated,
but then end up having to use a single criterion composite because the
dimension ratings are highly correlated or because some type of factor
analysis suggested the existence of a dominant single factor. For example,
a meta-analysis by Viswesvaran et al. (2005) found that when all forms
of measurement error were removed from the ratings, a general factor
accounted for 60% of the total variance. They concluded that the common practice of combining individual dimension ratings into an overall
performance composite to serve as a validation criterion is indeed justifiable. However, the researchers also noted that more specific performance
factors could account for some of the variance unexplained by the general
factor.
Conclusions and recommendations. The results of our review generally support the conclusions of Schmidt and Kaplan (1971): Broad criteria
are best predicted by broad predictors, and narrow criteria are best predicted by narrow predictors. That said, there are some important new
developments.
(a) Criterion-related validity may be enhanced by linking narrow criteria to narrow predictors. However, such relationships may not
generalize across jobs to the same extent that relations between
broad criteria and broad predictors generalize (i.e., because narrow
criteria are less likely to be equally important for each job).
(b) Although weighting conceptually aligned predictor and criterion composites may enhance predictive validity, weighting also
poses some potential limitations. First, it assumes that researchers
can indeed extract multiple criterion variables from their performance measures, which, as we discussed, remains a persistent
challenge. Second, if researchers choose to derive empirically
weighted predictor composites on the basis of validation results
(rather than, for example, rationally weighted composites according to criterion importance to the organization), the predictors
906
PERSONNEL PSYCHOLOGY
907
908
PERSONNEL PSYCHOLOGY
and Brown (2003). Finally, Klehe and Latham (2006) discovered that
ratings of both situational and BDI questions were stronger predictors of
typical performance than of maximum performance, despite the fact that
selection interviews would seem to represent a maximum performanceoriented context.
Conclusions and recommendations. The distinction between maximum and typical performance is a potentially important but frequently
overlooked issue in validity research. We believe that much more research is needed to further understand this distinction and its implications
for predictor development and validation. On the basis of the relatively
limited amount of research that has been conducted, we recommend the
following.
(a) Devote careful thought to whether the performance domain of the
target job reflects maximum performance, typical performance, or
some combination of the two and then develop selection procedures accordingly. A mismatch of predictors and typical/maximum
criteria could obscure evidence of criterion-related validity and
lead to inaccurate conclusions about predictor effectiveness.
(b) Be aware, however, that recent findings suggest that traditional
distinctions between maximum and typical performance are not
always very clear. For example, there is evidence that motivation is
not constant in maximum performance contexts. Likewise, the distinction between predictors of typical and maximum performance
is not clear-cut. It appears that maximum and typical performance
represent a continuum rather than a strict dichotomy.
(c) The maximum-typical distinction may be most important for jobs
that have a strong maximum performance component. In such
cases, failure to include a maximum criterion would neglect a
potentially important aspect of the performance domain and, in
turn, the identification of valid predictors. As an example, Jackson
and colleagues (Jackson, Harris, Ashton, McCarthy, & Tremblay,
2000) noted that law enforcement jobs often require maximum
performance (e.g., apprehending suspects) but that such performance often is not easily observed by supervisors. They described
how standardized work sample tests can serve as useful validation
criteria for such jobs.
Dynamic Criteria
909
910
PERSONNEL PSYCHOLOGY
whereas the opposite was true for the achievement facet (r = .01 vs.
.22). Farrell and McDaniel (2001) examined task consistency as a potential moderator of predictor-criterion relations over time. Results revealed
that for jobs with primarily consistent tasks, cognitive ability was the best
predictor of initial performance, whereas psychomotor ability was a better
predictor of long-term performance. Conversely, for jobs with inconsistent
tasks, cognitive ability was the best predictor of both initial and long-term
performance.
Conclusions and recommendations. Performance is dynamic over
time yet is still explainable and predictable. Recent studies have helped
clarify the nature of criterion dynamism.
(a) Most research has found that performance follows a learning
curve, such that it tends to increase rapidly when employees begin
a job and then reaches an asymptote and levels off thereafter. Thus,
even though performance changes over time, there is systematic
variability present in the form of performance trends that can be
predicted and explained by individual differences.
(b) Different predictor constructs often are related to different stages
of job performance. Using Murphys (1989) framework, transition
performance tends to be best predicted by cognitively oriented
constructs, whereas maintenance performance tends to be best
predicted by motivational/dispositional constructs.
(c) Think carefully about when to collect the criterion data used to
validate selection procedures. As a general rule, it may be advantageous to collect performance data from employees who are in
the maintenance stage of their jobs. However, this may not always be easy to determine, as the duration of transition stages can
very depending on the job (e.g., complexity), the individual (e.g.,
prior experience), and the situation (e.g., quality of supervision;
Deadrick et al., 1997). Furthermore, transition times are becoming
increasingly important given the changing nature of work (Ilgen
& Pulakos, 1999), and early performance matters a great deal for
many jobs (e.g., first responders).
(d) If the validation sample comprises job incumbents, examine
whether tenure has any substantive influence on predictor scores,
criterion scores, or criterion-related validity. For example, researchers might calculate partial validity coefficients that remove
variance due to tenure when estimating predictor validity. Similarly, one could see whether tenure moderates validity. These types
of analyses may help researchers determine whether tenure has a
practically meaningful influence on predictive validity.
911
912
PERSONNEL PSYCHOLOGY
(e)
(f)
(g)
(h)
913
914
PERSONNEL PSYCHOLOGY
TABLE 1
Source Materials for Conducting Criterion-Related Validity Studies
Validation topic
I. Validity coefficient
corrections
II. Evaluation of
multiple predictors
III. Differential
prediction
Source materials
Schmidt and Hunter (1996). Describe various situations requiring
measurement error corrections that have relevance for validation
research.
DeShon (2002). Comprehensible overview of G-theory, including
SPSS and SAS syntax for computing g-coefficients.
Materials from a conference workshop presented by R. A. McCloy
and D. J. Putka on estimating interrater reliability in the real
world: http://www.humrro.org/corpsite/siop_reliability.php.
Sackett and Yang (2000). Describe various scenarios in which RR
may be present and suggested treatments.
Hunter et al. (2006). Step-by-step procedure for indirect RR
corrections.
Sackett et al. (2002) and LeBreton et al. (2003). Discuss criterion
RR and suggested treatments.
Johnson and LeBreton (2004). Describe concept of relative
importance and various types of importance indices.
LeBreton et al. (2007). Step-by-step procedure for evaluating
predictor importance (see Table 3 of their article).
SPSS program for conducting a relative weight analysis: Contact
Jeff johnson (jeff.johnson@pdri.com).
SAS program for conducting dominance analysis:
http://www.uwm.edu/azen.damacro.html.
Excel program for computing DA with six or fewer predictors:
http://www2.psych.purdue.edu/%7Elebreton/
Saad and Sackett (2002). Discuss differential validity resulting
from predictors versus criteria.
Sackett et al. (2003). Describe omitted variables problem and
potential solutions.
Computer programs by Aguinis and colleagues for estimating
power, checking violations of assumptions, and computing
predictor-subgroup effect sizes:
http://carbon.cudenver.edu/haguinis/mmr.
Murphy and Myors (2004). Readable book on statistical power.
Lievens et al. (2005). Framework for retesting effects and
associated analyses.
Coleman and Borman (2000), Lee and Allen (2002), and Van
Scotter et al. (2000). Sample citizenship performance measures.
Bennett and Robinson (2000) and Gruys and Sackett (2003).
Sample CWB measures.
Pulakos et al. (2002). Definitions and behavioral examples of
adaptive performance dimensions.
Murphy and Shiarella (1997). Step-by-step procedure for
computing validity coefficients for multivariate models.
Klehe, Anderson, and Viswesvaran (2007). Entire volume of
Human Performance devoted to the maximum-typical performance
distinction.
Ployhart et al. (2006). Chapter 4 covers a range of key issues
regarding the conceptualization and measurement of performance.
915
reviewed. We identify articles that describe the key issues, outline relevant
procedures and formulas, and include example measures. We also provide
Web addresses for freely available computer programs that can aid in the
analysis of validation data.
This article provided a comprehensive treatment of several core validation issues. Nonetheless, there are several issues we did not address that
researchers also should remember when validating selection procedures.
First, although highly important, planning and carrying out criterionrelated validity analyses represents only one aspect of selection research.
There are a variety of other important issues researchers must consider,
such as the identification of the performance domain for the target job and
the KSAOs that impact performance, the initial choice of selection procedures, the development of the procedures (e.g., mode of administration,
item formats, development of alternative forms), assessment of contentand construct-related validity, and the development of cutoff scores, just
to name a few. Furthermore, although critical, criterion-related validity is
one of several factors researchers should consider when selecting a final
set of predictors for operational use. Cost, subgroup differences and adverse impact, likelihood of faking, administrator and applicant reactions,
and time for administration and scoring also must carefully weighed.
Second, there are various reasons why local validation studies may not
always be feasible, including lack of access to large samples, inability to
collect valid and reliable criterion measures, and lack of resources to conduct a comprehensive validity study. In such instances, researchers should
consider alternative validation strategies, such as validity transportation,
synthetic validation, and validity generalization (for recent reviews, see
McPhail, 2007, and Scherbaum, 2005). Further, Newman, Jacobs, and
Bartram (2007) recently discussed how the combined use meta-analysis
and a local validation study (via Bayesian estimation) can lead to more
accurate estimates of criterion-related validity than can either method
alone.
Finally, above all else, effective validation research requires sound
professional judgment (Guion, 1991). Being knowledgeable about the
validation literature certainly can contribute to sound judgment. However, researchers also must possess a thorough understanding of the target
job and the organizational context in which the job is performed. Moreover, every validation study presents unique challenges and opportunities.
Therefore, researchers should apply what they have learned from the literature in light of their given circumstances.
REFERENCES
Abelson MA. (1987). Examination of avoidable and unavoidable turnover. Journal of
Applied Psychology, 72, 382386.
916
PERSONNEL PSYCHOLOGY
Aguinis H, Boik RJ, Pierce CA. (2001). A generalized solution for approximating the
power to detect effects of categorical moderator variables using multiple regression.
Organizational Research Methods, 4, 291323.
Aguinis H, Peterson SA, Pierce CA. (1999). Appraisal of the homogeneity of error variance assumption and alternatives to multiple regression for estimating
moderating effects of categorical variables. Organizational Research Methods, 2,
315329.
Aguinis H, Pierce CA. (2006). Computation of effect size for moderating effects of categorical variables in multiple regression. Applied Psychological Measurement, 30,
440442.
Aguinis H, Smith MA. (2007). Understanding the impact of test validity and bias on
selection errors and adverse impact in human resource selection. P ERSONNEL P SYCHOLOGY , 60, 165199.
Aguinis H, Stone-Romero EF. (1997). Methodological artifacts in moderated multiple
regression and their effects on statistical power. Journal of Applied Psychology, 82,
192206.
Allworth E, Hesketh B. (1999). Construct-oriented biodata: Capturing change-related and
contextually relevant future performance. International Journal of Selection and
Assessment, 7, 97111.
American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational
and psychological testing. Washington, DC: AERA.
Anderson N, Lievens F, van Dam K, Ryan AM. (2004). Future perspectives on employee
selection: Key directions for future research and practice. Applied Psychology: An
International Review, 53, 487501.
Arthur WJ, Bell ST, Villado AJ, Doverspike D. (2006). The use of person-organization
fit in employment decision making: An assessment of its criterion-related validity.
Journal of Applied Psychology, 91, 786801.
Arthur WJ, Villado AJ. (2008). The importance of distinguishing between constructs and
methods when comparing predictors in personnel selection research and practice.
Journal of Applied Psychology, 93, 435442.
Arvey RD, Strickland WJ, Drauden G, Martin C. (1990). Motivational components of test
taking. P ERSONNEL P SYCHOLOGY, 43, 695716.
Barrett GV, Caldwell MS, Alexander RA. (1985). The concept of dynamic criteria: A
critical reanalysis. P ERSONNEL P SYCHOLOGY, 5, 4156.
Barrett GV, Philips JS, Alexander RA. (1981). Concurrent and predictive validity designs:
A critical reanalysis. Journal of Applied Psychology, 66, 16.
Barrick MR, Stewart GL, Neubert MJ, Mount MK. (1998). Relating member ability and
personality to work-team processes and team effectiveness. Journal of Applied
Psychology, 83, 377391.
Barrick MR, Zimmerman RD. (2005). Reducing voluntary, avoidable turnover through
selection. Journal of Applied Psychology, 90, 159166.
Bartlett CJ, Bobko P, Mosier SB, Hannan R. (1978). Testing for fairness with a moderated
multiple regression strategy: An alternative to differential prediction. P ERSONNEL
P SYCHOLOGY, 31, 233241.
Bartram D. (2005). The great eight competencies: A criterion-centric approach to validation.
Journal of Applied Psychology, 90, 11851203.
Bennett RJ, Robinson SL. (2000). Development of a measure of workplace deviance.
Journal of Applied Psychology, 85, 349360.
Berry CM, Ones DS, Sackett PR. (2007). Interpersonal deviance, organizational deviance,
and their common correlates: A review and meta-analysis. Journal of Applied Psychology, 92, 410424.
917
Binning JF, Barrett GV. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74,
478494.
Birkeland SA, Kisamore JL, Brannick MT, Smith MA. (2006). A meta-analytic investigation of job applicant faking on personality measures. International Journal of
Selection and Assessment, 14, 317335.
Bliese PD. (1998). Group size, ICC values, and group-level correlations: A simulation.
Organizational Research Methods, 1, 355373.
Bobko P, Roth PL, Potosky D. (1999). Derivation and implications of a meta-analysis
matrix incorporating cognitive ability, alternative predictors, and job performance.
P ERSONNEL P SYCHOLOGY, 52, 561589.
Borman WC, Motowidlo SJ. (1993). Expanding the criterion domain to include elements
of contextual performance. In Schmitt N, Borman WC (Eds.), Personnel selection
in organizations (pp. 7198). San Francisco: Jossey-Bass.
Borman WC, Motowidlo SJ. (1997). Task performance and contextual performance: The
meaning for personnel selection research. Human Performance, 10, 99109.
Brief AP, Motowidlo SJ. (1986). Prosocial organizational behaviors. Academy of Management Review, 11, 710725.
Budescu DV. (1993). Dominance analysis: A new approach to the problem of relative
importance of predictors in multiple regression. Psychological Bulletin, 114, 542
551.
Budescu DV, Azen R. (2004). Beyond global measures of relative importance:
Some insights from dominance analysis. Organizational Research Methods, 7,
341350.
Burket GR. (1964). A study of reduced rank models for multiple prediction. Psychometricka
Monograph Supplement, 12.
Callender JC, Osburn HG. (1980). Development and test of a new model for validity
generalization. Journal of Applied Psychology, 65, 543558.
Cattin P. (1980). Estimating the predictive power of a regression model. Journal of Applied
Psychology, 65, 407414.
Cellar DF, Miller ML, Doverspike DD, Klawsky JD. (1996). Comparison of factor
structures and criterion-related validity coefficients for two measures of personality based of the five factor model. Journal of Applied Psychology, 81,
694704.
Chan D, Schmitt N. (2002). Situational judgment and job performance. Human Performance, 15, 233254.
Cleary TA. (1968). Test bias: Prediction of grades of negro and white students in integrated
colleges. Journal of Educational Measurement, 5, 115124.
Coleman VI, Borman WC. (2000). Investigating the underlying structure of the citizenship
performance domain. Human Resource Management Review, 10, 2544.
Conway JM. (1999). Distinguishing contextual performance from task performance for
managerial jobs. Journal of Applied Psychology, 84, 313.
Cronbach LJ. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,
16, 297334.
Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York:
Wiley.
Dalal RS. (2005). Meta-analysis of the relationship between organizational citizenship
behavior and counterproductive work behavior. Journal of Applied Psychology, 90,
12411255.
918
PERSONNEL PSYCHOLOGY
De Corte W. (1999). Weighting job performance predictors to both maximize the quality of
the selected workforce and control the level of adverse impact. Journal of Applied
Psychology, 84, 695702.
De Corte W, Lievens F. (2003). A practical procedure to estimate the quality and the adverse
impact of single-stage selection decisions. International Journal of Selection and
Assessment, 11, 8997.
De Corte W, Lievens F, Sackett PR. (2006). Predicting adverse impact and mean criterion performance in multistage selection. Journal of Applied Psychology, 91,
523537.
De Corte W, Lievens F, Sackett PR. (2007). Combining predictors to achieve optimal tradeoffs between selection quality and adverse impact. Journal of Applied Psychology,
92, 13801393.
Deadrick DL, Bennett N, Russell CJ. (1997). Using hierchical linear modeling to examine
dynamic performance criteria over time. Journal of Management, 23, 745757.
DeShon RP. (2002). Generalizability theory. In Drasgow F, Schmitt N (Eds.), Measuring and analyzing behavior in organizations: Advances in measurement and data
analysis (pp. 189220). San Francisco: Jossey-Bass.
DeShon RP. (2003). A generalizability theory perspective on measurement error corrections
in validity generalization. In Murphy KR (Ed.), Validity generalization: A critical
review (pp. 365402). Mahwah, NJ: Erlbaum.
DuBois CLZ, Sackett PR, Zedeck S, Fogli L. (1993). Further exploration of typical and
maximum performance criteria: Definitional issues, prediction, and white-black
differences. Journal of Applied Psychology, 78, 205211.
Dudley NM, Orvis KA, Lebiecki JA, Cortina JM. (2006). A meta-analytic investigation of
conscientiousness in the prediction of job performance: Examining the intercorrelations and the incremental validity of narrow traits. Journal of Applied Psychology,
91, 4057.
Dunbar SB, Linn RL. (1991). Range restriction adjustments in the prediction of military
job performance. In Wigdor AK, Green BF Jr. (Eds.), Performance assessment for
the workplace (Vol. 2, pp. 127157). Washington, DC: National Academy Press.
Dunbar SB, Novick MR. (1988). On predicting success in training for men and women:
Examples from Marine Corps clerical specialties. Journal of Applied Psychology,
75, 545550.
Dunnette MD. (1963). A note on the criterion. Journal of Applied Psychology, 47, 251254.
Dunnette MD, McCartney J, Carlson HC, Kirchner WK. (1962). A study of faking behavior
on a forced-choice self-description checklist. P ERSONNEL P SYCHOLOGY, 15, 13
24.
Ellingson JE, Sackett PR, Connelly BS. (2007). Personality assessment across selection
and development contexts: Insights into response distortion. Journal of Applied
Psychology, 92, 386395.
Equal Employment Opportunity Commission C. S. C, Department of Labor, and Department of Justice. (1978). Uniform guidelines on employee selection procedures,
Federal Register, 43 (166), 3829538309.
Farrell JN, McDaniel MA. (2001). The stability of validity coefficients over time: Ackermans (1988) model and the General Aptitude Test Battery. Journal of Applied
Psychology, 86, 6079.
Ferris GR, Witt LA, Hochwarter WA. (2001). Interaction of social skill and general mental
ability on job performance and salary. Journal of Applied Psychology, 86, 1075
1082.
Ghiselli EE. (1956). Dimensional problems of criteria. Journal of Applied Psychology, 40,
14.
919
Gruys ML, Sackett PR. (2003). Investigating the dimensionality of counterproductive work
behavior. International Journal of Selection and Assessment, 11, 3042.
Guion RM. (1961). Criterion measurement and personnel judgments. P ERSONNEL P SYCHOLOGY , 14, 141149.
Guion RM. (1991). Personnel assessment, selection, and placement. In Dunnette MD,
Hough LM (Eds.), Handbook of industrial and organizational psychology (2nd ed.,
Vol. 2, pp. 327397). Palo Alto, CA: Consulting Psychologists Press.
Guion RM. (1998). Assessment, measurement, and prediction for personnel decisions.
Mahwah, NJ: Erlbaum.
Guion RM, Cranny CJ. (1982). A note on concurrent versus predictive validity designs: A
critical reanalysis. Journal of Applied Psychology, 67, 239244.
Harold CM, McFarland LA, Weekley JA. (2006). The validity of verifiable and nonverifiable biodata items: An examination across applicants and incumbents. International Journal of Selection and Assessment, 14, 336346.
Hattrup K, OConnell MS, Wingate PH. (1998). Prediction of multidimensional criteria:
Distinguishing task and contextual performance. Human Performance, 11, 305319.
Hattrup K, Rock J, Scalia C. (1997). The effects of varying conceptualizations of job
performance on adverse impact, minority hiring, and predicted performance. Journal
of Applied Psychology, 82, 656664.
Hausknecht JP, Halpert JA, Di Paolo NT, Moriarty Gerrard MO. (2007). Retesting in
selection: A meta-analysis of coaching and practice effects for tests of cognitive
ability. Journal of Applied Psychology, 92, 373385.
Hausknecht JP, Trevor CO, Farr JL. (2002). Retaking ability tests in a selection setting:
Implications for practice effects, training performance, and turnover. Journal of
Applied Psychology, 87, 243254.
Hoffman BJ, Blair CA, Meriac JP, Woehr DJ. (2007). Expanding the criterion domain?
A quantitative review of the OCB literature. Journal of Applied Psychology, 92,
555566.
Hogan J, Barrett P, Hogan R. (2007). Personality measurement, faking, and employment
selection. Journal of Applied Psychology, 92, 12701285.
Hogan J, Holland B. (2003). Using theory to evaluate personality and job-performance
relations: A socioanalytic perspective. Journal of Applied Psychology, 88, 100112.
Hough LM. (1992). The Big Five personality variables-construct confusion: Description
versus prediction. Human Performance, 5, 139155.
Hough LM. (1998). Personality at work: Issues and evidence. In Hakel M (Ed.), Beyond multiple choice: Evaluating alternatives to traditional testing for selection
(pp. 131166). Mahwah, NJ: Erlbaum.
Hough LM. (2001). I/Owes its advances to personality. In Roberts B, Hogan RT (Eds.),
Applied personality psychology: The intersection between personality and I/O psychology (pp. 1944). Washington, DC: American Psychological Association.
Houston WM, Novick MR. (1987). Race-based differential prediction in Air Force technical
training programs. Journal of Educational Measurements, 24, 309320.
Huffcutt AI, Conway JM, Roth PL, Klehe U-C. (2004). The impact of job complexity and
study design on situational and behavior description interview validity. International
Journal of Selection and Assessment, 12, 262273.
Huffcutt AI, Conway JM, Roth PL, Stone NJ. (2001). Identification and meta-analytic
assessment of psychological constructs measured in employment interviews. Journal
of Applied Psychology, 86, 897913.
Humphreys LG. (1960). Investigations of the simplex. Psychometricka, 25, 313323.
Hunter JE, Schmidt FL, Le H. (2006). Implications for direct and indirect range restriction
for meta-analysis methods and findings. Journal of Applied Psychology, 91, 594
612.
920
PERSONNEL PSYCHOLOGY
Hurtz GM, Donovan JJ. (2000). Personality and job performance: The Big Five revisited.
Journal of Applied Psychology, 85, 869879.
Huselid MA. (1995). The impact of human resource management practices on turnover, productivity, and corporate financial performance. Academy of Management Journal,
38, 635672.
Ilgen DR, Pulakos ED. (1999). Employee performance in todays organizations. In Ilgen DR, Pulakos ED (Eds.), The changing nature of work performance: Implications for staffing, motivation, and development (pp. 120). San Francisco:
Jossey-Bass.
Jackson DN, Harris WG, Ashton MC, McCarthy JM, Tremblay PF. (2000). How useful are work samples in validation studies. International Journal of Selection and
Assessment, 8, 2933.
James LR. (1980). The unmeasured variables problem in path analysis. Journal of Applied
Psychology, 65, 415421.
James LR, Demaree RG, Wolf G. (1984). Estimating within-group interrater reliability with
and without response bias. Journal of Applied Psychology, 69, 8598.
Johnson JW. (2000). A heuristic method for estimating the relative weights of predictor variables in multiple regression. Multivariate Behavioral Research, 35,
119.
Johnson JW. (2001). The relative importance of task and contextual performance dimensions to supervisor judgments of overall performance. Journal of Applied Psychology, 86, 984996.
Johnson JW. (2004). Factors affecting relative weights: The influence of sampling and
measurement error. Organizational Research Methods, 7, 283299.
Johnson JW. (2007, April). Distinguishing adaptive performance from task and citizenship
performance. In Oswald FL, Oberlander EM (Chairs), Adaptive skills and adaptive
performance: Todays organizational reality. Symposium conducted at the 22nd
Annual Conference of the Society for Industrial and Organizational Psychology,
New York.
Johnson JW, Carter GW, Davison HK, Oliver DH. (2001). A synthetic validity approach
to testing differential prediction hypotheses. Journal of Applied Psychology, 86,
774780.
Johnson JW, LeBreton JM. (2004). History and use of relative importance indices in
organizational research. Organizational Research Methods, 7, 238257.
Keil CT, Cortina JM. (2001). Degradation of validity over time: A test and extension of
Ackermans model. Psychological Bulletin, 127, 673697.
Kirk AK, Brown DF. (2003). Latent constructs of proximal and distal motivation predicting
performance under maximum test conditions. Journal of Applied Psychology, 88,
4049.
Klehe U-C, Anderson N. (2007). Working hard and working smart: Motivation and ability
during typical and maximum performance. Journal of Applied Psychology, 92, 978
992.
Klehe U-C, Anderson N, Viswesvaran C. (2007). More than peaks and valleys: Introduction
to the special issue on typical and maximum performance. Human Performance, 20,
173178.
Klehe U-C, Latham GP. (2006). What would you do - really or ideally? Constructs underlying the behavioral description interview and situational interview for predicting
typical versus maximum performance. Human Performance, 19, 357381.
Ladd RT, Atchley EK, Gniatczyk LA, Baumann LB. (2002, April). An evaluation of
the construct validity of an assessment center using multiple-regression importance
921
analysis. Paper presented at the 17th Annual Conference of the Society for Industrial
and Organizational Psychology, Toronto.
Landy FJ. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 11831192.
Larson SC. (1931). The shrinkage of the coefficient of multiple correlation. Journal of
Educational Psychology, 22, 4555.
Le H, Schmidt FL. (2006). Correcting for indirect range restriction in meta-analysis: Testing
a new meta-analytic procedure. Psychological Methods, 4, 416438.
LeBreton JM, Burgess JRD, Kaiser RB, Atchley EK, James LR. (2003). The restriction
of variance hypothesis and interrater reliability and agreement: Are ratings from
multiple sources really dissimilar? Organizational Research Methods, 6, 80128.
LeBreton JM, Hargis MB, Griepentrog B, Oswald FL, Ployhart RE. (2007). A multidimensional approach for evaluating variables in organizational research and practice.
P ERSONNEL P SYCHOLOGY, 60, 475498.
LeBreton JM, Ployhart RE, Ladd RT. (2004). A Monte Carlo comparison of relative
importance methodologies. Organizational Research Methods, 7, 258282.
Lee K, Allen NJ. (2002). Organizational citizenship behavior and workplace deviance: The
role of affect and cognitions. Journal of Applied Psychology, 87, 131142.
LePine JA, Van Dyne L. (2001). Voice and cooperative behavior as contrasting forms
of contextual performance: Evidence of differential relationships with Big Five
personality characteristics and cognitive ability. Journal of Applied Psychology, 86,
326336.
Lievens F, Buyse T, Sackett PR. (2005). Retest effects in operational selection settings: Developments and test of a framework. P ERSONNEL P SYCHOLOGY, 58,
9811007.
Lievens F, Reeve CL, Heggestad ED. (2007). An examination of psychometric bias due to
retesting on cognitive ability tests in selection settings. Journal of Applied Psychology, 92, 16721682.
Linn RL, Harnisch DL, Dunbar SB. (1981). Corrections for range restriction: An empirical
investigation of conditions resulting in conservative corrections. Journal of Applied
Psychology, 66, 655663.
Marcus B, Goffin RD, Johnston NG, Rothstein MG. (2007). Personality and cognitive
ability as predictors of typical and maximum managerial performance. Human
Performance, 20, 275285.
McDaniel MA, Morgeson FP, Finnegan EB, Campion MA, Braverman EP. (2001). Predicting job performance from common sense. Journal of Applied Psychology, 86,
730740.
McGraw KO, Wong SP. (1996). Forming inferences about some intraclass correlations
coefficients. Psychological Methods, 1, 3046.
McKay PF, McDaniel MA. (2006). A reexamination of Black-White mean differences in
work performance: More data, more moderators. Journal of Applied Psychology,
91, 538554.
McManus MA, Kelly ML. (1999). Personality measures and biodata: Evidence regarding their incremental predictive value in the life insurance industry. P ERSONNEL
P SYCHOLOGY, 52, 137148.
McPhail SM. (2007). Alternative validation strategies: Developing new and leveraging
existing validity evidence. San Francisco: Wiley.
Messick SJ. (1998). Alternative models of assessment, uniform standards of validity. In
Hakel MD (Ed.), Beyond multiple choice: Evaluating alternatives to traditional
testing for selection (pp. 5974). Mahwah, NJ: Erlbaum.
922
PERSONNEL PSYCHOLOGY
Morgeson FP, Reider MH, Campion MA. (2005). Selecting individuals in team settings: The
importance of social skills, personality characteristics, and teamwork knowledge.
P ERSONNEL P SYCHOLOGY, 58, 583611.
Motowidlo SJ. (2000). Some basic issues related to contextual performance and organizational citizenship behavior in human resource management. Human Resource
Management Review, 10, 115126.
Mount MK, Witt LA, Barrick MR. (2000). Incremental validity of empirically keyed
biodata scales over GMA and the five-factor personality constructs. P ERSONNEL
P SYCHOLOGY, 53, 299323.
Murphy KR. (1989). Is the relationship between cognitive ability and job performance
stable over time? Human Performance, 2, 183200.
Murphy KR, Cleveland JN, Skattebo AL, Kinney TB. (2004). Raters who pursue different
goals give different ratings. Journal of Applied Psychology, 89, 158164.
Murphy KR, DeShon RP. (2000). Interrater correlations do not estimate the reliability of
job performance ratings. P ERSONNEL P SYCHOLOGY, 53, 873900.
Murphy KR, Myors B. (2004). Statistical power analysis: A simple and general
model for traditional and modern hypothesis tests (2nd ed.). Mahwah, NJ:
Erlbaum.
Murphy KR, Shiarella AH. (1997). Implications of the multidimensional nature of job
performance for the validity of selection tests: Multivariate frameworks for studying
test validity. P ERSONNEL P SYCHOLOGY, 50, 823854.
Nagle BF. (1953). Criterion development. P ERSONNEL P SYCHOLOGY, 6, 271289.
Newman DA, Jacobs RR, Bartram D. (2007). Choosing the best method for local validity
estimation: Relative accuracy of meta-analysis versus a local study versus Bayesanalysis. Journal of Applied Psychology, 92, 13941413.
Ones DS, Viswesvaran C. (2003). Job-specific applicant pools and national norms for personality scales: Implications for range-restriction corrections in validation research.
Journal of Applied Psychology, 88, 570577.
Organ DW. (1988). Organizational citizenship behavior: The good soldier syndrome.
Lexington, MA: Lexington.
Organ DW. (1997). Organizational citizenship behavior: Its construct clean-up time.
Human Performance, 10, 8597.
Oswald FL, Saad S, Sackett PR. (2000). The homogeneity assumption in differential
prediction analysis: Does it really matter. Journal of Applied Psychology, 85, 536
541.
Ployhart RE. (2006). Staffing in the 21st century: New challenges and strategic opportunities. Journal of Management, 32, 868897.
Ployhart RE, Hakel MD. (1998). The substantive nature of performance variability: Predicting interindividual differences in intraindividual performance. P ERSONNEL P SYCHOLOGY , 51, 859901.
Ployhart RE, Lim B-C, Chan K-Y. (2001). Exploring relations between typical and maximum performance ratings and the five-factor model of personality. P ERSONNEL
P SYCHOLOGY, 54, 809843.
Ployhart RE, Schneider B, Schmitt N. (2006). Staffing organizations: Contemporary practice and theory (3rd ed.). Mahwah, NJ: Erlbaum.
Pulakos ED, Arad S, Donovan MA, Plamondon KE. (2000). Adaptability in the workplace: Development of a taxonomy of adaptive performance. Journal of Applied
Psychology, 85, 612624.
Pulakos ED, Schmitt N, Dorsey DW, Arad S, Hedge JW, Borman WC. (2002). Predicting
adaptive performance: Further tests of a model of adaptability. Human Performance,
15, 299323.
923
Raju NS, Bilgic R, Edwards JE, Fleer PF. (1997). Methodology review: Estimation of
population validity and cross-validity, and the use of equal weights in prediction.
Applied Psychological Measurement, 21, 291305.
Raju NS, Bilgic R, Edwards JE, Fleer PF. (1999). Accuracy of population validity and
cross-validity estimation: An empirical comparison of formula-based, traditional
empirical, and equal weights procedures. Applied Psychological Measurement, 23,
99115.
Raymond MR, Neustel S, Anderson D. (2007). Retest effects and parallel forms in certification and licensure testing. P ERSONNEL P SYCHOLOGY, 60, 367396.
Robie C, Zickar MJ, Schmit MJ. (2001). Measurement equivalence between applicant
and incumbent groups: An IRT analysis of personality. Human Performance, 14,
187207.
Robinson SL, Bennett RJ. (1995). A typology of deviant workplace behaviors: A multidimensional scaling study. Academy of Management Journal, 38, 555572.
Roth PL, Bobko P, McFarland LA. (2005). A meta-analysis of work sample test validity:
Updating and integrating some classic literature. P ERSONNEL P SYCHOLOGY, 58,
10091037.
Roth PL, Huffcutt AI, Bobko P. (2003). Ethnic group differences in measures of job
performance: A new meta-analysis. Journal of Applied Psychology, 88, 694706.
Rotundo M, Sackett PR. (1999). Effect of rater race on conclusions regarding differential
prediction in cognitive ability tests. Journal of Applied Psychology, 84, 815822.
Rotundo M, Sackett PR. (2002). The relative importance of task, citizenship, and counterproductive performance to global ratings of job performance: A policy-capturing
approach. Journal of Applied Psychology, 87, 6680.
Saad S, Sackett PR. (2002). Investigating differential prediction by gender in employmentrelated personality measures. Journal of Applied Psychology, 87, 667674.
Sackett PR, Berry CM, Wiemann SA, Laczo RM. (2006). Citizenship and counterproductive
behavior: Clarifying relations between the two domains. Human Performance, 19,
441464.
Sackett PR, Laczo RM, Arvey RD. (2002). The effects of range restriction on estimates
of criterion interrater reliability: Implications for validation research. P ERSONNEL
P SYCHOLOGY, 55, 807825.
Sackett PR, Laczo RM, Lippe ZP. (2003). Differential prediction and the use of multiple
predictors: The omitted variables problem. Journal of Applied Psychology, 88, 1046
1056.
Sackett PR, Lievens F, Berry CM, Landers RN. (2007). A cautionary note on the effects of
range restriction on predictor intercorrelations. Journal of Applied Psychology, 92,
538544.
Sackett PR, Ostgaard DJ. (1994). Job-specific applicant pools and national norms for
cognitive ability tests: Implications for range restriction corrections in validation
research. Journal of Applied Psychology, 79, 680684.
Sackett PR, Yang H. (2000). Correction for range restriction: An expanded typology.
Journal of Applied Psychology, 85, 112118.
Sackett PR, Zedeck S, Fogli L. (1988). Relations between measures of typical and maximum
job performance. Journal of Applied Psychology, 73, 482486.
Salgado JF, Ones DS, Viswesvaran C. (2001). Predictors used for personnel selection: An
overview of constructs, methods and techniques. In Anderson N, Ones DS, Sinangil
HK, Viswesvaran C (Eds.), Handbook of industrial, work, and organizational psychology (Vol. 1, pp. 165199). Thousand Oaks, CA: Sage.
Scherbaum CA. (2005). Synthetic validity: Past, present, and future. P ERSONNEL P SYCHOLOGY , 58, 481515.
924
PERSONNEL PSYCHOLOGY
Schmidt FL. (1971). The relative efficiency of regression and simple unit predictor weights
in applied differential psychology. Educational and Psychological Measurement,
31, 699714.
Schmidt FL, Hunter JE. (1996). Measurement error in psychological research: Lessons
from 26 research scenarios. Psychological Methods, 1, 199223.
Schmidt FL, Hunter JE. (1998). The validity and utility of selection methods in personnel
selection: Practical and theoretical implications of 85 years of research findings.
Psychological Bulletin, 124, 262274.
Schmidt FL, Hunter JE, Urry VW. (1976). Statistical power in criterion-related validation
studies. Journal of Applied Psychology, 61, 473485.
Schmidt FL, Kaplan LB. (1971). Composite versus multiple criteria: A review and resolution of the controversy. P ERSONNEL P SYCHOLOGY, 24, 419434.
Schmidt FL, Le H, Ilies R. (2003). Beyond alpha: An empirical examination of the effects
of different sources of measurement error on reliability estimates for measures of
individual-differences constructs. Psychological Methods, 8, 206224.
Schmidt FL, Oh I-S, Le H. (2006). Increasing the accuracy of corrections for range restriction: Implications for selection procedure validities and other research results.
P ERSONNEL P SYCHOLOGY, 59, 281305.
Schmidt FL, Pearlman K, Hunter JE. (1981). The validity and fairness of employment
and educational tests for Hispanic Americans: A review and analysis. P ERSONNEL
P SYCHOLOGY, 33, 705724.
Schmidt FL, Viswesvaran C, Ones DS. (2000). Reliability is not validity and validity is not
reliability. P ERSONNEL P SYCHOLOGY, 53, 901912.
Schmit MJ, Ryan AM. (1993). The Big Five in personnel selection: Factor structure in
applicant and nonapplicant populations. Journal of Applied Psychology, 78, 966
974.
Schmitt N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8,
350353.
Schmitt N. (2007). The value of personnel selection: Reflections on some remarkable
claims. Academy of Management Perspectives, 21, 1923.
Schmitt N, Chan D. (2006). Situational judgment tests: Method or construct. In Weekley JA, Ployhart RE (Eds.), Situational judgment tests: Theory, measurement, and
application (pp. 135154). Mahwah, NJ: Erlbaum.
Schmitt N, Landy FJ. (1993). The concept of validity. In Schmitt N, Borman W
(Eds.), Personnel selection in organizations (pp. 275309). San Francisco:
Jossey-Bass.
Schmitt N, Ployhart RE. (1999). Estimates of cross-validity for stepwise regression and
with predictor selection. Journal of Applied Psychology, 84, 5057.
Schmitt N, Rogers W, Chan D, Sheppard L, Jennings D. (1997). Adverse impact and predictor efficiency of various predictor combinations. Journal of Applied Psychology,
82, 719730.
Sharf JC, Jones DP. (2000). Employment risk management. In Kehoe JF (Ed.), Managing
selection in changing organizations (pp. 271318). San Francisco: Jossey-Bass.
Smith CA, Organ DW, Near JP. (1983). Organizational citizenship behavior: Its nature and
antecedents. Journal of Applied Psychology, 68, 653663.
Smith DB, Ellingson JE. (2002). Substance versus style: A new look at social desirability
in motivating contexts. Journal of Applied Psychology, 87, 211219.
Smith DB, Hanges PJ, Dickson MW. (2001). Personnel selection and the five-factor model:
Reexamining the effects of applicants frame of reference. Journal of Applied Psychology, 86, 304315.
925
Society for Industrial and Organizational Psychology, I. (2003). Principles for the validation
and use of personnel selection procedures (4th ed.). Bowling Green, OH: Author.
Stewart GL. (1999). Trait bandwidth and stages of job performance. Journal of Applied
Psychology, 84, 959968.
Sturman MC, Cheramie RA, Cashen LH. (2005). The impact of job complexity and performance measurement on the temporal consistency, stability, and test-retest reliability
of employee job performance ratings. Journal of Applied Psychology, 90, 269283.
Thoresen CJ, Bradley JC, Bliese PD, Thoresen JD. (2004). The big five personality traits
and individual job performance growth trajectories in maintenance and transitional
job stages. Journal of Applied Psychology, 89, 835853.
Thorndike RL. (1949). Personnel selection. New York: Wiley.
Van Scotter JR, Motowidlo SJ. (1996). Interpersonal facilitation and job dedication as
separate facets of contextual performance. Journal of Applied Psychology, 81, 525
531.
Van Scotter JR, Motowidlo SJ, Cross TC. (2000). Effects of task performance and contextual
performance on systemic rewards. Journal of Applied Psychology, 85, 526535.
Viswesvaran C, Ones DS, Schmidt FL. (1996). Comparative analysis of the reliability of
job performance ratings. Journal of Applied Psychology, 81, 557574.
Viswesvaran C, Schmidt FL, Ones DS. (2002). The moderating influence of job performance
dimensions on convergence of supervisory and peer ratings of job performance:
Unconfounding construct-level convergence and rating difficulty. Journal of Applied
Psychology, 87, 345354.
Viswesvaran C, Schmidt FL, Ones DS. (2005). Is there a general factor in ratings of job
performance? A meta-analysis framework for disentangling substantive and error
influences. Journal of Applied Psychology, 90, 108131.
Wallace SR. (1965). Criteria for what? American Psychologist, 20, 411417.
Weekley JA, Ployhart RE, Harold CM. (2004). Personality and situational judgment tests
across applicant and incumbent settings: An examination of validity, measurement,
and subgroup differences. Human Performance, 17, 433461.
Wonderlic Personnel Test, Inc. (1998). Comprehensive personality profile. Libertyville, IL:
Author.
Wong KFE, Kwong JYY. (2007). Effects of rater goals and rating patterns: Evidence from
an experimental field study. Journal of Applied Psychology, 92, 577585.
Wright PM, Boswell WR. (2002). Desegregating HRM: A review and synthesis of micro
and macro human resources management research. Journal of Management, 28,
247276.
Yang H, Sackett PR, Nho Y. (2004). Developing a procedure to correct for range restriction
that involves both institutional selection and applicants rejection of job offers.
Organizational Research Methods, 7, 442455.