Você está na página 1de 8

Information School, University of Sheffield - 50th Anniversary

Journal of Information Science


1–8
Comparison of chemical similarity Ó The Author(s) 2013
Reprints and permission: sagepub.
measures using different numbers of co.uk/journalsPermissions.nav
DOI: 10.1177/0165551512470042
query structures jis.sagepub.com

Shereena M. Arif
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia

John D. Holliday
Information School, University of Sheffield, UK

Peter Willett
Information School, University of Sheffield, UK

Abstract
Many different similarity measures have been described for searching chemical databases. Drawing on previous work in textual infor-
mation retrieval, this paper investigates the numbers of queries that are required to make robust statements as to the relative retrieval
effectiveness of different similarity measures. Experiments with the MDL Drug Data Report database suggest that much larger numbers
of queries are ideally required for this purpose than has been the case in previous comparative studies in chemoinformatics.

Keywords
chemoinformatics; sample size; similarity searching; test collection; virtual screening

1. Introduction
The work of Cleverdon and his colleagues in the Cranfield experiments established the central role of the document test
collection in information retrieval (IR) [1]. Although much extended in later studies, most obviously in the long-running
TREC conferences [2] and comparable evaluations such as ImageCLEF [3], the test-collection approach continues to pro-
vide an invaluable source for the development and evaluation of new approaches to retrieval [4].
A classical document test collection has three components (although the precise nature of the three components will
depend on the type of evaluation that is being carried out): a set of documents; a set of queries that can be matched
against that set of documents; and a set of relevance judgements stating which documents are relevant to each of
the queries. Work in Sheffield since the mid-1980s has used a suitably modified form of the test-collection approach for
the development and evaluation of methods for searching databases of chemical structures. This is an important part
of the discipline known as chemoinformatics, that is, ‘the application of informatics methods to solve chemical prob-
lems’ [5, 6]. The approach has been widely adopted in chemoinformatics, initially in the context of similarity searching
[7], but subsequently more widely across the whole range of methods that are used for what is generally known as vir-
tual screening [8–11]. This brief communication investigates one aspect of test-collections that has not previously been
studied in a chemoinformatics context, specifically the numbers of queries that are required to make robust statements
as to the relative merits of different similarity measures for searching chemical databases.

Corresponding author:
Peter Willett, Information School, The University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK.
Email: p.willett@sheffield.ac.uk

Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 12, 2016


Arif et al. 2

2. Chemoinformatics and information retrieval


Given a database of molecules, as represented by their chemical structures, and some property of interest, such as the
ability to reduce a person’s cholesterol levels, a virtual screening method will rank those molecules in order of decreas-
ing probability that they will exhibit the desired property. The top-ranked molecules are then those that are most likely
to be of interest for further research, for example, in a drug discovery programme at a pharmaceutical company.
It will be realized that ranking a database of chemical molecules in order of decreasing probability of biological activ-
ity is analogous, in principle at least, to ranking a database of textual documents in order of decreasing probability of
relevance. This analogy suggests that methods that have been developed previously by the IR community might well be
applicable, with suitable modification, to research in chemoinformatics. In fact, there are many such analogies between
these two, ostensibly disparate, disciplines [12, 13], an important one being the way that database records are repre-
sented for search. IR has long used a ‘bag of words’ approach in which a document is indexed by its constituent words,
without taking account of the precise way in which these words are related to each other (which lies in the realm of the
cognate discipline of natural language processing). The similarity between a database document and a query, and
hence the document’s probability of relevance to that query, is then based on measuring the extent of the overlap
between the words representing a document and the words representing the query. In much the same way, a database
structure can be indexed for searching by a ‘bag of substructures’ representation that lists its constituent substruc-
tures, for example, a molecule containing an aromatic carboxylic acid grouping might be indexed as containing ben-
zene and acid substructures inter alia. The substructures indexing a molecule are encoded in a fingerprint, a binary
vector in which specific bits are switched ‘on’ to indicate the presence of specific substructures (non-binary finger-
prints can also be used in which the individual vector elements contain weights denoting the relative importance of
each specific substructure).
In chemical similarity searching, the query is typically a structure, often called the reference structure, that has already
been found to exhibit the property of interest, and database structures that are similar to this query are then assumed also
to exhibit that property. This assumption is commonly referred to as the similar property principle and is a direct analo-
gue of the cluster hypothesis (i.e. that similar documents tend to be relevant to the same requests) that underlies the use
of clustering methods in IR [7]. The similarity between a database structure and the reference structure is computed by
comparing their fingerprints to identify bits, and hence substructures, that they share in common (as against the words in
common that form the basis for IR searches of text databases). The overall similarity measure hence involves the type of
fingerprint that is used to represent the molecules and the similarity coefficient that is used to compare two such represen-
tations. The present study compared the effectiveness of 20 different similarity measures (as described in Methodology
below) when used for similarity searching. More specifically, the focus of interest was not which measures were the most
effective (since there have been many such comparative studies [14–16]) but, rather, how many query reference struc-
tures needed to be used to ensure that the correct ‘best’ measures were indeed identified. Time and effort can clearly be
saved if just a few queries need to be searched to obtain a robust ranking of the various similarity measures under investi-
gation, and previous comparative studies have typically used just small numbers of queries. However, this may not be
desirable if there is a marked trade-off between the number of queries used and the accuracy of the rankings resulting
from those queries, that is, between the efficiency and the effectiveness of the comparison. This question has been studied
previously in the IR context on several occasions [17–19]: the rather different nature of the queries in similarity searching
raises the question of whether such a trade-off exists in the chemoinformatics context.

3. Methodology
3.1. The test collection
The test collection used here, called MDDR, is derived from the MDL Drug Data Report database (available from
Accelrys Inc. at http://www.accelrys.com). This database contains the structures and pharmacological class information
for molecules that have been reported in patents, journals and conference proceedings as exhibiting biological activity,
and the version used here contained 102,514 different molecules. A collaboration with the Novartis Institutes for
Biomedical Research in Basel [20] identified several activity classes in the database that were of current interest to the
pharmaceutical industry and six of these classes were used:

• 827 5HT1A agonists (subsequently abbreviated to 5HT, and studied as treatments for depression);
• 636 cyclooxygenase inhibitors (COX, non-steroidal anti-inflammatory action);
• 395 D2 antagonists (D2, blocking dopamine receptors associated with central nervous system diseases);
• 750 HIV protease inhibitors (HIV, HIV and AIDS);

Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042

Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 12, 2016


Arif et al. 3

• 435 protein kinase C inhibitors (PKC, prevents the initiation of cancer tumours);
• 1130 renin inhibitors (REN, treatment of high blood pressure).

For example, each of the 750 HIV compounds, when used as a reference structure, seeks to retrieve the other 749 inhibi-
tors at the top of the ranking generated using one of our 20 similarity measures. By comparing the numbers of inhibitors
that are actually retrieved (for which a cut-off of the top 1% of the database was used, that is, the top-ranked 1025 mole-
cules), the similarity measures can be ranked in order of decreasing retrieval effectiveness. Averaging over searches
using the 750 different query molecules we then obtain an overall ranking of the 20 similarity measures. As noted in the
Introduction, previous comparative studies of similarity measures have used only small numbers of queries, and the
research question posed here is hence as follows: is the same ranking of the 20 measures obtained if the average is calcu-
lated over not the full set of 750 HIV molecules but over just 10 of them selected at random, just 25 selected at random,
just 50 selected at random, etc.?

3.2. The similarity measures


The 20 different similarity measures that have been tested derive from all possible combinations of five different types
of fingerprint and four different types of similarity coefficient. The five types of fingerprint that were used to index the
query and database structures were as follows:

• ECFP4 (for Extended Connectivity Fingerprint with diameter of four bonds) fingerprints (2048 bits, available
from http://www.accelrys.com);
• Tripos Unity fingerprints (988 bits, available from http://www.tripos.com);
• Digital Chemistry fingerprints (1052 bits, available from http://www.digitalchemistry.co.uk);
• Daylight fingerprints (2048 bits, available from http://www.daylight.com);
• MDL-key fingerprints (166 bits, available from http://www.accelrys.com).

These fingerprints are described by Gardiner et al. [21]. The four coefficients were as follows:

c
Tanimoto
a + b c
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Euclidean a + b  2c
c
RussellRao
n
cn
Forbes
ab
where it is assumed that the fingerprints describing the query structure and a database structure have a and b bits, respec-
tively, set in bit-strings containing a total of n bits (so that, for example, n = 2048 for the ECFP4 fingerprints), and that c
of these bits are common to the two fingerprints that are being compared. Values for the Tanimoto, Russell–Rao and
Forbes coefficients of either 0 or 1 denote fingerprints having no bits (or substructures) in common or fingerprints that
are identical (all substructures the same), respectively; for the Euclidean (which is a distance coefficient rather than a
pffiffiffiffiffiffiffiffiffiffiffi
similarity coefficient), the corresponding extremal values are a þ b and 0, respectively.

3.3. Evaluation of performance


Given a specific combination of one type of fingerprint and one type of similarity coefficient, some number, q (q ≤ N,
where N is the total number of molecules with the chosen bioactivity), of similarity searches are carried out using q query
molecules having a particular bioactivity. The top-ranked 1% of the database resulting from each such search is inspected
to determine how many of these nearest neighbour structures have the same bioactivity as the reference structure and then
the mean number of actives retrieved (when averaged over the q searches) is calculated to represent the retrieval effec-
tiveness of that particular similarity measure (this is a common evaluation criterion in chemoinformatics research). The
search process is repeated for each of the 20 similarity measures in turn, thus providing a quantitative basis for ranking
the similarity measures in order of decreasing effectiveness.

Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042

Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 12, 2016


Arif et al. 4

This procedure is summarized below, with the innermost loop represented diagrammatically in Figure 1:

FOR i := 1 TO 20 DO
Select the ith similarity measure
FOR j := 1 TO q DO
Select the jth reference structure at random (without replacement)
Rank the database in decreasing similarity to the jth reference structure using the ith similarity measure
Note the number of actives retrieved in the top 1% of the ranked database
END DO
Calculate the mean number of actives retrieved using the ith similarity measure
END DO
Rank the 20 similarity measures in order of decreasing mean number of actives retrieved

The search procedure above was carried out for each type of bioactivity class using values for q of 10, 25, 50, 100, 200,
300, . and N. The resulting rankings R10, R25, R50, etc., hence provided the basic data for a study of the extent to which
the rankings of similarity measures obtained using a sample of the query molecules mirror the ranking, RN, obtained
when the entire set of query molecules is used. For example, the ranking (highest to lowest mean numbers of actives
retrieved) of the 20 similarity measures obtained using the entire set of 395 D2 actives was as follows: ECFP4-Tan,
ECFP4-For, ECFP4-RR, DY-Tan, DC-Tan, DC-For, DY-For, DC-Euc, MDL-Tan, MDL-For, MDL-Euc, TU-For, TU-
Tan, TU-Euc, ECFP4-Euc, DY-Euc, DC-RR, DY-RR, MDL-RR and TU-RR (where ECFP4, TU, DC, DY and MDL
denote the ECFP4, Tripos Unity, Digital Chemistry, Daylight and MDL fingerprints listed in Section 3.2 above, and
where Tan, Euc, RR and For denote the Tanimoto, Euclidean, Russell–Rao and Forbes coefficients listed in the same
section). This ranking, RN, acts as the population-based gold standard with which each of the sample-based rankings
(R10, R25, R50, etc.) is compared to determine the effect of variations in the number of query molecules.
The obvious way of comparing two rankings of the same set of objects (such as R25 and RN in the present context) is
to use the Spearman rank correlation coefficient [22], which gives values of 1 or −1 for rankings that are identical or

Figure 1. Chemical similarity searching. A reference structure is compared with each of the molecules in a chemical database, and
the similarity computed using some particular similarity measure. The resulting similarities are then sorted into decreasing order
and the activity or inactivity noted (in this hypothetical example, two of the three most similar molecules have the similarities in
bold to denote that they are active). The procedure is repeated for each of the reference structures that is to be searched against
the database and then for each of the 20 different similarity measures.

Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042

Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 12, 2016


Arif et al. 5

are exact inverses of each other, respectively. Although very widely used, it may be argued that the Spearman coeffi-
cient is not an ideal evaluation criterion for this particular application. The purpose of comparing different similarity
measures is to identify the measure or measures that give the best overall level of performance, that is, that come at or
near the top of the rankings, with those measures occurring at the bottom of the rankings being considered of less impor-
tance (except, possibly, in a negative sense as measures that should be avoided in general). The Spearman coefficient,
however, considers the entire ranking, giving as much importance to the top of the ranking as it does to the bottom [23,
24] (and a related problem has been identified when the popular receiver operating characteristic curves are used for
evaluating virtual screening experiments [25]). An additional, purely empirical scoring function is hence proposed here
that attaches more importance to agreements at the top of the rankings that are being compared (and that accordingly
acts as a check on the well-established Spearman coefficient). Assume that some particular similarity measure is ranked
at position i (1 ≤ i ≥ 20) in the population-based ranking RN, and that it is ranked at position r in a sample-based rank-
ing. Then the function

21  i
1 + ji  r j
describes the degree of difference between the two rankings of that particular similarity measure. Summing over all of
the 20 measures that are under test, the function

X20
21  i
,
i=1
1 + ji  rj

which will be referred to as the accuracy score, then describes the overall degree of difference between the two rankings.
The maximum and minimal values of this expression are 20 + 19 + 18 + 17 . + 2 + 1 and (20/20) + (19/18) +
(18/16) + (17/14) + (16/12) + . + (2/18) + (1/20), respectively.

4. Results and discussion


The results obtained are exemplified by those for the D2 and HIV molecules shown in Figure 2. Figure 2a plots the
change in the Spearman rank correlation coefficient as the size of the query sample increases through the range R10, R25,
R50, R100, R200, etc. It will be seen that the coefficient has its lowest value for the smallest sample, R10, but that the value
then rises, albeit hardly smoothly, towards the limiting value for RN. Thus, while the ranking generated using just 10 ref-
erence structures is noticeably different from that obtained when the full set of reference structures is used, this differ-
ence soon disappears as the sample size grows. For this dataset, then, a sample of 100 queries yields a ranking of the 20
similarity measures that is little different from that obtained when all 395 queries are used.
The minimal values observed for the Spearman coefficient (and also for the accuracy score, where the scores have
been rounded to the nearest integer) for each activity class are listed in Table 1. The great majority of these minimal val-
ues were observed when the smallest samples were used, for example, the R10 rankings with the D2 molecules. The only
exceptions to this general rule were as follows: for the Spearman coefficient, the smallest value for COX was obtained
using R25; and for the accuracy score, the smallest values for COX and HIV were obtained with R50 and for PKC with
R25.
Inspection of the Spearman values in Table 1 suggests that, in some cases at least, the use of only small numbers of
queries is unlikely to change dramatically the conclusions that can be drawn as to the relative merits of the various simi-
larity measures that are being compared. For example, even the smallest HIV samples suggest a high level of agreement
between the sample-based rankings and RN, as illustrated by the plot in Figure 2c. This finding supports much previous
work on the comparison of similarity measures for virtual screening; for example, the many studies carried out in
Sheffield have typically used sets of 10 or 20 reference structures for each activity class under study; while Sastry et al.
[16] and McGaughey et al. [26] have used just a single reference structure for each class. In other cases, however, the
Spearman results indicate that considerably larger numbers are advisable when comparing different similarity measures,
with the D2 molecules providing an obvious example of this behaviour.
The need for large samples is much more marked if the accuracy score is used to compare the sample-based rankings
with RN, as demonstrated by Figure 2b and d. In both cases, it will be seen that, while the general form of the plots is
similar to that in Figure 2a and c, in that the value rises with an increase in the number of queries, even the largest sam-
ples yield accuracy scores that are rather less than the maximal value of 210. In some cases, such as D2, PKC and REN,
Table 1 shows that the minimal value is only about one-half of the maximum value, indicating a marked difference in

Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042

Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 12, 2016


Arif et al. 6

1 250
0.98

Spearman correlation
200

Accuracy score
0.96
150
0.94
100
0.92
0.9 50

0.88 0
0 100 200 300 400 500 0 100 200 300 400 500
Number of reference structures Number of reference structures

(a) (b)
1 250
0.98
Spearman correlation

200

Accuracy score
0.96
150
0.94
100
0.92
0.9 50

0.88 0
0 200 400 600 800 0 200 400 600 800
Number of reference structures Number of reference structures

(c) (d)

Figure 2. Comparison with RN of sample-based rankings using different numbers of query molecules: (a) Spearman correlation
coefficient for the D2 molecules; (b) accuracy score for the D2 molecules; (c) Spearman correlation coefficient for the HIV
molecules; (d) accuracy score for the HIV molecules.

Table 1. Smallest observed values for the Spearman correlation coefficient and the accuracy score in comparisons with RN of
sample-based rankings using different numbers of query molecules

Activity class Spearman coefficient Accuracy score

5HT 0.98 147


COX 0.96 121
D2 0.90 109
HIV 0.96 147
PKC 0.90 106
REN 0.92 96

the sample-based and population-based rankings of the 20 similarity measures. The accuracy score plots hence suggest
that notably more searches need to be carried out to approximate the full set of reference structures than when the
Spearman coefficient is used to assess the accuracy of the rankings. That said, the two criteria are in agreement to the
extent that both identify the 5HT, COX and HIV sample-based rankings as differing less from RN than do those for D2,
PKC and REN.
It has been noted previously that the purpose of this work is not to identify the best similarity measure: this is a topic
that has been extensively studied in the past, with the previous studies demonstrating the general effectiveness of the
Tanimoto coefficient and the ECFP4 fingerprint (as is the case, for example, for the D2 RN ranking listed previously). It
is hence hardly surprising that this combination was the best single measure when averaged over the entire sets of experi-
ments carried out here, with the combination of the Forbes coefficient and the ECFP4 fingerprint the next best. However,
there were occasions when the rankings of the 20 measures had these two in reverse order. Specifically, the latter

Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042

Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 12, 2016


Arif et al. 7

combination would have been judged the better of the two (and the best of the 20 measures) for the R10 to R300 rankings
for the PKC molecules and for the R10, R25 and R50 rankings for the COX molecules.
The results hence demonstrate clearly that there can be discrepancies in the rankings of similarity measures that are
obtained using different numbers of bioactive reference structures. In particular, rankings obtained using small numbers
can be markedly different from those obtained using all of the available active molecules. Since previous similarity-
measure comparisons have used only small, or very small, sets of queries, one might reasonably question the validity of
the results that have been obtained to date. However, it must be emphasized that such comparisons have typically used
multiple activity classes, combining the ranking obtained from one class with those from each of the other classes to
obtain an overall ranking of the available measures that is expected to give a better guide as to effectiveness than would
the use of just a single activity class. For example, the Sheffield group (and subsequently others) have made extensive
use of 11 (and sometimes up to 30) activity classes from the MDDR database, the Maximum Unbiased Validation data-
set of Rohrer and Baumann contains 17 classes from the PubChem database [27], and the Bajorath group has recently
described a set of 50 activity classes from the ChEMBL database [28].

5. Conclusions
The comparison of different types of similarity measure is an important aspect of research into virtual screening.
Previous comparative studies have used only limited numbers of query molecules, but the results presented here suggest
that this can be far from ideal, with some hundreds of query molecules being required to mirror closely the results
obtained using the full sets of query molecules for each of six types of biological activity. This report hence suggests that
future experimental studies could usefully employ larger numbers of queries than has been the case in the past.

Acknowledgements
We thank the Universiti Kebangsaan Malaysia for funding, and Accelrys Inc., Daylight Chemical Information Systems Inc., Digital
Chemistry Limited and Tripos Inc. for data and software support.

References
[1] Cleverdon CW. The Cranfield tests on index language devices. Aslib Proceedings 1967; 19(6): 173–94.
[2] Voorhees EM and Harman DK. TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT Press, 2005.
[3] Müller H, Clough P, Deselaers T and Caputo B (eds) ImageCLEF – experimental evaluation of visual information retrieval.
Heidelberg: Springer, 2010.
[4] Sanderson M. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information
Retrieval 2010; 4: 247–375.
[5] Gasteiger J and Engel T (eds) Chemoinformatics: A textbook. Weinheim: Wiley-VCH, 2003.
[6] Willett P. From chemical documentation to chemoinformatics: fifty years of chemical information science. Journal of
Information Science 2008; 34: 477–499.
[7] Willett P. Similarity methods in chemoinformatics. Annual Review of Information Science and Technology 2009; 43: 3–71.
[8] Alvarez J and Shoichet B (eds) Virtual screening in drug discovery. Boca Raton, FL: CRC Press, 2005.
[9] Geppert H, Vogt M and Bajorath J. Current trends in ligand-based virtual screening: molecular representations, data
mining methods, new application areas, and performance evaluation. Journal of Chemical Information and Modeling 2010; 50:
205–16.
[10] Rippenhausen P, Nisius B, Peltason L and Bajorath J. Quo vadis, virtual screening? A comprehensive survey of prospective
applications. Journal of Medicinal Chemistry 2010; 53: 8461–8467.
[11] Ripphausen P, Nisius B and Bajorath J. State-of-the-art in ligand-based virtual screening. Drug Discovery Today 2011; 16:
372–376.
[12] Willett P. Textual and chemical information retrieval: Different applications but similar algorithms. Information Research
2000; 5(2), http://InformationR.net/ir/5-2/infres52.html
[13] Willett P. Information retrieval and chemoinformatics: Is there a bibliometric link? In: Larsen B, Schneider JW, Åström F and
Schlemmer B (eds) The Janus faced scholar – A Festschrift in honour of Peter Ingwersen. Copenhagen: Royal School of
Library and Information Science, 2010, pp. 107–115.
[14] Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M and Davies JW. How similar are similarity searching methods? A prin-
cipal components analysis of molecular descriptor space. Journal of Chemical Information and Modeling 2009; 49: 108–119.
[15] Duan J, Dixon SL, Lowrie JF and Sherman W. Analysis and comparison of 2D fingerprints: Insights into database screening
performance using eight fingerprint methods. Journal of Molecular Graphics and Modelling 2010; 29: 157–170.
[16] Sastry M, Lowrie JF, Dixon SL and Sherman W. Large-scale systematic analysis of 2D fingerprint methods and parameters to
improve virtual screening enrichments. Journal of Chemical Information and Modeling 2010; 50: 771–748.

Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042

Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 12, 2016


Arif et al. 8

[17] Sparck Jones K and Bates RG. Report on a design study for the ‘ideal’ information retrieval test collection. Cambridge:
Computer Laboratory, University of Cambridge, 1977.
[18] Robertson SE. On sample sizes for non-matched-pair IR experiments. Information Processing and Management 1990; 26(6):
739–753.
[19] Voorhees EM. On test collections for adaptive information retrieval. Information Processing and Management 2008; 44: 1879–
1885.
[20] Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E et al. Comparison of fingerprint-based methods for virtual
screening using multiple bioactive reference structures. Journal of Chemical Information and Computer Sciences 2004; 44:
1177–1185.
[21] Gardiner EJ, Gillet VJ, Haranczyk M, Hert J, Holliday JD, Malim N et al. Turbo similarity searching: Effect of fingerprint and
dataset on virtual-screening performance. Statistical Analysis and Data Mining 2009; 2: 103–114.
[22] Siegel S and Castellan NJ. Nonparametric statistics for the behavioural sciences, 2nd edn. New York: McGraw-Hill, 1988.
[23] Yilmaz E, Aslam JA and Robertson SE. A new rank correlation coefficient for information retrieval. Proceedings of the
31st annual ACM SIGIR conference on research and development in information retrieval, 2008. New York: ACM, 2008,
pp. 587–594.
[24] Carterette B. On rank correlation and the distance between rankings. Proceedings of the 32nd annual ACM SIGIR conference
on research and development in information retrieval. NewYork: ACM, 2009, pp. 436–443.
[25] Truchon J-F and Bayly CI. Evaluating virtual screening methods: good and bad metrics for the ‘early recognition’ problem.
Journal of Chemical Information and Modeling 2007; 47: 488–508.
[26] McGaughey GB, Sheridan RP, Bayly CI, Culberson JC, Kreatsoulas C, Lindsley S et al. Comparison of topological, shape, and
docking methods in virtual screening. Journal of Chemical Information and Modeling 2007; 47: 1504–1519.
[27] Rohrer SG and Baumann K. Maximum Unbiased Validation (MUV) data sets for virtual screening based on PubChem bioactiv-
ity data. Journal of Chemical Information and Modeling 2009; 49: 169–184.
[28] Heikamp K and Bajorath J. Large-scale similarity search profiling of ChEMBL compound data sets. Journal of Chemical
Information and Modeling 2011; 51: 1831–1839.

Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042

Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 12, 2016

Você também pode gostar