Escolar Documentos
Profissional Documentos
Cultura Documentos
Shereena M. Arif
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia
John D. Holliday
Information School, University of Sheffield, UK
Peter Willett
Information School, University of Sheffield, UK
Abstract
Many different similarity measures have been described for searching chemical databases. Drawing on previous work in textual infor-
mation retrieval, this paper investigates the numbers of queries that are required to make robust statements as to the relative retrieval
effectiveness of different similarity measures. Experiments with the MDL Drug Data Report database suggest that much larger numbers
of queries are ideally required for this purpose than has been the case in previous comparative studies in chemoinformatics.
Keywords
chemoinformatics; sample size; similarity searching; test collection; virtual screening
1. Introduction
The work of Cleverdon and his colleagues in the Cranfield experiments established the central role of the document test
collection in information retrieval (IR) [1]. Although much extended in later studies, most obviously in the long-running
TREC conferences [2] and comparable evaluations such as ImageCLEF [3], the test-collection approach continues to pro-
vide an invaluable source for the development and evaluation of new approaches to retrieval [4].
A classical document test collection has three components (although the precise nature of the three components will
depend on the type of evaluation that is being carried out): a set of documents; a set of queries that can be matched
against that set of documents; and a set of relevance judgements stating which documents are relevant to each of
the queries. Work in Sheffield since the mid-1980s has used a suitably modified form of the test-collection approach for
the development and evaluation of methods for searching databases of chemical structures. This is an important part
of the discipline known as chemoinformatics, that is, ‘the application of informatics methods to solve chemical prob-
lems’ [5, 6]. The approach has been widely adopted in chemoinformatics, initially in the context of similarity searching
[7], but subsequently more widely across the whole range of methods that are used for what is generally known as vir-
tual screening [8–11]. This brief communication investigates one aspect of test-collections that has not previously been
studied in a chemoinformatics context, specifically the numbers of queries that are required to make robust statements
as to the relative merits of different similarity measures for searching chemical databases.
Corresponding author:
Peter Willett, Information School, The University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK.
Email: p.willett@sheffield.ac.uk
3. Methodology
3.1. The test collection
The test collection used here, called MDDR, is derived from the MDL Drug Data Report database (available from
Accelrys Inc. at http://www.accelrys.com). This database contains the structures and pharmacological class information
for molecules that have been reported in patents, journals and conference proceedings as exhibiting biological activity,
and the version used here contained 102,514 different molecules. A collaboration with the Novartis Institutes for
Biomedical Research in Basel [20] identified several activity classes in the database that were of current interest to the
pharmaceutical industry and six of these classes were used:
• 827 5HT1A agonists (subsequently abbreviated to 5HT, and studied as treatments for depression);
• 636 cyclooxygenase inhibitors (COX, non-steroidal anti-inflammatory action);
• 395 D2 antagonists (D2, blocking dopamine receptors associated with central nervous system diseases);
• 750 HIV protease inhibitors (HIV, HIV and AIDS);
Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042
• 435 protein kinase C inhibitors (PKC, prevents the initiation of cancer tumours);
• 1130 renin inhibitors (REN, treatment of high blood pressure).
For example, each of the 750 HIV compounds, when used as a reference structure, seeks to retrieve the other 749 inhibi-
tors at the top of the ranking generated using one of our 20 similarity measures. By comparing the numbers of inhibitors
that are actually retrieved (for which a cut-off of the top 1% of the database was used, that is, the top-ranked 1025 mole-
cules), the similarity measures can be ranked in order of decreasing retrieval effectiveness. Averaging over searches
using the 750 different query molecules we then obtain an overall ranking of the 20 similarity measures. As noted in the
Introduction, previous comparative studies of similarity measures have used only small numbers of queries, and the
research question posed here is hence as follows: is the same ranking of the 20 measures obtained if the average is calcu-
lated over not the full set of 750 HIV molecules but over just 10 of them selected at random, just 25 selected at random,
just 50 selected at random, etc.?
• ECFP4 (for Extended Connectivity Fingerprint with diameter of four bonds) fingerprints (2048 bits, available
from http://www.accelrys.com);
• Tripos Unity fingerprints (988 bits, available from http://www.tripos.com);
• Digital Chemistry fingerprints (1052 bits, available from http://www.digitalchemistry.co.uk);
• Daylight fingerprints (2048 bits, available from http://www.daylight.com);
• MDL-key fingerprints (166 bits, available from http://www.accelrys.com).
These fingerprints are described by Gardiner et al. [21]. The four coefficients were as follows:
c
Tanimoto
a + b c
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Euclidean a + b 2c
c
RussellRao
n
cn
Forbes
ab
where it is assumed that the fingerprints describing the query structure and a database structure have a and b bits, respec-
tively, set in bit-strings containing a total of n bits (so that, for example, n = 2048 for the ECFP4 fingerprints), and that c
of these bits are common to the two fingerprints that are being compared. Values for the Tanimoto, Russell–Rao and
Forbes coefficients of either 0 or 1 denote fingerprints having no bits (or substructures) in common or fingerprints that
are identical (all substructures the same), respectively; for the Euclidean (which is a distance coefficient rather than a
pffiffiffiffiffiffiffiffiffiffiffi
similarity coefficient), the corresponding extremal values are a þ b and 0, respectively.
Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042
This procedure is summarized below, with the innermost loop represented diagrammatically in Figure 1:
FOR i := 1 TO 20 DO
Select the ith similarity measure
FOR j := 1 TO q DO
Select the jth reference structure at random (without replacement)
Rank the database in decreasing similarity to the jth reference structure using the ith similarity measure
Note the number of actives retrieved in the top 1% of the ranked database
END DO
Calculate the mean number of actives retrieved using the ith similarity measure
END DO
Rank the 20 similarity measures in order of decreasing mean number of actives retrieved
The search procedure above was carried out for each type of bioactivity class using values for q of 10, 25, 50, 100, 200,
300, . and N. The resulting rankings R10, R25, R50, etc., hence provided the basic data for a study of the extent to which
the rankings of similarity measures obtained using a sample of the query molecules mirror the ranking, RN, obtained
when the entire set of query molecules is used. For example, the ranking (highest to lowest mean numbers of actives
retrieved) of the 20 similarity measures obtained using the entire set of 395 D2 actives was as follows: ECFP4-Tan,
ECFP4-For, ECFP4-RR, DY-Tan, DC-Tan, DC-For, DY-For, DC-Euc, MDL-Tan, MDL-For, MDL-Euc, TU-For, TU-
Tan, TU-Euc, ECFP4-Euc, DY-Euc, DC-RR, DY-RR, MDL-RR and TU-RR (where ECFP4, TU, DC, DY and MDL
denote the ECFP4, Tripos Unity, Digital Chemistry, Daylight and MDL fingerprints listed in Section 3.2 above, and
where Tan, Euc, RR and For denote the Tanimoto, Euclidean, Russell–Rao and Forbes coefficients listed in the same
section). This ranking, RN, acts as the population-based gold standard with which each of the sample-based rankings
(R10, R25, R50, etc.) is compared to determine the effect of variations in the number of query molecules.
The obvious way of comparing two rankings of the same set of objects (such as R25 and RN in the present context) is
to use the Spearman rank correlation coefficient [22], which gives values of 1 or −1 for rankings that are identical or
Figure 1. Chemical similarity searching. A reference structure is compared with each of the molecules in a chemical database, and
the similarity computed using some particular similarity measure. The resulting similarities are then sorted into decreasing order
and the activity or inactivity noted (in this hypothetical example, two of the three most similar molecules have the similarities in
bold to denote that they are active). The procedure is repeated for each of the reference structures that is to be searched against
the database and then for each of the 20 different similarity measures.
Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042
are exact inverses of each other, respectively. Although very widely used, it may be argued that the Spearman coeffi-
cient is not an ideal evaluation criterion for this particular application. The purpose of comparing different similarity
measures is to identify the measure or measures that give the best overall level of performance, that is, that come at or
near the top of the rankings, with those measures occurring at the bottom of the rankings being considered of less impor-
tance (except, possibly, in a negative sense as measures that should be avoided in general). The Spearman coefficient,
however, considers the entire ranking, giving as much importance to the top of the ranking as it does to the bottom [23,
24] (and a related problem has been identified when the popular receiver operating characteristic curves are used for
evaluating virtual screening experiments [25]). An additional, purely empirical scoring function is hence proposed here
that attaches more importance to agreements at the top of the rankings that are being compared (and that accordingly
acts as a check on the well-established Spearman coefficient). Assume that some particular similarity measure is ranked
at position i (1 ≤ i ≥ 20) in the population-based ranking RN, and that it is ranked at position r in a sample-based rank-
ing. Then the function
21 i
1 + ji r j
describes the degree of difference between the two rankings of that particular similarity measure. Summing over all of
the 20 measures that are under test, the function
X20
21 i
,
i=1
1 + ji rj
which will be referred to as the accuracy score, then describes the overall degree of difference between the two rankings.
The maximum and minimal values of this expression are 20 + 19 + 18 + 17 . + 2 + 1 and (20/20) + (19/18) +
(18/16) + (17/14) + (16/12) + . + (2/18) + (1/20), respectively.
Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042
1 250
0.98
Spearman correlation
200
Accuracy score
0.96
150
0.94
100
0.92
0.9 50
0.88 0
0 100 200 300 400 500 0 100 200 300 400 500
Number of reference structures Number of reference structures
(a) (b)
1 250
0.98
Spearman correlation
200
Accuracy score
0.96
150
0.94
100
0.92
0.9 50
0.88 0
0 200 400 600 800 0 200 400 600 800
Number of reference structures Number of reference structures
(c) (d)
Figure 2. Comparison with RN of sample-based rankings using different numbers of query molecules: (a) Spearman correlation
coefficient for the D2 molecules; (b) accuracy score for the D2 molecules; (c) Spearman correlation coefficient for the HIV
molecules; (d) accuracy score for the HIV molecules.
Table 1. Smallest observed values for the Spearman correlation coefficient and the accuracy score in comparisons with RN of
sample-based rankings using different numbers of query molecules
the sample-based and population-based rankings of the 20 similarity measures. The accuracy score plots hence suggest
that notably more searches need to be carried out to approximate the full set of reference structures than when the
Spearman coefficient is used to assess the accuracy of the rankings. That said, the two criteria are in agreement to the
extent that both identify the 5HT, COX and HIV sample-based rankings as differing less from RN than do those for D2,
PKC and REN.
It has been noted previously that the purpose of this work is not to identify the best similarity measure: this is a topic
that has been extensively studied in the past, with the previous studies demonstrating the general effectiveness of the
Tanimoto coefficient and the ECFP4 fingerprint (as is the case, for example, for the D2 RN ranking listed previously). It
is hence hardly surprising that this combination was the best single measure when averaged over the entire sets of experi-
ments carried out here, with the combination of the Forbes coefficient and the ECFP4 fingerprint the next best. However,
there were occasions when the rankings of the 20 measures had these two in reverse order. Specifically, the latter
Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042
combination would have been judged the better of the two (and the best of the 20 measures) for the R10 to R300 rankings
for the PKC molecules and for the R10, R25 and R50 rankings for the COX molecules.
The results hence demonstrate clearly that there can be discrepancies in the rankings of similarity measures that are
obtained using different numbers of bioactive reference structures. In particular, rankings obtained using small numbers
can be markedly different from those obtained using all of the available active molecules. Since previous similarity-
measure comparisons have used only small, or very small, sets of queries, one might reasonably question the validity of
the results that have been obtained to date. However, it must be emphasized that such comparisons have typically used
multiple activity classes, combining the ranking obtained from one class with those from each of the other classes to
obtain an overall ranking of the available measures that is expected to give a better guide as to effectiveness than would
the use of just a single activity class. For example, the Sheffield group (and subsequently others) have made extensive
use of 11 (and sometimes up to 30) activity classes from the MDDR database, the Maximum Unbiased Validation data-
set of Rohrer and Baumann contains 17 classes from the PubChem database [27], and the Bajorath group has recently
described a set of 50 activity classes from the ChEMBL database [28].
5. Conclusions
The comparison of different types of similarity measure is an important aspect of research into virtual screening.
Previous comparative studies have used only limited numbers of query molecules, but the results presented here suggest
that this can be far from ideal, with some hundreds of query molecules being required to mirror closely the results
obtained using the full sets of query molecules for each of six types of biological activity. This report hence suggests that
future experimental studies could usefully employ larger numbers of queries than has been the case in the past.
Acknowledgements
We thank the Universiti Kebangsaan Malaysia for funding, and Accelrys Inc., Daylight Chemical Information Systems Inc., Digital
Chemistry Limited and Tripos Inc. for data and software support.
References
[1] Cleverdon CW. The Cranfield tests on index language devices. Aslib Proceedings 1967; 19(6): 173–94.
[2] Voorhees EM and Harman DK. TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT Press, 2005.
[3] Müller H, Clough P, Deselaers T and Caputo B (eds) ImageCLEF – experimental evaluation of visual information retrieval.
Heidelberg: Springer, 2010.
[4] Sanderson M. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information
Retrieval 2010; 4: 247–375.
[5] Gasteiger J and Engel T (eds) Chemoinformatics: A textbook. Weinheim: Wiley-VCH, 2003.
[6] Willett P. From chemical documentation to chemoinformatics: fifty years of chemical information science. Journal of
Information Science 2008; 34: 477–499.
[7] Willett P. Similarity methods in chemoinformatics. Annual Review of Information Science and Technology 2009; 43: 3–71.
[8] Alvarez J and Shoichet B (eds) Virtual screening in drug discovery. Boca Raton, FL: CRC Press, 2005.
[9] Geppert H, Vogt M and Bajorath J. Current trends in ligand-based virtual screening: molecular representations, data
mining methods, new application areas, and performance evaluation. Journal of Chemical Information and Modeling 2010; 50:
205–16.
[10] Rippenhausen P, Nisius B, Peltason L and Bajorath J. Quo vadis, virtual screening? A comprehensive survey of prospective
applications. Journal of Medicinal Chemistry 2010; 53: 8461–8467.
[11] Ripphausen P, Nisius B and Bajorath J. State-of-the-art in ligand-based virtual screening. Drug Discovery Today 2011; 16:
372–376.
[12] Willett P. Textual and chemical information retrieval: Different applications but similar algorithms. Information Research
2000; 5(2), http://InformationR.net/ir/5-2/infres52.html
[13] Willett P. Information retrieval and chemoinformatics: Is there a bibliometric link? In: Larsen B, Schneider JW, Åström F and
Schlemmer B (eds) The Janus faced scholar – A Festschrift in honour of Peter Ingwersen. Copenhagen: Royal School of
Library and Information Science, 2010, pp. 107–115.
[14] Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M and Davies JW. How similar are similarity searching methods? A prin-
cipal components analysis of molecular descriptor space. Journal of Chemical Information and Modeling 2009; 49: 108–119.
[15] Duan J, Dixon SL, Lowrie JF and Sherman W. Analysis and comparison of 2D fingerprints: Insights into database screening
performance using eight fingerprint methods. Journal of Molecular Graphics and Modelling 2010; 29: 157–170.
[16] Sastry M, Lowrie JF, Dixon SL and Sherman W. Large-scale systematic analysis of 2D fingerprint methods and parameters to
improve virtual screening enrichments. Journal of Chemical Information and Modeling 2010; 50: 771–748.
Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042
[17] Sparck Jones K and Bates RG. Report on a design study for the ‘ideal’ information retrieval test collection. Cambridge:
Computer Laboratory, University of Cambridge, 1977.
[18] Robertson SE. On sample sizes for non-matched-pair IR experiments. Information Processing and Management 1990; 26(6):
739–753.
[19] Voorhees EM. On test collections for adaptive information retrieval. Information Processing and Management 2008; 44: 1879–
1885.
[20] Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E et al. Comparison of fingerprint-based methods for virtual
screening using multiple bioactive reference structures. Journal of Chemical Information and Computer Sciences 2004; 44:
1177–1185.
[21] Gardiner EJ, Gillet VJ, Haranczyk M, Hert J, Holliday JD, Malim N et al. Turbo similarity searching: Effect of fingerprint and
dataset on virtual-screening performance. Statistical Analysis and Data Mining 2009; 2: 103–114.
[22] Siegel S and Castellan NJ. Nonparametric statistics for the behavioural sciences, 2nd edn. New York: McGraw-Hill, 1988.
[23] Yilmaz E, Aslam JA and Robertson SE. A new rank correlation coefficient for information retrieval. Proceedings of the
31st annual ACM SIGIR conference on research and development in information retrieval, 2008. New York: ACM, 2008,
pp. 587–594.
[24] Carterette B. On rank correlation and the distance between rankings. Proceedings of the 32nd annual ACM SIGIR conference
on research and development in information retrieval. NewYork: ACM, 2009, pp. 436–443.
[25] Truchon J-F and Bayly CI. Evaluating virtual screening methods: good and bad metrics for the ‘early recognition’ problem.
Journal of Chemical Information and Modeling 2007; 47: 488–508.
[26] McGaughey GB, Sheridan RP, Bayly CI, Culberson JC, Kreatsoulas C, Lindsley S et al. Comparison of topological, shape, and
docking methods in virtual screening. Journal of Chemical Information and Modeling 2007; 47: 1504–1519.
[27] Rohrer SG and Baumann K. Maximum Unbiased Validation (MUV) data sets for virtual screening based on PubChem bioactiv-
ity data. Journal of Chemical Information and Modeling 2009; 49: 169–184.
[28] Heikamp K and Bajorath J. Large-scale similarity search profiling of ChEMBL compound data sets. Journal of Chemical
Information and Modeling 2011; 51: 1831–1839.
Journal of Information Science, 2013, pp. 1–8 Ó The Author(s), DOI: 10.1177/0165551512470042