Journal of Biomolecular Screening

Journal of Biomolecular Screening http://jbx.sagepub.
com/
Enrichment of Extremely Noisy High-Throughput Screening Data Using a Nave Bayes Classifier
Meir Glick, Anthony E. Klon, Pierre Acklin and John W. Davies J Biomol Screen 2004 9: 32 DOI: 10.1177/1087057103260590 The online version of this article can be found at: http://jbx.sagepub.com/content/9/1/32
Published by:
http://www.sagepublications.com
Additional services and information for Journal of Biomolecular Screening can be found at: Email Alerts: http://jbx.sagepub.com/cgi/alerts Subscriptions: http://jbx.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://jbx.sagepub.com/content/9/1/32.refs.html
>> Version of Record - Feb 1, 2004 What is This?
Downloaded from jbx.sagepub.com by guest on November 25, 2011
Enrichment Glick et al. ARTICLE of Extremely Noisy 10.1177/1087057103260590 HTS Data
Enrichment of Extremely Noisy High-Throughput Screening Data Using a Nave Bayes Classifier
MEIR GLICK, ANTHONY E. KLON, PIERRE ACKLIN, and JOHN W. DAVIES
The noise level of a high-throughput screening (HTS) experiment depends on various factors such as the quality and robustness of the assay itself and the quality of the robotic platform. Screening of compound mixtures is noisier than screening single compounds per well. A classification model based on nave Bayes (NB) may be used to enrich such data. The authors studied the ability of the NB classifier to prioritize noisy primary HTS data of compound mixtures (5 compounds/well) in 4 campaigns in which the percentage of noise presumed to be inactive compounds ranged between 81% and 91%. The top 10% of the compounds suggested by the classifier captured between 26% and 45% of the active compounds. These results are reasonable and useful, considering the poor quality of the training set and the short computing time that is needed to build and deploy the classifier. (Journal of Biomolecular Screening 2004:32-36) Key words: high-throughput screening, compound mixtures, molecular similarity, extended-connectivity fingerprints, nave Bayes INTRODUCTION BAYES (NB) IS A SIMPLE CLASSIFICATION method based on the Bayes rule for conditional probability.1 Bayes rule states that the probability of event A, given that event B has occurred, P(A|B), is the following:
AVE
probability is obtained by multiplying the individual probabilities, so

P( Active| D1 , 2 , . . ., n ) = P( D1 | Active ) P( D2 | Active )... P( Active ) . P( Dn | Active ) P( D1 , 2 , . . ., n )
P( A ) P( A| B ) = P( B| A ) , P( B )
where P(A) and P(B) denote the probabilities of events A and B, respectively. NB may be used in high-throughput screening (HTS) data analysis as a classifier between active and inactive compounds, guided by the frequency of occurrence of various molecular descriptors,2 D, as shown in the following equation:
P( Active ) P( Active| D) = P( D| Active ) . P( D)
NB classification relies on 2 basic assumptions. First, the descriptors in the training samples are equally important. Second, the method navely (hence nave Bayes) assumes independence of the descriptors from each other.3 Therefore, their combined
Novartis Institute for Biomedical Research, Cambridge, MA. Received Jun 13, 2003, and in revised form Aug 13, 2003. Accepted for publication Sep 5, 2003. Journal of Biomolecular Screening 9(1); 2004 DOI: 10.1177/1087057103260590 Published by Sage Publications in association with The Society for Biomolecular Screening
In practice, these assumptions are often violated, and full independence is rarely observed. Studies on HTS data show, however, that NB is very robust to violations of these assumptions.4 NB is also known to be tolerant toward noise.4 Indeed, HTS data might be extremely noisy when screening in mixtures.5 In our case, the compound collection was screened in mixtures of 5 compounds/ well. Given that a compound is active, the mixture that contains this compound is likely to give a signal. The probability that any of the remaining 4 compounds in the same well will also be active is remote because the hit rate in an HTS campaign is low (ca. 0.1%). To pinpoint which of the 5 compounds is the active one, each of the compounds is retested as 1 compound/well. In this article, we refer to the remaining 4 compounds as noise presumed to be inactive compounds. Assessment of NB noise tolerance when the HTS data are extremely noisy is therefore highly desirable. The scientific literature contains empirical evaluation of various ensemble methods.6 However, the tolerance of NB toward noise in HTS data is unclear because the performance of NB is determined by how well its underlying assumptions describe the activity of the compounds in a given HTS campaign. Here we studied the ability of NB to enrich compound-mixtures data in 4 HTS campaigns. We also tested the
32 www.sbsonline.org
2004 The Society for Biomolecular Screening
Enrichment of Extremely Noisy HTS Data accuracy of an NB classifier in mining other compound collections for additional active compounds in which the classifier was based on a data set, with an increasing amount of noise presumed to be inactive compounds (up to 99% of the training set). METHODS We used the extended-connectivity fingerprints (ECFPs) implementation in Pipeline Pilot2 in all the test cases in this study. The ECFPs are a new class of fingerprints for molecular characterization, developed by the Sci-Tegic group,7 that rely on the Morgan algorithm.8 Fingerprints normally present the molecule as a collection of substructures such as functional groups. In contrast, the ECFPs features correspond to the presence of an exact structure with limited specified attachment points. To generate the fingerprints, the program assigns an initial code to each atom. Then, each atom code is updated in an iterative manner to reflect the codes of each atoms neighbors. When the desired neighborhood size is reached, the process is complete, and the set of all features is returned as the fingerprint. ECFPs can represent a much larger set of features than many fingerprints and contain a significant number of different structural units that are crucial for the molecular comparison among the compounds. To accommodate this amount of information, a hashing scheme into a space of 232 feature codes is used. The Bayesian models were generated and deployed using Pipeline Pilot, in which each bin contains the number of occurrences of a given hashed fingerprint bit. The normalized probability was then calculated to provide a final contribution of the feature to the total relative estimate. RESULTS Can the NB classifier prioritize primary data of compound mixtures? Advances in combinatorial and parallel chemistries, robotics, and miniaturization enable screening of large compound collections at an increasing rate. With the soaring costs of a typical screen, devising time-efficient and cost-effective strategies is highly desirable. By prioritizing the primary data, one can avoid picking false positives and pinpoint false negatives, hence reducing the overall cost of the screen and increasing its yield. We conducted a retrospective study on 4 in-house HTS campaigns. For each drug target, roughly 300,000 compounds were screened in mixtures of 5 compounds/well. The compounds were randomly assigned to these mixtures. Issues such as intermolecular chemical reactivity, cell toxicity, similarity, diversity, and redundancy were disregarded. The number of hits in the primary screen ranged significantly between 811 and 2792 (Table 1). It should be noted that few of the wells had fewer than 5 compounds (but more than 1). The Z factors for these screens ranged between 0.55 in screen III and 0.87 in screen IV. All the primary hits were deconvoluted; the compounds in the active wells were then screened again (cherry-picked for secondary screening) as 1 compound/well. The percentage of noise presumed to be inactive compounds ranged between 81% and 91% among the 4 campaigns (Table 1). We defined all the compounds in the active wells from the primary screen (including the noise presumed to be inactive compounds) as active. These active compounds and an additional number of inactive compounds, which ranged between 124,133 and 189,910, were used to construct a Bayesian model for an active compound for each of the campaigns. The Bayesian model was then employed to prioritize the compounds from the primary screen for each of the 4 campaigns. The results were compared to the secondary screen (1 compound/well) in terms of enrichment and receiver operating characteristic (ROC) curves.9 An enrichment curve attempts to answer whether testing compounds in a particular order is beneficial. To generate such a plot, the Bayesian classifier was used to order the samples. The curve is a cumulative count of the number of active compounds found when testing the compounds in such order. Enrichment and ROC curves are closely related. ROC plots are a robust method for understanding the utility of a model or score in classifying data samples. A ROC curve describes the trade-off between sensitivity and specificity. Sensitivity is defined as the ability of the model to avoid false negatives, whereas specificity is the ability to avoid false positives. An ideally sensitive classifier would have an area under the curve of 1, whereas a random classifier would have a value of 0.5. Roughly, an area from 0.7 to 0.8 is considered reasonable, 0.8 to 0.9 is good, and 0.9 to 1 is excellent. In 3 of 4 cases, the quality of the model was reasonable. The area below the ROC curve was higher than 0.75. In terms of enrichment, the top 10% compounds suggested by the Bayesian model captured between 26% and 45% (Table 1) of the confirmed hits. Can an NB classifier based on extremely noisy data be used to identify additional compounds that were not tested in the campaign? Screening of external compound collections is routine in the drug discovery process. Prioritizing these compound collections at an early stage is therefore desirable. The ability of an NB classifier to prioritize primary data of compound mixtures implies success in mining and prioritizing other compound collections, even if the training of the classifier is based on a highly noisy primary data. Our strategy involved leaving the NB algorithm unmodified and altering the training examples to introduce factors believed to be importantin this case, increasing the amount of noise presumed to be inactive compounds. Two HTS experiments were studied. In the first one, around 300,000 compounds were tested for their inhibitory activity. These compounds were purchased from external vendors based on their drug likeness10 and chemical diversity. Other compounds were selected from in-house lead optimization projects. In the primary screening, compounds were tested as 1 compound/well at a concentration of 10 M. Compounds at 49% change were considered hits and were tested again at 10 M. The Z factor for this screen was 0.84. The percent change is the difference between the signal of the sample well and
Journal of Biomolecular Screening 9(1); 2004
www.sbsonline.org 33
Glick et al. Table 1. Prioritization of Compound Mixtures Using Nave Bayes

Preparation of the Bayesian Models Drug Target I II III IV
ROC, receiver operating characteristic.
30%
Deployment of the Models Enrichment Factor in the First 10% 3.8 4.5 2.6 2.7
Number of Hits Taken from the Primary Screen (5 Compounds/Well) 1826 2792 2584 811
Number of Confirmed Hits after Retesting the Compounds as 1 Compound/Well 347 363 258 73
Area under the ROC Curve 0.87 0.83 0.61 0.75
the median of the signals obtained from the blank wells on the same plate, divided by the difference between the median signal of the control wells (wells with the target protein but without the test compound) and the median signal of the blank wells that are located on the same plate. From these, 242 validated hits were identified with IC50 values ranging between 0.02 and 23.3 M. The 242 compounds were divided at random into 2 equal-sized groups of 121. We generated a Bayesian model based on 121 actives from the first group and 12,100 inactive compounds picked at random from the experiment. We then tested the ability of the Bayesian model to pinpoint the other 121 actives, in which an additional 84,416 inactive compounds (that were not used for the training) were added as decoys. We studied the enrichment in terms of percent actives captured in the first 1% (845 compounds). We focused on the first 1% because it is not practical to retest (cherry-pick) tens of thousands of compounds. Instead, the aim of the Bayesian model here was to identify a small number of active compounds inside a large compound collection. We then started generating noise at increasing amounts for up to 99% of the training compounds. The noise was created by picking at random inactive compounds from the training set and labeling them as active. New Bayesian models were then created from these increasingly noisy data. Each Bayesian model was evaluated on the same test set. The enrichment values when using Bayesian models, based on an increasing amount of noise, are shown in Figure 1a. The training set without any noise yielded the best Bayesian model (25% of the actives captured in the first 1% of the compounds). Strikingly, the precision of the Bayesian model was preserved, although increasing levels of noise were introduced (up to 80%). In a value of 70% noise, the model was nearly as good as the one created without any noise (identification of 20% of the actives vs. 25%). The second HTS experiment was an agonist assay in which the same compound collection was screened. In the primary screening, compounds were tested as 1 compound/well at a concentration of 10 M. Compounds at 10% change or greater (10% to 90% agonism) were considered hits and were tested again in duplicates at 10 M. The Z factor for this screen was 0.80. From these, 170 validated hits were identified with EC50 values ranging between 0.2 and 69.9 M. The 170 validated hits were divided at random
Enrichment (number of actives captured in the first 1%)
25% 20% 15% 10% 5% 0%
10
20
30
40
50 % Noise
60
70
80
90
99
Bayesian model
Random model
(a)
30%
Enrichment (number of actives captured in the first 1%)
25% 20% 15% 10% 5% 0% 0 10 20 30 40 50 % Noise 60 70 80 90 99
(b)
Bayesian model
Random model
FIG. 1. The enrichment (percent actives captured in the first 1%) values when using Bayesian models based on an increasing amount of noise for 2 assays. (a) Antagonist assay. (b) Agonist assay.
into 2 equal-sized groups of 85 compounds. We generated a Bayesian model from 85 actives in the first group and 8500 inactive compounds picked at random from the experiment. We then used the model in attempt to identify the second half of the 85 actives in which an additional 100,000 inactive compounds (that were not part of the training set) were used as decoys. Again, we studied the enrichment in the first 1% (1002 compounds) when using Bayesian models based on an increasing amount of noise. The results are depicted in Figure 1b. The noise-free training set yielded the Bayesian model with the highest precision (25% enrichment). Unlike the previous case (the antagonist assay), the accuracy of the Bayesian model decayed with the increasing amount of noise, and at 40%, it was ineffective.
Enrichment of Extremely Noisy HTS Data To get an insight into the difference between the 2 experiments, we derived a similarity matrix between the training and test sets for each of the 2 HTS campaigns. The feature set was the same as that used for the classifier (ECFPs). Figure 2a shows the number of active compounds at increasing similarity levels, divided by the number of cells in the similarity matrix (i.e., percent similar compounds). The data are displayed in this normalized manner because the number of compounds used for the training and test sets was different between the 2 HTS campaigns (85 vs. 121). Figure 2b depicts the similarity between the active compounds used for the training and the inactive compounds that were used as decoys, divided by the number of cells in the similarity matrix. Figure 2c shows the similarity between the compounds used for the training set and themselves. The compounds on the diagonal (duplicates) were not included in the matrix. In the antagonist assay, the compounds used for training are more similar to the test set. However, they are also more similar to the decoy set. In both cases, there is some structural redundancy between the training sets and themselves, indicated by the percentage of compounds with a high value of the Tanimoto coefficient. The differences in the intercompound similarity distributions in Figure 2a indicate that the compounds in the training and test sets had a higher degree of similarity in the antagonist assay. This may account for the differences in sensitivity to noise.
0.25
0.2
% Similar Compounds
0.15
0.1
0.05
0
0.600.65 0.650.70 0.700.75 0.750.80 0.800.85 0.850.90 0.900.95 0.95-1
Tanimoto Coefficient
Antagonist assay
Agonist assay
(a)
0.0035
0.003
% Similar Compounds
0.0025
0.002
0.0015
0.001
0.0005
0
0.600.650.70 0.700.75 0.750.80 0.800.85 0.850.90 0.900.95 0.95-1
DISCUSSION One way of reducing the high costs and increasing the throughput of an HTS campaign is to use mixtures in which 2 or more compounds are present in the same well. In the case of 5 compounds/well, the primary screening is 5 times faster. The compound management and plate supply is more efficient, and fewer reagents are needed. When screening in mixtures of 5 compounds/ well, around 80% of the compounds identified in the primary screening are noise that is presumed to be inactive. In practice, the amount of noise is higher than 80%, and in 2 of 4 experiments, it was around 90%, although the final percentage of DMSO in each of the 4 screens was the same. This may be due to various experimental artifacts such as cell toxicity or perhaps chemical reaction, which might occur between compounds in the same mixture. Indeed, mixtures design11 may be one way to improve the quality of the experiment. The significant amount of noise poses a serious problem when one needs to build a mathematical model based on the primary data. We have shown that an NB classifier can be used to prioritize such data. The results were still reasonable, even when the amount of noise was around 90%. The area under the ROC curve and the enrichment curves show that the model tends to be more accurate as the amount of noise is lower; the predictions for 81% and 87% noise are significantly better than the cases of 90% and 91% noise. An NB classifier is a more attractive avenue than deriving an interwell similarity matrix and prioritizing the compounds in a de-
0.65
Antagonist assay
Agonist assay
(b)
0.4 0.35 0.3
% Similar Compounds
0.25 0.2 0.15 0.1 0.05 0
0.600.65
0.650.70
0.700.75
0.750.80
0.800.85
0.850.90
0.900.95
0.95-1
Antagonist assay
Agonist assay
(c)
FIG. 2. The percentage of compounds for 2 high-throughput screening campaigns at increasing similarity levels. (a) The active compounds in the training and test sets. (b) The active and inactive compounds in the training set. (c) The active compounds in the training set and themselves.
creasing Tanimoto coefficient.5 The number of compounds that can be included in the similarity matrix is a prohibiting factor in terms of computing time. Using NB is a more efficient approach
www.sbsonline.org 35
Glick et al. because the computing time grows linearly and enables the prioritization of large compound collections (1 million and more) within minutes on a desktop PC. ECFPs represent a much larger and sparse set of features than what is common for other fingerprints. The virtual size of the fingerprint is 4 billion (4,000,000,000) different features. In practice, only a small subset of those features is present for a given molecule. The usage of ECFPs resulted in a large number of bins when using the NB (order of thousands). With a diverse compound collection such as in this study, the noise is presented by descriptors with a low frequency, and therefore the signal-to-noise proportion is amplified. This may explain the tolerance of NB that was demonstrated in this work. We believe that the similarities between the training sets and themselves (Fig. 2c) were enough to yield significant frequencies of descriptors that encoded for true active compounds. A similarity analysis cannot always foresee the NB classifiers degree of tolerance. The precision of the Bayesian model was preserved in the first case (Fig. 1a), unlike the second one (Fig. 1b), in which it decayed rapidly. Only retrospectively, once all the actives were known, the higher degree of similarity between the training and test sets for the antagonist assay (Fig. 2a) rationalized the results. Identification of additional compounds from external compound collections may be more difficult than prioritization of primary screening data. In the latter case, the model is constructed and employed on the same compounds. In the former, the model is used to prioritize new compounds. The success of the classifier therefore depends not only on the amount of noise in the training set but also on the degree of similarity between the training set and the new compound collection, as indicated by the results in Figure 1b. A compound is considered active in the primary screen when its activity lies at or beyond the set threshold. Assigning this threshold, however, is not straightforward.12 The primary screen measures efficacy, usually in terms of percent inhibition, and not potency (IC50 or EC50 values). One would like, however, to capture the maximal number of potent compounds based on the primary data. One way of achieving this goal is cherry-picking additional compounds based on a Bayesian model. Such a mathematical model needs to balance between the amounts of noise it can handle without losing its precision. Regrettably, a priori assessment of NB noise tolerance is a challenging task because such a problem in machine learning is inherently empirical. Nevertheless, we showed that NB can be used as a classifier in most of the test cases in this work, even when significant noise was introduced. It also can be argued that the best approach to enrich noisy HTS data is not post facto application of statistical methods but up-front screen quality improvement and switching to a single compound per well. REFERENCES
1. 2. 3. 4. 5. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference and Prediction. Berlin: Springer, 2001. Pipeline Pilot, SciTegic, Inc., 9665 Chesapeake Drive, Suite 401, San Diego, CA 92123. Insightful Corporation: Insightful Miner 2 Users Guide. Seattle, WA: Insightful Corporation, 2001. Labute P: Binary QSAR: a new method for the determination of quantitative structure activity relationships. Pac Symp Biocomput 1999;444-555. Glick M, Klon AE, Acklin P, Davies JW: Prioritization of high throughput screening data of compound mixtures using molecular similarity. Mol Phys 2003;101:1325-1328. Opitz D, Maclin R: Popular ensemble methods: a survey. J Artif Intell Res 1999;11:169-198. Rogers D, Hahn M: High-throughput data analysis: extended-connectivity fingerprints: a high-dimensional descriptor for molecular data analysis. In preparation. Morgan HL: The generation of a unique machine description for chemical structures: a technique developed at Chemical Abstracts Service. J Chem Doc 1965;5:107-112. Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. New York: Morgan Kaufmann, 1999. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 2001;46:3-26. Hann M, Hudson B, Lewell X, Lifely R, Miller L, Ramsden N: Strategic pooling of compounds for high-throughput screening. J Chem Inf Comput Sci 1999;39:897-902. Zhang JH, Chung TD, Oldenburg KR: Confirmation of primary active substances from high throughput screening of chemical and biological populations: a statistical approach and practical considerations. J Comb Chem 2000;2:258-265.
6. 7.
8.
9. 10.
11.
12.
Address reprint requests to: Meir Glick Novartis Institute for Biomedical Research 100 Technology Square Cambridge, MA 02139 E-mail: meir.glick@pharma.novartis.com

Journal of Biomolecular Screening

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Journal of Biomolecular Screening

Enviado por

Direitos autorais:

Formatos disponíveis

Journal of Biomolecular Screening http://jbx.sagepub.

>> Version of Record - Feb 1, 2004 What is This?

Downloaded from jbx.sagepub.com by guest on November 25, 2011

Enrichment Glick et al. ARTICLE of Extremely Noisy 10.1177/1087057103260590 HTS Data

probability is obtained by multiplying the individual probabilities, so

2004 The Society for Biomolecular Screening

Downloaded from jbx.sagepub.com by guest on November 25, 2011