Equbits product suite helps High-Throughput Screening (HTS) scientists develop accurate predictive models to support virtual screening efforts

Posted on July 18, 2004 By Corinne Jones Blog

LIMSfinder > Picture1″ hspace=0 src=”http://w3markets.smugmug.com/photos/6280807-M.jpg” width=193 border=0>

Equbits product suite helps HTS scientists develop accurate predictive models to support virtual screening efforts. Based on a common and open platform, Equbits Insight™ sets new standards in predictive modeling accuracy for HTS prediction. By exploiting novel technologies, automating the modeling tuning process, and packaging results in an intuitive, interpretable manner, Equbits Insight™ provides the most accurate solution in an easy to use, software application geared towards scientists. The revolutionary Equbits technology effectively handles high dimensional data with sparse respresentations even in situations where limited numbers of examples exist for training.

History of Existing Technologies

Artificial intelligence techniques have been applied to QSAR analysis since the late 1980s, mainly in response to the need for more accurate solutions. Intelligent classification techniques, including neural networks, genetic algorithms and decision trees, have also come to the forefront. Machine learning techniques have, in general, offered greater accuracy than have their statistical forebears, but not without accompanying problems for the QSAR analyst to consider.

Expert Systems

Expert Systems leverage the knowledge of subject matter experts to provide a methodology for modeling systems. The main limitation of expert systems is that they are optimized for a specific problem. Though highly accurate for discrete, well bounded problems, expert systems lack extensibility and cannot be generalized to solve a variety of predictive modeling applications.

Neural Networks

Neural networks, developed based on a framework that mimics the nervous system of an individual, is a machine learning methodology that involves the tuning of discrete “nodes” to generate a model. Neural networks suffer from over-fitting, resulting in accurate models based on training data but significantly lower accuracy on actual prediction data. Other problems with the use of neural networks concern the reproducibility of results, owing largely to set-up and stopping criteria, and lack of information regarding the classification produced. Scalability issues exist when medium to large especially when high dimensional datasets are being modeled.

Genetic Algorithms

Genetic algorithms suffer from their stochastic nature, in that results may be hard to reproduce and the resulting classification may not be optimal. Genetic algorithms, by their inherent nature, can lead to models based on local minima, resulting in sub-optimized results and under-fitting.

Decision Trees

Decision trees offer a large amount of information regarding their decisions, in the form of predictive rules, but struggle to provide the accuracy supplied by more powerful techniques. Decision trees have trouble with small data sets and data sets where one class tends to be very small, such as HTS data sets. Even though the models are interpretable, it is debatable whether they provide insight into the structure activity relationship. As users move to high dimensional descriptors and larger training sets, decision trees have trouble effectively scaling.

Naïve Bayes

The application of existing machine learning algorithms such as Naïve Bayes or Bayesian inference models has recently taken hold in the drug discovery area. Naïve Bayes is a classification technique that assigns compounds with a prediction score for every class. As a result, Naïve Bayes requires a significant number of training examples to generate separation of class boundaries. In the case of limited training data, Naïve Bayes models can result in assigning a molecule to two exclusive categories with the same likelihood score.

Naïve Bayes also makes an assumption that the relationship between attributes are independent and hence not correlated when generating classification models. These assumptions do not always hold true with respect to QSAR data.

The Challenge

Over the past decade, high-throughput screening (HTS) has become a cornerstone technology of pharmaceutical research. A current estimate is that biological screening and preclinical pharmacological testing alone account for ~14% of the total research and development (R&D) expenditures of the pharmaceutical industry. Moreover, it is anticipated that screening one million compounds per target will become a gold standard for the major pharmaceutical companies. The current trend is to re-rationalize drug discovery research by carrying out fewer, smarter experiments.

Various computational approaches have been designed to complement the array of high-throughput discovery technologies by searching large compound databases in silico and selecting a limited number of molecules for testing to identify novel chemical entities that have desired biological activity.

A number of technologies have been evaluated with respect to the virtual screening problem in order to maximize the quality of prediction accuracy. The better the prediction accuracy, the better choices a scientist can make in selecting the appropriate compounds. Currently, most predictive modeling methods are extremely inaccurate even though many novel techniques have been applied to the predictive modeling problem. Difficulties in modeling high dimensional data, significant levels of experimental noise resulting in false positives/false negatives, and limited number of training examples represent challenges in accurately modeling structure activity relationships (SAR).

Equbits Core Technology — Support Vector Machines

SVM Overview

The general problem of machine learning is to search a usually very large, space of potential hypotheses to determine the one that will best fit the data and any prior knowledge. The task is to learn a hypothesis based on this data and any prior knowledge that correctly predicts the labels of previously unseen data. Normally data is split into a training set and a test set. The hypothesis is learned using the training data and an unbiased estimate of the generalization error is then given by the error on the test set. To reduce the variance in the estimate this can be repeated several times and the results averaged over the different partitions (cross-validation). Classifiers typically learn by empirical risk minimization (ERM); that is they search for the hypothesis with the lowest error on the training set. Unfortunately, this approach is doomed to failure without some sort of capacity control.

Equbits Insight™ implementation of Support vector machines is based on the structural risk minimization principle (SRM) from computational learning theory. SVMs construct a hyper-plane that separates two classes (this can be extended to multi-class problems). Separating the classes with a large margin minimizes the bound on the expected generalization error. By searching for large margin hyper-planes, the SVM is limiting the complexity of the hypothesis space.

DIAGRAM 1

LIMSfinder > Picture2″ hspace=0 src=”http://w3markets.smugmug.com/photos/6280804-M.jpg” width=600 border=0>

In the case of non-separable classes, SVM minimizes the number of misclassifications whilst maximizing the margin with respect to the correctly classified examples. The hyper-plane output by the SVM is given as an expansion on a small number of training points known as support vectors.

Unlike other algorithms, SVM makes no assumptions about the relationships between a set of features in a feature space. This allows SVM to identify and determine the most relevant features used in a model and the model’s feature dependencies. As a result, SVM’s lend themselves well to accurate non-linear modeling.

SVMs are very powerful learners. In the separable case SVMs are fully automatic in that they need no fine-tuning. Moreover, SVMs are relatively insensitive to variation in the parameters and are not prone to over-fitting even when using high degree polynomial kernels.

Equbits Insight™ ™- Providing SVM Interpretability

Most modeling software tout their ability to provide interpretable models. Even though a model might be interpretable, it still might not provide much insight into the structure activity relationship. In Diagram 1, points closest to the hyper-plane intuitively correspond to points that are hardest to classify. These vectors provide the most information regarding the impact of each feature on the model. For each feature relevant to model building, the support vector machine algorithm calculates a weighting and identifies feature relationships. Unlike other algorithms, SVM’s view all the features together simultaneously to determine the impact of features and the relationships among features, providing a more consistent method for model interpretation. With SVM, model interpretation is stable, even when rearranging features or training data, giving the user more confidence in the predictions and the interpretations made by the model.

Equbits Insight™ allows you to rank features based on impact on the model. In the diagram below, you get information not only about feature ranking but also information about which features are closely correlated.

LIMSfinder > Picture3″ hspace=0 src=”http://w3markets.smugmug.com/photos/6280803-M.jpg” width=600 border=0>

Advantages of Applying SVM to QSAR Data

When learning QSARs the algorithm must deal with high dimensional feature spaces. SVM uses VC-theory to avoid over-fitting and hence has the potential to deal with a large number of features, even in the low throughput case where there are few training examples. This is apparent from the fact that SVMs typically learn in very high dimensional spaces without over-fitting.

Using an SVM, high degree polynomial classifiers can be generated while being confident that it will not over-fit. Furthermore, the degree of the polynomial can be chosen in an automated manner. Furthermore, SVMs do not make any assumptions about correlations between the features, as opposed to techniques that assume statistical independence. Since SVMs are robust learners they are not as badly affected by noise as other classification algorithms. This is particularly important in high throughput screening where there may be a significant amount of noise in the data labels.

Equbits Insight™ Product Suite- A Tool for Predictive Modeling

A variety of algorithms have been applied to the SAR problem with limited success. When learning QSARs, Equbits Insight™ effectively deals with high dimensional feature spaces. Equbits Insight™ patent pending learning method avoids over-fitting and hence has the potential to deal with a large number of features, even in the low throughput cases where there are few training examples.

QSARs are typically highly non-linear. Using Equbits Insight™ ™, users can automatically generate a high degree polynomial classifier and also be confident that it will not over-fit. Equbits Insight™ tuning methodology does not make any assumptions about correlations between features, as opposed to techniques that assume statistical independence, thus generating robust learners. Furthermore, Equbits Insight™ provides packaged robust learners that are less impacted by noise as other classification algorithms.

3.2 Equbits Insight™ – Open Standards Make for Easy Integration

Equbits Insight™ provides a platform for Predictive Modeling by allowing the user to apply any data type and format in building models. With Equbits Insight™, a user can build models based any descriptor type to predict the properties of interest.

LIMSfinder > Picture4″ hspace=0 src=”http://w3markets.smugmug.com/photos/6280802-M.jpg” width=600 border=0>

Equbits Insight™ leverages your existing informatics infrastructure, reducing the need to change processes, data formats, and modeling data modeling framework to generate Equbits Insight™ models.

Equbits Insight™ Advantage- Automated Model Generation

With Equbits Insight™, users can automatically generate optimized predictive models without being a machine learning expert. Equbits Insight™ patent pending automated model optimization technique allows users to import data in the system and generate models without the need for fine tuning.

LIMSfinder > Picture5″ hspace=0 src=”http://w3markets.smugmug.com/photos/6280806-M.jpg” width=600 border=0>

Equbits Insight™ scans the search space for the optimized model in an efficient, scalable manner and provides the user with results with no manual intervention. Expert users can bypass this mode to further analyze the model parameters.

Case Study

NCI HIV HTS Dataset

A National Cancer Institute (NCI) library of 32,344 compounds was tested in a whole-cell assay against HIV-infected cells. In this dataset, there are at least 3 known modes of interaction, making this dataset a challenging dataset to model against. The results of the assay are reported as “CI” (confirmed inactive), “CM” (confirmed moderately active) and “CA” (confirmed active). The library contained 450 moderately active compounds and 230 active compounds. The low number of moderately active and inactive compounds makes predictive modeling difficult. The library contained the 2D pharmacophore fingerprints which are folded into 2048 binary descriptors (attributes). The fingerprints are bit strings in which each bit represents the presence or absence of a particular 2-node pattern. The descriptors are sparse bit-strings as most patterns do not occur in most molecules.

The CM compounds were dropped to define the dataset as a two class classification problem. The library was split equally into a training set and test set. The predictive modeling methods were trained on 15778 compounds of which 115 (0.718%) were active. They each generated a model that was then used a validation set to make predictions. The predictions were compared against the ground truth and the results recorded. This validation set consisted of 16114 molecules of which 117 (.710%) were inactive.

Two different predictive modeling methods were used and their results compared. The first was the Equbits InsightTM product and the second was a Naïve Bayes classifier. The Equbits InsightTM product is based on Support Vector Machines (SVM). Support Vector Machines separate the data points for each feature in a massively multidimensional space that allows correlation of features to be preserved while minimizing error. The probability of each feature determining whether a compound is in a class is calculated and then for each compound the probability of being active and inactive is computed. Whichever probability is higher determines its class.

The predicted activity of each molecule was calculated in the training set and for the test set and then compared against the ground truth. For each set, the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) were determined. Based on the results, the following was calculated:

Precision – A measure (%) of the model’s ability to predict whether a molecule is active or inactive
Recall – Often referred to as Sensitivity or True Positive Rate. A measure on the model’s ability to predict all the active molecules (100 – false negative rate)
Specificity – Often referred to as True Negative Rate. The probability of predicting a negative given its true state is negative.
Enrichment – A measure of the ratio between the percentage of actives your model accurately predicts compared to the percentage actives found through random selection
Area Under the Curve – The Area under the Receiver Operator Curve (ROC). ROC curves graph the Sensitivity vs. Specificity.

LIMSfinder > Picture6″ hspace=0 src=”http://w3markets.smugmug.com/photos/6280805-M.jpg” width=573 border=0>

LIMSfinder > Picture7″ hspace=0 src=”http://w3markets.smugmug.com/photos/6280801-M.jpg” width=600 border=0>

On the NCI HIV data set, Equbits Insight™ using SVM significantly outperformed the Naïve Bayes classifier when on all measured parameters. SVM demonstrates an ability to provide high precision at high recall rates, even on challenging data sets where multiple modes of interaction are known. The opportunity to generate an additional level of performance and accuracy exists by applying SVM to HTS in silico screening results where the data is known to represent one mode of interaction.

Conclusions

Equbits Insight™ provides a framework for developing QSAR predictive modeling. Equbits Insight™ provides open standards that integrate with your existing technical and data modeling framework. Equbits Insight™ utilizes a patent pending automated modeling building implementation of support vector machines that provides the most accurate predictive models in an easy-to-use software application.

About Equbits LLC

Equbits LLC provides software that helps scientists at pharmaceutical companies accelerate lead optimization. Equbits applies advanced machine learning techniques including Support Vector Machines to QSAR predictive modeling in an easy to use, intuitive software application geared towards HTS and ADME chemists.

Founded by experts in the field of machine learning, chemistry and software development, Equbits makes drug and chemical discovery possible with first-in-class computational applications in predictive modeling and chemistry.

Equbits has a world-class scientific advisory board consisting of:

Professor Trevor John Hastie has been a faculty member of the Stanford University Statistics and Biostatistics Departments since 1990. He has authored many publications and three books, whose titles are: Generalized Additive Models (1990), Statistical Models in S (1991) and The Elements of Statistical Learning: Prediction, Inference and Data Mining (2001)

Dr. Jennifer Miller led discovery chemistry group as Sr. Director at Signature BioScience and is a member of ACS Computers in Chemistry Executive Committee.

Dr. Isabelle Guyon is a research leader in the machine learning community and invented Support Vector Machines at AT&T with Vapnik.

Dr. Kristin Bennett is an Associate Professor at RPI in the Department of Mathematical Sciences where she has conducted extensive ADME predictive modeling research.

For More Information

Equbits LLC

Address: PO Box 51977, Palo Alto, CA 94303

Website: www.equbits.com

Email: sales@equbits.com

Phone: 1-888-318-3377

by Corinne Jones

Website