Molecular Medicine Israel

Contrastive learning in protein language space predicts interactions between drugs and protein targets

Significance

In time and money, one of the most expensive steps of the drug discovery pipeline is the experimental screening of small molecules to determine binding to a protein target of interest. Therefore, accurate high-throughput computational prediction of drug-target interactions would unlock significant value, guiding and prioritizing promising candidates for experimental screening. We introduce ConPLex, a machine learning method for predicting drug-target binding which achieves state-of-the-art accuracy on many types of targets by using a pretrained protein language model. The approach co-locates the proteins and potential drug molecules in a shared feature space while learning to contrast true drugs from similar nonbinding “decoy” molecules. ConPLex is extremely fast, which allows it to rapidly shortlist candidates for deeper investigation.

Abstract

Sequence-based prediction of drug–target interactions has the potential to accelerate drug discovery by complementing experimental screens. Such computational prediction needs to be generalizable and scalable while remaining sensitive to subtle variations in the inputs. However, current computational techniques fail to simultaneously meet these goals, often sacrificing performance of one to achieve the others. We develop a deep learning model, ConPLex, successfully leveraging the advances in pretrained protein language models (“PLex”) and employing a protein-anchored contrastive coembedding (“Con”) to outperform state-of-the-art approaches. ConPLex achieves high accuracy, broad adaptivity to unseen data, and specificity against decoy compounds. It makes predictions of binding based on the distance between learned representations, enabling predictions at the scale of massive compound libraries and the human proteome. Experimental testing of 19 kinase-drug interaction predictions validated 12 interactions, including four with subnanomolar affinity, plus a strongly binding EPHB1 inhibitor (KD = 1.3 nM). Furthermore, ConPLex embeddings are interpretable, which enables us to visualize the drug–target embedding space and use embeddings to characterize the function of human cell-surface proteins. We anticipate that ConPLex will facilitate efficient drug discovery by making highly sensitive in silico drug screening feasible at the genome scale. ConPLex is available open source at https://ConPLex.csail.mit.edu.

In the drug discovery pipeline, a key rate-limiting step is the experimental screening of potential drug molecules against a protein target of interest. Thus, fast and accurate computational prediction of drug–target interactions (DTIs) could be extremely valuable, accelerating the drug discovery process. One important class of computational DTI methods, molecular docking, uses 3D structural representations of both the drug and target. While the recent availability of high-throughput accurate 3D protein structure prediction models (13) means that these methods can be employed starting only from a protein’s amino acid sequence, the computational expense of docking (4) and other structure-based approaches [e.g., rational design (5), active site modeling (6), template modeling (78)] unfortunately remains prohibitive for large-scale DTI screening. An alternative class of DTI prediction methods use 3D structure only implicitly, making rapid DTI predictions when the inputs consist only of a molecular description of the drug [such as the SMILES string (9)] and the amino acid sequence of the protein target. This class of sequence-based DTI approaches enables scalable DTI prediction, but there have been barriers to matching the levels of accuracy obtained by structure-based approaches.

In this paper, we introduce ConPLex, a rapid purely sequence-based DTI prediction method that leverages rich featurizations from pretrained protein language models (PLMs) and show that it can produce state-of-the-art performance on the DTI prediction task at scale. The advance provided by ConPLex comes from two main ideas that together overcome some of the limitations of previous approaches: informative PLM-based representations and contrastive learning. While many methods have been proposed for the sequence-based setting of the DTI problem (10) [e.g., using secure multiparty computation (11), convolutional neural networks (12), or transformers (13)], their protein and drug representations are constructed solely from DTI ground truth data. The high level of diversity among the DTI inputs, combined with the limited availability of DTI training data, limits the accuracy of these methods and their generalizability beyond their training domain. Furthermore, the methods that do generalize often do so by sacrificing fine-grained specificity, i.e., are unable to distinguish true-positive binding compounds from false positives with similar physicochemical properties (“decoys”).

In contrast, the “PLex” (Pretrained Lexicographic) part of ConPLex helps alleviate the problem of limited DTI training data. As we showed in our preliminary work (14), one way to get around the limited size of DTI datasets that has hampered the quality of the representations learned by previous methods is to transfer learned proteins representations from pretrained PLMs to the DTI prediction task. PLMs learn the distributional characteristics of amino acid sequences over millions of proteins in an unsupervised fashion, generating sequence-based representations that encode deep structural insights. A design paradigm in machine learning is that an informative featurization of the input can enhance the power of even simple models. For DTI, where task-specific data are limited, using PLM-generated representations as the input features allows us to borrow strength from the much larger corpus of single protein sequences (14). Starting with the PLMs, our second insight directly addresses the fine-grained specificity problem in our architecture by using the “Con” (Contrastive learning) part: a protein-anchored contrastive coembedding that colocates the proteins and the drugs into a shared latent space. We show that this coembedding enforces separation between true interacting partners and decoys to achieve both broad generalization and high specificity (Fig. 2).

Putting these two ideas together gives us ConPLex, a representation learning approach that enables both broad generalization and high specificity. We show that ConPLex enables more accurate prediction of DTIs than competing methods while avoiding many of the pitfalls suffered by currently available approaches. Thus, our work constitutes a concrete demonstration of the power of a well-designed transfer learning approach that adapts foundation models for a specific task (1516). In particular, we found that the performance of existing sequence-based DTI prediction methods could be sensitive to variation in drug-vs-protein coverage in the dataset, whereas ConPLex performs well in multiple coverage regimes. Indeed, ConPLex performs especially well relative to other methods in the zero-shot prediction setting where no information is available about a given protein or drug at training time. Experimental validation of ConPLex yielded a 63% hit rate (12/19), including four hits with subnanomolar binding affinity, demonstrating the value of ConPLex as an accurate, highly scalable, in silico screening tool.

ConPLex can also be adapted beyond the binary case to make predictions about binding affinity. Furthermore, the shared representation also offers advantages beyond prediction accuracy. The coembedding of both proteins and drugs in the same space offers intepretability, and we show that distances in this space meaningfully reflect protein domain structure and binding function: We leverage ConPLex representations to functionally characterize cell-surface proteins from the Surfaceome database (17), a set of 2,886 proteins localized to the external plasma membrane that participate in signaling and are likely able to be easily targeted by ligands.

ConPLex is extremely fast: As a proof of concept, we make predictions for the human proteome against all drugs in ChEMBL (18) (≊2 ×1010 pairs) in just under 24 h using a single NVIDIA A100 GPU. Thus, ConPLex has the potential to be applied for tasks which would require prohibitive amounts of computation for purely structure-based approaches or less efficient sequence-based methods, such as genome-scale side-effect screens, identifying drug repurposing candidates via massive compound libraries searches or in silico deep mutational scans to predict variant effects on binding with currently approved or potential new therapeutics. We note that most DTI methods require significant computation on each drug–target pair (i.e., have quadratic time complexity). Because ConPLex predictions rely only on the distance in the shared space, predictions can be made highly efficiently once embeddings (which have linear time complexity) are computed.

Distinguishing between Low- and High-Coverage DTI Prediction.

We benchmark performance of ConPLex and competing methods in two different regimes, which we term low-coverage and high-coverage DTI prediction (Fig. 1C). We show that ConPLex outperforms its competitors in both settings, but note that separating the two regimes helps clarify an often-seen issue in the field: methods whose performance varies substantially across different proposed DTI benchmarks. Several prior attempts have been made to standardize DTI benchmarking and develop a consistent framework for model evaluation (1920). However, much of this work has overlooked a key aspect of benchmarking that we find to significantly affect model performance—differing per-biomolecule data coverage. We define coverage as the average proportion of drugs or targets for which a data point exists in that dataset, whether that is a positive or negative interaction (Methods). Depending on the per-biomolecule data coverage of the benchmark dataset, we claim that these benchmarks are looking at very different problems. In particular, low-coverage datasets (Fig. 1A) tend to measure the broad strokes of the DTI landscape, containing a highly diverse set of drugs and targets. Such datasets can present a modeling challenge due to the diverse nature of targets covered but allow for a broad assessment of compatibility between classes of compounds and proteins. High-coverage datasets (Fig. 1B) represent the opposite trade-off: They contain limited diversity in drug or target type but report a dense set of potential pairwise interactions. Thus, they capture the fine-grained details of a specific subclass of drug–target binding and enable distinguishing between similar biomolecules in a particular context…

Sign up for our Newsletter