publications
publication list.
2025
2024
- eMOSAIC: Multi-modal Out-of-distribution Uncertainty Quantification Streamlines Large-scale PolypharmacologyAmitesh Badkul, Li Xie, Shuo Zhang, and Lei XiebioRxiv 2024
Polypharmacology has emerged as a new paradigm to discover novel therapeutics for unmet medical needs. Accurate, reliable and scalable predictions of protein-ligand binding affinity across multiple proteins are essential for polypharmacology. Machine learning is a promising tool for multi-target binding affinity predictions, often formulated as a multi-modal regression problem. Despite considerable efforts, three challenges remain: out-of-distribution (OOD) generalizations for compounds with new chemical scaffolds, uncertainty quantification of OOD predictions, and scalability to billions of compounds, which structure-based methods fail to achieve. To address aforementioned challenges, we propose a new model-agnostic anomaly detection-based uncertainty quantification method, embedding Mahalanobis Outlier Scoring and Anomaly Identification via Clustering (eMOSAIC). eMOSAIC uniquely quantifies distribution similarities or differences between the multi-modal representation of known cases and that of a new unseen one. We apply eMOSAIC to a multi-modal deep neural network model for multi-target ligand binding affinity predictions, leveraging a pre-trained strucrture-informed large protein language model. We extensively validate eMOSAIC in OOD settings, showing that it significantly outperforms state-of-the-art sequence-based deep learning and structure-based protein-ligand docking (PLD) methods by a large margin as well as existing uncertainty quantification methods. This finding highlights eMOSAIC’s potential for real-world polypharmacology and other applications.Competing Interest StatementThe authors have declared no competing interest.
- Comparative study of DCNN and image processing based classification of chest X-rays for identification of COVID-19 patients using fine-tuningAmitesh Badkul, Inturi Vamsi, and Radhika SudhaJournal of Medical Engineering & Technology 2024
The conventional detection of COVID-19 by evaluating the CT scan images is tiresome, often experiences high inter-observer variability and uncertainty issues. This work proposes the automatic detection and classification of COVID-19 by analysing the chest X-ray images (CXR) with the deep convolutional neural network (DCNN) models through a fine-tuning and pre-training approach. CXR images pertaining to four health scenarios, namely, healthy, COVID-19, bacterial pneumonia and viral pneumonia, are considered and subjected to data augmentation. Two types of input datasets are prepared; in which dataset I contains the original image dataset categorised under four classes, whereas the original CXR images are subjected to image pre-processing via Contrast Limited Adaptive Histogram Equalisation (CLAHE) algorithm and Blackhat Morphological Operation (BMO) for devising the input dataset II. Both datasets are supplied as input to various DCNN models such as DenseNet, MobileNet, ResNet, VGG16, and Xception for achieving multi-class classification. It is observed that the classification accuracies are improved, and the classification errors are reduced with the image pre-processing. Overall, the VGG16 model resulted in better classification accuracies and reduced classification errors while accomplishing multi-class classification. Thus, the proposed work would assist the clinical diagnosis, and reduce the workload of the front-line healthcare workforce and medical professionals.
2023
- End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteinsTian Cai, Li Xie, Shuo Zhang, Muge Chen, Di He, Amitesh Badkul, Yang Liu, and 4 more authorsPLOS Computational Biology 2023
Discovering chemical-protein interactions for millions of chemicals across the entire human and pathogen genomes is instrumental for chemical genomics, protein function prediction, drug discovery, and other applications. However, more than 90% of gene families remain dark, i.e., their small molecular ligands are undiscovered due to experimental limitations and human biases. Existing computational approaches typically fail when the unlabeled dark protein of interest differs from those with known ligands or structures. To address this challenge, we developed a deep learning framework PortalCG. PortalCG consists of four novel components: (i) a 3-dimensional ligand binding site enhanced sequence pre-training strategy to represent the whole universe of protein sequences in recognition of evolutionary linkage of ligand binding sites across gene families, (ii) an end-to-end pretraining-fine-tuning strategy to simulate the folding process of protein-ligand interactions and reduce the impact of inaccuracy of predicted structures on function predictions under a sequence-structure-function paradigm, (iii) a new out-of-cluster meta-learning algorithm that extracts and accumulates information learned from predicting ligands of distinct gene families (meta-data) and applies the meta-data to a dark gene family, and (iv) stress model selection that uses different gene families in the test data from those in the training and development data sets to facilitate model deployment in a real-world scenario. In extensive and rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art techniques of machine learning and protein-ligand docking when applied to dark gene families, and demonstrated its generalization power for off-target predictions and compound screenings under out-of-distribution (OOD) scenarios. Furthermore, in an external validation for the multi-target compound screening, the performance of PortalCG surpassed the human design. Our results also suggested that a differentiable sequence-structure-function deep learning framework where protein structure information serve as an intermediate layer could be superior to conventional methodology where the use of predicted protein structures for predicting protein functions from sequences. We applied PortalCG to two case studies to exemplify its potential in drug discovery: designing selective dual-antagonists of Dopamine receptors for the treatment of Opioid Use Disorder, and illuminating the undruggable human genome for targeting diseases that do not have effective and safe therapeutics. Our results suggested that PortalCG is a viable solution to the OOD problem in exploring the understudied protein functional space.
- TrustAffinity: accurate, reliable and scalable out-of-distribution protein-ligand binding affinity prediction using trustworthy deep learningAmitesh Badkul, Li Xie, Shuo Zhang, and Lei XieNeurIPS 2023 Workshop on New Frontiers of AI for Drug Discovery and Development & AAAI 2024 Workshop on LLMs4Bio 2023
Accurate, reliable and scalable predictions of protein-ligand binding affinity have a great potential to accelerate drug discovery. Despite considerable efforts, three challenges remain: out-of-distribution (OOD) generalizations for understudied proteins or compounds from unlabeled protein families or chemical scaffolds, uncertainty quantification of individual predictions, and scalability to billions of compounds. We propose a sequence-based deep learning framework, TrustAffinity, to address aforementioned challenges. TrustAffinity synthesizes a structure-informed protein language model, efficient uncertainty quantification based on residue-estimation and novel uncertainty regularized optimization. We extensively validate TrustAffinity in multiple OOD settings. TrustAffinity significantly outperforms state-of-the-art computational methods by a large margin. It achieves a Pearson’s correlation between predicted and actual binding affinities above 0.9 with a high confidence and at least three orders of magnitude of faster than protein-ligand docking, highlighting its potential in real-world drug discovery. We further demonstrate TrustAffinity’s practicality through an Opioid Use Disorder lead discovery case study.
2022
- RNN-driven Approaches to Self-healing Compound SynthesisAmitesh Badkul, and Ashif Iquebal2022
3D printing technology has revolutionized manufacturing processes, offering en- hanced precision andversatilityin product design. However, the materials commonly used in this domain often exhibit brittleness, leading to concerns about their durabil- ity. The frequent and irreversible damage to these materials necessitates a solution to enhance their longevity and reduce maintenance. Self-healing materials, characterized by their ability to recover from damage au- tonomously, present a promising avenue to address this challenge. Hydrogen bond- ing, a fundamental atomic interaction, plays a pivotal role in facilitating the self- healing properties of materials. Yet, systematically exploring the chemical space to identify compounds with optimal hydrogen bonding for self-healing remains a com- plex task. This research aims to employ Recurrent Neural Networks-based (RNNs) algorithms to navigate this vast chemical space, striving to design compounds that harness the potential of hydrogen bonding for enhanced self-healing properties.