Protein - Ligand Affinity Prediction_Strategizing Data Usage for Virtual Screening, Elix, CBI 2023

Protein - ligand aﬃnity prediction Strategizing data usage for virtual
screening Thomas Auzard (presenter) - Elix, Inc. David Jimenez Barrero, Nazim Medzhidov, PhD - Elix, Inc. Naoki Tarui, PhD - SEEDSUPPLY Chem-Bio Informatics Society (CBI) Annual Meeting 2023, Tokyo, Japan｜October 25th, 2023

© 2023 Elix Inc. Protein - ligand binding prediction: strategizing
data usage 2 Proposed solution for virtual hit screening under various data availability scenarios

© 2023 Elix Inc. 3 Proprietary protein-ligand binding dataset Positive
samples Negative samples • Proprietary training dataset ◦ 689 proteins (GPCRs and SLC transporters) ◦ Binder or non-binder molecules (SMILES) • Testing dataset ◦ 446,559 molecules ◦ Activity for 4 proteins ▪ GPR87, MC4R, GLP2R, SLC40A1 ▪ Respectively 26, 4, 7, and 9 positive hits

© 2023 Elix Inc. 4 Model training and inference pipeline
[1] Esposito C, Landrum GA, Schneider N, Stieﬂ N, Riniker S. GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. J Chem Inf Model. 2021;61(6):2623-2640. doi:10.1021/acs.jcim.1c00160

© 2023 Elix Inc. 5 Does it work for any
unseen target? Neighborhood of GPR87 and MCR4R. Each point represents a protein. Neighbors of MC4R have a better balance of positive and negative samples than GPR87. Protein GPR87 MC4R GLP2R SLC40A1 Prediction / ground truth 0 / 26 3 / 7 0 / 4 2 / 9 Shortlist size 49521 (11.0%) 37885 (8.4%) 45496 (10.2%) 43794 (9.8%) Confidence 40% 64% 40% 42% Here, 3 different architectures were combined. 64% and 40% confidence levels are respectively equivalent to 20 and 12 models agreeing on the prediction. The 4 test proteins are unseen (not in the training dataset). GPR87 MC4R Coverage of the target protein in the training dataset impacts the performances. How so? • Defining protein coverage ◦ Presence of similar protein in the training data ◦ Quality of the coverage ▪ Positive / negative balance ▪ Else?

© 2023 Elix Inc. 6 Training dataset augmentation scenario •
Adding samples (including positive) for the target to the training dataset • 3 training modalities 1. Target unseen 2. Target seen 3. Fine-tuning on target Can we improve the performances with additional data collection?

© 2023 Elix Inc. 7 Training dataset augmentation scenario results
Fine-tuning Retraining • Re-trained and ﬁne-tuned models systematically beat baseline model ◦ GPR87: 5 / 15 hits in ~3100 mols ◦ MC4R: 2 / 3 hits in ~80 000 mols • GPR87 showed the best improvement ◦ Worst baseline • 50% of the augmentation data belongs to GPR87 ◦ Predominance of GPR87 samples could have biased the models towards GPR87 Improving coverage can improve performances. However, the model seems to be biased to this added coverage. How so?

© 2023 Elix Inc. 8 Understanding data bias with protein
clustering • Cluster preparation ◦ Based on protein similarity: inverse of euclidean distance of ESM2-generated feature vectors • 9 clusters based on similarity, 2 clusters based on dissimilarity Protein similarity map, using multidimensional scaling (MDS). The bigger the dot, the higher the intra-similarity of the cluster. This is only a representation, not a actual depiction of the distances between proteins. … …

© 2023 Elix Inc. 9 Protein clustering - impact of
training coverage • Intra-cluster similarity ◦ Sensitivity: strong correlation ◦ Precision: moderate correlation 4 lowest similarity clusters, removed in 3 graphs below High intra-cluster similarity is correlated with good performances. So, adding more coverage should boost the performances, right? High performing clusters Topology of those clusters deﬁne a good cluster • Cluster topology (among high intra-similarity clusters) ◦ No signiﬁcant correlation with performance metrics ◦ Reasonable %positive (> 15%)

© 2023 Elix Inc. 10 Focus on SLC7A1 - is
it always worth collecting more data? • SLC7A1 (part of training data) ◦ 23 positive samples ◦ Belongs to cluster 3 ▪ High intra-similarity ▪ 18 proteins ▪ 37% of positive samples ◦ Coverage from clusters 4, 5, 7 • 3 training modalities 1. Without cluster 3 2. With samples of SLC7A1 3. With cluster of SLC7A1

© 2023 Elix Inc. 11 Improving coverage with similar proteins
- results • No improvements with enrichment ◦ Baseline is already “good” ◦ SLC7A1 is covered by clusters 4, 5 and 7 Clusters 3, 4, 5 and 7 cover similar space. Compared to cluster 1 or 2, they receive coverage from other clusters. Improving coverage has its limits. What about the impact of distant data points? Fine-tuned with SLC7A1 Samples Fine-tuned with Clusters

© 2023 Elix Inc. 12 Tailored clusters from proprietary dataset
• Training model on limited protein space • Preparation of custom clusters for protein of interest 1. Protein of interest is the centroid of the cluster 2. A good cluster need to satisfy topology metrics • Low to no impact for GPR87 and MC4R • Lowered performance for SLC40A1 Better use the whole training dataset in both case. Intra-cluster similarity Size % of positive 11% 36% 38%

© 2023 Elix Inc. 13 Conclusion • Protein coverage, good
cluster: depends on the data → Similar analysis should be performed with available training ip data and library to screen • Performances conditioned by models ◦ Protein-ligand interaction not used! ▪ Pharmacophore models, … • Another use case leveraging our approach ◦ Hard to express human proteins → Use animal proteins for experiments ◦ Data for animal protein could be used for enriching training data → More conﬁdence in the binding aﬃnity with ML

14 Thank you for your attention

株式会社Elix http://ja.elix-inc.com/ Elix Inc. https://www.elix-inc.com 15 SEEDSUPPLY https://www.seedsupply.co.jp &

16 Appendix

© 2023 Elix Inc. 18 Models architecture Molecular Graph Molecular
Features Protein Features Graph Features Protein-Molecule Features Aggregation Linear Network Aggregation Prediction Molecular SMILES Protein Sequence ESM-2 Network Graph Convolutional NN Augmented tiered GCN Feature aggregation • GCN pretrained on publicly available GPCR data • Combine numerous featurization and aggregation process for both ligand and protein The augmented tiered GCN performing slightly better and being lighter, this model will be the default unless stated. Ligand featurization Ligand featurization Protein featurization Protein featurization Aggregation Aggregation

© 2023 Elix Inc. 19 Appendix Average intra-distance Number of
unique samples / total number of samples Number of proteins % of positive samples Cluster 1 1.89 1703 / 1728 33 16.72% Cluster 2 2.26 5882 / 6192 117 33.43% Cluster 3 1.64 859 / 864 18 37.38% Cluster 4 1.51 432 / 432 9 15.74% Cluster 5 1.36 240 / 240 5 14.16% Cluster 6 1.54 432 / 432 9 29.86% Cluster 7 1.02 232 / 240 5 46.67% Cluster 8 1.41 236 / 240 5 18.75% Cluster 9 1.51 320 / 336 7 37.80% Cluster 10 5.36 262 / 288 5 35.42% Cluster 11 4.43 5746 / 5953 117 19.05% Detailed distributions of the clusters, obtained with manual inspection of DBSCAN clustering. Protein similarity map, using multidimensional scaling (MDS). The bigger the dot, the higher the intra-similarity of the cluster. This is only a representation, not a actual depiction of the distances between proteins.

© 2023 Elix Inc. 20 Average intra-distance Number of unique
samples / total number of samples Number of proteins % of positive samples GPR87 1.22 717 / 720 13 11.11% MC4R 1.25 240 / 240 5 35.58% SLC40A1 1.82 192 / 192 4 38.02% Detailed distributions of the custom clusters. SLC40A1 is a control cluster: it has only 4 proteins and the threshold distance for selection was 2.0.

株式会社Elix http://ja.elix-inc.com/ 21

Protein - Ligand Affinity Prediction_Strategizi...

Protein - Ligand Affinity Prediction_Strategizing Data Usage for Virtual Screening, Elix, CBI 2023

Elix

More Decks by Elix

Other Decks in Research

Featured

Transcript

Protein - ligand aﬃnity prediction Strategizing data usage for virtual

© 2023 Elix Inc. Protein - ligand binding prediction: strategizing

© 2023 Elix Inc. 3 Proprietary protein-ligand binding dataset Positive

© 2023 Elix Inc. 4 Model training and inference pipeline

© 2023 Elix Inc. 5 Does it work for any

© 2023 Elix Inc. 6 Training dataset augmentation scenario •

© 2023 Elix Inc. 7 Training dataset augmentation scenario results

© 2023 Elix Inc. 8 Understanding data bias with protein

© 2023 Elix Inc. 9 Protein clustering - impact of

© 2023 Elix Inc. 10 Focus on SLC7A1 - is

© 2023 Elix Inc. 11 Improving coverage with similar proteins

© 2023 Elix Inc. 12 Tailored clusters from proprietary dataset

© 2023 Elix Inc. 13 Conclusion • Protein coverage, good

14 Thank you for your attention

株式会社Elix http://ja.elix-inc.com/ Elix Inc. https://www.elix-inc.com 15 SEEDSUPPLY https://www.seedsupply.co.jp &

16 Appendix

© 2023 Elix Inc. 17

© 2023 Elix Inc. 18 Models architecture Molecular Graph Molecular

© 2023 Elix Inc. 19 Appendix Average intra-distance Number of

© 2023 Elix Inc. 20 Average intra-distance Number of unique

株式会社Elix http://ja.elix-inc.com/ 21