Benchmarking Deployed Generative Models on Elix Discovery, Elix, CBI 2023

Benchmarking Deployed Generative Models on Elix Discovery Elix, Inc. Vincent
Richard Jun Jin Choong, Ph.D Chem-Bio Informatics Society (CBI) Annual Meeting 2023, Tokyo, Japan | October 24th, 2023

2 Introduction

3 General drug discovery ﬂow in machine learning Datasets ZINC
CHEMBL MOSES … Representations O=C(NCCCn1ccnc1)c1cccs1 Evaluation How to assess the performance of molecular generative model? Models VAE GAN RNN … Existing trained model shared weights by the authors, or internal existing model

4 Production environment diﬃculties Model 1 Dataset : ZINC Model
2 Dataset : CHEMBL with custom ﬁlters Model 3 Dataset : ZINC + CHEMBL How to evaluate models trained with various datasets ? Is it possible to have a fair evaluation ?

5 Current Benchmarking Solutions

6 The Current State of Evaluation Metrics for Generative Models
Distribution metrics (From MOS ) - FCD (Fréchet ChemNet Distances) - SNN (Similarity to Nearest Neighbor) - Scaffold Similarity - Validity - Uniqueness - Filters (% passing MOSES defined smarts filters) - Novelty - IntDiv (Can detect mode collapse of generative models) - Fragment similarity Oracle based metrics (From TDC and GuacaMol) Molecule generation with a desired property in mind. - Docking score - ML based score (DRD2, JNK3, GSK3B) - Similarity to another molecules - Rediscovery - Isomer identification - Property optimization (LogP, QED, SA) - Scaffold hops - … Generate Feedback Oracle Training Learned distribution [1] Polykosvkiy et al. “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models” https://arxiv.org/abs/1811.12823 [2] Huang et al. “Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development” https://arxiv.org/abs/2102.09548 [3] Brown et al. “GuacaMol: Benchmarking Models for De Novo Molecular Design” https://arxiv.org/abs/1811.09621 [1] [2] [3]

7 Internal result benchmark metrics model 1 model 2 model
3 model 4 albuterol_similarity 0.5 ± 0.0065 0.638 ± 0.0328 0.857 ± 0.0077 0.598 ± 0.0052 amlodipine_mpo 0.506 ± 0.0086 0.546 ± 0.0056 0.62 ± 0.0071 0.492 ± 0.0044 celecoxib_rediscovery 0.546 ± 0.0959 0.626 ± 0.0365 0.841 ± 0.0107 0.321 ± 0.0041 deco_hop 0.632 ± 0.0553 0.631 ± 0.0093 0.729 ± 0.0612 0.586 ± 0.0023 drd2 0.948 ± 0.0224 0.973 ± 0.0047 0.978 ± 0.005 0.847 ± 0.0321 fexofenadine_mpo 0.678 ± 0.0072 0.716 ± 0.0114 0.801 ± 0.0059 0.695 ± 0.0037 gsk3b 0.779 ± 0.0693 0.784 ± 0.0142 0.877 ± 0.0197 0.586 ± 0.0332 isomers_c7h8n2o2 0.834 ± 0.0421 0.794 ± 0.022 0.898 ± 0.0112 0.481 ± 0.0741 isomers_c9h10n2o2pf2cl 0.591 ± 0.0489 0.644 ± 0.0102 0.745 ± 0.0069 0.524 ± 0.0149 jnk3 0.387 ± 0.0315 0.439 ± 0.0298 0.573 ± 0.0455 0.361 ± 0.0262 median1 0.183 ± 0.0073 0.268 ± 0.0064 0.338 ± 0.0038 0.198 ± 0.0045 median2 0.219 ± 0.0148 0.221 ± 0.0035 0.28 ± 0.0093 0.165 ± 0.0027 mestranol_similarity 0.367 ± 0.0189 0.618 ± 0.0374 0.835 ± 0.0105 0.367 ± 0.004 osimertinib_mpo 0.796 ± 0.0111 0.795 ± 0.0027 0.825 ± 0.0023 0.756 ± 0.0075 perindopril_mpo 0.444 ± 0.0128 0.471 ± 0.0049 0.538 ± 0.0181 0.463 ± 0.0056 qed 0.936 ± 0.0021 0.937 ± 0.0013 0.939 ± 0.0006 0.931 ± 0.0038 ranolazine_mpo 0.445 ± 0.0213 0.734 ± 0.0036 0.785 ± 0.0017 0.714 ± 0.0053 scaffold_hop 0.498 ± 0.0047 0.475 ± 0.0031 0.498 ± 0.0032 0.464 ± 0.0019 sitagliptin_mpo 0.281 ± 0.0198 0.297 ± 0.0169 0.371 ± 0.0295 0.217 ± 0.0185 thiothixene_rediscovery 0.368 ± 0.0115 0.388 ± 0.0094 0.605 ± 0.0471 0.29 ± 0.0068 troglitazone_rediscovery 0.26 ± 0.007 0.301 ± 0.0083 0.5 ± 0.041 0.234 ± 0.0075 valsartan_smarts 0.033 ± 0.0653 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0002 zaleplon_mpo 0.484 ± 0.0125 0.444 ± 0.0054 0.49 ± 0.0038 0.412 ± 0.0055 Sum 11.715 12.74 14.9 10.7 Mol opt benchmark: Focus on sample eﬃciency, restricted to 10,000 Oracle calls. The core metric is area under the curve (AUC) of the top-10 average score

8 Limitations of current benchmark Current Limitations Distribution Metrics: Easily
Beatable Can be fooled by simple task. Most models achieve high accuracy. Good to debug when training a model [1] Renz et al. “On failure modes in molecule generation and optimization” https://www.sciencedirect.com/science/article/pii/S1740674920300159 [1]

Beatable Oracle tasks and training data Certain tasks can be addressed by the training dataset alone. The model learns the dataset's distribution, but novelty isn't always guaranteed. A molecule is novel if the nearest neighbor in the training set has a similarity less than 0.4 (ECFP4 based) [1] Franco et al. “The use of 2D fingerprint methods to support the assessment of structural similarity in orphan drug legislation” https://jcheminf.biomedcentral.com/counter/pdf/10.1186/1758-2946-6-5.pdf [1]

10 Oracle task bias by the training data Current Oracles
Real Case Scenarios

Beatable Oracle tasks and training data Oracle tasks don’t focus on synthesizable molecules Focusing only on objective is insuﬃcient, a molecule has no value if it cannot be made. Biaising the model toward synthesizable molecules thought an oracle. [1] Gao et al. “The Synthesizability of Molecules Proposed by Generative Models” https://pubs.acs.org/doi/10.1021/acs.jcim.0c00174 [1]

12 Proposed Solution

13 Proposed solution How to evaluate models trained with various
datasets ? Is it possible to have a fair evaluation? One solution is to always evaluate on novelty First model objective set Advantages: Track the ability of the model to generate out of distribution molecules. Second model objective set Disadvantages: Tasks are not equally diﬃcult for each model. A need for diversity.

14 Proposed solution: The task Docking [1] Cieplinski and al.
“We Should at Least Be Able to Design Molecules That Dock Well” https://arxiv.org/abs/2006.16955 Diﬃcult and reﬂects real case scenario Each target has a unique objective chemical space Can be extended to Fragment based Optimization [1] Docking

15 Proposed solution Docking Score Drug-like Objective MW < 600
Pass Set of Chemical Rules Rotatable Bonds < 10 … Synthesizability Objective Add an Synthetic Accessibility score ﬁlter SA < 4 Oracle Task: (5k and 10k Oracle calls) Evaluation Filter on drug like objective Filter on SA scores Filter out molecule similar to the training set Compute Evaluation AUC of Top 5% Docking Score of Novel Molecules Distribution metric for sanity check

16 Conclusion • We detailed the limitation of current existing
benchmark in the litterature • We shared a new benchmark direction, which reﬂects real case scenario and solves the issue of evaluating model in production. • Especially, with elix benchmark all high performing model can be trusted in real-case scenarios. Which was not the case for existing benchmark. Future Directions: • Extend the benchmark to other Drug discovery tasks not present in current literature, especially Lead Optimization tasks.

Thank you for your attention. Q & A 17

www.elix-inc.com

Appendix 19

20 Oracle task bias by the training data From our
previous results with various models Novelty threshold In real scenarios we are looking for novel molecule. We observed that no model produces novel (enough) molecules. All similarities are computed on a ECFP4 based ﬁngerprint with tanimoto similarity

21 Explanation of the beneﬁt of Novelty for evaluation Current
Benchmark How to evaluate those models? Is it possible to have a fair evaluation with various training data? Various training data Good Result Can’t be trusted Checking training data is needed Poor Result Can be trusted Can be trusted Elix Benchmark Same training data Various training data Good Result Can be trusted Can be trusted Poor Result Checking training data is needed Can be trusted Same training data

22 Sample Eﬃciency Matters Generate Feedback Oracle Generation process with
an oracle Descriptors . . . Others Speed Accuracy +++ ++ ++ ~ – +++ – – ++++ [1] Gao and al. “Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization” https://arxiv.org/abs/2206.12411 [2] Cieplinski and al. “We Should at Least Be Able to Design Molecules That Dock Well” https://arxiv.org/abs/2006.16955 [3] Sundin and al. “Human-in-the-loop assisted de novo molecular design” https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00667-8 [1] Docking [2] Human in the loop[3] Activity ADME Property Toxicity … Machine learning models

Benchmarking Deployed Generative Models on Elix...

Benchmarking Deployed Generative Models on Elix Discovery, Elix, CBI 2023

Elix

More Decks by Elix

Other Decks in Research

Featured

Transcript

Benchmarking Deployed Generative Models on Elix Discovery Elix, Inc. Vincent

2 Introduction

3 General drug discovery ﬂow in machine learning Datasets ZINC

4 Production environment diﬃculties Model 1 Dataset : ZINC Model

5 Current Benchmarking Solutions

6 The Current State of Evaluation Metrics for Generative Models

7 Internal result benchmark metrics model 1 model 2 model

8 Limitations of current benchmark Current Limitations Distribution Metrics: Easily

9 Limitations of current benchmark Current Limitations Distribution Metrics: Easily

10 Oracle task bias by the training data Current Oracles

11 Limitations of current benchmark Current Limitations Distribution Metrics: Easily

12 Proposed Solution

13 Proposed solution How to evaluate models trained with various

14 Proposed solution: The task Docking [1] Cieplinski and al.

15 Proposed solution Docking Score Drug-like Objective MW < 600

16 Conclusion • We detailed the limitation of current existing

Thank you for your attention. Q & A 17

www.elix-inc.com

Appendix 19

20 Oracle task bias by the training data From our

21 Explanation of the beneﬁt of Novelty for evaluation Current

22 Sample Eﬃciency Matters Generate Feedback Oracle Generation process with