CHEMBL MOSES … Representations O=C(NCCCn1ccnc1)c1cccs1 Evaluation How to assess the performance of molecular generative model? Models VAE GAN RNN … Existing trained model shared weights by the authors, or internal existing model
2 Dataset : CHEMBL with custom filters Model 3 Dataset : ZINC + CHEMBL How to evaluate models trained with various datasets ? Is it possible to have a fair evaluation ?
Distribution metrics (From MOS ) - FCD (Fréchet ChemNet Distances) - SNN (Similarity to Nearest Neighbor) - Scaffold Similarity - Validity - Uniqueness - Filters (% passing MOSES defined smarts filters) - Novelty - IntDiv (Can detect mode collapse of generative models) - Fragment similarity Oracle based metrics (From TDC and GuacaMol) Molecule generation with a desired property in mind. - Docking score - ML based score (DRD2, JNK3, GSK3B) - Similarity to another molecules - Rediscovery - Isomer identification - Property optimization (LogP, QED, SA) - Scaffold hops - … Generate Feedback Oracle Training Learned distribution [1] Polykosvkiy et al. “Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models” https://arxiv.org/abs/1811.12823 [2] Huang et al. “Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development” https://arxiv.org/abs/2102.09548 [3] Brown et al. “GuacaMol: Benchmarking Models for De Novo Molecular Design” https://arxiv.org/abs/1811.09621 [1] [2] [3]
Beatable Can be fooled by simple task. Most models achieve high accuracy. Good to debug when training a model [1] Renz et al. “On failure modes in molecule generation and optimization” https://www.sciencedirect.com/science/article/pii/S1740674920300159 [1]
Beatable Oracle tasks and training data Certain tasks can be addressed by the training dataset alone. The model learns the dataset's distribution, but novelty isn't always guaranteed. A molecule is novel if the nearest neighbor in the training set has a similarity less than 0.4 (ECFP4 based) [1] Franco et al. “The use of 2D fingerprint methods to support the assessment of structural similarity in orphan drug legislation” https://jcheminf.biomedcentral.com/counter/pdf/10.1186/1758-2946-6-5.pdf [1]
Beatable Oracle tasks and training data Oracle tasks don’t focus on synthesizable molecules Focusing only on objective is insufficient, a molecule has no value if it cannot be made. Biaising the model toward synthesizable molecules thought an oracle. [1] Gao et al. “The Synthesizability of Molecules Proposed by Generative Models” https://pubs.acs.org/doi/10.1021/acs.jcim.0c00174 [1]
datasets ? Is it possible to have a fair evaluation? One solution is to always evaluate on novelty First model objective set Advantages: Track the ability of the model to generate out of distribution molecules. Second model objective set Disadvantages: Tasks are not equally difficult for each model. A need for diversity.
“We Should at Least Be Able to Design Molecules That Dock Well” https://arxiv.org/abs/2006.16955 Difficult and reflects real case scenario Each target has a unique objective chemical space Can be extended to Fragment based Optimization [1] Docking
Pass Set of Chemical Rules Rotatable Bonds < 10 … Synthesizability Objective Add an Synthetic Accessibility score filter SA < 4 Oracle Task: (5k and 10k Oracle calls) Evaluation Filter on drug like objective Filter on SA scores Filter out molecule similar to the training set Compute Evaluation AUC of Top 5% Docking Score of Novel Molecules Distribution metric for sanity check
benchmark in the litterature • We shared a new benchmark direction, which reflects real case scenario and solves the issue of evaluating model in production. • Especially, with elix benchmark all high performing model can be trusted in real-case scenarios. Which was not the case for existing benchmark. Future Directions: • Extend the benchmark to other Drug discovery tasks not present in current literature, especially Lead Optimization tasks.
previous results with various models Novelty threshold In real scenarios we are looking for novel molecule. We observed that no model produces novel (enough) molecules. All similarities are computed on a ECFP4 based fingerprint with tanimoto similarity
Benchmark How to evaluate those models? Is it possible to have a fair evaluation with various training data? Various training data Good Result Can’t be trusted Checking training data is needed Poor Result Can be trusted Can be trusted Elix Benchmark Same training data Various training data Good Result Can be trusted Can be trusted Poor Result Checking training data is needed Can be trusted Same training data
an oracle Descriptors . . . Others Speed Accuracy +++ ++ ++ ~ – +++ – – ++++ [1] Gao and al. “Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization” https://arxiv.org/abs/2206.12411 [2] Cieplinski and al. “We Should at Least Be Able to Design Molecules That Dock Well” https://arxiv.org/abs/2006.16955 [3] Sundin and al. “Human-in-the-loop assisted de novo molecular design” https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00667-8 [1] Docking [2] Human in the loop[3] Activity ADME Property Toxicity … Machine learning models