Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Triaging Scholarly Information Overload: Using...

wing.nus
October 15, 2024

Triaging Scholarly Information Overload: Using LLMs and Computation Chemistry to find promising candidates for X

by Ziyang Zhang, Yixi Ding, Thushari Pahalage, Jiaying Wu, Giorgia Pastorin, Raye Yeow, Min-Yen Kan.
Presented by Min-Yen Kan at the TUMCreate DRAGON Symposium.

https://www.tum-create.edu.sg/event/dragon-symposium

wing.nus

October 15, 2024
Tweet

More Decks by wing.nus

Other Decks in Education

Transcript

  1. Triaging Scholarly Information Overload: Using LLMs and Computation Chemistry to

    find promising candidates for Ziyang Zhang, Yixi Ding, Thushari Pahalage, Jiaying Wu, Giorgia Pastorin, Raye Yeow, Min-Yen Kan 15 Oct 2024 15 Oct 2024 TUM-NUS DRAGON Symposium (Singapore) 1 Slides here: soc-n.us/ 241015-dragon-kan
  2. 🫣 Spoiler Alert How can mining the literature potentially help

    you? Where do we experience high volumes of tangentially relevant signals for the problems of interest? AI is inherently interdisciplinary, but we need to ask the right questions 3 Predict ROS Selectivity Extrapolating efficacy over K or M candidates Designing Divergent and Convergent SCC Selecting SCC metallacages Characterising host–guest chemistry EPR effect prediction Forecasting Translation Scalability Synthesis Cost Estimation Discerning Trends in PFAS Removal Creativity in System Design
  3. 4 Signal to Noise “ While astronomers often have access

    to efficient and robust mechanisms that serve to archive, curate, and make primary data available. But very few parallel systems exist for derived data. Because most, if not all, scientific articles in Astronomy are based on derived data, making such data visible, intelligible and available to the public is of fundamental importance.” How Do Astronomers Share Data? Pepe et al. (2014) PLOS One
  4. 5

  5. 6

  6. 7

  7. 8

  8. 9

  9. Large Language Models encode knowledge about drugs The Impact of

    Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4. (Microsoft Research AI4Science, arXiv: 2311.07361) Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. (Lee et al., N Engl J Med 2023) 10
  10. LLMs can be further augmented by retrieved knowledge Almanac —

    Retrieval-Augmented Language Models for Clinical Medicine (Zakka et al., NEJM AI 2024) User 11
  11. LLM Retrieval + Extraction Structured information extraction from scientific text

    with large language models. (Dagdelen et al., Nature Communications 2024) 12
  12. Outline o Motivation o AI + CC Information Curation: A

    case study of DFU • Diabetic Foot Ulcer • Qualify & Quantify: AI + CC Workflow o Qualify: AI LLM Screening o Quantify: CC in silico Simulation o Validate: Assay in vitro validation o Towards the Future 15 Oct 2024 TUM-NUS DRAGON Symposium (Singapore) 13
  13. Diabetic Foot Ulcers A serious complication that affects individuals with

    diabetes. Prevalence & Health Impact o Globally, every 20 seconds, someone loses a leg due to DFUs. Regardless of the amputation scale, the five-year survival rate is less than 50%. o Diabetic (14.9% in SG) → hospitalization due to DFU (6%) Challenges in Current Therapies o DFU healthcare is highly complex and hard to customize due to personal characteristics (gait, wound shape and micro-environment shift). o Conventional drug discovery is time-consuming and costly. 14 Guidelines on interventions to enhance healing of foot ulcers in people with diabetes (Chen et al., IWGDF 2023 update)
  14. Theoretical Basis Impaired blood flow caused by diabetes mellitus leads

    to protein dysregulation. In the case of DFUs, such dysregulation hinders the body’s ability to heal wounds effectively. o Hypothesis 1: When diabetic wounds induce damaging changes in the concentration of specific proteins, medications inhibiting such changes may prove beneficial. o Hypothesis 2: When a particular efficacious therapeutic intervention induces positive changes in the concentration of proteins within the diabetic wound, stimulating such changes may be helpful. 15
  15. Qualify, then Quantify, then Validate 16 Qualify whether an effect

    might be present through an LLM search of the literature Quantify effect in silico through quantum chemistry Validate using in vitro assays Larger Number of Candidates More expense per candidate
  16. Outline o Motivation o AI + CC Information Curation: A

    case study of DFU o Qualify: AI LLM Screening o Quantify: CC in silico Simulation o Validate: Assay in vitro validation o Towards the Future 15 Oct 2024 TUM-NUS DRAGON Symposium (Singapore) 17
  17. Protein Concentration Change Identification 1. Signal disease : Comparison in

    Diseased and Healthy Tissues Change in concentration levels of a protein in DFU patients when compared to those in the healthy control group. 2. Signal treatment : Comparison Before and After Treatment Change in concentration levels of a protein after the treatment of DFU, compared to pre-treatment levels. 18
  18. Step 1: Identify protein concentration changes in the literature Identify

    the recommended therapeutic regulation direction ◦ Increase ◦ Decrease ◦ Unknown 19
  19. Step 2: Drug–Protein Matching Identify whether the action is consistent

    with the recommended therapeutic regulation direction • Yes • No
  20. Does it work? Validating AI LLM Extraction Prompt Strategy: Few-shot

    setting Test Set: • 97 (protein, document) pairs • Manually 136 annotated sentences (“Evidence”). • For proteins w/ multiply detected evidence, decide the result by voting. Metrics: Precision • Make predictions that are correct with respect to gold standard • Ignore those predicted as unknown 22
  21. Evaluation: ~70% precision From over 8K protein and 2K drug

    candidates, our process shortlists 35 drug candidates from 756 protein–drug pairs for correlation study. Newly identified Folic Acid as the most promising candidate 23
  22. Challenges Phases of Protein Changes E.g. "The TGFβ levels significantly

    increased after 5 days of Hyperbaric Oxygen Therapy (HBOT) and decreased progressively until the end of the treatment, when the lowest plasma levels were observed." Imbalance in the number of instances for each proteins: o Instance: a (paper, protein) pair o Tend to mark it as “unknown” e.g., when the two signals conflict • Result: Proteins with less evidence are more prone to being overlooked 24
  23. Outline o Motivation o AI + CC Information Curation: A

    case study of DFU o Qualify: AI LLM Screening o Quantify: CC in silico Simulation o Validate: Assay in vitro validation o Towards the Future 15 Oct 2024 TUM-NUS DRAGON Symposium (Singapore) 25
  24. Quantum Chemistry wavefunction analysis δg inter = 0.001 Folic Acid

    FGF: Lysine 74 FGF: Glutamine 64 FGF: Glutamic Acid 66 ΔΔG = –86.20 kcal/mol 28
  25. 30

  26. Outline o Motivation o AI + CC Information Curation: A

    case study of DFU o Qualify: AI LLM Screening o Quantify: CC in silico Simulation o Validate: Assay in vitro validation o Towards the Future 15 Oct 2024 TUM-NUS DRAGON Symposium (Singapore) 33
  27. in vitro Wound Healing Assay 1. Create a gap (wound)

    2. Apply different treatments (drugs) 3. Measure the cell migration at gap (wound closure) 34
  28. 41

  29. Outline o Motivation o AI + CC Information Curation: A

    case study of DFU o Qualify: AI LLM Screening o Quantify: CC in silico Simulation o Validate: Assay in vitro validation o Towards the Future • at ground level: on Folic Acid • at 30,000 feet: Literature Whispers 15 Oct 2024 TUM-NUS DRAGON Symposium (Singapore) 42
  30. At the Ground Level: Analysing Byproducts of Folic Acid Folic

    Acid breaks down into: o Pyrrolidone Carboxylic Acid (PCA) o Pteroylglutamic Acid (PGA) o Xanthopterin (XA) o Pterin How do these byproducts cause or perturb efficacy? How well can we attribute the efficacy of folic acid to its byproducts? 43
  31. 44

  32. At 30,000 feet: Literature Whispers How can mining the literature

    potentially help you? Where do we experience high volumes of tangentially relevant signals for the problems of interest? AI is inherently interdisciplinary, but we need to ask the right questions 45 Predict ROS Selectivity Extrapolating efficacy over K or M candidates Fuctionalising Divergent Convergent SCC Synthesis Cost Estimation Characterising host–guest chemistry EPR effect prediction Forecasting Translation Scalability Forecasting Lifetimes of Supramolecular Materials Discerning Trends in PFAS Removal Creativity in System Design
  33. 46

  34. Thank you! Hope to discuss with you over dinner! 15

    Oct 2024 TUM-NUS DRAGON Symposium (Singapore) 47 Ziyang Zhang, Yixi Ding, Thushari Pahalage, Jiaying Wu, Giorgia Pastorin, Raye Yeow, Min-Yen Kan Get the slides here: https://soc-n.us/ 241015-dragon-kan