Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

2012-04-27_PhD_defense

 2012-04-27_PhD_defense

Playing hide and seek on the genomic playground: Unveiling biological function from literature

Public PhD Defense by Sofie Van Landeghem

Sofie Van Landeghem

April 27, 2012
Tweet

More Decks by Sofie Van Landeghem

Other Decks in Research

Transcript

  1. Playing hide and seek on the genomic playground: Unveiling biological

    function from literature Sofie Van Landeghem Promotor: Prof. Dr. Yves Van de Peer Co-promotor: Prof. Dr. Bernard De Baets Co-promotor: Dr. Yvan Saeys Zwijnaarde, April 27th, 2012 Public PhD defense
  2. Outline • Introduction • Part 1: Theoretical text mining algorithms

    • Protein-protein interactions • Event extraction • Non-causal relations • Part 2: Practical applications • EVEX: bibliome-wide text mining • Manual browsing • Database and pathway curation • Future prospects 2 Sofie Van Landeghem, public PhD defense
  3. Genes and DNA • DNA: carries genetic information ➔ 4

    nucleobases: A, C, G, T Gene: basic unit of hereditary ➔ Codes for a specific function ➔ All genes present in all cells 3 Sofie Van Landeghem, public PhD defense
  4. 4 Sofie Van Landeghem, public PhD defense Gene prediction agaaaatctgggagctatgtatgtacagctgttggaggaggactgtaaaggggctggtatagacatctacagaaggttgatggttct

    gtgtctacttcctctgtgaatgtgcagacaagcatctcatagggagcctggggctgctattggtccatagagctagtagttgaggagg agacctgggcatggaatagggagagacagggcaaactggaaaccaccagcacctttgcatctgtctctgactgtctccaaccaaca gtaaccttagaacaaaatgactactgctcactgccacctcccaaatattattcttttggccaagtgtagctgggatccattcagggaag gtgattctgcaaaacatagttcccagaataatcaaggtgaaaatagaataatcttgcacacaggtctttgataagactaggaatatat aatacataacagctaggaaaaaatatataaattttccccaagtgcttataatgaacaaacatatttgggaaccatatccacctagcct agtcaactaaattaaaagctggagtcatcttagatgctttc... (DISC1 gene: 421,458 bp) DNA mRNA MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFLSPAVGTLF RFPGGVSGEESHHSESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGTSAHFGIQLRGGTRLPDRLS WPCGPGSAGWQQEFAAMDSSETLDASWEAACSDGARRVRAAGSLPSAELSSNSCSPGCGPEVPP TPPGSHSAFTSSFSFIRLSLGSAGERGEAEGCPPSREAESHCQSPQEMGAKAASLDGPHEDPRCLSRP FSLLATRVSADLAQAARNSSRPERDMHSLPDMDPGSSSSLDPSLAGCGGDGSSGSGDAHSWDTLLR KWEPVLRDCLLRNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSRQ PALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRDWLLQEKQQ LQKEIEALQARMFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSLGQLQEVSKALQDTLASA GQIPFHAEPPETIRRYC (547 aa) Protein Transcription Translation Gene expression
  5. 5 Sofie Van Landeghem, public PhD defense Function of Disc1?

    “ ... we have cloned and sequenced the breakpoints of a (1;11)(q42.1;q14.3) translocation linked to schizophrenia ...” Millar et al. Human Molecular Genetics, 2000 “ ... results of karyotypic, clinical, and ERP investigations ...” Blackwood et al. American Journal of Human Genetics, 2001 “... microarray data analysis ... ontological profiling ...” Glatt et al. PNAS, 2005 “... using a combination of recombinant and neuronal cell models ...” Wang et al. Molecular Psychiatry, 2011 MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFLSPAVGTLFRFPGGVSGEESHH SESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGTSAHFGIQLRGGTRLPDRLSWPCGPGSAGWQQEFAAMDSSETLD ASWEAACSDGARRVRAAGSLPSAELSSNSCSPGCGPEVPPTPPGSHSAFTSSFSFIRLSLGSAGERGEAEGCPPSREAESHC QSPQEMGAKAASLDGPHEDPRCLSRPFSLLATRVSADLAQAARNSSRPERDMHSLPDMDPGSSSSLDPSLAGCGGDGS SGSGDAHSWDTLLRKWEPVLRDCLLRNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSR QPALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRDWLLQEKQQLQKEIEALQAR MFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSLGQLQEVSKALQDTLASAGQIPFHAEPPETIRRYC
  6. “ ... DISC1 and DISC2 should be considered formal candidate

    genes for susceptibility to psychiatric illness ...” Millar et al. Human Molecular Genetics, 2000 “ ... the recently described genes DISC1 and DISC2, ..., may have a role in the development of a disease phenotype that includes schizophrenia as well as unipolar and bipolar affective disorders ...” Blackwood et al. American Journal of Human Genetics, 2001 “... TNIK mRNA expression was increased in the dorsolateral prefrontal cortex of schizophrenia subjects ... ” Glatt et al. PNAS, 2005 “... DISC1 and TNIK interact to regulate synapse composition and function ...” Wang et al. Molecular Psychiatry, 2011 6 Sofie Van Landeghem, public PhD defense Function revealed!
  7. “ ... 砫粍 榃 斪昮朐 玾珆玸 橍殧澞 泏狔狑 賌輈鄍 趍

    椻楒 嬽 騩鰒 鰔 狅妵妶 漀 鷵鷕 橚橍 膣 濇燖燏 ...” 齞齝囃 狅妵妶 跠跬 滘 “ ... 稢綌 幋暕楋 糲蘥蠩 慖 酳銪 鶷鷇鶾 蒠蓔蜳 濷瓂癚 蛶 蛃袚觙 麷劻穋 糋罶羬 墏 檎檦 澂 釢髟偛 翍脝艴 麷劻穋 臷菨 輐銛 ...” 譾躒鑅 嬏嶟 劁, 蝑蝞 “... 潧潣瑽 溗煂獂 酳 笢笣雗雘雝 齈龘墻 馻噈嫶 斖蘱 蔏 礛嬨嶵 徲 倱哻圁 毚丮厹 鬄鵊 蜸 躆 ... ” 轕 媝寔嵒 鋧鋓頠 茺苶 “... 轖嫀嵥嵧 枅杺枙 慔 浶洯 轖轕 刲匊呥 鶷鷇鶾 鉌 蜭 鋄銶 邆錉 霋 裍裚詷 軿鉯頏 藙藨蠈 ...” 犤繵 沀皯竻 歅 筩筡 譒蹸 烺焆琀 7 Sofie Van Landeghem, public PhD defense Function revealed?
  8. BioNLP • Natural Language Processing for Biomedical texts • Exponential

    growth of available literature • Goal: formal summarization, hypothesis generation • Challenge 1: Highly ambiguous gene/protein symbols ◦ Synonymy: ESR1 = NR3A1 ◦ Lexical variants: Esr-1, ESR1, Era, ESRα ◦ Abbreviations ◦ Ambiguity: Wasp, diabetes, CAT • Challenge 2: Complexity of natural language ◦ Complex grammatical structures ◦ Speculation, negation 9 Sofie Van Landeghem, public PhD defense
  9. 10 Sofie Van Landeghem, public PhD defense Natural language “Time

    flies like an arrow” “Fruit flies like a banana” “I once shot an elephant in my pajamas. How he got in my pajamas, I’ll never know” - Groucho Marx “The government plans to raise taxes were defeated” “You can always count on the Americans to do the right thing – after they have tried everything else” - Winston Churchill
  10. “ ... DISC1 and DISC2 should be considered formal candidate

    genes for susceptibility to psychiatric illness ...” Millar et al. Human Molecular Genetics, 2000 “ ... the recently described genes DISC1 and DISC2, ..., may have a role in the development of a disease phenotype that includes schizophrenia as well as unipolar and bipolar affective disorders ...” Blackwood et al. American Journal of Human Genetics, 2001 “... TNIK mRNA expression was increased in the dorsolateral prefrontal cortex of schizophrenia subjects ... ” Glatt et al. PNAS, 2005 “... DISC1 and TNIK interact to regulate synapse composition and function ...” Wang et al. Molecular Psychiatry, 2011 11 Sofie Van Landeghem, public PhD defense BioNLP target
  11. Toolbox: machine learning • Learning complex properties from input data

    • Supervised learning • Training data with known class labels • Hidden test data with unknown class labels • Goal: Automatically predict these unknown lables • Example: classification of horses • Class labels: positive (horse) or negative (not a horse) 12 Sofie Van Landeghem, public PhD defense
  12. TRAINING positive examples negative examples 13 Sofie Van Landeghem, public

    PhD defense Machine learning: example features: brown, hooves, mane, 4 legs brown, paws, mane, 4 legs black, hooves, mane, 4 legs black, paws, no mane, 4 legs white, hooves, mane, 4 legs white, hooves, no mane, 4 legs ... ...
  13. TESTING unlabeled instances features: white, hooves, mane, 4 legs white,

    paws, mane, 4 legs 14 Sofie Van Landeghem, public PhD defense Machine learning: example HORSE NOT A HORSE
  14. Outline • Introduction • Part 1: Theoretical text mining algorithms

    • Protein-protein interactions • Event extraction • Non-causal relations • Part 2: Practical applications • EVEX: bibliome-wide text mining • Manual browsing • Database and pathway curation • Future prospects 15 Sofie Van Landeghem, public PhD defense
  15. NLP framework 16 Nfklsqdjfhsiqfs Sfqdffzfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq

    s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sdgSgfs dgsdhdh Dhshdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Nfklsqdjfhsiqfs Sfqdffzrtdfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sfqfsqdksjqhDfsg gsdg Sgfs dgsdhdh Dhshsdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Nfklsqdjfhsiqfs Sfqdffzrtdfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sfqfsqdsfksjqhDfsg gsdg Sgfs dgsdhdh Dhshsdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Information retrieval xxxxxx xxxx xxxxxxxx xx CDC42 xxxxx xxx xxx xxxxx PAK4 KTN1 xxx xx xxxxxxxx xx Named entity recognition <sentence> <text>xxxxx</text> <prot>CDC42</prot> <prot>PAK4</prot> …. </sentence> Structured text interaction(CDC42, PAK4) interaction(CDC42, KTN1) … Relation extraction Network construction ...
  16. Relation extraction • Feature generation • Lexical, syntactic, grammatical features

    • Stanford parser: constituency tree and dependency parse • Feature selection • Identifying subset of automatically generated features • Reducing model complexity • Avoiding overfitting of the classifier • Classification: WEKA implementation of LibSVM • Evaluation • Precision: percentage of predictions that are correct • Recall: percentage of statements that are identified • F-score: harmonic mean between precision and recall 17 Sofie Van Landeghem, public PhD defense
  17. 18 Sofie Van Landeghem, public PhD defense Constituency tree •

    Tree represents the full syntax of a sentence • Phrase chuncking: constituents • Part-of-speech (POS) tags "The tyrosine phosphorylation of STAT1 was enhanced significantly".
  18. 19 Sofie Van Landeghem, public PhD defense Dependency parsing •

    Grammatical relations represented as a graph • More compact • Robust to syntactic variation "The tyrosine phosphorylation of STAT1 was enhanced significantly".
  19. PPI: feature generation 20 Sofie Van Landeghem, public PhD defense

    Bag-of-words from the sentence “we”, “show”, “that”, "binds" ... Vertex walks from connecting path PROTX nsubj “binds” “binds” prep_as “heterodimer” “heterodimer” prep_with PROTX Edge walks from connecting path nsubj “binds” prep_as prep_as “heterodimer” prep_with Root "binds" PROTX PROTX "We show here that c-Rel binds to kappa B sites as heterodimers with p50" Van Landeghem et al. 2008. SMBM
  20. Van Landeghem et al. 2008. SMBM Van Landeghem et al.

    2008. BeneLearn 21 Sofie Van Landeghem, public PhD defense • PPI datasets: performance discrepancy (20-27 pp.) • Different initial selection of corpus scope and size • Some only annotate entities involved in relations • Some only include sentences with interactions F-score AIMed HPRD50 IEPA LLL Co-occurrence 30 55 58 66 Inst. CV 62 71 71 82 Abstr. CV 46 55 67 73 Cross-dataset (test) 38 57 41 40 → PPI corpora and evaluations are highly biased Protein-protein interactions
  21. 22 Sofie Van Landeghem, public PhD defense Shared Task on

    Event Extraction • BioNLP Shared Task corpus • 800 training, 150 development and 260 test abstracts • 9 event types (“task 1”) • 6 different "physical" event types involving genes/proteins: gene expression, localization, transcription, binding, protein catabolism, phosphorylation • 3 regulation events : may also take events as arguments: Positive regulation, Negative regulation, Regulation • Additional information (“task 2”) • E.g. phosphorylation site, localization information, ... • Negation and speculation (“task 3”) • Participation: 24 international teams
  22. 23 Sofie Van Landeghem, public PhD defense Event characterization •

    Trigger word: refers to event type • "phosphorylated", "interaction", "mediates", ... • Theme (affectee) and cause (affector) arguments
  23. • Separate SVM pipeline for each event type • Trigger

    detection • Instance creation: trigger + co-occurring genes/proteins • Feature generation (cf. next slide) • Classification using LibSVM • Post-processing • Parallellization • 6 physical event types • 3 regulatory event types • Task 3 • Negation & speculation • Rule-based system 24 Sofie Van Landeghem, public PhD defense Extended framework Van Landeghem et al. 2009. BioNLP
  24. Feature generation 25 Sofie Van Landeghem, public PhD defense Bag-of-words

    from the sub-graph “binds”, “heterodimer”, ... Trigrams from the sub-sentence “PROTX binds to”, “binds to kappa”, “to kappa B”, ... Vertex walks from the sub-graph PROTX nsubj TRIGGERX TRIGGERX prep_as “heterodimer” “heterodimer” prep_with PROTX Trigger "binds" Verb PROTX PROTX We show here that c-Rel binds to kappa B sites as heterodimers with p50 TRIGGERX Van Landeghem et al. 2009. BioNLP
  25. 26 Sofie Van Landeghem, public PhD defense F-score Task1 Task2

    Task3 UTurku 51.95 JULIELab 46.66 ConcordU 44.62 42.52 UT+DBCLS 44.35 43.12 VIBGhent 40.54 37.80 UTokyo 36.88 Event extraction: results recall precision F-score Physical events 50.75 67.24 57.85 Regulatory events 17.36 31.61 22.41 All events 33.41 51.55 40.54 Task 3 30.55 49.57 37.80 Van Landeghem et al. 2009. BioNLP
  26. Van Landeghem, Abeel et al. 2010. Bioinformatics Feature selection •

    High dimensional datasets • Between 1.800 and 30.000 features • Automatically generated • High complexity for classification algorithm • State-of-the art performance is still limited (~ 65% F) • We need to understand what is going on • Supervised machine learning (ML) systems dominate the top ranked systems • Black box behaviour • Difficult to understand the nature of the predictions 27 Sofie Van Landeghem, public PhD defense
  27. Van Landeghem, Abeel et al. 2010. Bioinformatics • Aggregation of

    multiple weak feature selectors • Baseline: 65.02 F-score (physical events) • 100 runs: small but consistent improvement 28 Sofie Van Landeghem, public PhD defense Ensemble feature selection Feature space Min. F Max. F Avg. F 75% 64.85 65.33 65.26 50% 65.60 66.43 65.88 30% 64.94 66.60 65.86 25% 65.51 66.82 66.14 20% 65.08 66.56 65.85 10% 61.75 64.90 63.59
  28. Van Landeghem, Abeel et al. 2010. Bioinformatics Feature clouds Visualisation

    of most discrimative features 29 Sofie Van Landeghem, public PhD defense binding vertex walks
  29. Non-causal relations • Relations between named entities (genes/gene products or

    "GGPs") and general domain terms • Termed "non-causal", "static" or "entity" relations • Provide a more detailed view on the information in the sentence, on top of event extraction 30 Sofie Van Landeghem, public PhD defense
  30. Van Landeghem et al. 2011. BioNLP Van Landeghem, Björne et

    al. 2012. BMC Bioinformatics • BioNLP'11 Shared Task: 4 participants (for this task) • Analysed 16 pp. discrepancy between (1) and (2) • Hybrid system • Turku term detection + Ghent relation detection • Conclusion: rule-based term detection module of Ghent system is responsible for performance gap 31 Sofie Van Landeghem, public PhD defense recall precision F-score UTurku 50.10 68.04 57.71 Ghent 47.48 37.04 41.62 ConcordiaU 24.35 46.85 32.04 UoS 15.69 23.26 18.74 Non-causal relations: results
  31. Outline •Introduction •Part 1: Theoretical text mining algorithms • Protein-protein

    interactions • Event extraction • Non-causal relations •Part 2: Practical applications • EVEX: bibliome-wide text mining • Manual browsing • Database and pathway curation • Future prospects 33 Sofie Van Landeghem, public PhD defense
  32. NLP framework 34 Nfklsqdjfhsiqfs Sfqdffzfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq

    s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sdgSgfs dgsdhdh Dhshdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Nfklsqdjfhsiqfs Sfqdffzrtdfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sfqfsqdksjqhDfsg gsdg Sgfs dgsdhdh Dhshsdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Nfklsqdjfhsiqfs Sfqdffzrtdfhsqfsqdfsqdf Dfsg gsdgfsf Sgfs dgsdhdhdfsq Dhshsdhdhsqfsq s hsdh fklsqdjsqffhsiqfs Sfqdffzrtgssqdfdfh sfqfsqdsfksjqhDfsg gsdg Sgfs dgsdhdh Dhshsdhdh Shs hsdh sdfsdfsq dsfhdsfqdsfhsdhsd dsfhdhsdhsd Information retrieval xxxxxx xxxx xxxxxxxx xx CDC42 xxxxx xxx xxx xxxxx PAK4 KTN1 xxx xx xxxxxxxx xx Named entity recognition <sentence> <text>xxxxx</text> <prot>CDC42</prot> <prot>PAK4</prot> …. </sentence> Structured text interaction(CDC42, PAK4) interaction(CDC42, KTN1) … Relation extraction Network construction ...
  33. Björne et al. 2010. BioNLP Björne, Van Landeghem et al.

    2012. Bioinformatics (under review) Bibliome-scale event extraction • Information retrieval • 21M PM abstracts and 400K PMC Open-Access full-texts • Named entity recognition • BANNER (Leaman and Gonzalez, 2008) • Relation extraction • Turku Event Extraction System (TEES) • First in the ST'09, further improved for the ST'11 • Automatically assigned confidences • Analysis: full-text vs. abstracts • 13.5M events from full text, 20.8M events from abstracts • Full text more difficult: 50.72% F vs. 54.37% F • Only 37% of the full-text events are also found in abstracts 35 Sofie Van Landeghem, public PhD defense
  34. Van Landeghem et al. 2011. BioNLP • Original data distributed

    as millions of flat files • Not trivial to search through • No notion of “equality” of events across articles • Distribution of MySQL database: "EVEX" • Normalization of text symbols: From textual strings to unique identities • Aggregation of equivalent event structures: Identifying equal events within and across articles • Integration with gene families: Enables information retrieval of homologs • Future work: release of API 36 Sofie Van Landeghem, public PhD defense EVEX
  35. Van Landeghem et al. 2012. Advances in Bioinformatics "Ang II

    induces a rapid increase in MAPK activity." E1: Positive-Regulation(T: MAPK ) E2: Positive-Regulation(C:Ang II, T:E1) Final structure: Pos-Reg(C:Ang II, T:Pos-Reg(T:MAPK)) 37 Sofie Van Landeghem, public PhD defense Refining event structures Refined to: Pos-Reg(C:Ang II, T:MAPK) → Helps to determine equivalence with similar events: Number of distinct event structures reduced with 60% → Original structures are preserved
  36. Van Landeghem et al. 2012. Advances in Bioinformatics 38 Sofie

    Van Landeghem, public PhD defense Pairwise relations from events "Thrombin augmented EGF-stimulated Akt phosphorylation" E1: Phosphorylation(T:Akt) E2: Positive-Regulation(C:EGF, T:E1) E3: Positive-Regulation(C:Thrombin, T:E2) Final structure: Pos-Reg(C:Thrombin,T:Pos-Reg(C:EGF,Pho(T:Akt))) Pairwise relations • EGF → Akt • Thrombin → Akt Indirect relations • Co-regulators: Thrombin and EGF
  37. Van Landeghem et al. 2011. BioNLP 39 Sofie Van Landeghem,

    public PhD defense Canonicalization Automatically tagged gene/protein symbols • Often whole noun phrase, e.g. “human Esr-1 gene” Need to identify common affixes for removal • Dictionary of affixes, listed by occurrence count • Recognition of organism names (Linnaeus + SimString) “human anti-inflammatory IL-10 gene” • -ORG- anti-inflammatory IL-10 gene • anti-inflammatory IL-10 gene • anti-inflammatory IL-10 • final canonical form = “il10”
  38. Canonicalization results 40 Sofie Van Landeghem, public PhD defense •

    Originally 67.3M gene/protein mentions • 3235K canonical forms • 2-5% can be linked to gene families (cf. next slides), accounting for 52-60% of all occurrences • Long tail of infrequent gene symbols! • Evaluation on the ST'09 gene/protein mentions • Aims at identifying mentions likely to match databases • Original BANNER matches: 50.2% F-score • After canonicalization: 61.1% F-score Van Landeghem et al. 2011. BioNLP
  39. Björne, Van Landeghem et al. 2012. Bioinformatics (under review) Gene

    normalization • Map textual, ambiguous gene symbols to unique IDs • E.g. “Esr-1” → “AT1G12980” • Context: Arabidopsis thaliana article • Highly ambiguous gene/protein symbols • GenNorm system, developed by Wei et al. 2011 • Ranked first on several criteria in the BioCreative 3 task • 28.6M (43%) gene/protein symbols could be normalized • 120 thousand unique genes • More than 4800 species (bacteria, plants, animals, ...) 41 Sofie Van Landeghem, public PhD defense
  40. Gene family assignment • Retrieve gene families from HomoloGene /

    Ensembl • Assign a family according to normalized gene mention • Rely on canonical form when there is still ambiguity • Inter-species: resolve to the same gene family • Intra-species: resolve to the most commonly used family • Resolve "esr1" to a default family? • Reliable synonym for "estrogen receptor" • Not a reliable synonym for "enhancer of shoot regeneration" • Manual evaluation: 72% of a set of correct event occurrences had both of their arguments assigned to the correct family 42 Sofie Van Landeghem, public PhD defense Van Landeghem et al. 2011. BioNLP
  41. Van Landeghem et al. 2011. BioNLP 43 Sofie Van Landeghem,

    public PhD defense Event generalizations • Aggregate multiple occurrences of the same event • Define equivalence of events: same type, same structure, equivalent arguments • ≠ definitions of "equivalence" of arguments: Events Occurrences Occ % Canonical form 2953K 34.3M 100.0% Entrez Gene normalization 748K 15.8M 46.2% HomoloGene 1006K 21.8M 63.5% Ensembl 1042K 23.5M 68.5% Ensembl Genomes 1001K 21.4M 62.5%
  42. Outline •Introduction •Part 1: Theoretical text mining algorithms • Protein-protein

    interactions • Event extraction • Non-causal relations •Part 2: Practical applications • EVEX: bibliome-wide text mining • Manual browsing • Database and pathway curation • Future prospects 44 Sofie Van Landeghem, public PhD defense
  43. http://www.evexdb.org Van Landeghem et al. 2012. Advances in Bioinformatics 45

    Sofie Van Landeghem, public PhD defense EVEX website
  44. 46 Sofie Van Landeghem, public PhD defense EVEX website http://www.evexdb.org

    Van Landeghem et al. 2012. Advances in Bioinformatics
  45. Pathway reconstruction 47 Björne, Van Landeghem et al. 2012. Bioinformatics

    (under review) Sofie Van Landeghem, public PhD defense • For each pair in the pathway, retreive EVEX events • Visualisation of highest ranked pair for each interaction
  46. Hypothesis generation • Use case on NADP(H)-metabolism in E. coli

    • Integration of EVEX with microarray expression data • Automated text analysis: high recall, fast retrieval • Manual expert validation: ensure high precision 48 Kaewphan et al. 2012. LREC Sofie Van Landeghem, public PhD defense
  47. Conclusions (1) • Developed a novel NLP framework for extraction

    of... • Protein-protein interactions • Various other biomolecular events • Non-causal relations • Thorough evaluations on publicly available corpora • Ensemble feature selection techniques in combination with a text mining framework • Producing more cost-effective classifiers • Clues for enhanced feature generation modules • Offering insight into the black-box behaviour of the SVM 49 Sofie Van Landeghem, public PhD defense
  48. Conclusions (2) • EVEX: a large-scale text mining resource •

    34.3M events • Gene normalization • Integration with gene families • Applications and future work • Data integration: locating inconsistencies, aggregating confidences, ... • Network construction • Pathway curation • Hypothesis generation • Automated analysis + manual evaluation! 50 Sofie Van Landeghem, public PhD defense Playing hide and seek Human vs. computer: 1-1
  49. Acknowledgments Promotor: Yves Van de Peer Co-promotors: Bernard De Baets,

    Yvan Saeys Ensemble FS: Thomas Abeel Manual evaluation: Zuzanna Drebert EVEX DB: Filip Ginter, Jari Björne, Sampo Pyysalo, Chih-Hsuan Wei EVEX website & API: Kai Hakala E. coli use-case: Suwisa Kaewphan, Sanna Kreula Technical support: Marijn Vandevoorde, Michiel Van Bel, IT Funding: BOF, FWO Bioinformatics group @ UGent 51 Sofie Van Landeghem, public PhD defense
  50. 52

  51. PPI corpora • AIMed • 225 abstracts from DIP :

    protein-protein interactions • Around 1000 annotated interactions • Hprd50 • 50 abstracts from HPRD, with 92 distinct relations • IEPA • 303 abstracts with protein-protein interactions • LLL • 164 sentences with protein-gene interactions • Training + testing : 164 + 106 interactions 54
  52. PPI corpora properties • Gene-protein or protein-protein interactions • Symmetrical

    interactions or agent/target roles • Including homodimers or not • Negative instances specified or Closed World Assumption • Explicit test set available or only CV possible • Complete abstracts included or merely a collection of sentences • Different dataformats 55
  53. Van Landeghem et al. 2011. Computational Intelligence 56 Sofie Van

    Landeghem, public PhD defense Event extraction: Precision-recall
  54. Van Landeghem et al. 2011. Computational Intelligence 57 Sofie Van

    Landeghem, public PhD defense Event extraction: learning curve
  55. Van Landeghem et al. 2010. BioNLP Non-causal relations and events

    • Provide clues to extract events: • Improvement of event extraction with entity relations: • Overall only marginal increases of performance • Behaviour dependent on specific event types • Phosphorylation and localization are most relevant 58 Sofie Van Landeghem, public PhD defense
  56. Van Landeghem et al. 2011. BioNLP • Based on the

    PPI extraction framework • Named entities (genes, proteins) annotated • Non-named entites • Dictionary approach • Rule-based recognition • Semantic clustering using LSA and MCL • Types • Protein-Component • Subunit-Complex • Equivalence • Member-Collection 59 Sofie Van Landeghem, public PhD defense Non-causal relations: classification
  57. PLEV evaluation • 1176 Arabidopsis articles: 1792 of 7691 events

    evaluated • Judge correctness of extracted statements: precision • Recall not as easy: requires annotation of full documents • Classifier perfectly capable of generalizing results • Confidence scores can be used for ranking 60 Sofie Van Landeghem, public PhD defense