function from literature Sofie Van Landeghem Promotor: Prof. Dr. Yves Van de Peer Co-promotor: Prof. Dr. Bernard De Baets Co-promotor: Dr. Yvan Saeys Zwijnaarde, April 27th, 2012 Public PhD defense
nucleobases: A, C, G, T Gene: basic unit of hereditary ➔ Codes for a specific function ➔ All genes present in all cells 3 Sofie Van Landeghem, public PhD defense
“ ... we have cloned and sequenced the breakpoints of a (1;11)(q42.1;q14.3) translocation linked to schizophrenia ...” Millar et al. Human Molecular Genetics, 2000 “ ... results of karyotypic, clinical, and ERP investigations ...” Blackwood et al. American Journal of Human Genetics, 2001 “... microarray data analysis ... ontological profiling ...” Glatt et al. PNAS, 2005 “... using a combination of recombinant and neuronal cell models ...” Wang et al. Molecular Psychiatry, 2011 MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFLSPAVGTLFRFPGGVSGEESHH SESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGTSAHFGIQLRGGTRLPDRLSWPCGPGSAGWQQEFAAMDSSETLD ASWEAACSDGARRVRAAGSLPSAELSSNSCSPGCGPEVPPTPPGSHSAFTSSFSFIRLSLGSAGERGEAEGCPPSREAESHC QSPQEMGAKAASLDGPHEDPRCLSRPFSLLATRVSADLAQAARNSSRPERDMHSLPDMDPGSSSSLDPSLAGCGGDGS SGSGDAHSWDTLLRKWEPVLRDCLLRNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSR QPALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRDWLLQEKQQLQKEIEALQAR MFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSLGQLQEVSKALQDTLASAGQIPFHAEPPETIRRYC
genes for susceptibility to psychiatric illness ...” Millar et al. Human Molecular Genetics, 2000 “ ... the recently described genes DISC1 and DISC2, ..., may have a role in the development of a disease phenotype that includes schizophrenia as well as unipolar and bipolar affective disorders ...” Blackwood et al. American Journal of Human Genetics, 2001 “... TNIK mRNA expression was increased in the dorsolateral prefrontal cortex of schizophrenia subjects ... ” Glatt et al. PNAS, 2005 “... DISC1 and TNIK interact to regulate synapse composition and function ...” Wang et al. Molecular Psychiatry, 2011 6 Sofie Van Landeghem, public PhD defense Function revealed!
flies like an arrow” “Fruit flies like a banana” “I once shot an elephant in my pajamas. How he got in my pajamas, I’ll never know” - Groucho Marx “The government plans to raise taxes were defeated” “You can always count on the Americans to do the right thing – after they have tried everything else” - Winston Churchill
genes for susceptibility to psychiatric illness ...” Millar et al. Human Molecular Genetics, 2000 “ ... the recently described genes DISC1 and DISC2, ..., may have a role in the development of a disease phenotype that includes schizophrenia as well as unipolar and bipolar affective disorders ...” Blackwood et al. American Journal of Human Genetics, 2001 “... TNIK mRNA expression was increased in the dorsolateral prefrontal cortex of schizophrenia subjects ... ” Glatt et al. PNAS, 2005 “... DISC1 and TNIK interact to regulate synapse composition and function ...” Wang et al. Molecular Psychiatry, 2011 11 Sofie Van Landeghem, public PhD defense BioNLP target
• Supervised learning • Training data with known class labels • Hidden test data with unknown class labels • Goal: Automatically predict these unknown lables • Example: classification of horses • Class labels: positive (horse) or negative (not a horse) 12 Sofie Van Landeghem, public PhD defense
PhD defense Machine learning: example features: brown, hooves, mane, 4 legs brown, paws, mane, 4 legs black, hooves, mane, 4 legs black, paws, no mane, 4 legs white, hooves, mane, 4 legs white, hooves, no mane, 4 legs ... ...
• Stanford parser: constituency tree and dependency parse • Feature selection • Identifying subset of automatically generated features • Reducing model complexity • Avoiding overfitting of the classifier • Classification: WEKA implementation of LibSVM • Evaluation • Precision: percentage of predictions that are correct • Recall: percentage of statements that are identified • F-score: harmonic mean between precision and recall 17 Sofie Van Landeghem, public PhD defense
Tree represents the full syntax of a sentence • Phrase chuncking: constituents • Part-of-speech (POS) tags "The tyrosine phosphorylation of STAT1 was enhanced significantly".
Grammatical relations represented as a graph • More compact • Robust to syntactic variation "The tyrosine phosphorylation of STAT1 was enhanced significantly".
Bag-of-words from the sentence “we”, “show”, “that”, "binds" ... Vertex walks from connecting path PROTX nsubj “binds” “binds” prep_as “heterodimer” “heterodimer” prep_with PROTX Edge walks from connecting path nsubj “binds” prep_as prep_as “heterodimer” prep_with Root "binds" PROTX PROTX "We show here that c-Rel binds to kappa B sites as heterodimers with p50" Van Landeghem et al. 2008. SMBM
2008. BeneLearn 21 Sofie Van Landeghem, public PhD defense • PPI datasets: performance discrepancy (20-27 pp.) • Different initial selection of corpus scope and size • Some only annotate entities involved in relations • Some only include sentences with interactions F-score AIMed HPRD50 IEPA LLL Co-occurrence 30 55 58 66 Inst. CV 62 71 71 82 Abstr. CV 46 55 67 73 Cross-dataset (test) 38 57 41 40 → PPI corpora and evaluations are highly biased Protein-protein interactions
from the sub-graph “binds”, “heterodimer”, ... Trigrams from the sub-sentence “PROTX binds to”, “binds to kappa”, “to kappa B”, ... Vertex walks from the sub-graph PROTX nsubj TRIGGERX TRIGGERX prep_as “heterodimer” “heterodimer” prep_with PROTX Trigger "binds" Verb PROTX PROTX We show here that c-Rel binds to kappa B sites as heterodimers with p50 TRIGGERX Van Landeghem et al. 2009. BioNLP
High dimensional datasets • Between 1.800 and 30.000 features • Automatically generated • High complexity for classification algorithm • State-of-the art performance is still limited (~ 65% F) • We need to understand what is going on • Supervised machine learning (ML) systems dominate the top ranked systems • Black box behaviour • Difficult to understand the nature of the predictions 27 Sofie Van Landeghem, public PhD defense
"GGPs") and general domain terms • Termed "non-causal", "static" or "entity" relations • Provide a more detailed view on the information in the sentence, on top of event extraction 30 Sofie Van Landeghem, public PhD defense
2012. Bioinformatics (under review) Bibliome-scale event extraction • Information retrieval • 21M PM abstracts and 400K PMC Open-Access full-texts • Named entity recognition • BANNER (Leaman and Gonzalez, 2008) • Relation extraction • Turku Event Extraction System (TEES) • First in the ST'09, further improved for the ST'11 • Automatically assigned confidences • Analysis: full-text vs. abstracts • 13.5M events from full text, 20.8M events from abstracts • Full text more difficult: 50.72% F vs. 54.37% F • Only 37% of the full-text events are also found in abstracts 35 Sofie Van Landeghem, public PhD defense
as millions of flat files • Not trivial to search through • No notion of “equality” of events across articles • Distribution of MySQL database: "EVEX" • Normalization of text symbols: From textual strings to unique identities • Aggregation of equivalent event structures: Identifying equal events within and across articles • Integration with gene families: Enables information retrieval of homologs • Future work: release of API 36 Sofie Van Landeghem, public PhD defense EVEX
induces a rapid increase in MAPK activity." E1: Positive-Regulation(T: MAPK ) E2: Positive-Regulation(C:Ang II, T:E1) Final structure: Pos-Reg(C:Ang II, T:Pos-Reg(T:MAPK)) 37 Sofie Van Landeghem, public PhD defense Refining event structures Refined to: Pos-Reg(C:Ang II, T:MAPK) → Helps to determine equivalence with similar events: Number of distinct event structures reduced with 60% → Original structures are preserved
Originally 67.3M gene/protein mentions • 3235K canonical forms • 2-5% can be linked to gene families (cf. next slides), accounting for 52-60% of all occurrences • Long tail of infrequent gene symbols! • Evaluation on the ST'09 gene/protein mentions • Aims at identifying mentions likely to match databases • Original BANNER matches: 50.2% F-score • After canonicalization: 61.1% F-score Van Landeghem et al. 2011. BioNLP
normalization • Map textual, ambiguous gene symbols to unique IDs • E.g. “Esr-1” → “AT1G12980” • Context: Arabidopsis thaliana article • Highly ambiguous gene/protein symbols • GenNorm system, developed by Wei et al. 2011 • Ranked first on several criteria in the BioCreative 3 task • 28.6M (43%) gene/protein symbols could be normalized • 120 thousand unique genes • More than 4800 species (bacteria, plants, animals, ...) 41 Sofie Van Landeghem, public PhD defense
Ensembl • Assign a family according to normalized gene mention • Rely on canonical form when there is still ambiguity • Inter-species: resolve to the same gene family • Intra-species: resolve to the most commonly used family • Resolve "esr1" to a default family? • Reliable synonym for "estrogen receptor" • Not a reliable synonym for "enhancer of shoot regeneration" • Manual evaluation: 72% of a set of correct event occurrences had both of their arguments assigned to the correct family 42 Sofie Van Landeghem, public PhD defense Van Landeghem et al. 2011. BioNLP
(under review) Sofie Van Landeghem, public PhD defense • For each pair in the pathway, retreive EVEX events • Visualisation of highest ranked pair for each interaction
• Integration of EVEX with microarray expression data • Automated text analysis: high recall, fast retrieval • Manual expert validation: ensure high precision 48 Kaewphan et al. 2012. LREC Sofie Van Landeghem, public PhD defense
of... • Protein-protein interactions • Various other biomolecular events • Non-causal relations • Thorough evaluations on publicly available corpora • Ensemble feature selection techniques in combination with a text mining framework • Producing more cost-effective classifiers • Clues for enhanced feature generation modules • Offering insight into the black-box behaviour of the SVM 49 Sofie Van Landeghem, public PhD defense
34.3M events • Gene normalization • Integration with gene families • Applications and future work • Data integration: locating inconsistencies, aggregating confidences, ... • Network construction • Pathway curation • Hypothesis generation • Automated analysis + manual evaluation! 50 Sofie Van Landeghem, public PhD defense Playing hide and seek Human vs. computer: 1-1
interactions or agent/target roles • Including homodimers or not • Negative instances specified or Closed World Assumption • Explicit test set available or only CV possible • Complete abstracts included or merely a collection of sentences • Different dataformats 55
• Provide clues to extract events: • Improvement of event extraction with entity relations: • Overall only marginal increases of performance • Behaviour dependent on specific event types • Phosphorylation and localization are most relevant 58 Sofie Van Landeghem, public PhD defense
evaluated • Judge correctness of extracted statements: precision • Recall not as easy: requires annotation of full documents • Classifier perfectly capable of generalizing results • Confidence scores can be used for ranking 60 Sofie Van Landeghem, public PhD defense