Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

2009-06-05-BioNLP_ST

Sofie Van Landeghem
June 05, 2009
43

 2009-06-05-BioNLP_ST

Analyzing text in search of bio-molecular events: a high-precision machine learning framework

Presentation given by Sofie Van Landeghem at BioNLP 2009, Boulder, Colorado

Sofie Van Landeghem

June 05, 2009
Tweet

Transcript

  1. Analyzing text in search of Analyzing text in search of

    bio-molecular events: a high-precision hi l i f k machine learning framework Sofie Van Landeghem, Yvan Saeys, Bernard De Baets, Yves Van de Peer , Friday June 5th, 2009 BioNLP 2009, Boulder, Colorado
  2. General ML pipeline Trigger extraction Pipeline for Candidate arguments one

    specific event type Negative instances filter Candidate instances Negative instances filter Feature extraction Feature extraction Classification Predictions 2 Sofie Van Landeghem
  3. Framework overview Transcription pipeline Phosphorylation pipeline ... p p p

    p y p p ... Merged predictions Merged predictions C i t h k 3 regulation pipelines Consistency check Results task 1 S l ti l Negation rules Speculation rules Negation rules Results task 3 3 Sofie Van Landeghem
  4. Trigger dictionaries ƒ E.g. “phosphorylated”, “overexpression” ƒ Porter stemming ƒ

    Separate dictionary for each type of event ƒ Compiled automatically from training data ƒ Manually filtered to remove general words ƒ Such as “are”, “via” or “through” ƒ Binding : distinction between ƒ “Single” (e.g. “homodimer”, “binding site”) ƒ “Multiple” (e.g. “heterodimer”, “complex”) 4 Sofie Van Landeghem
  5. Instance creation 1. Finding triggers in the text 2. Initially

    : all (combinations of) proteins / events that appear in the same sentence as the trigger ƒ Lots of noise -> high-dimensional datasets 3. Implementation of Negative-instances (NI) filter Ch k th l th f th b t d b th ƒ Checks the length of the sub-sentence spanned by the candidate event & applies a cut-off value ƒ Checks the size of the subgraph of the dependency graph, Checks the size of the subgraph of the dependency graph, corresponding to the candidate event & applies a cut-off 5 Sofie Van Landeghem
  6. Dimensionality reduction # instances positive i t instances Manually filtered

    dictionaries Binding 34.612 2% Distinction between single and multiple binding Single Multiple 4708 3861 11% 5% p Application of the NI filter Single Multiple 4070 2365 13% 8% Distribution of Binding instances for different design choices (training data) 7 Sofie Van Landeghem
  7. Feature generation ƒ Stanford dependency parsing : smallest subgraph ƒ

    Vertex walks extracted from the dependency subgraph ƒ Vertex – edge – vertex ƒ Lexical variant : trigger/protein blinded : e.g. “trigger nsubj protx” ƒ Syntactic variant : e.g. “nn nsubj nn” ƒ Bag of words : nodes of dependency graph ƒ Bag-of-words : nodes of dependency graph ƒ Excludes uninformative words such as prepositions ƒ Stemmed trigrams e.g. “by induc transcript” ƒ Lexical and syntactic information of the triggers y gg ƒ Length of the sub-sentence & size of the subgraph ƒ Regulation : whether arguments are proteins or events g g p 8 Sofie Van Landeghem
  8. Classification ƒ High-dimensional and highly unbalanced datasets ƒ Support vector

    machine (SVM) ƒ LibSVM implementation as provided by WEKA ƒ Kernel type : radial basis function (default) ƒ Internal 5-fold CV loop to tune parameters 9 Sofie Van Landeghem
  9. Consistency check (1) Overlapping triggers of different event types ƒ

    Predictions for different event types are processed in parallel, independently of each other, and merged afterwards afterwards ƒ One word in the text might lead to two distinct triggers of different type yp ƒ E.g. “expression” can lead to both a Transcription and a Gene expression event ƒ But in real life, this never happens at the same time! ¾ Keep only the prediction with the highest SVM score ¾ Minimal overlap between dictionaries can avoid this inconsistency to occur in the first place 10 Sofie Van Landeghem
  10. Consistency check (2) Different events from the same type, based

    on the same trigger ƒ One trigger is involved in many events from the same type ƒ E.g. “It induces expression of STAT5-regulated genes is CTLL-2, i.e. beta-casein, and oncostatin M (OSM)” 2 G i t b d th t i ƒ 2 Gene expression events based on the trigger “expression”, one with beta-casein and one with OSM. ƒ This happens often and the predictions will have similar This happens often, and the predictions will have similar SVM scores ¾ However, for some types, usually only one true , yp , y y event is linked to each trigger (Protein catabolism & Phosphorylation) Æ keep only top-ranked result 11 Sofie Van Landeghem
  11. Negation ƒ Three cases 1. Negation construct in close vicinity

    of the trigger ¾ “CsA was found not to inhibit lck gene expression.” 2. Trigger is inherently negative ¾ “This was associated with a reduction in endothelial MCP- 1 secretion and GRO alpha immobilization ” 1 secretion and GRO-alpha immobilization. 3. The “but not” pattern ¾ “Overexpression of Vav but not SLP-76 augments ¾ Overexpression of Vav, but not SLP-76, augments CD28-induced IL-2 promoter activity.” ƒ Custom made rule based system : locating y g certain words, triggers and patterns that indicate negation 13 Sofie Van Landeghem
  12. Speculation ƒ Two cases 1. Uncertainty (stating the research) ¾

    “We examined the ability of type I and type II IFNs to l t ti ti f STAT6 b IL 4 i i h regulate activation of STAT6 by IL-4 in primary human monocytes.” 2 Hypothesis (interpreting the research) 2. Hypothesis (interpreting the research) ¾ “(…) suggesting that these nuclear proteins may determine the IP-10 mRNA inducibility by IFNgamma.” ƒ Custom made rule based system : locating certain words that indicate speculation 14 Sofie Van Landeghem
  13. Performance - Task 3 ƒ The results of this subtask

    heavily depend on the y p results of subtask 1 ƒ When we only consider events found in subtask 1, y , recall of the rule-based system is actually higher: above 50%. 15 Sofie Van Landeghem
  14. Thanks to ... ƒ My supervisors ƒ Yvan Saeys ƒ

    Yvan Saeys ƒ Yves Van de Peer ƒ Bernard De Baets Bernard De Baets ƒ The whole Bioinformatics team @ The whole Bioinformatics team @ University of Ghent, Belgium ƒ Many thanks to the BioNLP’09 organizers, for offering to the community a very valuable and offering to the community a very valuable and well organized task about event extraction! 16 Sofie Van Landeghem