bio-molecular events: a high-precision hi l i f k machine learning framework Sofie Van Landeghem, Yvan Saeys, Bernard De Baets, Yves Van de Peer , Friday June 5th, 2009 BioNLP 2009, Boulder, Colorado
p y p p ... Merged predictions Merged predictions C i t h k 3 regulation pipelines Consistency check Results task 1 S l ti l Negation rules Speculation rules Negation rules Results task 3 3 Sofie Van Landeghem
Separate dictionary for each type of event Compiled automatically from training data Manually filtered to remove general words Such as “are”, “via” or “through” Binding : distinction between “Single” (e.g. “homodimer”, “binding site”) “Multiple” (e.g. “heterodimer”, “complex”) 4 Sofie Van Landeghem
: all (combinations of) proteins / events that appear in the same sentence as the trigger Lots of noise -> high-dimensional datasets 3. Implementation of Negative-instances (NI) filter Ch k th l th f th b t d b th Checks the length of the sub-sentence spanned by the candidate event & applies a cut-off value Checks the size of the subgraph of the dependency graph, Checks the size of the subgraph of the dependency graph, corresponding to the candidate event & applies a cut-off 5 Sofie Van Landeghem
dictionaries Binding 34.612 2% Distinction between single and multiple binding Single Multiple 4708 3861 11% 5% p Application of the NI filter Single Multiple 4070 2365 13% 8% Distribution of Binding instances for different design choices (training data) 7 Sofie Van Landeghem
Vertex walks extracted from the dependency subgraph Vertex – edge – vertex Lexical variant : trigger/protein blinded : e.g. “trigger nsubj protx” Syntactic variant : e.g. “nn nsubj nn” Bag of words : nodes of dependency graph Bag-of-words : nodes of dependency graph Excludes uninformative words such as prepositions Stemmed trigrams e.g. “by induc transcript” Lexical and syntactic information of the triggers y gg Length of the sub-sentence & size of the subgraph Regulation : whether arguments are proteins or events g g p 8 Sofie Van Landeghem
machine (SVM) LibSVM implementation as provided by WEKA Kernel type : radial basis function (default) Internal 5-fold CV loop to tune parameters 9 Sofie Van Landeghem
Predictions for different event types are processed in parallel, independently of each other, and merged afterwards afterwards One word in the text might lead to two distinct triggers of different type yp E.g. “expression” can lead to both a Transcription and a Gene expression event But in real life, this never happens at the same time! ¾ Keep only the prediction with the highest SVM score ¾ Minimal overlap between dictionaries can avoid this inconsistency to occur in the first place 10 Sofie Van Landeghem
on the same trigger One trigger is involved in many events from the same type E.g. “It induces expression of STAT5-regulated genes is CTLL-2, i.e. beta-casein, and oncostatin M (OSM)” 2 G i t b d th t i 2 Gene expression events based on the trigger “expression”, one with beta-casein and one with OSM. This happens often and the predictions will have similar This happens often, and the predictions will have similar SVM scores ¾ However, for some types, usually only one true , yp , y y event is linked to each trigger (Protein catabolism & Phosphorylation) Æ keep only top-ranked result 11 Sofie Van Landeghem
of the trigger ¾ “CsA was found not to inhibit lck gene expression.” 2. Trigger is inherently negative ¾ “This was associated with a reduction in endothelial MCP- 1 secretion and GRO alpha immobilization ” 1 secretion and GRO-alpha immobilization. 3. The “but not” pattern ¾ “Overexpression of Vav but not SLP-76 augments ¾ Overexpression of Vav, but not SLP-76, augments CD28-induced IL-2 promoter activity.” Custom made rule based system : locating y g certain words, triggers and patterns that indicate negation 13 Sofie Van Landeghem
“We examined the ability of type I and type II IFNs to l t ti ti f STAT6 b IL 4 i i h regulate activation of STAT6 by IL-4 in primary human monocytes.” 2 Hypothesis (interpreting the research) 2. Hypothesis (interpreting the research) ¾ “(…) suggesting that these nuclear proteins may determine the IP-10 mRNA inducibility by IFNgamma.” Custom made rule based system : locating certain words that indicate speculation 14 Sofie Van Landeghem
heavily depend on the y p results of subtask 1 When we only consider events found in subtask 1, y , recall of the rule-based system is actually higher: above 50%. 15 Sofie Van Landeghem
Yvan Saeys Yves Van de Peer Bernard De Baets Bernard De Baets The whole Bioinformatics team @ The whole Bioinformatics team @ University of Ghent, Belgium Many thanks to the BioNLP’09 organizers, for offering to the community a very valuable and offering to the community a very valuable and well organized task about event extraction! 16 Sofie Van Landeghem