2009-06-05-BioNLP_ST

Analyzing text in search of Analyzing text in search of
bio-molecular events: a high-precision hi l i f k machine learning framework Sofie Van Landeghem, Yvan Saeys, Bernard De Baets, Yves Van de Peer , Friday June 5th, 2009 BioNLP 2009, Boulder, Colorado

General ML pipeline Trigger extraction Pipeline for Candidate arguments one
specific event type Negative instances filter Candidate instances Negative instances filter Feature extraction Feature extraction Classification Predictions 2 Sofie Van Landeghem

Framework overview Transcription pipeline Phosphorylation pipeline ... p p p
p y p p ... Merged predictions Merged predictions C i t h k 3 regulation pipelines Consistency check Results task 1 S l ti l Negation rules Speculation rules Negation rules Results task 3 3 Sofie Van Landeghem

Trigger dictionaries E.g. “phosphorylated”, “overexpression” Porter stemming
Separate dictionary for each type of event Compiled automatically from training data Manually filtered to remove general words Such as “are”, “via” or “through” Binding : distinction between “Single” (e.g. “homodimer”, “binding site”) “Multiple” (e.g. “heterodimer”, “complex”) 4 Sofie Van Landeghem

Instance creation 1. Finding triggers in the text 2. Initially
: all (combinations of) proteins / events that appear in the same sentence as the trigger Lots of noise -> high-dimensional datasets 3. Implementation of Negative-instances (NI) filter Ch k th l th f th b t d b th Checks the length of the sub-sentence spanned by the candidate event & applies a cut-off value Checks the size of the subgraph of the dependency graph, Checks the size of the subgraph of the dependency graph, corresponding to the candidate event & applies a cut-off 5 Sofie Van Landeghem

Negative instances filter Distribution of Multiple binding instances (training data)
6 Sofie Van Landeghem

Dimensionality reduction # instances positive i t instances Manually filtered
dictionaries Binding 34.612 2% Distinction between single and multiple binding Single Multiple 4708 3861 11% 5% p Application of the NI filter Single Multiple 4070 2365 13% 8% Distribution of Binding instances for different design choices (training data) 7 Sofie Van Landeghem

Feature generation Stanford dependency parsing : smallest subgraph
Vertex walks extracted from the dependency subgraph Vertex – edge – vertex Lexical variant : trigger/protein blinded : e.g. “trigger nsubj protx” Syntactic variant : e.g. “nn nsubj nn” Bag of words : nodes of dependency graph Bag-of-words : nodes of dependency graph Excludes uninformative words such as prepositions Stemmed trigrams e.g. “by induc transcript” Lexical and syntactic information of the triggers y gg Length of the sub-sentence & size of the subgraph Regulation : whether arguments are proteins or events g g p 8 Sofie Van Landeghem

Classification High-dimensional and highly unbalanced datasets Support vector
machine (SVM) LibSVM implementation as provided by WEKA Kernel type : radial basis function (default) Internal 5-fold CV loop to tune parameters 9 Sofie Van Landeghem

Consistency check (1) Overlapping triggers of different event types
Predictions for different event types are processed in parallel, independently of each other, and merged afterwards afterwards One word in the text might lead to two distinct triggers of different type yp E.g. “expression” can lead to both a Transcription and a Gene expression event But in real life, this never happens at the same time! ¾ Keep only the prediction with the highest SVM score ¾ Minimal overlap between dictionaries can avoid this inconsistency to occur in the first place 10 Sofie Van Landeghem

Consistency check (2) Different events from the same type, based
on the same trigger One trigger is involved in many events from the same type E.g. “It induces expression of STAT5-regulated genes is CTLL-2, i.e. beta-casein, and oncostatin M (OSM)” 2 G i t b d th t i 2 Gene expression events based on the trigger “expression”, one with beta-casein and one with OSM. This happens often and the predictions will have similar This happens often, and the predictions will have similar SVM scores ¾ However, for some types, usually only one true , yp , y y event is linked to each trigger (Protein catabolism & Phosphorylation) Æ keep only top-ranked result 11 Sofie Van Landeghem

Performance - Task 1 12 Sofie Van Landeghem

Negation Three cases 1. Negation construct in close vicinity
of the trigger ¾ “CsA was found not to inhibit lck gene expression.” 2. Trigger is inherently negative ¾ “This was associated with a reduction in endothelial MCP- 1 secretion and GRO alpha immobilization ” 1 secretion and GRO-alpha immobilization. 3. The “but not” pattern ¾ “Overexpression of Vav but not SLP-76 augments ¾ Overexpression of Vav, but not SLP-76, augments CD28-induced IL-2 promoter activity.” Custom made rule based system : locating y g certain words, triggers and patterns that indicate negation 13 Sofie Van Landeghem

Speculation Two cases 1. Uncertainty (stating the research) ¾
“We examined the ability of type I and type II IFNs to l t ti ti f STAT6 b IL 4 i i h regulate activation of STAT6 by IL-4 in primary human monocytes.” 2 Hypothesis (interpreting the research) 2. Hypothesis (interpreting the research) ¾ “(…) suggesting that these nuclear proteins may determine the IP-10 mRNA inducibility by IFNgamma.” Custom made rule based system : locating certain words that indicate speculation 14 Sofie Van Landeghem

Performance - Task 3 The results of this subtask
heavily depend on the y p results of subtask 1 When we only consider events found in subtask 1, y , recall of the rule-based system is actually higher: above 50%. 15 Sofie Van Landeghem

Thanks to ... My supervisors Yvan Saeys
Yvan Saeys Yves Van de Peer Bernard De Baets Bernard De Baets The whole Bioinformatics team @ The whole Bioinformatics team @ University of Ghent, Belgium Many thanks to the BioNLP’09 organizers, for offering to the community a very valuable and offering to the community a very valuable and well organized task about event extraction! 16 Sofie Van Landeghem

2009-06-05-BioNLP_ST

2009-06-05-BioNLP_ST

Sofie Van Landeghem

More Decks by Sofie Van Landeghem

Featured

Transcript

Analyzing text in search of Analyzing text in search of

General ML pipeline Trigger extraction Pipeline for Candidate arguments one

Framework overview Transcription pipeline Phosphorylation pipeline ... p p p

Trigger dictionaries E.g. “phosphorylated”, “overexpression” Porter stemming

Instance creation 1. Finding triggers in the text 2. Initially

Negative instances filter Distribution of Multiple binding instances (training data)

Dimensionality reduction # instances positive i t instances Manually filtered

Feature generation Stanford dependency parsing : smallest subgraph

Classification High-dimensional and highly unbalanced datasets Support vector

Consistency check (1) Overlapping triggers of different event types

Consistency check (2) Different events from the same type, based

Performance - Task 1 12 Sofie Van Landeghem

Negation Three cases 1. Negation construct in close vicinity

Speculation Two cases 1. Uncertainty (stating the research) ¾

Performance - Task 3 The results of this subtask

Thanks to ... My supervisors Yvan Saeys