Computational Semantics and Evaluation Benchmark for Interrogative Sentences via Combinatory Categorial Grammar
Hayate Funakura and Koji Mineshima, Computational Semantics and Evaluation Benchmark for Interrogative Sentences via Combinatory Categorial Grammar, PACLIC 37, 2023.
for Interrogative Sentences via Combinatory Categorial Grammar 1 Hayate Funakura Kyoto University Koji Mineshima Keio University This work is partially supported by JST, CREST grant number JPMJCR 2 1 1 4 .
the FraCaS test suite (Cooper+ 1 9 9 6 ) • A collection of inference problems in the format for recognizing textual entailment (RTE) • Covers a wide range of linguistic phenomena e.g. GQ, plurals, anaphora, and etc. 8 Evaluation benchmark — Background
less addressed • FraCaS covers only polar-questions • Wanatabe+ ( 2 0 1 9 ) provide a benchmark for question semantics, but the dataset has limitations in variation. It does not include wh-words other than who, and there are no instances where the object is a wh-word. 9 Evaluation benchmark — Background
the syntax- semantics interface for various types of questions. • QSEM tests understanding of the following: • Quanti fi cational expressions • Multiple wh-questions • Scope ambiguity • Near-real text wh-questions Evaluation benchmark — Proposal
and responses were extracted from sections 1 . 1 and 1 . 2 of FraCaS as samples for quanti fi cation. • Multiple wh-questions: The QA pairs on multiple wh-questions are taken from Dayal ( 2 0 1 6 ). • Scope ambiguity: Samples for scope ambiguity were taken from Chierchia ( 1 9 9 3 ) and Krifka ( 2 0 0 3 ). • Near-real text wh-questions: We randomly sampled questions from the SQuAD dataset as more real-text-like data and created sentential answers based on the non-sentential answers in the original dataset. Evaluation benchmark — Proposal
P 1 Bill made every dish. P 2 Bill is a boy. Q Which boy made every dish? Label yes ID: 0 5 3 (created based on examples from Krifka 2 0 0 3 ) yes The premises directly answer the question no The premises negate the presupposition of the question unknown None of the above Evaluation benchmark — Proposal
certain quanti fi cational expression is the subject. 13 P Bill likes Smith and Sue likes Jones. Q Who does everyone like? Label yes ID: 0 4 8 (created based on examples from Krifka 2 0 0 3 ) P Everyone likes Smith. Q Who does everyone like? Label yes ID: 0 4 9 (created based on examples from Krifka 2 0 0 3 ) Evaluation benchmark — Proposal
certain quanti fi cational expression is the subject. 14 P Bill likes Smith and Sue likes Jones. Q Who does everyone like? Label yes ID: 0 4 8 (created based on examples from Krifka 2 0 0 3 ) P Everyone likes Smith. Q Who does everyone like? Label yes ID: 0 4 9 (created based on examples from Krifka 2 0 0 3 ) ∀ > 𝗐𝗁 𝗐 𝗁 > ∀ Evaluation benchmark — Proposal
Each of these systems depends on a speci fi c semantic representation. • There is still no system that supports the diverse analyses of questions. 18 4ZTUFN 4FNBOUJDSFQSFTFOUBUJPO NatLog (MacCartney+ 2 0 0 7 ) Natural logic (MacCartney+ 2 0 0 7 ) Boxer (Bos 2 0 0 8 ) DRT LangPro (Abzianidze 2 0 1 5 ) Natural logic (Muskens 2 0 1 0 ) ccg 2 lambda (Mineshima+ 2 0 1 5 ) Higher-order logic Software — Background
an implementation of syntax-semantics-prover interface 19 CCG Parser e.g. depccg Semantic composition Prover e.g. Coq yes/no/unk Benchmark P 1 , P 2 , … H Semantic template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background
an implementation of syntax-semantics-prover interface 20 Semantic composition yes/no/unk Benchmark Semantic template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background Lexical meaning representations Semantic rules P 1 , P 2 , … H CCG Parser e.g. depccg Prover e.g. Coq
an implementation of syntax-semantics-prover interface 21 Semantic composition yes/no/unk Benchmark Semantic template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background Higher-order logic P 1 , P 2 , … H CCG Parser e.g. depccg Prover e.g. Coq
an implementation of syntax-semantics-prover interface 22 Semantic composition yes/no/unk Benchmark Semantic template CCG tree MR for P 1 MR for P 2 MR for Q yes: The Ps entail the H no: The Ps contradict the H unk: None of the above Software — Background P 1 , P 2 , … H CCG Parser e.g. depccg Prover e.g. Coq
an implementation of syntax-semantics-prover interface 23 Semantic composition yes/no/unk Benchmark Semantic template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background High customizability FraCaS (RTE, Mineshima+ 2 0 1 5 ), SICK (Yanaka+ 2 0 1 8 ), etc. P 1 , P 2 , … H CCG Parser e.g. depccg Prover e.g. Coq
lambda, and call this system ccg 2 hol. • The new features: • Utilization of semantic tags (Abzianidze+ 2 0 1 7 ) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows us to design more fl exible lexical meaning representation, and gives a uni fi ed evaluation platform for question semantics. 24 Software — Proposal
lambda, and call this system ccg 2 hol. • The new features: • Utilization of semantic tags (Abzianidze+ 2 0 1 7 ) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows for: • more fl exible design for lexical meaning • uni fi ed evaluation for question semantics 25 Software — Proposal
can be performed depending on the de fi nition of the operators and . 𝖰 ? 30 • Who does John like? • Does John smokes? 𝖰 (λx . ∃e . [Like(e) ∧ Subj = John ∧ Obj = x]) ?(∃e . [Smoke(e) ∧ Subj(e) = John])
our analysis and the annotation of QSEM in parallel: 32 Evaluation and annotation QSEM ccg 2 hol Manual correction of semtags 6 8 / 1 3 8 ✅ CCG Tree ✅ MR ✅ Inference 7 0 / 1 3 8 Not annotated primarily due to parsing errors * For error analysis, please refer to the paper.
seq 2 seq model, a universal semantic parser can be realized without having a model for each representation. 35 Future prospects Seq 2 seq model LF HOL DRS Sentence
ccg4lambda, and call this system ccg#hol. • The new features: • Utilization of semantic tags (Abzianidze+ 4DEF) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows for: • more flexible design for lexical meaning • unified evaluation for question semantics X Software — Proposal Table: Comparison of existing benchmarks and ours X #FODINBSL 5ZQFTPGRVFTUJPOT 4J[F FraCaS polar )*+ Watanabe+ )*+, who, polar, alternative *, Ours who, which, what when, where, polar %&' Evaluation benchmark — Proposal • By the following procedure, we conducted the evaluation of our analysis and the annotation of QSEM in parallel: X Evaluation and annotation QSEM ccg#hol Manual correction of semtags !" / %&" ✅ CCG Tree ✅ MR ✅ Inference '( / %&" Not annotated primarily due to parsing errors * For error analysis, please refer to the paper. • By replacing syntactic parsing and semantic composition with a seq5seq model, a universal semantic parser can be realized without having a model for each representation. X Future prospects Seq$seq model LF HOL DRS Sentence
and a label (yes/no/unk). 37 P 1 An Italian became the world's greatest tenor. Q Was there an Italian who became the world's greatest tenor? H There was an Italian who became the world's greatest tenor. Label yes (P 1 entails H, and P 1 provides a positive answer to Q) ID: 0 0 1 yes Ps entail H / Ps provides a positive answer to Q no Ps contradict H / Ps provides a negative answer to Q unknown None of the above Evaluation benchmark — Background
9 ) contains A(nswer), Q(uestion), and a label (yes/no/unk) • It does not include wh-words other than who, and there are no instances where the object is a wh-word. 38 A John ran. Q Who ran? Label yes (A provides a positive answer to Q) ID: 0 0 1 Evaluation benchmark — Background