Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Computational Semantics and Evaluation Benchmar...

hfunakura
December 04, 2023

Computational Semantics and Evaluation Benchmark
for Interrogative Sentences via Combinatory Categorial Grammar

Hayate Funakura and Koji Mineshima, Computational Semantics and Evaluation Benchmark
for Interrogative Sentences via Combinatory Categorial Grammar, PACLIC 37, 2023.

Conference website: https://paclic2023.github.io/

hfunakura

December 04, 2023
Tweet

More Decks by hfunakura

Other Decks in Research

Transcript

  1. 4, Dec, 2023 PACLIC 37 Computational Semantics and Evaluation Benchmark

    
 for Interrogative Sentences 
 via Combinatory Categorial Grammar 1 Hayate Funakura Kyoto University Koji Mineshima Keio University This work is partially supported by JST, CREST grant number JPMJCR 2 1 1 4 .
  2. Agenda • Background • Proposals • Benchmark • Software •

    Evaluation and annotation • Future prospects 2
  3. Background • Formal semantics has been developed since the 1

    9 7 0 s. • Various analyses have been proposed for individual phenomena. • Argument wh-questions (Hirsch 2 0 1 9 , Xiang 2 0 2 0 , etc.) • Adjective wh-questions (Nelken+ 1 9 9 8 , Tellings 2 0 1 9 , etc.) • Polar/alternative questions (Biezma 2 0 1 1 , Roelofsen+ 2 0 1 5 , etc.) • Embedded questions (Lahiri 2 0 0 2 , Ciardelli+ 2 0 1 3 , etc.) 3
  4. • Formal semantics has been developed since the 1 9

    7 0 s. • Various analyses have been proposed for individual phenomena. • Argument wh-questions (Hirsch 2 0 1 9 , Xiang 2 0 2 0 , etc.) • Adjective wh-questions (Nelken+ 1 9 9 8 , Tellings 2 0 1 9 , etc.) • Polar/alternative questions (Biezma 2 0 1 1 , Roelofsen+ 2 0 1 5 , etc.) • Embedded questions (Lahiri 2 0 0 2 , Ciardelli+ 2 0 1 3 , etc.) 4 Data and a system are needed 
 for the uni fi ed evaluation of various analyses Background
  5. Our proposals 1 . Evaluation benckmark for question semantics 2

    . Software for implementing and evaluating NL semantics 3 . Evaluation and Annotation 5
  6. Our proposals 1 . Evaluation benckmark for question semantics 2

    . Software for implementing and evaluating NL semantics 3 . Evaluation and Annotation 6
  7. • Creating benchmarks for evaluating theoretical linguistics 
 began with

    the FraCaS test suite (Cooper+ 1 9 9 6 ) • A collection of inference problems in the format for recognizing textual entailment (RTE) • Covers a wide range of linguistic phenomena 
 e.g. GQ, plurals, anaphora, and etc. 8 Evaluation benchmark — Background
  8. • Subsequent benchmarks have been proposed, 
 but questions are

    less addressed • FraCaS covers only polar-questions • Wanatabe+ ( 2 0 1 9 ) provide a benchmark for question semantics, but the dataset has limitations in variation. 
 It does not include wh-words other than who, 
 and there are no instances where the object is a wh-word. 9 Evaluation benchmark — Background
  9. 10 • We have created a benchmark QSEM for evaluating

    the syntax- semantics interface for various types of questions. • QSEM tests understanding of the following: • Quanti fi cational expressions • Multiple wh-questions • Scope ambiguity • Near-real text wh-questions Evaluation benchmark — Proposal
  10. 11 • Quanti fi cational expressions: Pairs of polar questions

    and responses were extracted from sections 1 . 1 and 1 . 2 of FraCaS as samples for quanti fi cation. • Multiple wh-questions: The QA pairs on multiple wh-questions are taken from Dayal ( 2 0 1 6 ). • Scope ambiguity: Samples for scope ambiguity were taken from Chierchia ( 1 9 9 3 ) and Krifka ( 2 0 0 3 ). • Near-real text wh-questions: We randomly sampled questions from the SQuAD dataset as more real-text-like data and created sentential answers based on the non-sentential answers in the original dataset. Evaluation benchmark — Proposal
  11. • QSEM contains P(remises), Q(uestion), and a label (yes/no/unk) 12

    P 1 Bill made every dish. P 2 Bill is a boy. Q Which boy made every dish? Label yes ID: 0 5 3 (created based on examples from Krifka 2 0 0 3 ) yes The premises directly answer the question no The premises negate the presupposition of the question unknown None of the above Evaluation benchmark — Proposal
  12. • Scope ambiguity: Ambiguity arises in wh-questions 
 when a

    certain quanti fi cational expression is the subject. 13 P Bill likes Smith and Sue likes Jones. Q Who does everyone like? Label yes ID: 0 4 8 (created based on examples from Krifka 2 0 0 3 ) P Everyone likes Smith. Q Who does everyone like? Label yes ID: 0 4 9 (created based on examples from Krifka 2 0 0 3 ) Evaluation benchmark — Proposal
  13. • Scope ambiguity: Ambiguity arises in wh-questions 
 when a

    certain quanti fi cational expression is the subject. 14 P Bill likes Smith and Sue likes Jones. Q Who does everyone like? Label yes ID: 0 4 8 (created based on examples from Krifka 2 0 0 3 ) P Everyone likes Smith. Q Who does everyone like? Label yes ID: 0 4 9 (created based on examples from Krifka 2 0 0 3 ) ∀ > 𝗐𝗁 𝗐 𝗁 > ∀ Evaluation benchmark — Proposal
  14. Table: Comparison of existing benchmarks and ours 15 #FODINBSL 5ZQFTPGRVFTUJPOT

    4J[F FraCaS polar 3 4 6 Watanabe+ 2 0 1 9 who, polar, alternative 4 9 Ours who, which, what when, where, polar 1 3 8 Evaluation benchmark — Proposal
  15. Our proposals 1 . Evaluation benckmark for question semantics 2

    . Software for implementing and evaluating NL semantics 3 . Evaluation and Annotation 16
  16. • Several implementations of semantic parsing have been proposed. •

    Each of these systems depends on a speci fi c semantic representation. • There is still no system that supports the diverse analyses of questions. 18 4ZTUFN 4FNBOUJDSFQSFTFOUBUJPO NatLog (MacCartney+ 2 0 0 7 ) Natural logic (MacCartney+ 2 0 0 7 ) Boxer (Bos 2 0 0 8 ) DRT LangPro (Abzianidze 2 0 1 5 ) Natural logic (Muskens 2 0 1 0 ) ccg 2 lambda (Mineshima+ 2 0 1 5 ) Higher-order logic Software — Background
  17. • ccg 2 lambda (Mineshima+ 2 0 1 5 )

    
 an implementation of syntax-semantics-prover interface 19 CCG Parser 
 e.g. depccg Semantic 
 composition Prover 
 e.g. Coq yes/no/unk Benchmark P 1 , P 2 , … H Semantic 
 template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background
  18. • ccg 2 lambda (Mineshima+ 2 0 1 5 )

    
 an implementation of syntax-semantics-prover interface 20 Semantic 
 composition yes/no/unk Benchmark Semantic 
 template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background Lexical meaning representations Semantic rules P 1 , P 2 , … H CCG Parser 
 e.g. depccg Prover 
 e.g. Coq
  19. • ccg 2 lambda (Mineshima+ 2 0 1 5 )

    
 an implementation of syntax-semantics-prover interface 21 Semantic 
 composition yes/no/unk Benchmark Semantic 
 template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background Higher-order logic P 1 , P 2 , … H CCG Parser 
 e.g. depccg Prover 
 e.g. Coq
  20. • ccg 2 lambda (Mineshima+ 2 0 1 5 )

    
 an implementation of syntax-semantics-prover interface 22 Semantic 
 composition yes/no/unk Benchmark Semantic 
 template CCG tree MR for P 1 MR for P 2 MR for Q yes: The Ps entail the H no: The Ps contradict the H unk: None of the above Software — Background P 1 , P 2 , … H CCG Parser 
 e.g. depccg Prover 
 e.g. Coq
  21. • ccg 2 lambda (Mineshima+ 2 0 1 5 )

    
 an implementation of syntax-semantics-prover interface 23 Semantic 
 composition yes/no/unk Benchmark Semantic 
 template CCG tree MR for P 1 MR for P 2 MR for Q Software — Background High customizability FraCaS (RTE, Mineshima+ 2 0 1 5 ), SICK (Yanaka+ 2 0 1 8 ), etc. P 1 , P 2 , … H CCG Parser 
 e.g. depccg Prover 
 e.g. Coq
  22. • We have proposed an extended version of ccg 2

    lambda, 
 and call this system ccg 2 hol. • The new features: • Utilization of semantic tags (Abzianidze+ 2 0 1 7 ) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows us to design more fl exible lexical meaning representation, and gives a uni fi ed evaluation platform for question semantics. 24 Software — Proposal
  23. • We have proposed an extended version of ccg 2

    lambda, 
 and call this system ccg 2 hol. • The new features: • Utilization of semantic tags (Abzianidze+ 2 0 1 7 ) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows for: • more fl exible design for lexical meaning • uni fi ed evaluation for question semantics 25 Software — Proposal
  24. • Note: • ccg 2 hol is not a question-speci

    fi c system • It supports a wide range of questions in addition to other constructions. 26 Software — Proposal
  25. • System output for Who does John like? (see our

    paper for details of analysis) 27 Software — Proposal CCG category HOL (we de fi ned) Standard logical expr. DRS (not displayed here)
  26. • System output for Who does John like? (see our

    paper for details of analysis) 28 ∃x . ∃e . [Like(e) ∧ Subj(e) = John ∧ Obj(e) = x] Software — Proposal
  27. • Our system can realize other analyses (e.g. Karttunen sem.,

    Inquisitive sem.) 29 Software — Proposal λp . ∃x . [p(wa ) ∧ p = λw . […]]
  28. Software — Proposal • Our theoretical assumptions • Various analyses

    can be performed depending on the de fi nition of the operators and . 𝖰 ? 30 • Who does John like? 
 • Does John smokes? 
 𝖰 (λx . ∃e . [Like(e) ∧ Subj = John ∧ Obj = x]) ?(∃e . [Smoke(e) ∧ Subj(e) = John])
  29. Our proposals 1 . Evaluation benckmark for question semantics 2

    . Software for implementing and evaluating NL semantics 3 . Evaluation and Annotation 31
  30. • By the following procedure, we conducted the evaluation of

    our analysis and the annotation of QSEM in parallel: 32 Evaluation and annotation QSEM ccg 2 hol Manual correction 
 of semtags 6 8 / 1 3 8 ✅ CCG Tree ✅ MR ✅ Inference 7 0 / 1 3 8 Not annotated 
 primarily due to parsing errors * For error analysis, please refer to the paper.
  31. • Currently, our system derives each representation separately. 33 Future

    prospects Semantic 
 composition CCG 
 Parsing LF HOL DRS
  32. • In the future, we aim to implement a feature

    that will convert from HOL to other representations. 34 Future prospects Semantic 
 composition LF HOL DRS Sentence CCG 
 Parsing
  33. • By replacing syntactic parsing and semantic composition with a

    seq 2 seq model, a universal semantic parser can be realized 
 without having a model for each representation. 35 Future prospects Seq 2 seq model LF HOL DRS Sentence
  34. Summary 36 • We have proposed an extended version of

    ccg4lambda, and call this system ccg#hol. • The new features: • Utilization of semantic tags (Abzianidze+ 4DEF) as a resource for judging lexical meaning representations • Support for various analyses of question semantics • Our system allows for: • more flexible design for lexical meaning • unified evaluation for question semantics X Software — Proposal Table: Comparison of existing benchmarks and ours X #FODINBSL 5ZQFTPGRVFTUJPOT 4J[F FraCaS polar )*+ Watanabe+ )*+, who, polar, alternative *, Ours who, which, what when, where, polar %&' Evaluation benchmark — Proposal • By the following procedure, we conducted the evaluation of our analysis and the annotation of QSEM in parallel: X Evaluation and annotation QSEM ccg#hol Manual correction of semtags !" / %&" ✅ CCG Tree ✅ MR ✅ Inference '( / %&" Not annotated primarily due to parsing errors * For error analysis, please refer to the paper. • By replacing syntactic parsing and semantic composition with a seq5seq model, a universal semantic parser can be realized without having a model for each representation. X Future prospects Seq$seq model LF HOL DRS Sentence
  35. • Each problem in FraCaS contains 
 P(remises), Q(uestion), H(ypothesis),

    and a label (yes/no/unk). 37 P 1 An Italian became the world's greatest tenor. Q Was there an Italian who became the world's greatest tenor? H There was an Italian who became the world's greatest tenor. Label yes (P 1 entails H, and P 1 provides a positive answer to Q) ID: 0 0 1 yes Ps entail H / Ps provides a positive answer to Q no Ps contradict H / Ps provides a negative answer to Q unknown None of the above Evaluation benchmark — Background
  36. • The dataset created by Watanabe+ ( 2 0 1

    9 ) contains 
 A(nswer), Q(uestion), and a label (yes/no/unk) • It does not include wh-words other than who, and there are no instances where the object is a wh-word. 38 A John ran. Q Who ran? Label yes (A provides a positive answer to Q) ID: 0 0 1 Evaluation benchmark — Background