Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MioGatto: A Math Identifier-oriented Grounding ...

Watson
July 30, 2021

MioGatto: A Math Identifier-oriented Grounding Annotation Tool / mathui2021

We present a new annotation tool, called MioGatto, to efficiently build large corpora for grounding math formulae. While in documents in science, technology, engineering, and mathematics, math identifiers can be used in multiple meanings in a single document, corpora with annotated coreference relations between identifiers are crucial for the grounding task. Using MioGatto, annotators can produce a list of math concepts for each document, associate one of the math concepts with each occurrence of math identifiers, and annotate the text span that is the source for grounding. In general, manual annotation of coreference relations is a very tough task, but this tool is specialized for building grounding corpora and can annotate them more efficiently than existing general-purpose annotation tools. The tool can be obtained from https://github.com/wtsnjp/MioGatto.

Watson

July 30, 2021
Tweet

More Decks by Watson

Other Decks in Research

Transcript

  1. MioGatto: A Math Identifier-oriented Grounding Annotation Tool MioGatto: A Math

    Identifier-oriented Grounding Annotation Tool Takuto Asakura (UTokyo), Yusuke Miyao (UTokyo), Akiko Aizawa (NII), and Michael Kohlhase (FAU) MathUI Workshop 2021 2021-07-30 1 / 18
  2. MioGatto: A Math Identifier-oriented Grounding Annotation Tool The Annotation Tool

    — MioGatto Math Identifier-oriented Grounding Annotation Tool A novel annotation tool for math formulae grounding Its aim is to build large corpora for Math Language Processing (MLP) It can perform two types of annotations: 1. Math concept for each math identifier occurrence 2. Sources of grounding, i.e., definitions and declarations It is open source! (MIT License) https://github.com/wtsnjp/MioGatto 2 / 18
  3. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Background: Ambiguities in

    Math Formulae Various ambiguities similar to natural languages [Kohlhase+, 2014] A symbol (token) can be used in several meanings Syntactic ambiguity E.g. ƒ( + b) Formulae cannot be understood without reading surrounding texts Common sense and domain knowledge may be required E.g. π is Archimedes’ constant Usage of character y in the first chapter of PRML (except exercises) Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x 3 / 18
  4. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Grounding of Formulae

    [Asakura+, 2020] 1. Finding groups of tokens which refer to mathematical concepts E.g. , α, cos, , =, ×, etc. 2. Associating a corresponding mathematical concept to each group 4 / 18
  5. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Grounding of Formulae

    [Asakura+, 2020] ≈ Description Alignment + Coreference Resolution A task to associate description for each math identifier occurrence There are some existing work [Aizawa+ 2013, Alexeeva+ 2020, etc.] 5 / 18
  6. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Grounding of Formulae

    [Asakura+, 2020] ≈ Description Alignment + Coreference Resolution Coreference in Natural Languages Bob told Alice that he wants to study NLP. Coreference Coreference in Formulae The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase (PRML, p. 2) 6 / 18
  7. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Prior Results (1)

    Pilot Annotation [Asakura+, 2020] Annotating all occurrences of math identifiers in an academic paper * Identifiers: variables, functions, and constants E.g. , y, θ, sin A suitable paper was taken from arXiv A Very Brief Introduction to Machine Learning With Applications to Communication Systems [Simeone, 2018] Basic statistics of the target paper #words in texts 10,616 #<mi> tags 937 #sections 7 #inline math 331 #pages (in PDF) 20 #display math 23 All existing data are distributed for SIGMathLing members (please join!) https://sigmath ing.kwarc.info/resources/grounding-dataset/ 7 / 18
  8. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Source of Grounding

    Bases of grounding of formulae inside or outside documents: inner-document Surrounding texts, formulae E.g. apposition noun, def = outer-document Common sense, domain knowledge E.g. Wikidata 9 / 18
  9. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Definition, Declaration, and

    Registration Sources of grounding are normally one of the followings. Definition and Declaration definition declaration definiendum Registration (Others) Throughout, we use Roman font to denote random variables and the corre- sponding letter in regular font for realizations. [Simeone, 2018] 10 / 18
  10. MioGatto: A Math Identifier-oriented Grounding Annotation Tool How MioGatto Works

    Input and Output Input XHTML — L A TEX documents converted by L A TEXML [Miller, 2018] → Files in the arXMLiv dataset [Ginev, 2020] satisfies Output Annotation data in JSON format∗ Brief Procedure for Annotators 1. Create items in the math concept dictionary 2. Associate one of the items for each identifier occurrence 3. Register text spans for the grounding sources Let me show you demonstration! ∗Please refer to https://github.com/wtsnjp/MioGatto/wiki for the detailed spec. 11 / 18
  11. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Variety of Annotation

    Tools in NLP General Tools (not specific for NLP) Some commercial tools provides basic annotation functionalities: Adobe Acrobat can add free-text notes and highlighting texts in PDF hypothes.is can do something similar for web pages Annotation Tools for NLP Number of efforts have been made. Examples brat [Stenetorp+, 2012] is a high functionarity tool and famous in NLP WebAnno [Yimam+, 2013] has extensive features for collaboration PDFAnno [Shindo+, 2018] can annotate PDFs directly SACR [Oberle, 2018] is specialized for coreference relations 12 / 18
  12. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Comparison with Other

    Tools for MLP KAT: KWARC Annotation Tool [Ginev+, 2015] A web-based annotation tool for STEM documents Annotating attributes for the OMDoc format [Kohlhase 2006] Input: HTML5, Output: Annotation expressed in RDF Currently, not actively maintained AnnoMathTeX [Scharpf+, 2019 & 2021] An annotation recommender system for math identifiers Document-global annotations unless a ‘local’ option is specified Input: Wikitext or L A TEX, Output: Annotation expressed in JSON MioGatto made some additions including: extra information to math concepts E.g. math type and arity annotation of grounding sources 13 / 18
  13. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Current Status: Working

    on the Actual Annotation Our Team We are currently working with 8 part-time annotators They are mostly graduate students: Four in Natural Language Processing Two in Logics One in Mathematics One in Physics Annotation Math concepts Sources of grounding 14 / 18
  14. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Future Plan (1)

    Automating the Grounding 3-step of Automation 1. Detecting/Retrieving inner-document sources of grounding → Pattern matching + POS tagging 2. ‘Dictionary’ generation by clustering the sources → Short text clustering [Jiaming+, 2017] may be applicable 3. Associating each occurrence with the entry in the ‘dictionary’ → Pattern matching + POS tagging + text classification Source Detection Dictionary Generation Associating Repeat & Improvement Proposing Dataset &OIBODFNFOU &WBMVBUJPO 15 / 18
  15. MioGatto: A Math Identifier-oriented Grounding Annotation Tool Future Plan (2)

    Enhancement for MioGatto Review Mode Clearly show discrepancies between annotators Enable commenting on annotations for discussing Other enhancements Output format standardization More improvements on the UI for efficient annotation MioGatto is open source! You are welcome to use it, requesting new features, and sending patches for improvements. https://github.com/wtsnjp/MioGatto 16 / 18
  16. MioGatto: A Math Identifier-oriented Grounding Annotation Tool References (1) Akiko

    Aizawa, Michael Kohlhase, and Iadh Ounis. “NTCIR-10 Math Pilot Task Overview.” In Proceedings of NTCIR-10 (2013). Maria Alexeeva et al. “MathAlign: Linking Formula Identifiers to their Contextual Natural Language Descriptions.” In Proceedings of LREC 2020. Takuto Asakura et al. “Towards Grounding of Formulae.” In Proceedings of SDP2020. Christopher M Bishop. Pattern Recognition and Machine Learning (2006). Deyan Ginev et al. “KAT: an annotation tool for STEM documents”. In Proceedings of MathUI Workshop 2015. Deyan Ginev. arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org. SIGMathLing (2020). Xu, Jiaming, et al. “Self-taught convolutional neural networks for short text clustering.” Neural Networks 88 (2017). Michael Kohlhase and Mihnea Iancu. “Co-representing structure and meaning of mathematical documents” (2014). Bruce Miller. L A TEXML The Manual — A L A TEX to XML/HTML/MathML Converter (2018). Bruno Oberle. “SACR: A Drag-and-Drop Based Tool for Coreference Annotation.” In Proceedings of LREC 2018. 17 / 18
  17. MioGatto: A Math Identifier-oriented Grounding Annotation Tool References (2) Hiroyuki

    Shindo, Yohei Munesada, and Yuji Matsumoto. “PDFAnno: a web-based linguistic annotation tool for pdf documents.” In Proceedings of LREC 2018. Philipp Scharpf et al. “AnnoMathTeX — a Formula Identifier Annotation Recommender System for STEM Documents”. In Proceedings RecSys 2019. Philipp Scharpf et al. “Fast Linking of Mathematical Wikidata Entities in Wikipedia Articles Using Annotation Recommendation”. In Proceedings WWW 2021. Osvaldo Simeone. “A very brief introduction to machine learning with applications to communication systems.” IEEE Transactions on Cognitive Communications and Networking (2018). Pontus Stenetorp et al. “brat: a Web-based Tool for NLP-Assisted Text Annotation.” In Proceedings of EACL 2012. Seid Muhie Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Chris Biemann. “WebAnno: A flexible, web-based and visually supported system for distributed annotations.” In Proceedings of ACL 2013. 18 / 18