Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Scientific Documents with Synthet...

Watson
July 08, 2019

Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language / cicm2019

Converting science, technology, engineering and mathematics documents to formal expressions is beneficial. To achieve that conversion it is necessary to analyze both on formulae and texts interactively. We began to tackle the conversion from two foundational parts for the synthetic analyses. In this abstract, we briefly introduce our aim, planning approaches, and current status of the work.

Watson

July 08, 2019
Tweet

More Decks by Watson

Other Decks in Research

Transcript

  1. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language Takuto ASAKURA SOKENDAI / Miyao Group at UTokyo 2019-07-08 1 / 18
  2. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language About Me Takuto ASAKURA (aka. wtsnjp) A graduate student at SOKENDAI A member of Miyao Group, UTokyo Supervisers: Prof. Yusuke Miyao Prof. Akiko Aizawa I studied bioinformatics at UTokyo for bachelor I’m also a heavy TEX user A member of the TEX Live Team maintaining Texdoc—a documentation search tool supports for Japanese A contributer for the L ATEX3 Project 2 / 18
  3. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Targets: STEM Documents The targets of our work are Science, Technology, Engineering, and Mathematics (STEM) documents. Example Papers, Textbooks, and Manuals, etc. STEM documents are: essence of human knowledge well organized (semi-structured) texts with mathematical expressions 3 / 18
  4. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Long-term Goal: Converting STEM Documents to Formal Expressions STEM Documents (Natural Language + Formulae) Papers, textbooks, manuals, etc. Conversion Computational Form (Formal Language) Executable code, first-order logic, etc. The conversion enables us to: construct databases of mathematical knowledge search for formulae 4 / 18
  5. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Necessity of Synthetic Analysis Importance of formulae in STEM documents Mathematical expressions are commonly used in scientific communication in numerous fields. E.g. Mathematics, Physics, Informatics, etc. They often express key ideas in STEM documents. Interaction among texts and formulae Texts and formulae are complimentary to each other: [Kohlhase and Iancu, 2015] Texts explains formulae (and vice versa) Texts in formulae E.g. { ∈ N |  is prime} Notations and verbalizations E.g. 1 + 2 and “one plus two” Deep synthetic analyses on natural language and mathematical expressions are necessary. 5 / 18
  6. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Short-term Goals At first, we focus on the token-level analysis in formulae: an initial step of the conversion almost untouched Examples (tokens) , ε, ×, log, , etc. We will work on both algorithm and theory: Token-level (Grounding) Word-level (Morphology, Lexical Semantics) Fragment-level (Parsing) Phrase-level (Syntax, PSG) Formulae- level (SP) Sentence- level (Semantics) Applications (Conversion, IR, Searching, etc.) Mathematical Expressions Natural Language 1. Automatically associating formulae tokens to mathematical objects 2. Discussing morphology and lexical semantics for mathematical expressions 6 / 18
  7. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Grounding Tokens to Mathematical Objects Tokens in formulae and their combination can refer to mathematical objects The detection is fundamental for understanding STEM documents Example For example,  might describe the outcome of flipping a coin, with  = 1 representing ‘heads’, and  = 0 representing ‘tails’. We can imagine that this is a damaged coin so that the probability of landing heads is not necessarily the same as that of landing tails. The probability of  = 1 will be denoted by the parameter μ. The probability distribution over  can therefore be written in the form Bern(  | μ ) = μ(1 − μ)1− The result of coin flipping, int,  ∈ {0, 1} The probability of ‘heads’ on top, float, 0 ≤ μ ≤ 1 which is known as the Bernoulli distribution. (PRML, pp. 86–87) 7 / 18
  8. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Difficulty of the Grounding Factors which make the detection highly challenging: ambiguity of tokens (see below) syntactic ambiguity of formulae E.g. ƒ( + b) necessity for common sence & domain knowledge severe abbreviation Usage of character y in the first chapter of PRML (except exercises) Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x 8 / 18
  9. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Related Work [Aizawa+, 2013] NTCIR-10 Math Pilot Task annotating a description for each token in formulae an object can be described in several ways → difficult to make the annotation coherent [Stathopoulos+, 2018] Variable Typing assigning a mathematical type for each token E.g. set, monoid, etc. a sort of subtask for our grounding Usage of character y in the first chapter of PRML (except exercises) Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) . . . a part of pairs of values, corresponding to x 9 / 18
  10. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language A Simplified Task What are mathematical objects? Each mathematical object should have a description and some attributes including: mathematical type condition E.g. larger than 0 Necessary and sufficient attributes are still unclear → We will see after some experiments. . . Clustering for tokens Giving a label for tokens which refer to the same mathematical object is easier. cf. Co-reference in NLP The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data. (PRML, p. 2) 10 / 18
  11. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Morphology for Mathematical Expressions Morphemes and words (terms of morphology) morphome: the shortest meaningful unit in a language word: a morphemes or combination of a few morphemes which can refer to an object Example A word “un-break-able” comprises three morphemes. Words in mathematical expressions As a matter of fact, words also exist in formulae. Example M is a word in “Matrix M”, but M is not a word in “An entry M,j” (M,j is a word). 11 / 18
  12. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Semantics Over Natural Language and Mathematical Expressions There are ambiguity arise only when context exists. For instance, “equals signs” (=) in formulae have at least three meanings: definition, identity, and equation. Example Let  = 4, b = 3. Suppose we have to solve 4 + b2 + 1 = 0. To reach the answer, “difference of two” is helpful: p2 − q2 = (p + q)(p − q). 12 / 18
  13. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Dataset arXMLiv papers from arXiv in XML format [Ginev+, 2009] converted from L ATEX via L ATEXML formulae are in MathML markups L A TEXML XHTML/XML arXiv.org 13 / 18
  14. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language A Little Note for MathML a W3C Recommendation [Ausbrooks+, 2014] includes two markups: presentation and content Presentation Markup This shows syntax: <msup> <mfenced> <mi>a</mi> <mo>+</mo> <mi>b</mi> </mfenced> <mm>2</mm> </msup> Content Markup This shows semantics: <apply> <power> <apply> <plus/> <ci>a</ci> <ci>b</ci> </apply> <cn>2</cn> </apply> ( + b)2 14 / 18
  15. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Pilot Annotation For the first attempt, we are now giving pilot annotation for 3 papers from arXMLiv in the following manner: 1. Detecting minimal groups of tokens (i.e., words) each of which refers to a mathematical object. 2. Categorizing words by the mathematical object they referring to. Example The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data. (PRML, p. 2) Let me show you a demonstration! 15 / 18
  16. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language What’s Next? Creating a dataset completing the annotation for ≥ 10 papers in arXiv I would also like to do it for some textbooks check for the reproducibility of the annotation Automating the detectiion Combination of rule-based and machine learning with features such as: apposition nouns E.g. “a function ƒ” syntactic information in formulae E.g. does it appear inside an argument or not? distance from the former appearence 16 / 18
  17. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Possible Applications Mathematical Information Retrieval (MIR) → enables us to create scientific knowledge bases Automatic code generation E.g. Python, Coq, etc. Searching for mathematical expressions Example Let us think about searching for: n + yn = zn (n ≥ 3). It is easy to search if you know a keyword Fermat’s Last Theorem, but otherwise. . . 17 / 18
  18. Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and

    Natural Language Today’s Conclusions converting STEM documents to computational form is beneficial and challenging for the conversion, synthetic analysis on natural language and mathematical expressions is required At first, we focus on token-level analyses: grounding tokens to mathematical objects disscussing morphorogy for formulae Currenly, we are working on the pilot annotation Possible applications: MIR, code generation, searching for formulae Thanks for your time! Questions? 18 / 18