LR parsing. “On the translation of languages from left to right” • 1975: Yacc is published • 1985: GNU Bison initial release • 1989: Berkeley Yacc initial release • 2006: GCC migrates it’s parser from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers
code • If valid • Build internal representation (AST) for subsequent process (“compile.c”) • If invalid • Report Syntax Error • Build AST as far as possible (New!)
invalid cases • A rule failure doesn’t imply a parsing failure like in context free grammars https://github.com/python/cpython/blob/889b0b9bf95651fc05ad532cc4a66c0f8ff29fc2/Grammar/python.gram
Go skips one or more tokens until one of “followlist” appears https://github.com/golang/go/blob/157aae6eed1c092fd9e8ead3527185378eb828e1/src/cmd/compile/internal/syntax/parser.go#L1032 https://github.com/golang/go/blob/157aae6eed1c092fd9e8ead3527185378eb828e1/src/cmd/compile/internal/syntax/parser.go#L321
cases • hand-written parser requires developers to cover all error cases • panic-mode error recovery loses part of input, sometimes which is most important for complementation
template fi le (“yacc.c”) • Template fi le is a detail of implementation • Installed version of Bison depends on environments • Expanding Bison template is not easy https://github.com/akimd/bison/blob/25b3d0e1a3f97a33615099e4b211f3953990c203/data/skeletons/yacc.c#L1640
implementation • Will be installed ruby/ruby tool directory • Input fi le is Bison format fi le (“parse.y”) • Output is LALR parser written by C • Generate 100% compatible C fi le for Ruby 3.0.5, 3.1.0, 3.2.0 • https://bugs.ruby-lang.org/issues/19637
parser need to be aware of detail of grammar • Bison’s panic mode loses part of input • Token based error recovery is fl exible, no need to know the detail of grammar • We can ride on DFA’s theory if we use LR parser • Defeated the fi rst boss !
prevailing consensus, we can have the best of both worlds: for a very general, practical class of grammars—a strict superset of Knuth’s canonical LR—we can generate parsers automatically, and such that the resulting parser code, as well as the generation procedure itself, is highly ef fi cient. “Practical LR Parser Generation”
(about 15,000 lines) 2. LALR is dif fi cult, e.g. S/R con fl ict, R/R con fl ict 3. Bison doesn’t provide syntax sugar like option, list 4. It’s a mixture of parser and ripper 5. Parser and Lexer are tightly-coupled
types of state. • int paren_nest: Nest level of (, [, {. Used for parsing -> {}. • int lpar_beg: Stores paren_next when parsing lambda starts. • int brace_nest: Next level of {. Used for parsing “#{var}". • stack_type cond_stack: Used for parsing condition like “while ... do” • stack_type cmdarg_stack: Used for parsing command call like “foo 1, 2 do”
https://matz.rubyist.net/20040426.html#p02 1. Write two full set of rules, one is with do, another is without do. 2. Hack a lexer so that a lexer returns different tokens for same “do” string based on the context (= state) • CRuby selected the later
• > yacc doesn’t support conditions for rules, we can not omit some rules when some conditions are met. • Matz daily 2004/04/26 • https://matz.rubyist.net/20040426.html#p02
the fact does not mean it’s impossible! • Joe Zimmerman “Practical LR Parser Generation”, Sep 2022 • https://arxiv.org/pdf/2209.08383.pdf • “Nonterminal attributes”
ne attributes 2. “do” in f_larglist has less precedence than “do” in lambda_body -> “do” is reduced 3. “do” in () is shifted 4. “do” in top level is shifted
between parser and lexer • Nonterminal attributes solves a part of problems • Nonterminal attributes for precedence solves “do” overload • We have not leveraged the potential of LR parser • Defeated the second boss !!
CRuby parser • mruby, PicoRuby: Other Ruby implementation by C • JRuby, Truf fl eRuby, ruruby: Other Ruby implementation by non-C • sorbet, typeprof: Tools • Implementing 100 % compatible Ruby parser is a bit dif fi cult • Managing parser for each version is dif fi cult
function pointer 2. Linking functions into a parser shared library 1. parse.o: Generated from “parse.y” 2. node2.o: Separate AST/Node codes from “node.c” 3. st2.o: Copy “st.c” and remove unnecessary codes • https://github.com/yui-knk/ruby/tree/universal-parser
It is something that we can decide, and to the extent that we do not violate any known laws of the universe, we can probably make it work the way that we want to. Alan Curtis Kay
Universal Parser • We can ride on DFA’s theory when we use LR parser • We have not leveraged the potential of LR parser • Lrama LALR parser generator is an infrastructure for Ruby parser
from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers
from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers • 2020: “Don't Panic! Better, Fewer, Syntax Errors for LR Parsers” • 2022: “Practical LR Parser Generation”
from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers • 2020: “Don't Panic! Better, Fewer, Syntax Errors for LR Parsers” • 2022: “Practical LR Parser Generation” • 2023: “The future vision of Ruby Parser” in RubyKaigi 2023
basic building blocks • Expanding grammar DSL to leverage LR parser • Moving lexer’s logic into parser grammar rules • Multiple parser algorithms for multiple purposes • Users can focus on writing grammar
to Lrama • Install Lrama into CRuby • Usability (Error-tolerant parser) • Integrate Lrama error-tolerant functions with CRuby • Maintainability • Use Nonterminal attributes precedence • Universal Parser • Sort out the interface then merge the PR
parser • Any feedbacks are welcome • For developers who will use Universal Parser and AST • Share me use cases • For developers who has interest in implementing Universal Parser • Let me know