of parsers. People are setting sail into the vast sea of parsers. - RubyKaigi 2023 LT- Yuichiro Kaneko https://twitter.com/kakutani/status/1657762294431105025/
Backend and Code Generator. Each component is independent from others so that we need to touch only necessary components when new feature is enhanced. - BuriKaigi 2024 in Toyama -
codes, they belong to Ruby language. https://github.com/tric/trick2022/blob/master/01-tompng/entry.rb https://github.com/tric/trick2022/blob/master/06-mame/entry.rb
range of languages • Major parser algorithm • To be precise, LR-attributed grammar • I believe grammar easy for human is close to LR grammar • LL parser • Has has less power than LR parser • PEG • It’s difficult to create Error Tolerant parser • A rule failure doesn’t imply a parsing failure like in context free grammars
generator gives accurate feedback for grammar • BNF is very declarative • No gap between grammar and parser implementation • LR parser is based on theory of computer science
is CFG or not • This is a trick used in TRICK 2022 • This is NOT CFG because existence of the variable affects the following codes https://www.slideshare.net/mametter/trick-2022-results
Knuth and Peter Wegner • Original paper is Knuth, Donald E. (1968) "Semantics of context-free languages" • “An attribute grammar is a formal way to supplement a formal grammar with semantic information processing.” • https://en.wikipedia.org/wiki/Attribute_grammar
and usages • Type checking • Check control flow function f1() { var i = 1; i + j; } function f1() { var i = 1; var j = 2; i + j; } Error: Not declared variable “j” is used
}} expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list raise "#{ident_1.value} is not declared" unless vars[ident_1.value] raise "#{ident_2.value} is not declared" unless vars[ident_2.value] expr.value = vars[ident_1.value] + vars[ident_2.value] }} decls: decls decl {{ decls.var_list = decls.var_list.merge(decl.var_list) }} decls: decl {{ decls.var_list = decl.var_list }} func_body: decls expr {{ expr.var_list = decls.var_list }} Add an identi fi er to a list Check identi fi ers are declared Use identi fi er’s value Merge identi fi er lists to one Pass identi fi er list to expr so that we can access identi fi er list in expr * Only important production rules and semantic rules Copy identi fi er list
update the list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2}
the variable list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2} (3) list = {i: 1, j: 2}
the variable list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2} (3) list = {i: 1, j: 2} (4) list = {i: 1, j: 2} (5) list = {i: 1, j: 2}
not declared decls func_body expr + decl i = 1 ident i ident j function f1() { var i = 1; // var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1} (3) list = {i: 1} (4) Error!!!
Attribute • In expr, var list is Inherited Attribute • Inherited Attribute allows to pass from parent to children decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j (ˑ) list = {i: 1} (ˑ) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2}
not tree • It may be circular • It may require exponential time for calculation • Subset of attribute grammar • L-attributed grammar • LR-attributed grammar • S-attributed grammar
is that some automatons are managed by a stack • Generate automatons from each rule program : class_def class_def : "class" id body "end" body : method_def method_def : "def" id "end" M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 P1 P2 method_def def end id class_def C4 class id body end
then current automaton reaches to the accepting state class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end
move next automaton state to “B2” • Next automaton also reaches to the accepting state class A def m end end P1 P2 class_def B1 B2 C1 C2 C3 C5 method_def C4 class id body end
… • Which automatons the parser should put to the stack? M1 M2 M3 M4 S1 S2 S3 S5 P1 P2 def end id defn / defs S4 def S6 self . id end def m end Option 1 Option 2
LR parser can evaluate when the parser parse codes • Condition #1: All attribute dependencies are left-to-right direction • Condition #2: All inherited attributes in the same state has unique values
be handled by LR parser • Class can not be defined in def scope • Variable list also can be handled class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end in_class = true in_def = true
attribute can’t be decided just after “def” • The attribute doesn’t exist in Ruby • The attribute can be decided just after “self” D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def m end def self.m end in_singleton_def = false in_singleton_def = true in_singleton_def = false in_singleton_def = true
inherited attributes • Inherited attributes carries contexts from top to bottom • In short, LR parser can manage contexts with some limitation • The direction of context flow is left-to-right • I believe this is reasonable because we read codes from top-to- bottom and left-to-right • If multiple production rules are expected, the value of contexts should be unique • I believe this is reasonable to reduce cognitive cost of human
and invalid input • Parser gives structure to the input correctly • Grammar defines the boundary of the language and structure of the language • Use appropriate grammar class • With great power comes great difficulties • LR-attributed grammar is the foundations theory of current Ruby parser
for machines • What is the key properties of good programing language for programmers? • However sometimes it’s difficult for programmers to understand which sentences are connected
design the grammar whose grammar rule conflicts • Example: “Dangling else” • https://en.wikipedia.org/wiki/Dangling_else // Rules if a then s if b then s1 else s2 // Code if a then if b then s else s2 // #1 (if a then (if b then s else s2)) // #2 (if a then (if b then s) else s2)
the precedence locally • Then it’s not the limitation of parser but the design of grammar • In the discussion, consistency of precedence between “=” & “and” are kept
start and end are clear, the chance of conflict will decrease • Informally: I consider how left context is powerful enough to minimize the rule candidates • In change precedence case • The appearance of “{“ on the left is enough powerful to distinguish normal expressions and inverse precedences expressions • The appearance of “}” determines the end of inverse precedences expressions
because method definition always starts with “def” • End is clear because method definition always ends with “end” until Ruby 2.7.0 • Endless method definition is introduced from Ruby 3.0.0
is the cause of conflict • Design the precedence based on human cognitive ability • E.g. ‘+’ < ‘*’ • Modifier has similar characteristic with infix operator
In textbooks, lexer and parser are completely separated components • However both of them are tightly coupled with in Ruby • Sometimes it’s called “Monstrous lex_state”
r51617 but r51616 fixed the issue, right? • The error was unexpected “keyword_do_cond” and COND_PUSH(1) is called after ‘?’ • There is a space between “end” and ‘:’
managing where label is disallowed (EXPR_VALUE) to where label is allowed (EXPR_LABEL) {foo: ("" rescue "")} { label:<<-DOC Some text for a heredoc goes here DOC } label label
tokenized • Then it’s IS_ARG() { label:<<-DOC Some text for a heredoc goes here DOC } case '<': last_state = lex_state; c = nextc(); if (c == '<' && !IS_lex_state(EXPR_DOT | EXPR_CLASS) && !IS_END() && (!IS_ARG() || space_seen)) { int token = heredoc_identi fi er(); if (token) return token; } ...
for programmers not for machines • Ruby syntax changes • Parser needs to • have theory and mechanism which mitigate implementation complexities • give the language designer feedbacks about syntax changes
Ruby • Lexer generates different tokens depending on the parser state • Tokens with same length but different identity • Tokens with different length • By the way, parser knows what kind of tokens itself can accept on each parser state
lexer then change to manage states on parser side • Joel E. Denny. “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages”, May 2010. • https://tigerprints.clemson.edu/cgi/viewcontent.cgi? article=1519&context=all_dissertations • PSLR stands for Pseudo-Scannerless Minimal LR
to generate loosely coupled scanners and parsers, so the user must maintain these tightly coupled scanner and parser specifications separately but consistently. • > Scanner and parser specifications would be significantly more maintainable if all sub-language transitions were instead computed from a grammar by a parser generator and recognized automatically by the scanner using the parser’s stack.
Scoped Declarations” of the paper • C++0x • and ‘>>’ has higher precedence than ‘>’, vice verse in Lc Lc Lt Y<X<(6>>1)>> x4; Lc Lt Lp Lc : main C++0x language Lt : template argument list sub-language Lp : parenthesized expression sub-sub-language %lex-prec ’>’ -< ’>>’ for Lc and Lp %lex-prec ’>>’ -< ’>’ for Lt Y<X<(6>>1)>> x4;
• Collecting tokens which are the last token of “opt_block_param” -> ‘|’ • Parser update the lexer precedence to sub-language mode after “do” and restore it after the second ‘|’ primary: method_call brace_block brace_block: k_do do_body k_end do_body: opt_block_param bodystmt opt_block_param: block_param_def block_param_def: '|' opt_bv_decl '|' | '|' … '|'
defined, the parser state has scope conflict • Split the state again so that each state doesn’t have contradictional lexer precedence • In this case, the states can be separated because one follows “{” and other follows “do” %nterm opt_block_param { %lex-prec ‘||’ < ‘|’ } %nterm brace_body { %lex-prec ‘||’ > ‘|’ } %% brace_block: k_do do_body k_end do_body: opt_block_param bodystmt brace_block: '{' brace_body '}' brace_body: opt_block_param compstmt
keyword_if if the lex state is EXPR_BEG • modifier_if if the lex state is not EXPR_BEG “if” keyword_if modi fi er_if if cond then … … if cond EXPR_BEG ! EXPR_BEG
accepts both keyword_if and modifier_if • If the state accepts keyword_if, it doesn’t accept modifier_if • If the state accepts modifier_if, it doesn’t accept keyword_if • Always current state knows how to handle “if”
Then “if … end” is accepted • After the number, state is EXPR_END. Then modifier if is accepted • It’s clear which type of if can be written 1 + 2 BEG END BEG END 1 + if true; 1 else 2 end 1 + 2 if true EXPR_BEG ! EXPR_BEG
is powerful enough to distinguish which tokens are accepted • #2: A lot of token types can be determined on parser side • If so, sub-language model is not the best mental model in Ruby
sentence with operator • It’s not clear just after control syntax • If the relation between modifier_if and keyword_if are specified, parser inform conflicts to us • How conflicts are resolved in the language is important insight when new syntax is added
the surrounding sentences • lex_state is complicated • Need to mitigate the complexities for further syntax extensions • Tight communication between scanner and parser will reduce the complexities • Explicitly declaration of conflict resolution recodes what the language designer decided • Able to refer to the past decisions when similar pattern appears
optional • Argument is optional • Parentheses around arguments are optional • Block is optional • (The symbol of pattern matching, `in` or `=>) • Need to discuss grammar rules as group • E.g. “a == b”, “1 + 2” and “1..2” are in same “arg” group • If change “arg” rules, need to consider the impact on “expr” and “stmt” too
simple • Parser implementation is a combination of parts • Parser generator: combination of rules • Recursive Descent Parser: combination of functions, e.g. “parse_pattern_matching”, “parse_arguments”
good practice • Divide the difficulties • However it requires mechanism to integrate these parts • LR parser generator has the mechanism, conflict detection • Hand written parser doesn’t have such mechanism • Parser generator works as checker/linter for grammar • Can not keep soundness of grammar without the help from computer science
lexer level conflict • Because parser doesn’t know relationship between keyword_if and modifier_if • However it conflicts on some points from programmers viewpoint • The detection is helpful for syntax discussion return if … return (if …) (return) if … keyword_if modifier_if
are the most mysterious syntax part of Ruby • “space_seen” variable • ‘\n’ token and tIGNORED_NL token • How to include space and newline into parser context is open problem
• Can not keep soundness of grammar without the help from computer science • PSLR is key concept for checking soundness of lexer state sensitive grammar
and components • Compiler, Type System, LSP, Linter and Code Formatter • Therefore what’s the use case of Syntax Tree ? • How to satisfy the use cases ?
to analyze comments (LSP DocumentLink) • Need to walk through parent node from child node (LSP SelectionRange) • Want to rewrite codes (LSP & Code Formatter) • This is the most difficult use case, right now
action_b else action_c end end if condition_a action_a elsif condition_b if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end if condition_a action_a elsif condition_b action_b else action_c end 1. else to elsif 2. Delete “if condition_b” 3. Delete end 4.Delete duplicated “action_b”
string every time • In both cases, need to move/copy sub-strings after “else” if condition_a action_a else action_b end if condition_a action_a action_b end Delete else if condition_a action_a else action_b end if condition_a action_a elsif action_b end Replace with elsif
affects the rest of nodes if condition_a action_a else action_b end Parser::Source::Bu ff er if condition_a action_a action_b end Parser::Source::Bu ff er NODE_VCALL action_b Range (3, 2)-(3, 10) Delete else
Need to understand current status of each step if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end if condition_a action_a elsif condition_b action_b else action_c end 1. else to elsif 2. Delete “if condition_b” 3. Delete end 4.Delete duplicated “action_b”
syntax tree, rendering source code from Syntax Tree • However AST doesn’t have spaces, newline and so on… NODE_IF condition_a action_a NODE_ELSIF condition_b action_b action_c
information for new node • Calculation is still based on text oriented approach • Can not fully leverage the syntax tree transformation if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b else action_c end Start is same with else The last line is “action_c line - 1” The last column is “tail of action_c - 2”
restoration • CST preserves information which AST omits, e.g. spaces, newlines, parentheses • AST focuses on semantics, CST focuses on Syntax • Implementation • Introduce data structure for token • Keep information on token which lexer omitted • Node has child nodes and tokens
restoration • CST preserves information which AST omits, e.g. spaces, newlines, parentheses • AST focuses on semantics, CST focuses on Syntax • Implementation • Introduce data structure for token • Keep information on token which lexer omitted • Node has child nodes and tokens Syntax Tree having complete information of source code
Tree • Invented by C# (Roslyn) • Swift (SwiftSyntax) and rust-analyzer (LSP) uses this • Represent Syntax Tree with Red Node and Green Node • Let’s read swift-syntax • https://github.com/swiftlang/swift-syntax
chide elements • has width • Red Node • has reference to parent elements • has offset Token Green NODE Legend Red NODE NODE_IF width: 90 IF width: 3 NODE_IF width: 56 condition_a width: 11 action_a width: 11 NODE_ELSE width: 61 END width: 4 ELSE width: 5 NODE_IF o ff set: 0 NODE_ELSE o ff set: 25 NODE_IF o ff set: 30
& Code Formatter • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • Concrete Syntax Tree !! • Editable Syntax Tree • Red Green Tree !!
• If it’s not correct, hopefully want to auto correct • Syntax Tree rewriting can create Syntax Tree which parser never generates + 1 2 3 * * + 1 2 3 parse Rewrite Dump
compare syntax tree with the syntax tree before dump • Grammar might know which syntax tree parser can generate Grammar fi le Parser generator Parser Syntax Tree Checker
is the most difficult use case, right now • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • How to keep edited Syntax Tree correct • Open Problem
structure of the language • What parsers do • Parser distinguishes valid input and invalid input • Parser gives structure to the input correctly • LR-attributed grammar is the foundations theory of current Ruby parser
for programmers not for machines • Ruby syntax changes • Parser needs to • have theory and mechanism which mitigate implementation complexities • give the language designer feedbacks about syntax changes • PSLR is key concept for lexer state sensitive grammar
is the most difficult use case, right now • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • How to keep edited Syntax Tree correct • Open Problem
syntax tree • By grammar, we can reveal what Ruby is • It’s very interesting to expose the secret of Ruby syntax from grammar • I want to reveal what is the key of Ruby’s programmer friendly syntax • Hypothesis: • Programmers • can understand expression beginning and ending • feel it’s natural that conditional branch follows flow control keywords, like “return” • can understand space sensitive grammar • sometimes fails to understand precedence of non-arithmetic operator