What is Parser

September 7, 2024 Fukuoka RubyistKaigi 04 @yui-knk Yuichiro Kaneko What
is Parser

About Me The world is now in the great age
of parsers. People are setting sail into the vast sea of parsers. - RubyKaigi 2023 LT- Yuichiro Kaneko https://twitter.com/kakutani/status/1657762294431105025/

Self Introduction • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf
(Twitter) • Treasure Data • Engineering Manager of Applications Backend

In OSS world • Yuichiro Kaneko • yui-knk (GitHub) /
spikeolaf (Twitter) • CRuby committer, mainly develop parser generator and parser • Lrama LALR (1) parser generator (2023, Ruby 3.3) • The Bison Slayer • Ripper Rearchitecture (2024, Ruby 3.4) • Code positions to RNode (2018, Ruby 2.6) • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6)

What is Parser Parser generator consists of three parts, Frontend,
Backend and Code Generator. Each component is independent from others so that we need to touch only necessary components when new feature is enhanced. - BuriKaigi 2024 in Toyama -

What parser does • Parser gives the structure to input
string • Really ? Class Method Method Assignment @name Call name capitalize

What parser does • Parser gives the structure to input
string bytes 636c61737320477265657465720a2020646566 20696e697469616c697a65286e616d65290a20 202020406e616d65203d206e616d652e636170 6974616c697a650a2020656e640a0a2020646 5662073616c7574650a2020202070757473202 248656c6c6f20237b406e616d657d21220a202 0656e640a656e640a

What lexer does • Cut bytes into chunks (tokens) 636c61737320477265657465720a2020646566
20696e697469616c697a65286e616d65290a20 202020406e616d65203d206e616d652e636170 6974616c697a650a2020656e640a0a2020646 5662073616c7574650a2020202070757473202 248656c6c6f20237b406e616d657d21220a202 0656e640a656e640a class Greeter def

Parser & Lexer • Lexer generates tokens from bytes •
Parser gives structure to tokens Class Method Method Assignment @name Call name capitalize class Greeter def Lexer Parser …

Very easy.

Very easy, right ?

I will guide you to the frontline

Theory of formal language Defeat calamities by more powerful theory,
abstraction and Refactoring - RubyKaigi 2024 in Okinawa -

Language • A (formal) language is a subset of words
• Some words belong to Ruby language • Others don’t

Ruby • Even so these codes are transcendental and imbroglio
codes, they belong to Ruby language. https://github.com/tric/trick2022/blob/master/01-tompng/entry.rb https://github.com/tric/trick2022/blob/master/06-mame/entry.rb

Not Ruby • At a glance, this code seems Ruby
code, however it doesn’t belong to Ruby language.

Grammar • (Ruby) language is a infinite set of words
• Grammar is a finite set of rules which define language Grammar Language …

Grammar • Grammar provides structure to the language + 1
2 3 * * + 1 2 3 Correct Wrong

Grammar class and automaton • Chomsky hierarchy • Four formal
grammar classes consist hierarchy • There are correspondences between grammars and automatons Regular Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing machine Non-deterministic pushdown automaton Finite-state automaton Turing machine

Use appropriate grammar class • With great power comes great
difficulties • Context-sensitive grammar is more difficult to read and design production rules than context-free grammar S → abc | aSBc cB → Bc bB → bb Production rules for {anbncn : n ≥ 1} S → aSBc → aabcBc → aabBcc → aabbcc S → aSBc → aaSBcBc → aaabcBcBc → aaabcBBcc → aaabBcBcc → aaabBBccc → aaabbBccc → aaabbbccc Generate “aabbcc” Generate “aaabbbccc” https://ja.wikipedia.org/wiki/%E6%96%87%E8%84%88%E4%BE%9D%E5%AD%98%E6%96%87%E6%B3%95 Multiple terminals and nonterminals appear

Context-free grammar (CFG) • Context-free grammar is readable • Then
you can read it and try it CFG Single nonterminal appears

if + class • The code raise NoMethodError however it’s
syntactically valid $ ruby -c test.rb Syntax OK $ ruby test.rb test.rb:3:in '<main>': unde fi ned method '+' for nil (NoMethodError) end + class C ^

Context-free grammar (CFG) • Context-free grammar is widely used in
programing languages • To be accurate, deterministic context-free language (DCFL) • DCFL is a subset of CFG • LR parser analyses DCFL in linear time

In Chomsky hierarchy Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing
machine Non-deterministic pushdown automaton Turing machine DCFL Regular Finite-state automaton Deterministic pushdown automaton

Why LR parser? • LR parser • Can handle large
range of languages • Major parser algorithm • To be precise, LR-attributed grammar • I believe grammar easy for human is close to LR grammar • LL parser • Has has less power than LR parser • PEG • It’s difficult to create Error Tolerant parser • A rule failure doesn’t imply a parsing failure like in context free grammars

How to create parser? • Use parser generator • Lrama
(CRuby) • Bison (Perl, PHP, PostgreSQL) • ANTLR (Hive, Trino) • Hand written parser • Go, Rust, C# • Prism

Why LR parser generator is the best? • LR parser
generator gives accurate feedback for grammar • BNF is very declarative • No gap between grammar and parser implementation • LR parser is based on theory of computer science

RubyKaigi 2024 • Check slides and video for more detail
• https://rubykaigi.org/2024/presentations/spikeolaf.html

Actually context-free grammar? • Sometimes it’s discussed that Ruby grammar
is CFG or not • This is a trick used in TRICK 2022 • This is NOT CFG because existence of the variable affects the following codes https://www.slideshare.net/mametter/trick-2022-results

However • Current LR parser can parse such codes •
Ruby committers have hacked parser but NOT hacked LR parser algorithm • There must be some tricks somewhere

LR-attributed grammar (LR ଐੑจ๏) • The key concept is LR-attributed
grammar • LR parser can handle LR-attributed grammar

Attribute Grammar (ଐੑจ๏) • Attribute grammars were invented by Donald
Knuth and Peter Wegner • Original paper is Knuth, Donald E. (1968) "Semantics of context-free languages" • “An attribute grammar is a formal way to supplement a formal grammar with semantic information processing.” • https://en.wikipedia.org/wiki/Attribute_grammar

Static semantic analysis • Use cases • Check variable declarations
and usages • Type checking • Check control flow function f1() { var i = 1; i + j; } function f1() { var i = 1; var j = 2; i + j; } Error: Not declared variable “j” is used

Check variable declarations and usages • This language has a
semantic: “variable should be declared before used” • Represent the semantic formally in a grammar

decl: 'var' ident '=' integer ';' {{ decl.var_list[ident.value] = integer.value
}} expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list raise "#{ident_1.value} is not declared" unless vars[ident_1.value] raise "#{ident_2.value} is not declared" unless vars[ident_2.value] expr.value = vars[ident_1.value] + vars[ident_2.value] }} decls: decls decl {{ decls.var_list = decls.var_list.merge(decl.var_list) }} decls: decl {{ decls.var_list = decl.var_list }} func_body: decls expr {{ expr.var_list = decls.var_list }} Add an identi fi er to a list Check identi fi ers are declared Use identi fi er’s value Merge identi fi er lists to one Pass identi fi er list to expr so that we can access identi fi er list in expr * Only important production rules and semantic rules Copy identi fi er list

Syntax Tree • Create syntax tree from input string decls
decls func_body expr + decl j = 2 decl i = 1 function f1() { var i = 1; var j = 2; i + j; } ident i ident j

Analyze dependency of the variable list • In “expr”, “ident_1”
and “ident_2” need variable list of “expr” decls decls func_body expr + decl j = 2 decl i = 1 expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list … expr.value = vars[ident_1.value] + vars[ident_2.value] }} ident i ident j

Analyze dependency of the variable list • In “func_body”, “expr”
need variable list of “decls” decls decls func_body expr + decl j = 2 decl i = 1 func_body: decls expr {{ expr.var_list = decls.var_list }} ident i ident j

Analyze dependency of the variable list • In “decls”, “decls”
need variable list of “decls” and “decl” decls decls func_body expr + decl j = 2 decl i = 1 decls: decls decl {{ decls.var_list = decls.var_list.merge( decl.var_list ) }} ident i ident j

Analyze dependency of the variable list • In “decls”, “decls”
need variable list of “decls” and “decl” decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j decls: decls decl {{ decls.var_list = decls.var_list.merge( decl.var_list ) }}

Create attribute evaluator • Inverse dependency direction to get attribute
evaluator decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; }

How attribute evaluator works • Visit “i = 1” then
update the list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1}

How attribute evaluator works • Visit “j = 2” then
update the list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2}

How attribute evaluator works • Visit “i + j” with
the variable list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2} (3) list = {i: 1, j: 2}

How attribute evaluator works • Resolve “i” and “j” with
the variable list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2} (3) list = {i: 1, j: 2} (4) list = {i: 1, j: 2} (5) list = {i: 1, j: 2}

Semantically invalid code • Failed to resolve “j” because it’s
not declared decls func_body expr + decl i = 1 ident i ident j function f1() { var i = 1; // var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1} (3) list = {i: 1} (4) Error!!!

Automatic generation • Studied attribute evaluator auto-generation from semantic rules
Grammar fi le with attributes Parser generator Attribute evaluator generator Parser Attribute evaluator Program

Inherited & Synthesized • Attribute is divided into two groups
• Inherited Attribute (ܧঝଐੑ): Attribute calculated based on a parent and siblings • Synthesized Attribute (߹੒ଐੑ): Attribute calculated based on children decl: 'var' ident '=' integer ';' {{ decl.var_list[ident.value] = integer.value }} decls: decls decl {{ decls.var_list = decls.var_list.merge(decl.var_list) }} expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list … expr.value = vars[ident_1.value] + vars[ident_2.value] }} var_list is synthesized attribute var_list is synthesized attribute var_list is inherited attribute

Inherited & Synthesized • In decls, var list is Synthesized
Attribute • In expr, var list is Inherited Attribute • Inherited Attribute allows to pass from parent to children decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j (ˑ) list = {i: 1} (ˑ) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2}

Attribute grammar can be complex • Dependency is a graph
not tree • It may be circular • It may require exponential time for calculation • Subset of attribute grammar • L-attributed grammar • LR-attributed grammar • S-attributed grammar

How LR parser works • Mental model of LR parser
is that some automatons are managed by a stack • Generate automatons from each rule program : class_def class_def : "class" id body "end" body : method_def method_def : "def" id "end" M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 P1 P2 method_def def end id class_def C4 class id body end

How LR parser works • At the beginning, one automaton
exists on the stack class A def m end end P1 P2 class_def

How LR parser works • Parser read “class” then new
automaton is pushed onto the stack class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end

How LR parser works • Parser read “A” then current
automaton state is updated class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end

How LR parser works • Parser read “def” then new
automatons are pushed onto the stack class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end

How LR parser works • Parser read “m” and “end”
then current automaton reaches to the accepting state class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end

How LR parser works • Pop the current automaton then
move next automaton state to “B2” • Next automaton also reaches to the accepting state class A def m end end P1 P2 class_def B1 B2 C1 C2 C3 C5 method_def C4 class id body end

move next automaton state to “C4” class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end

How LR parser works • Parser read “end” then current
automaton reaches to the accepting state class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end

move next automaton state to “P2” • Reaches to the accepting state and no input lefts then program is accepted class A def m end end P1 P2 class_def

How LR parser works (2) • Program has one method
definition (defn) or one singleton method definition (defs) program: defn | defe defn: "def" id "end" defs: "def" "self" "." id “end" M1 M2 M3 M4 S1 S2 S3 S5 P1 P2 def end id defn / defs S4 def S6 self . id end

How LR parser works (2) • At the beginning, one
automaton exists on the stack P1 P2 defn / defs def m end

How LR parser works (2) • Parser read “def” then
… • Which automatons the parser should put to the stack? M1 M2 M3 M4 S1 S2 S3 S5 P1 P2 def end id defn / defs S4 def S6 self . id end def m end Option 1 Option 2

How LR parser works (2) • Merge these two automatons
to one automaton M1 M2 M3 M4 S1 S2 S3 S5 def end id S4 def S6 self . id end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end

How LR parser works (2) • Parser read “def” then
push new merged automaton on the stack P1 P2 defn / defs def m end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end

How LR parser works (2) • LR Parser can postpone
the decision of automaton def m end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def self.m end

LR-attributed grammar • LR-attributed grammar is an attribute grammar which
LR parser can evaluate when the parser parse codes • Condition #1: All attribute dependencies are left-to-right direction • Condition #2: All inherited attributes in the same state has unique values

#1: left-to-right direction • “in_class” & “in_def” inherited attributes can
be handled by LR parser • Class can not be defined in def scope • Variable list also can be handled class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end in_class = true in_def = true

#2: Unique values for the same state • “in_def” inherited
attribute can be decided just after “def” D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def m end def self.m end in_def = true

#2: Unique values for the same state • “in_singleton_def” inherited
attribute can’t be decided just after “def” • The attribute doesn’t exist in Ruby • The attribute can be decided just after “self” D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def m end def self.m end in_singleton_def = false in_singleton_def = true in_singleton_def = false in_singleton_def = true

LR-attributed grammar • LR-attributed grammar enables LR parsing to handle
inherited attributes • Inherited attributes carries contexts from top to bottom • In short, LR parser can manage contexts with some limitation • The direction of context flow is left-to-right • I believe this is reasonable because we read codes from top-to- bottom and left-to-right • If multiple production rules are expected, the value of contexts should be unique • I believe this is reasonable to reduce cognitive cost of human

Leverage Theory • It’s not currently popular to generate attribute
evaluators from semantic rules • However attribute grammar theory tells us what parsers can do and not

Summary • What parsers do • Parser distinguishes valid input
and invalid input • Parser gives structure to the input correctly • Grammar defines the boundary of the language and structure of the language • Use appropriate grammar class • With great power comes great difficulties • LR-attributed grammar is the foundations theory of current Ruby parser

Programing Language Ruby Users can focus on writing grammar -
RubyKaigi 2023 in Matsumoto -

Use case oriented approach • The parser’s use case is
Ruby • Understand Ruby syntax characteristics to understand what are important aspects of parser

Simple Syntax !! https://github.com/ruby/ruby

https://www.ruby-lang.org/en/about/

Ruby Syntax • Ruby syntax is designed for programmers not
for machines • What is the key properties of good programing language for programmers? • However sometimes it’s difficult for programmers to understand which sentences are connected

Grammar rule con fl ict • It’s not unusual to
design the grammar whose grammar rule conflicts • Example: “Dangling else” • https://en.wikipedia.org/wiki/Dangling_else // Rules if a then s if b then s1 else s2 // Code if a then if b then s else s2 // #1 (if a then (if b then s else s2)) // #2 (if a then (if b then s) else s2)

Grammar with conflict • For example, infix operator is the
cause of conflict • In many language, + has lower precedence * because of arithmetic operators we know + 1 2 3 * * + 1 2 3 #1 #2

Grammar without conflict • For example, Polish Notation has no
conflict

Polish Notation + () • Polish Notation + () seems
to be good idea • But Ruby didn’t choice this direction +

Con fl ict is design matter • https://bugs.ruby-lang.org/issues/19392 • Endless
method definition with “or”

Con fl ict is design matter • We can change
the precedence locally • Then it’s not the limitation of parser but the design of grammar • In the discussion, consistency of precedence between “=” & “and” are kept

Change precedence in some scopes • I implemented “change precedence
declaration” as PoC • Within { … }, + has higher precedence than * https://github.com/ruby/lrama/pull/254

Flash point of con fl ict • If the rule’s
start and end are clear, the chance of conflict will decrease • Informally: I consider how left context is powerful enough to minimize the rule candidates • In change precedence case • The appearance of “{“ on the left is enough powerful to distinguish normal expressions and inverse precedences expressions • The appearance of “}” determines the end of inverse precedences expressions

Case #1: Method de fi nition • Start is clear
because method definition always starts with “def” • End is clear because method definition always ends with “end”

Case #1: Method de fi nition • Start is clear
because method definition always starts with “def” • End is clear because method definition always ends with “end” until Ruby 2.7.0 • Endless method definition is introduced from Ruby 3.0.0

Case #2: Modi fi er • As explained, infix operator
is the cause of conflict • Design the precedence based on human cognitive ability • E.g. ‘+’ < ‘*’ • Modifier has similar characteristic with infix operator

Case #3: parentheses • Parentheses are great • Start is
clear because the rule starts with “(” • End is clear because the rule starts with “)” • Why do you omit parentheses ???

Polish Notation + () • What do you think about
Polish Notation + () ? +

Ruby Syntax complexities • “The Big Five parse.y calamities” in
RubyKaigi 2024 • Today’s topic is “Lex State” https://speakerdeck.com/yui_knk/the-grand-strategy-of-ruby-parser?slide=58

What’s Lex State ? • The state of lexer •
In textbooks, lexer and parser are completely separated components • However both of them are tightly coupled with in Ruby • Sometimes it’s called “Monstrous lex_state”

Why lex_state is needed • In general lexer check input
text in the longest match manner otherwise longer one never matches • E.g. Check “||” then check “|”

Why lex_state is needed • However in some cases, shorter
token should be returned • “|” for block parameter is two “|”

EXPR_BEG or not • If lex state is EXPR_BEG then
“|” is retuned otherwise “||” is retuned • A lot of conditional branches based on lex state • Too complicated “|” “||” Check lex state

Monstrous lex_state • Ruby’s lexer has 13 state bits!

Why it’s terrible • “All bugfixes are incompatibilities” • 36:00
~ https://rubykaigi.org/2019/presentations/nagachika.html

Fixing a bug caused other bugs • Fixing [Bug #10653]
caused [Bug #11456] and [Bug #11849]

• All of them include ‘:’ … ? ‘:’ is
di ff i cult true ? 1.tap do |n| p n end : 0 {foo: ("" rescue "")} { label:<<-DOC Some text for a heredoc goes here DOC }

Fix [Bug #10653] • By the way, I guess not
r51617 but r51616 fixed the issue, right? • The error was unexpected “keyword_do_cond” and COND_PUSH(1) is called after ‘?’ • There is a space between “end” and ‘:’

Fix [Bug #10653] • Anyway, r51617 changed the logic from
managing where label is disallowed (EXPR_VALUE) to where label is allowed (EXPR_LABEL) {foo: ("" rescue "")} { label:<<-DOC Some text for a heredoc goes here DOC } label label

[Bug #11456] • Lex state is “EXPR_ARG|EXPR_LABELED” after label is
tokenized • Then it’s NOT IS_BEG() • tLPAREN_ARG is returned • Only expr is allowed after tLPAREN_ARG !!! • expr doesn’t allow modifier rescue {foo: ("" rescue "")} case '(': if (IS_BEG()) { c = tLPAREN; } else if (IS_SPCARG(-1)) { c = tLPAREN_ARG; } paren_nest++; COND_PUSH(0); CMDARG_PUSH(0); lex_state = EXPR_BEG|EXPR_LABEL; return c; primary: tLPAREN_ARG expr rparen Before: EXPR_LABELARG After: EXPR_ARG|EXPR_LABELED

Fix [Bug #11456] • r51624 fixed the bug by adding
“EXPR_ARG|EXPR_LABELED” to IS_BEG() https://github.com/ruby/ruby/commit/0958af2ad4e83400f35c296e9ed9cf021b1675b4

[Bug #11849] • Lex state is “EXPR_ARG|EXPR_LABELED” after label is
tokenized • Then it’s IS_ARG() { label:<<-DOC Some text for a heredoc goes here DOC } case '<': last_state = lex_state; c = nextc(); if (c == '<' && !IS_lex_state(EXPR_DOT | EXPR_CLASS) && !IS_END() && (!IS_ARG() || space_seen)) { int token = heredoc_identi fi er(); if (token) return token; } ...

Fix [Bug #11849] • r53214 fixed the bug by adding
“EXPR_LABELED” check https://github.com/ruby/ruby/commit/9d5abbff9754589483938dc539226c2ad4895140

Ruby Syntax changes • 3.2.0 (2022-12-25): Anonymous rest and keyword
rest arguments can now be passed as arguments • 3.1.0 (2021-12-25): Anonymous block argument

Ruby Syntax changes • 3.0.0 (2020-12-25): Endless method definition •
2.7.0 (2019-12-25): Pattern matching, beginless range • 2.6.0 (2018-12-25): Endless range

What will happen by the change? • Proposal for existing
grammar • https://bugs.ruby-lang.org/issues/18080

Summary • Use Case: Ruby • Ruby syntax is designed
for programmers not for machines • Ruby syntax changes • Parser needs to • have theory and mechanism which mitigate implementation complexities • give the language designer feedbacks about syntax changes

Fight with implementation complexities We have not leveraged the potential
of LR parser - RubyKaigi 2023 in Matsumoto -

Monstrous lex_state • Ruby’s lexer has 13 state bits!

Parser & Lexer • Assume parser and lexer can be
separated Class Method Method Assignment @name Call name capitalize class Greeter def Lexer Parser …

Parser & Lexer • However lexer depends on parser in
Ruby • Lexer generates different tokens depending on the parser state • Tokens with same length but different identity • Tokens with different length • By the way, parser knows what kind of tokens itself can accept on each parser state

PSLR(1) • It seems good idea to integrate parser and
lexer then change to manage states on parser side • Joel E. Denny. “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages”, May 2010. • https://tigerprints.clemson.edu/cgi/viewcontent.cgi? article=1519&context=all_dissertations • PSLR stands for Pseudo-Scannerless Minimal LR

PSLR(1) • > Nevertheless, traditional scanner and parser generators attempt
to generate loosely coupled scanners and parsers, so the user must maintain these tightly coupled scanner and parser specifications separately but consistently. • > Scanner and parser specifications would be significantly more maintainable if all sub-language transitions were instead computed from a grammar by a parser generator and recognized automatically by the scanner using the parser’s stack.

Sub-languages and scopes • The example comes from “Figure 2.6:
Scoped Declarations” of the paper • C++0x • and ‘>>’ has higher precedence than ‘>’, vice verse in Lc Lc Lt Y<X<(6>>1)>> x4; Lc Lt Lp Lc : main C++0x language Lt : template argument list sub-language Lp : parenthesized expression sub-sub-language %lex-prec ’>’ -< ’>>’ for Lc and Lp %lex-prec ’>>’ -< ’>’ for Lt Y<X<(6>>1)>> x4;

Sub-languages and scopes • In Ruby case, ‘|’ has higher
precedence than ‘||’ in Lbp obj.m do || end Lrb Lbp Lrb : main ruby Lbp : block parameters %lex-prec ’|’ -< ’||’ for Lrb %lex-prec ’||’ -< ’|’ for Lbp

Scanner con fl ict • Identity conflict: Tokens with same
length but different identity • E.g. do, do_cond, do_block, do_LAMBDA • Length conflict: Tokens with different length • E.g. ‘|’, ‘||’

How to specify Sub-languages scopes • Specify nonterminals as a
scope of sub-languages • See: “3.7 Scoped Declarations” obj.m do |var| expr end method_call brace_block do |var| expr end k_do do_body k_end |var| expr opt_block_ param bodystmt | var | block_ param

How to specify Sub-languages scopes • With in “opt_block_param”, ‘|’
has higher precedence than ‘||’ %nterm opt_block_param { %lex-prec ‘||’ < ‘|’ } %% primary: method_call brace_block brace_block: k_do do_body k_end do_body: opt_block_param bodystmt opt_block_param: block_param_def block_param_def: '|' opt_bv_decl '|' | '|' block_param opt_bv_decl '|'

How it works • Collecting tokens before “opt_block_param” -> “do”
• Collecting tokens which are the last token of “opt_block_param” -> ‘|’ • Parser update the lexer precedence to sub-language mode after “do” and restore it after the second ‘|’ primary: method_call brace_block brace_block: k_do do_body k_end do_body: opt_block_param bodystmt opt_block_param: block_param_def block_param_def: '|' opt_bv_decl '|' | '|' … '|'

How it works • Some states are marked • ‘||’
is separated to two ‘|’ in marked states primary: method_call • brace_block brace_block: • k_do do_body k_end brace_block: k_do • do_body k_end do_body: • opt_block_param bodystmt opt_block_param: • block_param_def block_param_def: • ‘|' block_param ‘|' block_param_def: ‘|’ • block_param ‘|' block_param_def: ‘|’ block_param • ‘|' block_param_def: ‘|’ block_param ‘|' • do_body: opt_block_param • bodystmt ... primary: method_call brace_block •

Scope con fl ict • If contradictional lexer precedence are
defined, the parser state has scope conflict • Split the state again so that each state doesn’t have contradictional lexer precedence • In this case, the states can be separated because one follows “{” and other follows “do” %nterm opt_block_param { %lex-prec ‘||’ < ‘|’ } %nterm brace_body { %lex-prec ‘||’ > ‘|’ } %% brace_block: k_do do_body k_end do_body: opt_block_param bodystmt brace_block: '{' brace_body '}' brace_body: opt_block_param compstmt

IELR • IELR can split such state • IELR is
more powerful than LALR • PSLR is an extension of IELR • Both PSLR and IELR are invented by Joel E. Denny

ʮφϯτΧLRʯΛ੔ཧ͢Δ / Clarifying LR Algorithms https://speakerdeck.com/junk0612/clarifying-lr-algorithms?slide=5

• https://rubykaigi.org/2024/presentations/junk0612.html

Reconsider lex_state • Reconsider block parameter syntax • “||” is
not accepted after “do” • “||” is not accepted after “var” • No scanner conflict • It’s enough for parser to pass acceptable token list to lexer obj.m do |var| expr end // After “do” do_body: … • opt_block_param bodystmt opt_block_param: • none opt_block_param: • block_param_def block_param_def: • '|' opt_bv_decl ‘|' block_param_def: • '|' block_param opt_bv_decl ‘|' // After “var” $@23: ε • [‘='] f_eq: • $@23 ‘=' f_opt_primary_value: f_arg_asgn • f_eq primary_value f_arg_item: f_arg_asgn • ['|', '\n', ',', ';']

Reconsider modi fi er if • “if” will be •
keyword_if if the lex state is EXPR_BEG • modifier_if if the lex state is not EXPR_BEG “if” keyword_if modi fi er_if if cond then … … if cond EXPR_BEG ! EXPR_BEG

Checking states table • If I checked correctly, no state
accepts both keyword_if and modifier_if • If the state accepts keyword_if, it doesn’t accept modifier_if • If the state accepts modifier_if, it doesn’t accept keyword_if • Always current state knows how to handle “if”

EXPR and if • After the operator, state is EXPR_BEG.
Then “if … end” is accepted • After the number, state is EXPR_END. Then modifier if is accepted • It’s clear which type of if can be written 1 + 2 BEG END BEG END 1 + if true; 1 else 2 end 1 + 2 if true EXPR_BEG ! EXPR_BEG

Hypothesis • #1: In Ruby, the end of nonterminal symbol
is powerful enough to distinguish which tokens are accepted • #2: A lot of token types can be determined on parser side • If so, sub-language model is not the best mental model in Ruby

I forget command like control syntax • Tweak parse.y to
replace modifier_if with keyword_if • These grammar rules have conflict return if … return (if …) (return) if … keyword_if modifier_if

Insight • modifier_if or keyword_if • It’s clear in a
sentence with operator • It’s not clear just after control syntax • If the relation between modifier_if and keyword_if are specified, parser inform conflicts to us • How conflicts are resolved in the language is important insight when new syntax is added

Summary • In Ruby, how to extract token depends on
the surrounding sentences • lex_state is complicated • Need to mitigate the complexities for further syntax extensions • Tight communication between scanner and parser will reduce the complexities • Explicitly declaration of conflict resolution recodes what the language designer decided • Able to refer to the past decisions when similar pattern appears

Give feedbacks to the language designer It’s fun to hack
parser generator - RubyKaigi 2024 LT in Okinawa -

What will happen by the change? • Proposal for existing
grammar • https://bugs.ruby-lang.org/issues/18080

It’s possible to implement • > but nobu said it's
hard to support because of parse.y limitation. • No, it’s possible!! • https://github.com/yui-knk/ruby/tree/bugs_18080

Need to consider these patterns • There is an argument
or not • The arguments are sounded by parentheses or not • There is block or not • The symbol of pattern matching, `in` or `=>`

Need to consider these patterns • There is one combination
which is suspicious

Con fl icts with existing grammar • There is no
block • The arguments are not sounded by parentheses • The symbol of pattern matching is `=>`

LR parser generator knows this issue • S/R or R/R
conflict detection is a friend for programming language designer

Why this issue is di ff i cult to detect?
• Need to check all combination of grammar rules • Discussion of grammar and implementation of parser are localized

Combination of grammar rules • A lot of rules are
optional • Argument is optional • Parentheses around arguments are optional • Block is optional • (The symbol of pattern matching, `in` or `=>) • Need to discuss grammar rules as group • E.g. “a == b”, “1 + 2” and “1..2” are in same “arg” group • If change “arg” rules, need to consider the impact on “expr” and “stmt” too

Localized discussion & implementation • Examples in a ticket is
simple • Parser implementation is a combination of parts • Parser generator: combination of rules • Recursive Descent Parser: combination of functions, e.g. “parse_pattern_matching”, “parse_arguments”

Localized discussion & implementation • Localized discussion and implementation are
good practice • Divide the difficulties • However it requires mechanism to integrate these parts • LR parser generator has the mechanism, conflict detection • Hand written parser doesn’t have such mechanism • Parser generator works as checker/linter for grammar • Can not keep soundness of grammar without the help from computer science

• https://rubykaigi.org/2024/presentations/spikeolaf.html

Lexer level con fl ict • Current parser doesn’t warn
lexer level conflict • Because parser doesn’t know relationship between keyword_if and modifier_if • However it conflicts on some points from programmers viewpoint • The detection is helpful for syntax discussion return if … return (if …) (return) if … keyword_if modifier_if

Endless range • Endless range literal is cutting-edge syntax •
Traditional range ends with EXPR_END however endless range ends with EXPR_BEG 1 … 2 BEG END BEG END 1 … BEG END BEG

Endless range • Concerning lex state sensitive tokens • ‘%’:
is interpreted as a start of % string literal if EXPR_BEG • ‘||’: is divided into two ‘|’ if EXPR_BEG

Endless range • However it might not matter • ‘..’
and ‘…’ has relatively low precedence

Endless range • “and” also doesn’t matter • “and” has
lower precedence than “…”

Endless range • “rescue” might matter ? • Parser generator
could help [Feature #12912] discussion more

Last battle with Space • I think space and newline
are the most mysterious syntax part of Ruby • “space_seen” variable • ‘\n’ token and tIGNORED_NL token • How to include space and newline into parser context is open problem

Summary • It’s difficult for human to understand the combination
• Can not keep soundness of grammar without the help from computer science • PSLR is key concept for checking soundness of lexer state sensitive grammar

What parser generates I want to know truth of syntax
tree design - Osaka RubyKaigi 04 -

Recap Osaka RubyKaigi 04 https://yui-knk.hatenablog.com/entry/2024/08/23/113543

Use cases of Syntax Tree and it’s design in 10
mins

Syntax Tree • Parser generates Syntax Tree for other libraries
and components • Compiler, Type System, LSP, Linter and Code Formatter • Therefore what’s the use case of Syntax Tree ? • How to satisfy the use cases ?

Use cases • They want to execute codes • compile.c
• Type System • They want to analyze codes • LSP (ruby-lsp) • Linter & Code Formatter (RuboCop)

Code analysis • Need token information (Syntax Highlight) • Need
to analyze comments (LSP DocumentLink) • Need to walk through parent node from child node (LSP SelectionRange) • Want to rewrite codes (LSP & Code Formatter) • This is the most difficult use case, right now

Code rewriting • Style::IfInsideElse Cop • Unnest if inside if
https://github.com/rubocop/rubocop/blob/v1.65.1/lib/rubocop/cop/style/if_inside_else.rb#L10-L29

How RuboCop rewrite codes if condition_a action_a else if condition_b
action_b else action_c end end if condition_a action_a elsif condition_b if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end if condition_a action_a elsif condition_b action_b else action_c end 1. else to elsif 2. Delete “if condition_b” 3. Delete end 4.Delete duplicated “action_b”

Problem of TreeRewriter #1 • Implementation is complex • TreeRewriter
doesn’t edit source code directly • Create TreeRewriter::Action instances, store the actions then apply the changes at once Action. :replace (2, 0)-(2, 4) “elsif condition_b” Action. :replace (3, 2)-(3, 16) “action_b” Action. :replace (7, 0)-(7, 6) “” Action. :replace (4, 0)-(4, 13) “”

Why Action is needed #1 • It’s costly to edit
string every time • In both cases, need to move/copy sub-strings after “else” if condition_a action_a else action_b end if condition_a action_a action_b end Delete else if condition_a action_a else action_b end if condition_a action_a elsif action_b end Replace with elsif

Why Action is needed #2 • Directly editing the code
affects the rest of nodes if condition_a action_a else action_b end Parser::Source::Bu ff er if condition_a action_a action_b end Parser::Source::Bu ff er NODE_VCALL action_b Range (3, 2)-(3, 10) Delete else

Problem of TreeRewriter #2 • Rewriting operations are complicated •
Need to understand current status of each step if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end if condition_a action_a elsif condition_b action_b else action_c end 1. else to elsif 2. Delete “if condition_b” 3. Delete end 4.Delete duplicated “action_b”

Rewriting Syntax Tree • Can leverage Tree Structure • Change
NODE_IF to NODE_ELSIF then delete NODE_ELSE NODE_IF condition_a action_a NODE_ELSE NODE_IF condition_b action_b NODE_ELSE action_c NODE_IF condition_a action_a NODE_ELSIF condition_b action_b action_c

Generate source code from Syntax Tree • Once rewrite the
syntax tree, rendering source code from Syntax Tree • However AST doesn’t have spaces, newline and so on… NODE_IF condition_a action_a NODE_ELSIF condition_b action_b action_c

Other di ffi culties #1 • Need to pass range
information for new node • Calculation is still based on text oriented approach • Can not fully leverage the syntax tree transformation if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b else action_c end Start is same with else The last line is “action_c line - 1” The last column is “tail of action_c - 2”

Other di ffi culties #2 • All nodes following the
updated node are affected NODE_CLASS condition_a action_a NODE_ELSIF NODE_DEF NODE_IF NODE_DEF expr1 expr2 expr2 expr2 condition_b action_b action_c Update!!

Migration Tool • Code rewriting is not only for LSP
and code formatter • But also migration tool like Transpec http://yujinakayama.me/transpec/

How to solve the problem

Concrete Syntax Tree • Concrete Syntax Tree (CST) for code
restoration • CST preserves information which AST omits, e.g. spaces, newlines, parentheses • AST focuses on semantics, CST focuses on Syntax • Implementation • Introduce data structure for token • Keep information on token which lexer omitted • Node has child nodes and tokens

Concrete Syntax Tree • Concrete Syntax Tree (CST) for code
restoration • CST preserves information which AST omits, e.g. spaces, newlines, parentheses • AST focuses on semantics, CST focuses on Syntax • Implementation • Introduce data structure for token • Keep information on token which lexer omitted • Node has child nodes and tokens Syntax Tree having complete information of source code

Trivia • Trivia is information which lexer omits • Spaces,
Newlines, comments and so on Trivia (comment) Trivia (spaces) Trivia (new line)

Node, Token and Trivia • Token has trailing trivia and
leading trivia • Node holds nodes and tokens NODE_IF IF cond action_a END Token NODE Legend space (1) NL (1) + space (2) NL (1) Trivia

Syntax Tree to code • Dump codes with Depth-first search
to get the whole codes Token NODE Legend NODE_IF IF cond action_a END space (1) NL (1) + space (2) NL (1) Trivia

Red Green Tree • Red Green Tree is editable Syntax
Tree • Invented by C# (Roslyn) • Swift (SwiftSyntax) and rust-analyzer (LSP) uses this • Represent Syntax Tree with Red Node and Green Node • Let’s read swift-syntax • https://github.com/swiftlang/swift-syntax

Red Green Tree • Green Node • has reference to
chide elements • has width • Red Node • has reference to parent elements • has offset Token Green NODE Legend Red NODE NODE_IF width: 90 IF width: 3 NODE_IF width: 56 condition_a width: 11 action_a width: 11 NODE_ELSE width: 61 END width: 4 ELSE width: 5 NODE_IF o ff set: 0 NODE_ELSE o ff set: 25 NODE_IF o ff set: 30

Recap • Execute codes: compile.c • Analyze codes: LSP, Linter
& Code Formatter • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • Concrete Syntax Tree !! • Editable Syntax Tree • Red Green Tree !!

Problem • How to keep edited Syntax Tree correct ?
• If it’s not correct, hopefully want to auto correct • Syntax Tree rewriting can create Syntax Tree which parser never generates + 1 2 3 * * + 1 2 3 parse Rewrite Dump

Open Problem • Simple approach: Parse the dump code and
compare syntax tree with the syntax tree before dump • Grammar might know which syntax tree parser can generate Grammar fi le Parser generator Parser Syntax Tree Checker

Summary • Research Syntax Tree use case • Code rewriting
is the most difficult use case, right now • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • How to keep edited Syntax Tree correct • Open Problem

Conclusion The world is now in the great age of
parsers. People are setting sail into the vast sea of parsers. - RubyKaigi 2023 LT in Matsumoto -

Summary • Grammar defines the boundary of the language and
structure of the language • What parsers do • Parser distinguishes valid input and invalid input • Parser gives structure to the input correctly • LR-attributed grammar is the foundations theory of current Ruby parser

Summary • Use Case: Ruby • Ruby syntax is designed
for programmers not for machines • Ruby syntax changes • Parser needs to • have theory and mechanism which mitigate implementation complexities • give the language designer feedbacks about syntax changes • PSLR is key concept for lexer state sensitive grammar

Summary • Research Syntax Tree use case • Code rewriting
is the most difficult use case, right now • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • How to keep edited Syntax Tree correct • Open Problem

What is Grammar • The ruler of lexer, parser and
syntax tree • By grammar, we can reveal what Ruby is • It’s very interesting to expose the secret of Ruby syntax from grammar • I want to reveal what is the key of Ruby’s programmer friendly syntax • Hypothesis: • Programmers • can understand expression beginning and ending • feel it’s natural that conditional branch follows flow control keywords, like “return” • can understand space sensitive grammar • sometimes fails to understand precedence of non-arithmetic operator

What is Parser

What is Parser

More Decks by yui-knk

Other Decks in Programming

Featured

Transcript