Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kingdom of the Machine

Kingdom of the Machine

Avatar for yui-knk

yui-knk

April 22, 2026

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. ˒ Chomsky Hierarchy ˒ Equivalence Checking for (Regular) Languages ˒

    Attribute Grammars and Their Subclasses ˒ Automatic Parser Generation from Grammars ˒ etc. The Golden Age of Formal Languages
  2. ˒ Ruby ˒ PHP ˒ Perl ˒ C ˒ Go

    ˒ PostgreSQL ˒ etc. The Golden Age of Formal Parser Generators
  3. ˒ Were people fed up with dif fi cult theory?

    ˒ Had GNU Bison stopped evolving? ˒ Go ahead - put your favorite complaint about parsing theory here!!! ˒ 2006: GCC migrates it’s parser from Bison to hand-written recursive- descent parsers (C++ was 2004) ˒ 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers They are like a dream on a spring night…
  4. Fin

  5. ˒ A Combination of Recursive Descent and Pratt Parsing ˒

    Largely absent from major textbooks, such as the Dragon Book ˒ Vaughan R. Pratt, “Top Down Operator Precedence.” (1973) Implementing a Hand-Written Parser Today
  6. ˒ In Pratt parsing, each operator has binding power on

    both its left and right sides. ˒ Higher numbers mean higher precedence, so * binds more tightly than +. ˒ If an operator’s right binding power is greater than its left, it is left-associative; otherwise, it is right-associative. 1. Binding power (10) + (11) (10) - (11) (20) * (21) (20) / (21) (30) ^ (30)
  7. De fi ning Simple Arithmetic Expressions ˒ Even though we

    only want to de fi ne simple arithmetic expressions …
  8. ˒ A string is said to be ambiguously derived in

    a grammar when the grammar generates that string in more than one way. Ambiguity E 1 2 3 E + * E 1 2 3 E + *
  9. Tweak the Grammar ˒ This change lets us express the

    precedence and associativity people would naturally expect.
  10. ˒ Determine which rule to apply by always looking ahead

    a token. ˒ LL(k) allows us to look ahead k tokens, but here we assume the usual case: k = 1. ˒ In other words, this forbids multiple rules from beginning with the same token. ˒ In an actual implementation, left-recursive rules cause an in fi nite loop. Recursive Descent Parser (LL parser)
  11. ˒ Both issues can be addressed through mechanical rewriting. ˒

    This man tends to steer clear of LL parsers for one simple reason: he dislikes this kind of rewriting. Tweak the Grammar again
  12. ˒ LR parser generators avoid the tedious grammar rewriting often

    required by LL parser generators. LR parser (1) 2 E: T • ["end of fi le", '+'] 3 T: T • '*' F 3 T: T '*' • F 5 F: • number 0 $accept: E • "end of fi le" 1 E: E • '+' T 1 • * 2 + 3 1 • 1 • + 2 * 3
  13. ˒ Some LR parser generators let you de fi ne

    precedence and associativity directly on tokens. ˒ In that case, the original grammar works just fi ne. LR parser (2)
  14. ˒ A binary operator appears in pre fi x position.

    ˒ Humans use parentheses to explicitly mark syntax tree boundaries. Parser-Friendly Grammar
  15. ˒ The parser works by looping over comparisons between the

    binding power of the current operator and that of the next one. Pseudocode for a Pratt Parser
  16. ˒ Since * has the larger number, parsing continues with

    * rather than + ˒ Since - has the smaller number, parsing stops once at *. How it works (10) + (11) (10) - (11) (20) * (21) (20) / (21) (30) ^ (30) Binding power
  17. ˒ The code denoted by a token ˒ nud: Null

    denotation ˒ led: Left denotation ˒ It is something like a semantic action in an LR parser. 2. nud & led
  18. ˒ nud function and led function are called for each

    operator Pseudocode for a Pratt Parser
  19. ˒ It parses the expression on the right. ˒ It

    constructs a node. Pseudocode for a Pratt Parser
  20. ˒ > BNF grammars alone do not deal adequately with

    either of these issues … ˒ > writing an LL(k) grammar, and keeping it LL(k) after extending it, seems to be a black art … ˒ > However, this is really a draw-back of BNF; the non-terminals tempt one to try to say everything with just context-free rules … ˒ One thing I can say is that, in the pseudocode, what nud and led receive is a token. 3. Anti-BNF
  21. ˒ The thrill of seeing operators handled so simply ˒

    LL parsers require grammar rewriting. ˒ LR parsers are not as simple as Pratt parsers, either. ˒ There was the confusion—or perhaps unease—of not using BNF at all. Excitement and Unease
  22. ˒ This function takes a token and determines whether that

    token can appear at the start of an expression. 1. token_begins_expression_p
  23. ˒ We will not discuss here what “expression” means in

    Ruby. ˒ Normally, statements represent actions, while expressions yield values—but in Ruby, even actions have values. ˒ In parse.y terms, an expr is not an arg, so it cannot be used as an argument, but it is still a smaller unit than a stmt—though that is probably not worth getting into here. Note: What is the “expression”?
  24. ˒ In some cases, parsing of the range stops right

    after ..., and in others, it does not. ˒ So how do we decide? We have to look at the token that comes right after … . #1: Endless range
  25. ˒ when the operator is .. or ..., this function

    looks at the next token to decide whether the right-hand side of the range should be parsed. Where this function is called?
  26. ˒ How is token_begins_expression_p function itself implemented? ˒ Fortunately, the

    number of token types is fi nite, so each one can be checked individually. By the way
  27. ˒ It can be the right-hand side of the range,

    so parsing continues. ˒ token_begins_expression_p returns true For literals
  28. ˒ This becomes an endless range, so parsing stops here.

    ˒ token_begins_expression_p returns false For “&&” “||”
  29. ˒ It can be the right-hand side of the range,

    so parsing continues. ˒ More precisely, + and - can be either binary or unary: if binary, parsing stops; if unary, parsing continues. ˒ Since the lexer is in EXPR_BEG here, it should produce a unary-operator token. For ‘+’ ‘-’
  30. ˒ * can be either binary or unary; in the

    unary case, it denotes a splat. ˒ However, unary * is not allowed in this position. ˒ Not sure it is entirely right to say that it is “not an expression” simply because it cannot appear here, but for now, let’s treat it as not being an expression. ˒ In fact, token_begins_expression_p explicitly returns true for tokens like UPLUS (+), but not for USTAR (*). For ‘*’
  31. ˒ / is fi ne here, since it can start

    a regular expression literal. For ‘/’
  32. ˒ But there is some good news: Rust is another

    language that has endless ranges. In Rust
  33. ˒ Another example of where token_begins_expression_p is used is right

    after return. ˒ If the next token is a literal, such as 1, parsing continues. ˒ If the next token is a newline or a post fi x if, parsing stops at that point. ˒ if can also appear in two forms: pre fi x and post fi x. #2: return
  34. ˒ * can appear immediately after return, where it means

    splat. ˒ But earlier, we decided to treat something beginning with * as not being an expression. Can ‘*’ start expression?
  35. ˒ In the current implementation, when checking whether an expression

    follows return, the * and ** cases are handled separately as an additional check. Responsibility of the caller
  36. ˒ Another example, ** is allowed right after return, but

    not right after rescue. ˒ Anything that token_begins_expression_p function cannot account for has to be handled on the caller side. Responsibility of the caller
  37. ˒ It is probably a set of tokens with fairly

    similar properties. ˒ Good thing it wasn’t a FOLLOW set :) ˒ The FIRST set of a nonterminal or production is the set of terminal symbols that may appear at the beginning of a string derived from it. FIRST sets?
  38. ˒ Fundamentally, the original Pratt parser framework only gives us

    one mechanism: de fi ne binding powers for operators, and the parser follows them. What pratt parser provides?
  39. ˒ Since nud and led can contain arbitrary code, that

    is where all the hard work ends up being done. ˒ To implement if …, you have to write the code for parsing the entire if construct yourself inside nud. So what happens then?
  40. ˒ With an LR parser, the generator just handles it,

    so I do not really have to think about it. How is this handled in LR parsing?
  41. ˒ In the previous case, the rule was that &&

    would make it an endless range, while a literal such as 1 would make it an ordinary range. #1: Endless range
  42. ˒ For example, state 380 represents the point just after

    reading … . ˒ There, if the next token is something like an integer literal or class, parsing continues as the right-hand side of a range; otherwise—for example, &&—it is treated as an endless range. #1: Endless range
  43. ˒ First, consider a production rule with a dot inserted

    between the symbols on its right-hand side. This is what is called an item. 1. Constructing items from production rules
  44. ˒ Next, compute the closure to expand the set of

    items in the state. 2. Computing the closure
  45. ˒ Connect states by transitions labeled with terminal and nonterminal

    symbols. 3. Consider transitions between states
  46. ˒ That is how, by building the states this way,

    we can tell which tokens let parsing continue in each state. 4. States indicates next tokens
  47. ˒ This function determines whether parsing should stop based on

    the node and the binding power of the next token. 2. parse_expression_terminator
  48. ˒ Method calls can appear on the left-hand side of

    pattern matching, but commands—method calls without parentheses— cannot, according to the grammar. Example: command
  49. ˒ If this were just about precedence between + and

    in, we could handle it in Pratt parsing by assigning binding powers. Pratt parsing can handle operator precedence
  50. ˒ What matters here is the grammatical structure of whatever

    appears to the left of in. ˒ Since the issue involves the in fi x operator in, checking only the fi rst token of the left-hand side is not enough to resolve it. What’s the problem?
  51. ˒ So the idea is probably to check the left-hand-side

    node together with in to ensure grammatical validity. Checking nodes and operators ⭕ ❌
  52. ˒ Is there a one-to-one correspondence between nodes and grammar

    rules? ˒ For example, what about something preceded by not and followed by do ... end? A simple question (1)
  53. ˒ Given the vast universe of valid strings, some can

    appear on the left-hand side of in and some cannot. Doesn’t this ultimately turn into a function that enumerates all the syntax trees that are not allowed there? A simple question (2)
  54. ˒ With an LR parser, the generator just handles it,

    so I do not really have to think about it. How is this handled in LR parsing?
  55. ˒ In State 79, in cannot be shifted. That is

    consistent with the grammar, since in is not allowed after an expr. Symbols are manage by the stack fcall cmd command_args 1, 2 expr cmd 1, 2 Reduce
  56. ˒ Traces of nonterminals keep showing up. ˒ The parse_expression

    function for parsing expressions takes binding power as its second argument. 3. Something that looks like a nonterminal
  57. ˒ For && or and, for example, pass the current

    operator’s right binding power to parse_expression. ˒ This is exactly the kind of thing Pratt parsing is naturally designed for. #1: “&&” “and”
  58. ˒ In Ruby, what can follow a modi fi er

    rescue depends on where it appears. ˒ Modi fi er rescue has different precedence depending on where it appears. #2: modi fi er rescue
  59. ˒ As a result, in some cases we end up

    hard-coding an appropriate binding power for the situation and then calling parse_expression. How to represent this rule
  60. ˒ A Pratt parser assumes that each operator has a

    single fi xed precedence within the grammar. ˒ To handle operators like modi fi er rescue, whose precedence changes depending on the situation, you need some extra mechanism outside the Pratt parser itself. ˒ From a parser-generator person’s point of view, this really starts to look like a nonterminal. Assumption
  61. ˒ With an LR parser, the generator just handles it,

    so I do not really have to think about it. ˒ Write stmt and it parses stmt; write arg and it parses arg. How is this handled in LR parsing?
  62. ˒ Knowledge speci fi c to parser generators ˒ Knowledge

    speci fi c to parser ˒ Grammar knowledge of the target language Three Distinct Kinds of Knowledge
  63. ˒ How to declare precedence, how to use %prec, and

    how to read con fl ict reports and so on. ˒ This is knowledge you only need when working with parser generators. 1. Knowledge speci fi c to parser generators
  64. ˒ How operators—especially in fi x operators—are handled, including precedence

    and associativity. ˒ How to tell whether a token can start an expression ˒ How to determine where an expression ends 2. Knowledge speci fi c to parser
  65. ˒ For operators, you need to understand how Pratt parsing

    works. ˒ You may also need to understand functions like token_begins_expression_p and parse_expression_terminator. ˒ My personal impression is that Pratt parsing has a fairly narrow scope: it really only handles operators. As a result, you have to face the problem of how to design everything outside that scope, and what makes it dif fi cult is that those parts tend to become highly language-speci fi c. With a hand-written parser.
  66. ˒ You normally do not have to think about this

    when things are going well, but you may still need this kind of knowledge when investigating the cause of a con fl ict. ˒ Of course, anyone implementing a parser generator is expected to have a solid understanding of parsing. ˒ But that’s part of the fun, isn’t it? With parser generators
  67. ˒ What can and cannot appear in each grammatical position.

    ˒ For an ambiguous grammar, what choices were made to resolve the ambiguity. ˒ This is necessary whether the parser is hand-written or built with a parser generator. 3. Grammar knowledge of the target language
  68. ˒ “I am not sure if this is going to

    work in all cases, it may need to be refactored later. But it appears to work for now.” Comment on the function
  69. ˒ There are always some people who want to add

    new grammar. ˒ One of the points that comes up in such discussions is whether adding that grammar is syntactically sound. ˒ In other words, whether it introduces ambiguity. ˒ At the same time, this also leads to discussions about how far such additions can go. Grammar is a living thing
  70. ˒ That naturally raises the question: how do languages with

    hand- written parsers verify ambiguity? ˒ You think it’s a stretch to make the parser responsible for feedback on new grammar? ˒ Perhaps. But surely it remains the case that someone, somewhere, must do it, does it not? Grammar is a living thing
  71. ˒ Q1. What is this? ˒ Q2. What could this

    be ambiguous with? #1: Type parameter in Go
  72. ˒ P is a type parameter ˒ *C is a

    type constraint ˒ *C is a pointer type of C A. Generic type?
  73. ˒ P * C denotes the length of an array

    type. ˒ In Go, any expression can be used to specify the length of an array type. B. Array type?
  74. ˒ In these rare cases, the type parameter list is

    indistinguishable from an expression and the type declaration is parsed as an array type declaration. To resolve the ambiguity, embed the constraint in an interface or use a trailing comma: ˒ https://go.dev/ref/spec#Type_parameter_declarations This is an array type
  75. ˒ We will not go into the pros and cons

    of this grammar here. ˒ My personal view is that operators often end up being contested between types and the core grammar, which makes it dif fi cult to integrate types cleanly into the grammar. ˒ What interests me here is whether this was actually designed that way. What to Focus On
  76. ˒ https://groups.google.com/g/golang-nuts/c/iAD0NBz3DYw ˒ “We’re going to settle on square brackets

    for the generics syntax.” ˒ “To avoid the ambiguity with array declarations, we will require that all type parameters provide a constraint.” 2020/08/21 [ generics] Moving forward with the generics design draft
  77. ˒ The right-hand side of the assignment can be read

    as two comparison expressions. ˒ It can also be read as a call to w, instantiated with the types x and y, taking z as an argument. Why not use the syntax F<T> like C++ and Java?
  78. ˒ It can be read as a single unnamed parameter

    of type x(T)—that is, the type x instantiated with the type parameter T. ˒ But it can also be read as a parameter named x with type T. Why not use the syntax F(T)?
  79. ˒ They call it a “known ambiguity,” but I could

    not fi nd any earlier discussion of this speci fi c case. ˒ Seriously, if it was already known, put it in the proposal. ˒ So my guess is that they noticed this at some point during implementation. (known) ambiguity?
  80. ˒ This ticket is about cases where pattern matching cannot

    be written under certain conditions. Bug #18080
  81. That said, the grammar still has to be managed, and

    the syntax still has to be parsed, right? Why not try using AI or something? As they say, use whatever is available—even your parents, if they are standing there. そうは言っても文法は管理しないといけないし 構文は解析しないといけないでしょ? AIとか試してみたら? 立ってるものは親でも使えってね
  82. ˒ In short, this ticket is about changing how expressions

    are parsed when assignment, rescue, and in interact. Bug #21097
  83. Please change the current parse.y behavior so that x =

    a rescue b in c, which is currently parsed as (x = (a rescue b)) in c, is instead parsed as x = (a rescue (b in c)). … In parse.y, when handling arg keyword_in, I changed it so that b in c is folded into the body of rescue only when the left-hand side is a simple assignment or an endless def / defs, and the right-hand side is a modi fi er rescue. The core of the implementation is in parse.y (line 1660), and the call site is in parse.y (line 3565).
  84. ˒ There is no change to the grammar itself; only

    the semantic action was rewritten. Iter. 1: Modify AST in the action
  85. ˒ Inside the newly introduced function, it branches based on

    the kind of node. Iter. 1: Modify AST in the action
  86. ˒ It inspects the structure of the node on the

    left-hand side of in, and if it matches a particular pattern—say, an assignment combined with rescue—it rewrites the node structure into a more convenient form. Iter. 1: Modify AST in the action Assign Rescue x a b Rescue a Case b In Assign x
  87. Instead of achieving this by adjusting the node construction in

    the semantic action, try achieving it by changing the production rules themselves. Revised parse.y so that it no longer relies on post hoc AST rewriting, and instead resolves the issue through the production rules alone. I added a long, dedicated rule for rescue ... in ..., and assigned %prec tLOWEST to the shorter ... rescue ... side so that keyword_in is picked up fi rst. At the same time, I introduced a pattern entry point speci fi cally for the right-hand side of rescue, so that |, .., …, and => can be parsed without breaking ordinary pattern matching.
  88. ˒ It adds cases for body of endless method de

    fi nitions and right- hand sides of assignments that involve both rescue and in. ˒ It also adjusts precedence to prevent reduction when in is present. ˒ Fair enough, I guess. Iter. 2: New production rules
  89. ˒ For patterns, it introduces p_rescue_ pre fi xed rules

    such as p_rescue_expr, and these are used to exclude forms like parenthesis-less array patterns. Iter. 2: Sub rules of patterns
  90. Compared with p_top_expr_body, what is restricted in p_rescue_expr? … As

    a result, compared with p_top_expr_body, what p_rescue_expr mainly restricts is top-level abbreviated forms. What it does not allow: * bare array patterns: a, b * bare rest/tail patterns: *a, a, *rest * bare fi nd patterns * bare hash patterns: a:, a: 1, **rest
  91. ˒ The relatively low precedence assigned to the reduction from

    p_rescue_alt to p_rescue_as is there to make the parser choose shift when it con fl icts with |. Iter. 2: Sub rules of patterns
  92. Please handle the => case in the same way, not

    just the in case. I changed the => side as well so that it is handled directly by the grammar rules. I added long, dedicated rescue ... => ... rules to arg_rhs and endless_arg, so that x = a rescue b => c and def f = a rescue b => c are now parsed as x = (a rescue (b => c)) and def f = (a rescue (b => c)), respectively.
  93. ˒ This change also adds new rules for cases where

    both rescue and the arrow appear, just as it does for in. ˒ Fair enough, I guess. Iter. 3: Treat “=>” the same way as “in”
  94. ˒ Other approaches are possible, but the approach taken here—

    adding restrictions to patterns—is also one worth considering. ˒ Allow it only in assignments that aren't arg rule ˒ As for the implementation, I would like to see it shared a bit more with the existing rules. ˒ That would require adding new functionality to the generator. ˒ Overall, it is not as bad as I expected. Impressions
  95. Why is %prec keyword_in necessary in p_rescue_as: p_rescue_alt %prec keyword_in?

    Adding %prec keyword_in gives that reduce side the same low precedence as keyword_in. Since | has higher precedence than that, the parser chooses to shift |. In fact, the report says: * Con fl ict between reduce by "p_rescue_as -> p_rescue_alt" and shift '|' resolved as shift ("'in'" < '|')
  96. ˒ This report fi le contains everything you need to

    debug the parser. ˒ If it’s not there, it’s just not there. ˒ I added this Precedences section myself because I felt it was absolutely necessary, so I’m happy to see someone actually using it. ˒ Pulling something like this feels unfair—it makes me like it more. The report fi le