Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby's Line Breaks

yui-knk
April 16, 2025

Ruby's Line Breaks

yui-knk

April 16, 2025
Tweet

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. ✦ Yuichiro Kaneko ✦ yui-knk (GitHub) / spikeolaf (Twitter) ✦

    Treasure Data ✦ Engineering Manager of Applications Backend About me
  2. TD and Ruby committers twitter: @nalsh GitHub: @nurse twitter: @k_tsj

    GitHub: @k-tsj twitter: @ spikeolaf GitHub: @yui-knk twitter: @mineroaoki GitHub: @aamine twitter: @nahi GitHub: @nahi Applications Backend
  3. ✦ Yuichiro Kaneko ✦ yui-knk (GitHub) / spikeolaf (Twitter) ✦

    CRuby committer, mainly develop parser generator and parser ✦ Lrama LALR (1) parser generator (2023, Ruby 3.3) ✦ The Bison Slayer ✦ The parser monster ✦ The dawn bringer of the parser world ✦ Ripper Rearchitecture (2024, Ruby 3.4) ✦ Code positions to RNode (2018, Ruby 2.6) ✦ RubyVM::AbstractSyntaxTree (2018, Ruby 2.6) About me
  4. ✦ This code consists two method dispatches ✦ Calls `#p`

    method with 1 ✦ Calls `#p` method with 2 ✦ In Ruby, you can use line breaks to separate meaningful chunks of code Line Breaks in Ruby Grammar
  5. ✦ This code consists one method dispatch ✦ `Integer#+` is

    called for `1` with `2` Line Breaks in Ruby Grammar
  6. ✦ There are signi fi cant difference between these two

    codes ✦ #1: Due to the line break, the fi rst and second lines are interpreted as distinct method calls ✦ #2: The two lines are combined into a single line of code ✦ In other words, the second code behaves as if the line break is ignored Line Breaks are the question
  7. ✦ Hypothesis: In principle, a statement terminates if a line

    break is present at a point where a statement can be completed Principles of Line Breaks in Ruby https://x.com/tanaka_akr/status/1870679443376947467
  8. ✦ The fi rst line of this code becomes a

    complete statement as a method call without arguments ✦ Therefore this code is two method calls, not a single method call with an argument ✦ The principle holds true #1: method call w/o args !=
  9. ✦ The fi rst line of this code isn’t a

    complete statement because syntax error is raised ✦ Therefore this code is one method call with two arguments ✦ The principle holds true #2: method call w/ args == Syntax Error
  10. ✦ “1 +” isn’t a complete statement ✦ Therefore this

    code is interpreted as “1 + 2” ✦ The principle holds true #3: binary operator Syntax Error ==
  11. ✦ “-” isn’t a complete statement ✦ Therefore this code

    is interpreted as ‘-“str”’ ✦ The principle holds true #4: unary operator Syntax Error ==
  12. ✦ Because the ternary operator's scope includes the expression following

    the colon, it will not be able to be separated in the middle of that expression ✦ The principle holds true #5: ternary operator == Syntax Error
  13. ✦ So far, the hypothesis seems to be correct ✦

    Hypothesis: In principle, a statement terminates if a line break is present at a point where a statement can be complete ✦ However, are there truly no counterexamples in every part of Ruby's grammar …? No counterexamples exist …?
  14. ✦ The relationships of Language, Grammar, Automaton and Parser ✦

    Investigate how line breaks are treated ✦ Verify the hypothesis ✦ Understand the principles of line breaks in Ruby grammar Today’s topics
  15. ✦ A (formal) language is a subset of words ✦

    Some words belong to Ruby language ✦ Others don’t What is Language?
  16. ✦ Even so these codes are transcendental and imbroglio codes

    (TRICK), they belong to Ruby language. Ruby https://github.com/tric/trick2022/blob/master/01-tompng/entry.rb https://github.com/tric/trick2022/blob/master/06-mame/entry.rb
  17. ✦ At a glance, this code seems Ruby code, however

    it doesn’t belong to Ruby language Not Ruby
  18. ✦ (Ruby) language is a in fi nite set of

    words ✦ Grammar is a fi nite set of rules which de fi ne language What is Grammar? Grammar Language …
  19. ✦ Finite automaton is 5-tuple ✦ is a fi nite

    set of states ✦ is a fi nite set of input symbols ✦ is a transition function, from state to state by input symbol ✦ is an initial state ✦ is a set of accepting states (Q, Σ, δ, q0 , F) Q Σ δ q0 F What is Automaton?
  20. ✦ Let's consider a vending machine that sells juice at

    ¥110 ✦ Put in a 100-yen coin and 10-yen coin ✦ Press the “purchase” button Vending machine
  21. ✦ In the case of a vending machine, ✦ There

    are 3 inputs, 100-yen coin, 10-yen coin and pressing the “purchase” button ✦ Initial state is that no coins are inserted ✦ Accepting state is that 100-yen coin and 10-yen coins are inserted then the “purchase” button is pressed Vending machine as automaton q0 ¥100 q1 q2 ¥10 q3 q4 ¥100 ¥10 “purchase"
  22. ✦ Only limited inputs are accepted in each state ✦

    For example, on ✦ ¥10 coin is accepted ✦ ¥100 coin is not accepted ✦ “purchase” button is not accepted q1 Vending machine as automaton q0 ¥100 q1 q2 ¥10 q3 q4 ¥100 ¥10 “purchase"
  23. ✦ There are known methods to convert a Non-deterministic Finite

    Automaton (NFA) to a Deterministic Finite Automaton (DFA) ✦ It is known that the minimum DFA is unique ✦ It's possible to combine two automata into a single automaton Theory of Automaton
  24. ✦ If you attended Fujinami-san's talk this morning, you likely

    heard precisely about research on automaton Theory of Automaton https://speakerdeck.com/makenowjust/make-parsers-compatible-using-automata-learning
  25. ✦ @.bookstore at RubyKaigi 2025 ✦ “ܭࢉཧ࿦ͷجૅɹ[ݪஶୈ3൛] 1.ΦʔτϚτϯͱݴޠ” ✦ “നͱࠇͷͱͼΒ:

    ΦʔτϚτϯͱܗࣜݴޠΛΊ͙Δ๯ݥ” ✦ “ਖ਼نදݱٕज़ೖ໳――࠷৽Τϯδϯ࣮૷ͱཧ࿦తഎܠ” Books for Automaton
  26. ✦ Is the vending machine related to a parser? ✦

    No, parser is not a vending machine, but parser is an automaton Parser as fi nite-state automaton
  27. ✦ Let’s think about a grammar which only includes single

    class de fi nition ✦ “class”, identi fi er (id) then “end” ✦ This can be represented as an automaton with four states, taking “class” id and “end” as inputs Parser for very simple grammar class A end program : "class" id "end" P1 P2 P3 P4 class id end Grammar Language Automaton
  28. ✦ For example, on P1 ✦ “class” is accepted then

    goes to state P2 ✦ id is not accepted then syntax error ✦ “end” is not accepted then syntax error How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton
  29. ✦ For example, on P2 ✦ “class” is not accepted

    then syntax error ✦ id is accepted then goes to state P3 ✦ “end” is not accepted then syntax error How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton
  30. ✦ For example, on P4 ✦ “class” is not accepted

    then syntax error ✦ id is not accepted then syntax error ✦ “end” is not accepted then syntax error ✦ End of Input is accepted then the parsing process completed without errors How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton
  31. ✦ Let’s allow arbitrary levels of nesting for class de

    fi nitions Parser for complex grammar class A class B … end end program : class_def class_def : "class" id body "end" body : class_def | /* empty */ Grammar Language Automaton B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end B1 /* empty */
  32. ✦ When parsing nested class de fi nitions, a new

    automaton corresponding to the class de fi nition is created ✦ Parsed until “class A”, so that the second automaton is on C3 Automaton with a stack class A class B end end Input C1 C2 C3 C5 P1 P2 class_def C4 class id body end
  33. ✦ To parse “class B end” as the body of

    class A, create new automatons Automaton with a stack class A class B end end Input B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end C1 C2 C3 C5 C4 class id body end
  34. ✦ After reading to “class B end”, the bottom automaton

    enters the accepting state Automaton with a stack class A class B end end Input B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end C1 C2 C3 C5 C4 class id body end
  35. ✦ When the automaton reaches the accepting state, it is

    popped from the stack, and the original automaton transitions to the next state (C4) Automaton with a stack class A class B end end Input C1 C2 C3 C5 P1 P2 class_def C4 class id body end
  36. ✦ Parser is an automaton that takes tokens such as

    “class”, id, and “end” as input ✦ A fi nite automaton with a stack is called a pushdown automaton ✦ By using a stack, pushdown automaton can handle languages with in fi nite nesting ✦ Implementation of LR parser is pushdown automaton LR Parser as pushdown automaton
  37. ✦ LR parsers have two main operations ✦ Shift: Moves

    the automaton to the next state ✦ Reduce: Pops the automaton that has reached the accepting state from the stack LR parser actions C1 C2 C3 C5 C4 C1 C2 C3 C5 C4 Shift Reduce
  38. ✦ How does LR parser choose the correct automaton when

    multiple automatons are applicable? ✦ “body : class_def” is correct for left,“body : /* empty */” is correct for right Chose correct automaton class A end B1 B2 C1 C2 C3 C5 class_def C4 class id body end C1 C2 C3 C5 C4 class id body end class A class B end end B1 /* empty */
  39. ✦ Determine the correct rule by looking at the next

    token ✦ In the right case, next token is “end” ✦ In the case of the empty string rule, “end” can be shifted after the automaton is popped ✦ The set of tokens that can follow a certain rule is called a lookahead set Lookahead set class A end C1 C2 C3 C5 C4 class id body end B1 /* empty */ Next token is “end” Match with new token Input
  40. ✦ Chomsky hierarchy ✦ Four formal grammar classes consist hierarchy

    ✦ There are correspondences between grammars and automatons Grammar class and automaton Regular Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing machine Non-deterministic pushdown automaton Finite-state automaton Turing machine
  41. ✦ Grammar determines the scope of the language ✦ Grammar

    can be converted to an automaton ✦ Parser is an automaton that takes tokens as input Language, Grammar, Automaton and Parser class A … end program : class_def class_def : "class" id body "end" … Grammar Language Automaton = Parser C1 C2 C3 C5 C4 class id body end Determine Convert
  42. ✦ A lexer is the component that divides input string

    into meaningful chunks, which are called tokens Parser and Lexer C1 C2 C3 C5 P1 P2 class_def C4 class id body end class A class B end end Input class A class B end end Tokens Parser Lexer
  43. ✦ It's important to understand that the lexer intelligently handles

    line breaks, sometimes ignoring them and sometimes not #1: Lexer ignores Line Breaks method_1 arg 1 + 2 Input method_1 ‘\n’ arg 1 + 2 Tokens Not ignored Ignored
  44. ✦ On the other hand, regarding the grammar, statements are

    separated by line breaks (‘\n’) by a rule Grammar for statements
  45. ✦ Lexer returns or ignores ‘\n’ based on lex state

    How lexer works Returns ‘\n’ token Ignores ‘\n’ Returns ‘\n’ token Checks lex state
  46. ✦ 13 lex state fl ags exits! ✦ EXPR_BEG: BEGinning

    of expression ✦ EXPR_END: END of expression ✦ EXPR_ENDARG: END of ARGument ✦ EXPR_ENDFN: END of Function NAME ✦ EXPR_ARG: ARGument ✦ EXPR_CMDARG: CoMmanD ARGument ✦ EXPR_MID: MIDdle of expression ✦ EXPR_FNAME: immediate after “def” keyword, might be Function NAME ✦ EXPR_DOT: immediate after DOT (dot includes ‘.’ ‘&.’ ‘::’) ✦ EXPR_CLASS: immediate after “class” keyword ✦ EXPR_LABEL: label is possible, label is `a:` ✦ EXPR_LABELED: immediate after label ✦ EXPR_FITEM: just before fi tem. fi tem is token after undef or alias ✦ Only written the typical meanings, there are also exceptions They are lex state fl ags !
  47. ✦ EXPR_CLASS means immediate after “class” keyword ✦ The lexer

    ignores line breaks when EXPR_CLASS, so this code works without any issues EXPR_CLASS ==
  48. ✦ EXPR_DOT means immediate after DOT ✦ Dot includes ‘.’

    ‘&.’ ‘::’ ✦ The lexer ignores line breaks when EXPR_DOT, so this code works without any issues EXPR_DOT ==
  49. ✦ Ruby's grammar has the concepts of EXPR_BEG and EXPR_END

    ✦ In the '1 + 2' example, the EXPR_BEG and EXPR_END states are repeated like this EXPR_BEG and EXPR_END 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END
  50. ✦ Lexer ignores line breaks when EXPR_BEG ✦ Therefore, this

    code is equivalent to “1 + 2” EXPR_BEG 1 + 2 Input 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END ‘\n’ 1 + 2 ==
  51. ✦ Lexer emits line break tokens when EXPR_END ✦ Therefore,

    in this code, it's treated as two lines of code: “1”, and “+2”, rather than "1 + 2” EXPR_END 1 + 2 Input 1 ‘\n’ 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END 1 + 2 != + EXPR_BEG
  52. ✦ Lex states can be thought of as automaton that

    take tokens as input ✦ While lexers are typically described as automaton that take characters as input, Ruby's lexer is an automaton in a dual sense Lexer is an automaton Automaton for characters ‘|’ ‘|’ ‘=’ |= || | ||= Otherwise ‘=’ Otherwise Automaton for tokens BEG END tINTEGER +, \n +@
  53. ✦ What is this exception case? How lexer works again

    What is this? Returns ‘\n’ token Understand!
  54. ✦ ‘\n’ is needed after mandatory keyword arguments for method

    de fi nition without parentheses ✦ Ref: [Bug #9669] Method de fi nition w/o parentheses ‘\n’ is needed !=
  55. ✦ Even though the lexer intelligently ignores line breaks, there

    are situations where that doesn't work ✦ Lex state is EXPR_END after “arg” then lexer emits line breaks ✦ For example, “1+ 2” is “arg” ✦ In such cases, need to write line break token in the grammar #2: Grammar allows Line Breaks ‘\n’ is needed in grammar rules EXPR_END EXPR_END
  56. ✦ In a 2006 mailing list discussion, they were talking

    about ignoring line breaks where expressions cannot end ✦ [ruby-dev: 29206] Another evidence https://public-inbox.org/ruby-dev/[email protected]/
  57. ✦ Personally, I feel it's too dif fi cult, why

    we can maintain this… ✦ Then I asked other committers what they think about lex state ✦ “This is impossible to understand, isn't it?” by usa ✦ “I'm keeping my distance because it's scary” by ko1 ✦ “It's hard to believe anyone could completely understand and work with all of lex state” by akr Monstrous Lex state
  58. ✦ lex_stateͬͯεΫϦϓτͷͲ͜ΛಡΜͰΔ͔ʹΑͬͯෳࡶʹঢ়ଶ ͕มΘͬͯී௨ͷਓؒʹ௥͍੾ΕΔ΋ͷ͡Όͳ͍ؾ͕͢ΔΜ͚ͩ Ͳɺ͜ΕΛࢻతͱ͍͏͔จֶతʹදݱ͢ΔͳΒԿͯݺ΂͹͍͍ʁ ✦ lex_state changes so intricately based

    on the script's reading position that it seems beyond human comprehension. How could we describe this poetically or in a literary sense? Q&A session with Gemini
  59. ✦ ʮίʔυʹӅ͞ΕͨʰؾʱͷྲྀΕʯ ✦ ཧ༝: ໨ʹ͸ݟ͑ͳ͍͚ΕͲɺίʔυશମͷҙຯͷྲྀΕ΍ߏ଄ʢؾ ຺ʣΛܗ࡞ΓɺղऍΛํ޲෇͚͍ͯΔɻ౦༸ࢥ૝తͳਆൿੑΛࠐ Ίͯɻ ✦ “The hidden

    fl ow of Qi within the code” ✦ Reason: An unseen force that forms the semantic fl ow and structure of the code, directing its interpretation, with a touch of Eastern mysticism Lex state is Qi
  60. ✦ ͲΕ΋ૉఢͰ͕͢ɺlex_stateΛׂͱ͍͍΋ͷͱͱΒ͑ͯΔදݱ͕ ଟ͍ͱࢥ͍·͢ɻͲͪΒ͔ͱ͍͏ͱअѱͳࠇຐज़ͷྨͳؾ͕͢Δ ͷͰɺͦ͏͍͏ํ޲Ͱߟ͑ͯ΋Β͑·͔͢ ✦ While these are nice, but

    they describe lex_state quite favorably. I perceive it as a form of malevolent black magic. So could you consider it from that perspective?
  61. ✦ ʮ։ൃऀͷཧੑΛḝΉɺ໊ঢ়͕͍ͨ͠ঢ়ଶભҠʯ ✦ χϡΞϯεɿ ͦͷෳࡶ͕͞ਓؒͷཧղΛ௒͑ɺਂ௥͍͢Δͱਫ਼ਆ Λফ໣ͤ͞ΔΑ͏ͳɺ৮Εͯ͸ͳΒͳ͍ڰؾɻʢΫτΡϧϑਆ࿩త ͳڹ͖ʣ ✦ “An indescribable

    state transition that erodes the developer's sanity” ✦ Nuance: A madness that transcends human comprehension, a forbidden insanity that drains the mind upon close inspection. (With a Cthulhu Mythos vibe.) An indescribable state transition
  62. ✦ ʮύʔαʔͷਂ෵ʹજΉɺࠞಱʢΧΦεʣ΁ͷ൶ʯ ✦ χϡΞϯεɿ ͦͷෳࡶͳ࢓૊Έ͕ɺݴޠղऍͷ࠷΋Ԟਂ͘ɺਓؒ ʹ͸ཧղෆೳͳྖҬʢਂ෵ɾࠞಱʣ΁ͱܨ͕͍ͬͯΔɺఈ஌Εͳ ͍ෆؾຯ͞ɻ ✦ “A doorway

    to chaos lurking in the parser's abyss” ✦ Nuance: The unfathomable eeriness of its intricate mechanism, which leads to the deepest, incomprehensible realms of language interpretation (abyss and chaos). A doorway to chaos
  63. ✦ The independence of the parser and lexer ✦ Even

    though we've seen so many examples of the parser and lexer cooperating What makes it so dif fi cult?
  64. ✦ It needs to set the state immediately before the

    binary plus operator to be EXPR_END ✦ Grammar rule for the binary plus operator expression is “arg : arg + arg” ✦ Therefore, lex state should be EXPR_END when it comes to the end of arg Where should be EXPR_END? 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END arg + arg EXPR_BEG EXPR_END EXPR_BEG EXPR_END
  65. ✦ That means that it's only necessary to transition to

    EXPR_END for all the tokens that come at the end of the 'arg' rule ✦ In relation to the First set, could we refer to them as the Last set? ✦ These tokens can be the last token of arg ✦ tINTEGER ✦ tSTRING_END ✦ “end” Last set of rules
  66. ✦ To determine the last set, expand the rule until

    its fi nal symbol is a token ✦ “arg” includes “primary” ✦ “primary” can be expanded to tINTEGER and so on ✦ “primary” ends with ‘]’, “end” and so on ✦ Need to set the state to EXPR_END immediately after all the tokens obtained in this way How to know Last set arg : arg ‘+’ arg | arg '-' arg | … | primary primary : literal | tLBRACK aref_args ‘]' | k_if … k_end | … simple_numeric : tINTEGER | tFLOAT | tRATIONAL | tIMAGINARY literal : numeric | symbol
  67. ✦ Now, let's focus on the tokens themselves ✦ For

    example, ‘^’ (caret) ✦ This is a binary operator used to calculate the exclusive OR for integers ✦ As it's a binary operator, similar to ‘+’, it should transition to EXPR_BEG immediately after ‘^’ The context of the token
  68. ✦ The caret is now used not only as a

    binary operator but also as a unary operator, speci fi cally as a pin operator in pattern matching ✦ For now, even in the unary operator case, simply transitioning to EXPR_BEG works well ✦ However, if we change the lex state transition for the caret in the future, we'll need to consider both cases ✦ Lex state management is also dif fi cult because a single token can be used in grammatically very different places ‘^’ is not always binary operator
  69. ✦ Another reason why lex state is so dif fi

    cult is that a state is used for various purposes ✦ For example, EXPR_BEG is not only used to determine whether to ignore or emit line break but also whether to treat two bars as a set or as separate bars Lex state overloading This is “||” (!EXPR_BEG) This is ‘|’ and ‘|’ (EXPR_BEG)
  70. ✦ Fixing [Bug #10653] caused [Bug #11456] and [Bug #11849]

    ✦ “All bug fi xes are incompatibilities” in RubyKaigi 2019 Fixing a bug caused other bugs
  71. ✦ Ruby's parser handles line breaks by having the lexer

    decide whether to ignore or emit it, depending on the context ✦ This context-dependent behavior is managed by something called lex state ✦ It’s clear that lex state is incredibly hard to manage, and it can be described as “a doorway to chaos” A doorway to chaos lurking in the parser's abyss
  72. Day 0: Interview for nobu Nakada-san, what do you think

    about Lex state? Well, that's a necessary evil, isn't it?
  73. Day 0: Interview for nobu Could you tell me how

    much you know about Lex state behavior? Hmm, I'm at about 50% understanding at the moment.
  74. Day 0: Interview for nobu What's your usual method when

    you need to modify things related to Lex I check all the areas that might be affected, but the last part is just a feeling.
  75. ✦ Can subtraction be written in every place where addition

    can be written in Ruby? ✦ Yes, it's clear from the grammar ✦ Not only subtraction but also all binary operators can be checked by the grammar Grammar as Order arg : arg '+' arg | arg '-' arg | arg '*' arg | arg '/' arg | arg '^' arg | arg tCMP arg ... Grammar
  76. ✦ Is it possible to write the same element for

    both the default value of a formal argument (“expr1”) and the actual argument (“expr2”)? ✦ Yes, it's also clear from the grammar that both of them allow “arg_value” Grammar as Order f_args : f_arg ',' f_optarg(arg_value) ',' … call_args : args opt_block_arg args : arg_value Grammar
  77. ✦ What are the problems if lex state is chaotic?

    ✦ Today's theme is the fundamental principles of line breaks in Ruby's grammar ✦ However, some parts of Ruby's grammar are controlled by lex state. This means it's dif fi cult to fi nd fundamental principles from chaos Problems with being chaotic program : class_def class_def : "class" id body "end" … Grammar Automaton = Parser Lex state “A doorway to chaos” Order Chaos Chaos
  78. ✦ Defeat chaos by modeling lex state ✦ Modeling is

    simplifying an object by focusing on its important properties to make it easier to comprehensively understand the structure of the object What’s modeling and why
  79. ✦ Parser and lex states are automatons that take tokens

    as input ✦ Parser is an automaton that takes tokens such as “class”, id, and “end” as input ✦ Lex states can be thought of as automaton that take tokens as input Parser and Lex State Parser Lex state BEG END tINTEGER +, \n +@ A1 A2 A3 primary + A4 primary P1 P2 tINTEGER
  80. ✦ It's possible to combine two automata into a single

    automaton ✦ Therefore, it's possible to build a new automaton from these two ✦ For example, the state after reading '+' is the A3 state in the parser, and the lex state is EXPR BEG Combine automatons Parser BEG END tINTEGER +, \n +@ A1 A2 A3 primary + A4 primary BEG END BEG END Lex state
  81. ✦ Extend Lrama’s grammar for describing lexer state Describe lex

    state transition on grammar fi le Type of states Initial state Aliases Transitions
  82. ✦ To extract the lex state transitions from an existing

    implementation, simply focus on the tokens the lexer returns How to extract lex state transition EXPR_ARG if IS_AFTER_OPERATOR EXPR_BEG if ! IS_AFTER_OPERATOR
  83. ✦ As the lex state is updated within certain grammar

    rules, introduce new syntax to the grammar that allows specifying lex state ✦ “%ls” stands for Lexer State Transition in grammar Transitions
  84. ✦ Runs lrama command once lexer state is written ✦

    Lex state transitions for each token is shown ✦ For example, “end” token has two transitions ✦ If “end” is method name, transits to EXPR_ENDFN ✦ Otherwise transits to EXPR_END Lex state for tokens EXPR_ENDFN EXPR_END
  85. ✦ Lex state transitions for each grammar rule is shown

    ✦ Method de fi nitions always result in the EXPR_END state ✦ Even so lex state after “end” token can be EXPR_ENDFN or EXPR_END Lex state for rules EXPR_END
  86. ✦ By combining lex state transitions for each token and

    each rule, the lex state transitions for each parser state becomes clear ✦ For example, the parser immediately after reading the binary operator ‘+’ is always in the EXPR_BEG state Lex state for parser state EXPR_END
  87. ✦ To test the hypothesis about line breaks, let's look

    for counterexamples ✦ Hypothesis: In principle, a statement terminates if a line break is present at a point where a statement can be completed Verify the hypothesis https://x.com/tanaka_akr/status/1870679443376947467
  88. ✦ Grammar allows ‘\n’ for reduce action but lexer ignores

    ‘\n’ ✦ The lexer state can be EXPR_BEG ✦ Then ‘\n’ is ignored ✦ The grammar state’s lookahead set includes ‘\n’ token #1. Unexpectedly ignores ‘\n’
  89. ✦ After the dots of an endless range, it becomes

    EXPR_BEG, and the following line break is ignored ✦ Therefore the code below is not an endless range and “b” but an normal range Endless range == != Based on the principles
  90. ✦ After the “*” of arguments, it becomes EXPR_BEG, and

    the following line break is ignored ✦ Therefore the code below is not anonymous arguments ✦ “**” and “&” are same Anonymous arguments == != Based on the principles
  91. ✦ Since both endless ranges and anonymous arguments are added

    later, the current behavior of ignoring line breaks is reasonable from compatibility perspective Intentional or not?
  92. ✦ Grammar allows ‘\n’ for shift action but lexer ignores

    ‘\n’ ✦ The lexer state can be EXPR_BEG ✦ Then ‘\n’ is ignored ✦ The grammar state shifts ‘\n’ token #2. Unexpectedly ignores ‘\n’
  93. ✦ After the ‘(’ of ‘not’, it becomes EXPR_BEG |

    EXPR_LABEL, and the following line break is ignored ✦ Because this ‘\n’ is optional, it doesn't cause any problems if the lexer ignores it Line Breaks before ‘)’ '\n'? ')' ==
  94. ✦ Grammar doesn’t allow ‘\n’ but lexer emits ‘\n’ ✦

    The state can’t be EXPR_BEG ✦ Then ‘\n’ is emitted ✦ The state’s lookahead set doesn’t include ‘\n’ token nor shirt ‘\n’ token #3. Unexpectedly emits ‘\n’
  95. ✦ A syntax error occurs speci fi cally when a

    line break is placed between global variables within the alias de fi nition ✦ ‘\n’ is ignored when EXPR_FNAME but emitted when EXPR_END Line Breaks in “alias” Syntax Error unexpected '\n' Ignore ‘\n’ Emit ‘\n’
  96. ✦ When it's not a global variable, lex state is

    explicitly set to EXPR_FNAME | EXPR_FITEM then ‘\n’ is ignored ✦ I think this is not intentional Line Breaks in “alias”
  97. ✦ A syntax error occurs when a line break is

    placed after “BEGIN” or “END” ✦ Is this intentional? BEGIN and END Syntax Error unexpected '\n' Syntax Error unexpected '\n'
  98. ✦ Hypothesis: In principle, a statement terminates if a line

    break is present at a point where a statement can be completed ✦ Exception: A statement doesn’t terminate for endless range and anonymous arguments (#1) ✦ Hypothesis: Ignoring line breaks where expressions cannot end ✦ Exception: A line break is emitted for global variable alias, BEGIN and END (#3) Verify the hypothesis
  99. ✦ It's dif fi cult to understand the fundamental principles

    of grammar with the chaos of lex state ✦ The order of grammar is necessary to face chaos ✦ Using the fact that both the parser and lex state are automata that take tokens as input, and that we can create a new automaton by combining two automata ✦ To validate the principles regarding line breaks in Ruby grammar, search for exceptions, and fi nd some exceptions ✦ However, for the most part,the hypothesis seems to be correct Chaos and Order
  100. ✦ I don't want to use the new features of

    Lrama to manage lex state ✦ I want to remove lex state completely ✦ In principle, a statement terminates if a line break is present at a point where a statement can be completed ✦ Therefore, the parser simply needs to send instructions to the lexer like 'ignore line breaks' or ‘return line break as token' depending on the current parser state Next step arg + arg Ignore ‘\n’ Emit ‘\n’ Ignore ‘\n’ Lexer Emit ‘\n’ Parser
  101. ✦ Furthermore, it needs to be able to handle the

    exceptions we've identi fi ed this time by writing instructions on the grammar ✦ Why describe it in the grammar? Because it's the grammar that decides the language and the parser Next step arg : arg tDOT2 arg | arg tDOT2 %ignore-token('\n') arg + Ignore ‘\n’ Emit ‘\n’ Ignore ‘\n’ Lexer Parser
  102. ✦ Look at line breaks in Ruby's grammar, a character

    that's surprisingly interesting when you consider it ✦ To understand how line breaks are treated in current Ruby, it is necessary to understand the behavior of lex state, which is a doorway to chaos ✦ Bringing lex state into the ordered systems of grammar and automata makes it possible to understand lex state behavior Ruby's Line Breaks
  103. ✦ In principle, a statement terminates if a line break

    is present at a point where a statement can be completed ✦ Exception ✦ A statement doesn’t terminate for endless range and anonymous arguments ✦ A line break is emitted for global variable alias, BEGIN and END Principle and exceptions
  104. ✦ akr, nurse and other committers ✦ ESM, Inc. Parser

    Club ✦ Dragon Book study group ✦ LR parser gangs ✦ Contributors and supporters for Lrama and parse.y ✦ sakahukamaki Acknowledgements