Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

RubyConf Taiwan / Understanding Parser Generato...

RubyConf Taiwan / Understanding Parser Generators surrounding Ruby with Contributing Lrama

Junichi Kobayashi

December 14, 2023
Tweet

More Decks by Junichi Kobayashi

Other Decks in Programming

Transcript

  1. Understanding Parser Generators surrounding Ruby with Contributing Lrama Junichi Kobayashi

    (@junk0612) ESM, Inc. RubyConf Taiwan 2023 National Taipei University of Education 2023/12/15(Fri.)
  2. Junichi Kobayashi (@junk0612) • Working as Rails engineer at ESM,

    Inc. ◦ Agile division and RubyxAgile group ◦ A member of Parser Club • Hobbies ◦ Parsers ◦ Games (Rhythm games / Sim games / Tabletop games) ◦ Speed cubes
  3. ESM, Inc. • An IT development company in Japan •

    The sponsor of this presentation • "ESM" is the initials of "Eiwa System Management" ◦ Japanese: "永和システムマネジメント"
  4. Today's topics • Basic Knowledge of Parsing • My Contributions

    to Lrama • Understanding Internal Structure of Lrama through Implementation • Future Endeavors
  5. Basic Knowledge of Parsing • Components of Parsing ◦ Lexer

    ◦ Parser ◦ Parser Generator • Terms of Programming Language Processor ◦ Formal Language ◦ Context Free Grammar ◦ Backus-Naur Form
  6. Components of Parsing • Lexer ◦ Program that splits text

    into tokens (Tokenization) • Parser ◦ Program that constructs a structure from token stream ▪ Compilers: source code -> Abstract Syntax Tree (AST) ▪ JSON or CSV parsers: text -> some data structure • Parser Generator ◦ Program that generates a parser from grammar files
  7. CRuby Environment .rb file Ruby VM Components of Parsing Lexer

    Parser Generator Token Stream AST Byte codes
  8. CRuby Environment .rb file Ruby VM Components of Parsing Lexer

    Parser Generator Token Stream AST Byte codes Parser Generator grammar
  9. Terms of PL Processor • Formal Language ◦ The field

    of linguistics that deals with "Language" in a mathematical and set-theoretical way ◦ Considers how a language is represented as text ▪ Does not consider the semantics ▪ e.g., English is represented as sequences of alphabets, interspersed with symbols and spaces ◦ Composed of Symbols and Grammar
  10. Terms of PL Processor • Context Free Grammar (CFG) ◦

    A kind of Formal Language that is represented as follows: ▪ rule: A B C ... | D E F ... ▪ This notation is called Backus-Naur Form (BNF) ◦ Almost Programming Languages belong to this category ◦ Used in the grammar file which is the input of Parser Generator
  11. Terms of PL Processor • Context Free Grammar (CFG) ◦

    Nonterminal Symbol ▪ A symbol that can be replaced by other symbols ◦ Terminal Symbol ▪ A symbol that appears in input text
  12. CFG and BNF number: digit digit digit: '0' | '1'

    | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
  13. CFG and BNF number: digit digit digit: '0' | '1'

    | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Language for 2-digit integers (00 to 99)
  14. CFG and BNF expression: digit '+' digit | digit '-'

    digit | digit '*' digit | digit '/' digit digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
  15. expression: digit '+' digit | digit '-' digit | digit

    '*' digit | digit '/' digit digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Language for 2-term arithmetic expressions with 1-digit integers CFG and BNF
  16. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF
  17. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' e.g., expression CFG and BNF
  18. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF e.g., expression / digit
  19. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF e.g., ( expression ) / digit
  20. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF e.g., ( expression + digit ) / digit
  21. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF e.g., ( digit + digit ) / digit
  22. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF e.g., ( 2 + digit ) / digit
  23. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF e.g., ( 2 + 7 ) / digit
  24. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF e.g., ( 2 + 7 ) / 3
  25. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Language for arithmetic expressions of 1 digit integers with parentheses CFG and BNF
  26. Operation of Parser expression: digit | expression '+' digit |

    expression '-' digit | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' e.g., ( 2 + 7 ) / 3
  27. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Operation of Parser e.g., ( digit + 7 ) / 3
  28. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Operation of Parser e.g., ( digit + digit ) / 3
  29. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Operation of Parser e.g., ( digit + digit ) / digit
  30. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Operation of Parser e.g., ( expression + digit ) / digit
  31. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Operation of Parser e.g., ( expression ) / digit
  32. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Operation of Parser e.g., expression / digit
  33. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Operation of Parser e.g., expression
  34. expression: digit | expression '+' digit | expression '-' digit

    | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Accepted! Operation of Parser
  35. Summary • The Lexer and Parser analyze the program and

    create its data structures when source code is executed or compiled • A program which generates parser from grammar file is called Parser Generator • The input grammar file of Parser Generator is written in CFG • BNF is one of the representation of CFG
  36. Lrama • A Parser Generator built with Ruby as a

    replacement for Bison ◦ https://github.com/ruby/lrama • Presented in RubyKaigi 2023 by Yuichiro Kaneko ◦ https://youtu.be/IhfDsLx784g?si=kO1q6mLpTa1bIRYL • Use in CRuby 3.3 build process ◦ You can try now by building HEAD of Ruby ◦ Ruby's behavior is NOT changed
  37. Benefits of Using Lrama • No longer dependent on Bison's

    version ◦ Since Bison versions vary among users, it's necessary to assume that older versions may be installed ◦ Cannot be used even if new features are introduced • Allows for the implementation of Ruby-specific features ◦ Parsing unfinished code for LSP ◦ Making the complex parse.y more readable
  38. Ruby-Specific Features • Parameterizing Rules ◦ https://github.com/ruby/lrama/pull/181 ◦ Specific pattern

    rules can be created by attaching a symbol after symbols ▪ sym*: represents a list of 0+ syms ▪ sym+: represents a list of 1+ syms ▪ sym?: represents sym is appeared or not
  39. Other parser tools surrounding Ruby • Prism ◦ Hand-written parser

    for CRuby ◦ Built as a replacement for CRuby parser generated by Bison ◦ Compatible with both JRuby and TruffleRuby
  40. Other parser tools surrounding Ruby • Bison ◦ A next-generation

    parser generator developed by GNU, following Yacc ◦ Used for generating CRuby parser from parse.y • Racc ◦ A parser generator developed by Minero Aoki ◦ Used in Parser gem (RuboCop dependency) and others
  41. Can Bison be replaced with Racc? • It's impractical because

    while the generation algorithms are the same, there are few parts that can be commonly used, and it's less costly to create new ones ◦ Input file grammar ▪ Bison: Yacc-like / Racc: Original ◦ Generated parser's language ▪ Bison: C / Racc: Ruby
  42. Named References • A feature of Bison • Symbol names

    can be used as References in Action
  43. 🤔 Named References • A feature of Bison • Nonterminal

    symbol can be used as References in Action
  44. %{ Prologue (~ 1500 Lines) %} Bison declarations (~ 200

    Lines) %% Grammar rules (~ 4500 Lines) %% Epilogue (~ 8300 Lines) Structure of Bison Grammar File Lines in () indicate CRuby's parse.y
  45. %{ Prologue (~ 1500 Lines) %} Bison declarations (~ 200

    Lines) %% Grammar rules (~ 4500 Lines) ← Today's topic %% Epilogue (~ 8300 Lines) Structure of Bison Grammar File
  46. Structure of Grammar rules rule_name: rule rule ... rule {

    action } | rule rule ... rule { action } expression: NUMBER '+' expression { $$ = $1 + $3 } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 }
  47. rule_name: rule rule ... rule { action } | rule

    rule ... rule { action } expression: NUMBER '+' expression { $$ = $1 + $3 } ← | NUMBER '-' expression { $$ = $1 - $3 } ← Today's topic | '(' expression ')' { $$ = $2 } ← Structure of Grammar rules
  48. Action • A parser generated by Bison, if left unmodified,

    only tells you whether the input adheres to the grammar or not ◦ It does not create an AST, nor does it save any information necessary for subsequent processing • You can write programs in {} following each grammar rule • Can use $n or @n, as the values of symbols in grammar ◦ This feature is known as (Numbered) References
  49. An Example of Actions expression: NUMBER '+' expression { $$

    = $1 + $3 } The return value when this grammar is accepted is the sum of the 1st and 3rd elements
  50. expression: NUMBER '+' expression { $$ = $1 + $3

    } An Example of Actions The return value when this grammar is accepted is the sum of the 1st and 3rd elements
  51. expression: NUMBER '+' expression { $$ = $1 + $3

    } An Example of Actions The return value when this grammar is accepted is the sum of the 1st and 3rd elements
  52. expression: NUMBER '+' expression { $$ = $1 + $3

    } An Example of Actions The return value when this grammar is accepted is the sum of the 1st and 3rd elements
  53. Named References • Issues in Numbered References ◦ Hard to

    understand due to lack of declarativeness ◦ If the grammar changes, it must be rewritten since it specifies by position number • Named References was developed to resolve these issues, enabling the use of values through referencing nonterminal symbol names
  54. Named References expression: NUMBER '+' expression { $$ = $1

    + $3 } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] }
  55. expression: NUMBER '+' expression { $$ = $1 + $3

    } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] } Values can be referenced by rule names prefixed with $ Named References
  56. expression: NUMBER '+' expression { $$ = $1 + $3

    } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] } Enclosing with [] in rule descriptions allows for assigning aliases Named References
  57. expression: NUMBER '+' expression { $$ = $1 + $3

    } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] } If rule names or aliases contain symbols, enclosing them in [] on the calling side is fine Named References
  58. Named References expression: NUMBER '+' expression { $$ = $1

    + $3 } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] }
  59. Summary • Implement Named References to Lrama ◦ Lrama is

    a parser generator built with Ruby as a replacement for Bison ◦ Named References is a feature of Bison, allowing the use of symbol names as References within Actions
  60. Implementation Approach • Numbered References was already implemented ◦ If

    the symbol names can be associated with calls within Actions, the part for generating code from the association already exists ◦ Decided to associate symbol names with calls, taking inspiration from the implementation of Numbered References
  61. Reason of Editing Lexer • Lexer knows the parsing context

    ◦ Information about what is currently being parsed • When loading Actions, it checks the location of the rule just read to determine 'which symbols are being referenced', completing the association process • Transferring this entire process to the parser is extremely challenging and does not align with the original design
  62. • Parsing context is required when tokenizing ◦ Ignore comments

    ◦ Does not raise ParseError when the newline is included inside of HereDoc ◦ etc. • Who should know it? (parser or lexer) Who knows parsing context
  63. Who knows parsing context • If the lexer knows it

    ◦ Possible to tokenize all the input in one go by changing its own state, allowing the process to be divided into phases • If the parser knows it ◦ Since the parser will receive the next token from the lexer based on its state, the lexer can focus on tokenization
  64. Future Work • Implement Bison's feature ◦ Generating IELR parser

    • Todo list for Lrama written by Yuichiro Kaneko ◦ https://docs.google.com/document/d/1EAZzYMXBOdzK-6 mMIj2YNJxZZRVcpJxE7-4zXbHn8JA/edit?usp=sharing ◦ (Japanese only)
  65. The world is now in the great age of parsers.

    People are setting sail into the vast sea of parsers. ――RubyKaigi 2023 LT Yuichiro Kaneko