Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Let's Write a Parser

Let's Write a Parser

Ionuț G. Stan

May 19, 2016
Tweet

More Decks by Ionuț G. Stan

Other Decks in Programming

Transcript

  1. • Software Developer at Eloquentix • I work mostly with

    Scala • I like FP, programming languages, compilers About Me
  2. • Software Developer at Eloquentix • I work mostly with

    Scala • I like FP, programming languages, compilers • I started the Bucharest FP meet-up group About Me
  3. • Software Developer at Eloquentix • I work mostly with

    Scala • I like FP, programming languages, compilers • I started the Bucharest FP meet-up group • I occasionally blog on igstan.ro About Me
  4. 1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. Vehicle Language: µML
  5. 1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false Vehicle Language: µML
  6. 1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a Vehicle Language: µML
  7. 1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 Vehicle Language: µML
  8. 1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f Vehicle Language: µML
  9. 1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f 7. Addition and subtraction: a + b, a - b Vehicle Language: µML
  10. 1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f 7. Addition and subtraction: a + b, a - b 8. Parenthesized expressions: (a + b) Vehicle Language: µML
  11. Abstract Syntax Tree T Compiler ) 2 Parser APP FUN

    a VAR a INT 2 Abstract Syntax Tree (AST) uage (funct
  12. Code Generation T Compiler ) 2 Parser CodeGen APP FUN

    a VAR a INT 2 Abstract Syntax Tree (AST) uage (funct
  13. Last Year's Talk T Compiler ) 2 Parser CodeGen Type

    Checker AST Typed AST Last Year ... uage (funct
  14. Today's Talk T Compiler ) 2 Parser CodeGen Type Checker

    AST Typed AST Today ... uage (funct
  15. Lexing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  16. Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  17. Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  18. Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  19. Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  20. Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  21. Lexing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  22. Lexing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser uage • Expects a stream of characters or bytes • Groups them into semantically atomic units: tokens! • These are the words of the language! • What are the rules for grouping them, though?
  23. • Grouping can be thought of as "split by space"

    • Why not exactly that, though? Consider: Lexing
  24. • Grouping can be thought of as "split by space"

    • Why not exactly that, though? Consider: Lexing val sum = 1 + 2 ! val sum=1+2 ! val str = "spaces matter here"
  25. • We need rules for grouping characters into tokens •

    These rules form the lexical grammar Lexing
  26. • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions Lexing
  27. • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations Lexing
  28. • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library Lexing
  29. • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome Lexing
  30. • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome • Lexer generators: Lex, Flex, Alex, ANTLR, etc. Lexing
  31. • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome • Lexer generators: Lex, Flex, Alex, ANTLR, etc. • Lexing is what you need for syntax definition files Lexing
  32. µML — Lexical Grammar integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (,

    ), +, -, =, => keywords if, then, else, let, val, in, end, fn, true, false
  33. integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (, ), +, -, =,

    => keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar
  34. integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (, ), +, -, =,

    => keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar
  35. integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (, ), +, -, =,

    => keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar
  36. integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (, ), +, -, =,

    => keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar
  37. Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  38. • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language Parsing
  39. • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 Parsing
  40. • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val Parsing
  41. • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val • We must define the structure of phrases Parsing
  42. • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val • We must define the structure of phrases • A syntactical grammar achieves that Parsing
  43. • Regular expressions are not powerful enough • REs can't

    recognize nested structures • Because they use a finite amount of memory Parsing
  44. • Regular expressions are not powerful enough • REs can't

    recognize nested structures • Because they use a finite amount of memory • Nesting needs a stack to remember the upper structures you're traversing Parsing
  45. • Regular expressions are not powerful enough • REs can't

    recognize nested structures • Because they use a finite amount of memory • Nesting needs a stack to remember the upper structures you're traversing • Syntactical grammars express nesting using recursion Parsing
  46. µML — Syntactical Grammar expr = int | var |

    bool | ( expr ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr oper = + | - bool = true | false
  47. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr oper = + | - bool = true | false Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  48. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr oper = + | - bool = true | false Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  49. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  50. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  51. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  52. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  53. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  54. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  55. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  56. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  57. expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  58. • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 Introducing Precedence
  59. • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) Introducing Precedence
  60. • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence Introducing Precedence
  61. • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence Introducing Precedence
  62. • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence • Doesn't scale with many infix operators Introducing Precedence
  63. • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence • Doesn't scale with many infix operators • Use a special parser for that, e.g., the Shunting Yard algorithm Introducing Precedence
  64. Introducing Precedence expr = int | var | bool |

    ( expr ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | -
  65. Introducing Precedence expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  66. Introducing Precedence expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  67. Introducing Precedence expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  68. Introducing Precedence expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  69. • Two styles: • Top-down parsing: builds tree from the

    root • Bottom-up parsing: builds tree from the leaves Parsing Strategies
  70. • Two styles: • Top-down parsing: builds tree from the

    root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand Parsing Strategies
  71. • Two styles: • Top-down parsing: builds tree from the

    root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand • Bottom-up is not, but it's used by generators Parsing Strategies
  72. • Two styles: • Top-down parsing: builds tree from the

    root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand • Bottom-up is not, but it's used by generators • Parser generators: YACC, ANTLR, Bison, etc. Parsing Strategies
  73. • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent Recursive Descent Parser
  74. • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar Recursive Descent Parser
  75. • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function Recursive Descent Parser
  76. • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive Recursive Descent Parser
  77. • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack Recursive Descent Parser
  78. • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack • Very popular, e.g., Clang uses it for C/C++/Obj-C Recursive Descent Parser
  79. • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack • Very popular, e.g., Clang uses it for C/C++/Obj-C • Parser combinators are an abstraction over this idea Recursive Descent Parser
  80. • The current grammar has a problem • But, it's

    only a problem for our current parsing strategy; others can easily cope with it Removing Left-Recursion
  81. • The current grammar has a problem • But, it's

    only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the first symbol on the left Removing Left-Recursion
  82. • The current grammar has a problem • But, it's

    only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the first symbol on the left • This is problematic for a recursive descent parser because the structure of function calls follow the structure of rule definitions Removing Left-Recursion
  83. • The current grammar has a problem • But, it's

    only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the first symbol on the left • This is problematic for a recursive descent parser because the structure of function calls follow the structure of rule definitions • That means infinite recursion in the parser, which isn't good Removing Left-Recursion
  84. expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | - Left-Recursive Grammar
  85. expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | - Left-Recursive Grammar
  86. expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic Left-Recursive Grammar
  87. expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | atomic atomic | atomic atomic atomic | atomic atomic atomic atomic ... Left-Recursive Grammar
  88. expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | atomic atomic | atomic (atomic atomic) | atomic (atomic (atomic atomic)) ... Left-Recursive Grammar
  89. expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic { app } Left-Recursive Grammar
  90. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic { app } ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  91. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix
  92. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app oper infix
  93. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix
  94. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix | app oper app oper app oper infix
  95. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix | app oper app oper app oper infix ...
  96. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app (oper infix) | app (oper app (oper infix)) | app (oper app (oper app (oper infix))) ...
  97. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app { oper infix }
  98. Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app { oper infix } ! app = atomic { app } ! 12 14 13 (12 14) 13 ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  99. • Write a lexer for JSON • Write a recursive

    descent parser for JSON • It's way easier than today's vehicle language • I promise! • Specification: json.org Homework