Let's Write a Parser

Let's Write a Parser Ionuț G. Stan — I T.A.K.E.
— May 2016

About Me

• Software Developer at Eloquentix About Me

• Software Developer at Eloquentix • I work mostly with
Scala About Me

Scala • I like FP, programming languages, compilers About Me

Scala • I like FP, programming languages, compilers • I started the Bucharest FP meet-up group About Me

Scala • I like FP, programming languages, compilers • I started the Bucharest FP meet-up group • I occasionally blog on igstan.ro About Me

• Vehicle Language: µML Plan

• Vehicle Language: µML • Compilers Overview Plan

• Vehicle Language: µML • Compilers Overview • Parsing: Intuitions
and Live Coding Plan

Vehicle Language: µML

1. Integers: 1, 23, 456, etc. Vehicle Language: µML

1. Integers: 1, 23, 456, etc. 2. Identiﬁers (only letters):
inc, cond, a, etc. Vehicle Language: µML

inc, cond, a, etc. 3. Booleans: true and false Vehicle Language: µML

inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a Vehicle Language: µML

inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 Vehicle Language: µML

inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f Vehicle Language: µML

inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f 7. Addition and subtraction: a + b, a - b Vehicle Language: µML

inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f 7. Addition and subtraction: a + b, a - b 8. Parenthesized expressions: (a + b) Vehicle Language: µML

9. Let blocks/expressions:    let  val name = ...  in 
name  end Vehicle Language: µML

Small Example let val inc = fn a => a
+ 1 in inc 42 end

Compilers Overview

Compiler Compilers Overview

Compiler Compilers Overview Source Language

Target Language Compiler Compilers Overview Source Language

Target Language Compiler (fn a => a) 2 Compilers Overview
Source Language

Target Language Compiler (fn a => a) 2 (function(a){return a})(2)
Compilers Overview Source Language

Compilers Overview uage T Compiler ) 2 (funct

Parsing T Compiler ) 2 Parser uage (funct

Abstract Syntax Tree T Compiler ) 2 Parser APP FUN
a VAR a INT 2 Abstract Syntax Tree (AST) uage (funct

Code Generation T Compiler ) 2 Parser CodeGen APP FUN
a VAR a INT 2 Abstract Syntax Tree (AST) uage (funct

Many Intermediate Phases de T Compiler ) 2 Parser CodeGen
... AST (funct

Type Checking T Compiler ) 2 Parser CodeGen Type Checker
AST Typed AST ... uage (funct

Last Year's Talk T Compiler ) 2 Parser CodeGen Type
Checker AST Typed AST Last Year ... uage (funct

Today's Talk T Compiler ) 2 Parser CodeGen Type Checker
AST Typed AST Today ... uage (funct

Parsing Compiler ) 2 Parser uage

Lexing + Parsing Compiler ) 2 Parser Lexer Tokens (
fn a => a ) 2 Parser uage

Lexing Compiler ) 2 Parser Lexer Tokens ( fn a
=> a ) 2 Parser uage

=> a ) 2 Parser APP FUN a VAR a INT 2 AST uage

Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

Lexing

=> a ) 2 Parser uage • Expects a stream of characters or bytes • Groups them into semantically atomic units: tokens! • These are the words of the language! • What are the rules for grouping them, though?

• Grouping can be thought of as "split by space"
Lexing

• Why not exactly that, though? Consider: Lexing

• Why not exactly that, though? Consider: Lexing val sum = 1 + 2 ! val sum=1+2 ! val str = "spaces matter here"

• We need rules for grouping characters into tokens Lexing

• We need rules for grouping characters into tokens •
These rules form the lexical grammar Lexing

These rules form the lexical grammar • Can be deﬁned using regular expressions Lexing

These rules form the lexical grammar • Can be deﬁned using regular expressions • Conducive to easy and efﬁcient implementations Lexing

These rules form the lexical grammar • Can be deﬁned using regular expressions • Conducive to easy and efﬁcient implementations • Using a RegExp library Lexing

These rules form the lexical grammar • Can be deﬁned using regular expressions • Conducive to easy and efﬁcient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome Lexing

These rules form the lexical grammar • Can be deﬁned using regular expressions • Conducive to easy and efﬁcient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome • Lexer generators: Lex, Flex, Alex, ANTLR, etc. Lexing

These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome • Lexer generators: Lex, Flex, Alex, ANTLR, etc. • Lexing is what you need for syntax definition files Lexing

µML — Lexical Grammar integers 0|[1-9][0-9]* identiﬁers [a-zA-Z]+ symbols (,
), +, -, =, => keywords if, then, else, let, val, in, end, fn, true, false

integers 0|[1-9][0-9]* identiﬁers [a-zA-Z]+ symbols (, ), +, -, =,
=> keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar

Parsing

Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

• The lexer recognizes valid words in the language Parsing

• The lexer recognizes valid words in the language •
Not all combinations of valid words form valid phrases in a language Parsing

Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 Parsing

Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val Parsing

Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val • We must deﬁne the structure of phrases Parsing

Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val • We must deﬁne the structure of phrases • A syntactical grammar achieves that Parsing

• Regular expressions are not powerful enough Parsing

• Regular expressions are not powerful enough • REs can't
recognize nested structures Parsing

recognize nested structures • Because they use a ﬁnite amount of memory Parsing

recognize nested structures • Because they use a ﬁnite amount of memory • Nesting needs a stack to remember the upper structures you're traversing Parsing

recognize nested structures • Because they use a ﬁnite amount of memory • Nesting needs a stack to remember the upper structures you're traversing • Syntactical grammars express nesting using recursion Parsing

It's not weird-looking Unicode characters that make regexes unsuitable for
parsing.

Syntactical Grammar

• Function application has higher precedence over inﬁx expressions in
ML Introducing Precedence

ML • double 1 + 2 = (double 1) + 2 Introducing Precedence

ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) Introducing Precedence

ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence Introducing Precedence

ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence Introducing Precedence

ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence • Doesn't scale with many inﬁx operators Introducing Precedence

ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence • Doesn't scale with many inﬁx operators • Use a special parser for that, e.g., the Shunting Yard algorithm Introducing Precedence

Parsing Strategies

• Two styles: Parsing Strategies

• Two styles: • Top-down parsing: builds tree from the
root Parsing Strategies

root • Bottom-up parsing: builds tree from the leaves Parsing Strategies

root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand Parsing Strategies

root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand • Bottom-up is not, but it's used by generators Parsing Strategies

root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand • Bottom-up is not, but it's used by generators • Parser generators: YACC, ANTLR, Bison, etc. Parsing Strategies

• The simplest known parsing strategy; amenable to hand-coding Recursive
Descent Parser

• The simplest known parsing strategy; amenable to hand-coding •
Builds the tree top to bottom, from root to leaves, hence Descent Recursive Descent Parser

Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar Recursive Descent Parser

Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function Recursive Descent Parser

Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive Recursive Descent Parser

Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack Recursive Descent Parser

Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack • Very popular, e.g., Clang uses it for C/C++/Obj-C Recursive Descent Parser

Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack • Very popular, e.g., Clang uses it for C/C++/Obj-C • Parser combinators are an abstraction over this idea Recursive Descent Parser

• The current grammar has a problem Removing Left-Recursion

• The current grammar has a problem • But, it's
only a problem for our current parsing strategy; others can easily cope with it Removing Left-Recursion

only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the ﬁrst symbol on the left Removing Left-Recursion

only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the ﬁrst symbol on the left • This is problematic for a recursive descent parser because the structure of function calls follow the structure of rule deﬁnitions Removing Left-Recursion

only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the first symbol on the left • This is problematic for a recursive descent parser because the structure of function calls follow the structure of rule definitions • That means infinite recursion in the parser, which isn't good Removing Left-Recursion

expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic Left-Recursive Grammar

expr then expr else expr ! infix = app | infix oper infix ! app = atomic | atomic atomic | atomic atomic atomic | atomic atomic atomic atomic ... Left-Recursive Grammar

expr then expr else expr ! infix = app | infix oper infix ! app = atomic | atomic atomic | atomic (atomic atomic) | atomic (atomic (atomic atomic)) ... Left-Recursive Grammar

expr then expr else expr ! infix = app | infix oper infix ! app = atomic { app } Left-Recursive Grammar

Removing Left-Recursion expr = infix | fn var => expr
| if expr then expr else expr ! infix = app | infix oper infix ! app = atomic { app } ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -

| if expr then expr else expr ! infix = app | infix oper infix

| if expr then expr else expr ! infix = app | app oper infix

| if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix

| if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix | app oper app oper app oper infix

| if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix | app oper app oper app oper infix ...

| if expr then expr else expr ! infix = app | app (oper infix) | app (oper app (oper infix)) | app (oper app (oper app (oper infix))) ...

| if expr then expr else expr ! infix = app { oper infix }

| if expr then expr else expr ! infix = app { oper infix } ! app = atomic { app } ! 12 14 13 (12 14) 13 ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -

github.com / igstan / itake-2016

• Write a lexer for JSON • Write a recursive
descent parser for JSON • It's way easier than today's vehicle language • I promise! • Speciﬁcation: json.org Homework

Thank You!

Questions!

Let's Write a Parser

Let's Write a Parser

More Decks by Ionuț G. Stan

Other Decks in Programming

Featured

Transcript