Since the previous talk at Go Con 2014 Autumn, lots of things in the internals have changed. In this talk, I will try to give an overview of Go compiler internals and update the information as much as possible, along with my new hacks.
history of Go language infrastructure A brief history of Go language infrastructure Hacking the internals Hacking the internals Who Who am am I I: : @moriyoshi at github.com / @moriyoshit at twitter.com. @moriyoshi at github.com / @moriyoshit at twitter.com. An early Go contributor. An early Go contributor. Reviewed the Japanese translation of "Concurrency in Go". Reviewed the Japanese translation of "Concurrency in Go". commit a8fbf5dc2cd5b58167402df47bb06217c5e8fd22 commit a8fbf5dc2cd5b58167402df47bb06217c5e8fd22 Author: Moriyoshi Koizumi <[email protected]> Author: Moriyoshi Koizumi <[email protected]> Date: Tue Dec 15 21:24:17 2009 -0800 Date: Tue Dec 15 21:24:17 2009 -0800 This patch enables cgo utility to correctly convert enums in the C source This patch enables cgo utility to correctly convert enums in the C source into consts in the resulting Go source. Previously known as issue 161047, into consts in the resulting Go source. Previously known as issue 161047, which I deleted accidentally. Fixes issue 207. which I deleted accidentally. Fixes issue 207. R=rsc R=rsc https://golang.org/cl/166059 https://golang.org/cl/166059
cut it into a series of meaningful chunks. The lexer scans over the source code and cut it into a series of meaningful chunks. src/cmd/compile/internal/syntax/tokens.go src/cmd/compile/internal/syntax/tokens.go src/cmd/compile/internal/syntax/scanner.go src/cmd/compile/internal/syntax/scanner.go a := b + c(12) _Name _Define _IncOp _Name _Lparen _Rparen _Name _Literal
an AST (abstract syntax tree). The parser reads the generated tokens and build an AST (abstract syntax tree). src/cmd/compile/internal/syntax/nodes.go src/cmd/compile/internal/syntax/nodes.go src/cmd/compile/internal/syntax/parser.go src/cmd/compile/internal/syntax/parser.go CallExpr Operation Name _Name _Define _IncOp _Name _Lparen _Rparen _Name _Literal BasicLit Name AssignStmt Name Tokens AST + a b c 12 :=
a *syntax.File node, which consist of declaration nodes. A Go source file corresponds to a *syntax.File node, which consist of declaration nodes. package main import "fmt" const world = "world" type myString string var hello myString = "hello, " + world func main() { fmt.Println(hello) } FuncDecl File ImportDecl ConstDecl TypeDecl VarDecl ImportDecl ImportDecl ConstDecl ConstDecl TypeDecl TypeDecl VarDecl VarDecl FuncDecl FuncDecl CallExpr SelectorExpr Name Name Name fmt Println hello
to the annotated AST (node tree). "Noder" translates the AST to the annotated AST (node tree). src/cmd/compile/internal/gc/noder.go src/cmd/compile/internal/gc/noder.go Translation is done on a one-to-one basis. Translation is done on a one-to-one basis. CallExpr Operation Name BasicLit Name AssignStmt Name AST OCALL OADD ONAME OLITERAL ONAME OAS ONAME Node Tree + a b c 12 :=
to... "Typecheck" walks through the node tree and tries to... Determine the type of each node. Determine the type of each node. Annotate the nodes with the information used in later stages. Annotate the nodes with the information used in later stages. Replace the nodes for special function calls (len, cap, append and so on) to dedicated Replace the nodes for special function calls (len, cap, append and so on) to dedicated nodes. nodes. Translate the references to methods to closures. Translate the references to methods to closures. etc. etc. Typechecking happens occationally in the following stages to deal with the "synthesized" Typechecking happens occationally in the following stages to deal with the "synthesized" nodes. nodes. src/cmd/compile/internal/gc/typecheck.go src/cmd/compile/internal/gc/typecheck.go
ONAME OCALLFUNC OADD ONAME OLITERAL ONAME OAS ONAME + a b c 12 := + a b c 12 := ONAME OCALL ONAME len d ONAME OLEN d type=string type=int type=int type=int type=int type=func() type=int type=int
its body contains any closure function and finds out On each declared function, checks if its body contains any closure function and finds out how the outer variables are referenced in the closure. how the outer variables are referenced in the closure. a := 1 b := 2 func () { fmt.Println(a, b) }() a = 1 OCALLFUNC OCLOSURE OLITERAL OAS ONAME type=int assigned type=int ODCLFUNC = a 1 ODCL ONAME a type=func() ONAME type=int assigned type=int b ODCL Chosen strategies: a → pass by reference b → pass by value
address leaks off the stack. Checks on every variable if its address leaks off the stack. src/cmd/compile/internal/gc/escape.go src/cmd/compile/internal/gc/escape.go var z *int func main() { ... a := 1 z = &a ... } OLITERAL OAS ONAME type=int OADDR OAS ONAME type=*int type=*int ONAME type=int addrtaken ODCLFUNC = a 1 := z & a ODCL ONAME a type=int addrtaken type=int addrtaken typecheck escAnalyze OADDR ONAME type=*int type=*int ONAME type=int addrtaken class(PAUTOHEAP) z & a Located off-stack →mark the right operand as PAUTOHEAP
closure to a simpler form. Transform the immediate call to the closure to a simpler form. src/cmd/compile/internal/gc/closure.go src/cmd/compile/internal/gc/closure.go func do() { a := 1 func() { fmt.Println(a) a = 2 }() } func func1(a *int) { fmt.Println(*a) *a = 2 } func do() { a := 1 func1(&a) } This can be done after escape analysis because there is no chance the outer variables will This can be done after escape analysis because there is no chance the outer variables will leak. leak.
takes place. "Walk" phase happen right before the code generation takes place. Transforms the nodes to simpler forms Transforms the nodes to simpler forms Transforms the nodes into some function calls Transforms the nodes into some function calls Promote PAUTOHEAP variables (see Escape Analysis) into pointers and initialize them Promote PAUTOHEAP variables (see Escape Analysis) into pointers and initialize them with ONEWOBJ. with ONEWOBJ. etc. etc. src/cmd/compile/internal/gc/walk.go src/cmd/compile/internal/gc/walk.go
an intermediate representation that is often used SSA (Static Single Assignment) form is an intermediate representation that is often used to mediate an AST form of source code with corresponding machine code. to mediate an AST form of source code with corresponding machine code. As SSA form, every variable gets assigned only once during its lifecycle. As SSA form, every variable gets assigned only once during its lifecycle. This ensures each basic block has exactly a single path, and thus the same This ensures each basic block has exactly a single path, and thus the same optimization strategy can be applied. optimization strategy can be applied. A basic block: a node of a control flow graph, which contains no branch operation by A basic block: a node of a control flow graph, which contains no branch operation by definition (branches are represented as edges in CFG.) definition (branches are represented as edges in CFG.) GOSSAFUNC environment variable GOSSAFUNC environment variable ~$ GOSSAFUNC=foo go tool compile foo.go ~$ GOSSAFUNC=foo go tool compile foo.go dumped SSA to /home/moriyoshi/ssa.html dumped SSA to /home/moriyoshi/ssa.html
tools derived from Plan 9 / Inferno Most of the bootstrapping tools derived from Plan 9 / Inferno 6a / 6c / 6g / lib9... 6a / 6c / 6g / lib9... The toolchain was written in The toolchain was written in Plan-9 Plan-9 flavored flavored C C (http://doc.cat-v.org/plan_9/4th_edition/papers/comp) (http://doc.cat-v.org/plan_9/4th_edition/papers/comp) Naive code generation facility (~1.6) Naive code generation facility (~1.6) Semi-concrete IRs were generated directly from annotated ASTs. Semi-concrete IRs were generated directly from annotated ASTs.
of pointers in stack. Based on the construction of control flow graphs (CFG). Based on the construction of control flow graphs (CFG). Contiguous stack model. Contiguous stack model. 1.5 1.5 Achieved complete self-hosting. Achieved complete self-hosting. 1.7 1.7 SSA-based codegen introduced. SSA-based codegen introduced.
1: Let Go accept emojis for various identifiers. Modify the portion of the lexer so it will accept "" (U+1F363) for identifiers: Modify the portion of the lexer so it will accept "" (U+1F363) for identifiers: func (s *scanner) isIdentRune(c rune, first bool) bool { func (s *scanner) isIdentRune(c rune, first bool) bool { switch { switch { case unicode.IsLetter(c) || c == '_' || c == 0x1f363: // Modified case unicode.IsLetter(c) || c == '_' || c == 0x1f363: // Modified // ok // ok case unicode.IsDigit(c): case unicode.IsDigit(c): if first { if first { s.errorf("identifier cannot begin with digit %#U", c) s.errorf("identifier cannot begin with digit %#U", c) } } case c >= utf8.RuneSelf: case c >= utf8.RuneSelf: s.errorf("invalid identifier character %#U", c) s.errorf("invalid identifier character %#U", c) default: default: return false return false } } return true return true } }
New Operator Modify the lexer so it will generate a token value for the added operator. Modify the lexer so it will generate a token value for the added operator. Modify the parser so it will understand the token and give a right AST Node for it. Modify the parser so it will understand the token and give a right AST Node for it. Modify either the typecheck or walk facility so an annotated node that corresponds to Modify either the typecheck or walk facility so an annotated node that corresponds to the AST node will be transformed into a non-fancy branch of nodes. the AST node will be transformed into a non-fancy branch of nodes. type Foo struct { type Foo struct { } } func (*Foo) ->B(i int) { func (*Foo) ->B(i int) { fmt.Println(i) fmt.Println(i) } } func (*Foo) B->() int { func (*Foo) B->() int { return 1 return 1 } } type X interface { type X interface { ->B(int) ->B(int) B->() int B->() int } } func do(x X) { func do(x X) { x->B = 0 x->B = 0 x->B, c := 2, 3 x->B, c := 2, 3 fmt.Println(c) fmt.Println(c) } }