Miniboxing presentation at OOPSLA 2013

scala-miniboxing.org Miniboxing Miniboxing Load-time Specialization on the JVM OOPSLA, 29th
of October 2013 Vlad Ureche Cristian Talau Martin Odersky

2 scala-miniboxing.org We all like generics We all like generics

a trivial example a trivial example

def identity[T](t: T): T = t a trivial example a trivial example

def identity[T](t: T): T = t • will take any type and • will return that same type a trivial example a trivial example

def identity[T](t: T): T = t a trivial example a trivial example but under erasure: def identity(t: Any): Any = t

def identity[T](t: T): T = t a trivial example a trivial example but under erasure: def identity(t: Any): Any = t Any is the top of the Scala type system

x = identity(3) a trivial example a trivial example

x = identity(3) under erasure: a trivial example a trivial example x = unbox(identity(box(3)))

but under but under erasure erasure

generics execute similarly to dynamic languages but under but under erasure erasure

generics execute similarly to dynamic languages – generic values lose their type information but under but under erasure erasure

generics execute similarly to dynamic languages – generic values lose their type information – primitives need boxing but under but under erasure erasure

generics execute similarly to dynamic languages – generic values lose their type information – primitives need boxing – performance is affected but under but under erasure erasure

generics execute similarly to dynamic languages – generic values lose their type information – primitives need boxing – performance is affected but under but under erasure erasure Dynamic language VMs use specialization to improve performance*

generics execute similarly to dynamic languages – generic values lose their type information – primitives need boxing – performance is affected but under but under erasure erasure Dynamic language VMs use specialization to improve performance* *but the HotSpot JVM doesn't

18 scala-miniboxing.org WE ARE HERE Generics Specialization Miniboxing Performance Evaluation

19 scala-miniboxing.org Scala has a solution Scala has a solution

it's called specialization it's called specialization* * * Iulian Dragos – PhD thesis, EPFL, 2010

Compile-time (static) transformation – duplicates the original code – adapts it for each primitive type – rewrites programs to use the adapted code it's called specialization it's called specialization* * * Iulian Dragos – PhD thesis, EPFL, 2010

Compile-time (static) transformation – duplicates the original code – adapts it for each primitive type – rewrites programs to use the adapted code it's called specialization it's called specialization* * Adapted code doesn't need to box * Iulian Dragos – PhD thesis, EPFL, 2010

Compile-time (static) transformation – duplicates the original code – adapts it for each primitive type – rewrites programs to use the adapted code it's called specialization it's called specialization* * Adapted code doesn't need to box Performance is regained. * Iulian Dragos – PhD thesis, EPFL, 2010

24 scala-miniboxing.org Specialization Specialization let's revisit `def identity` let's revisit
`def identity`

`def identity` def identity[T](t: T): T = t

`def identity` def identity[T](t: T): T = t def identity_V(t: Unit): Unit = t def identity_Z(t: Boolean): Boolean = t def identity_B(t: Byte): Byte = t def identity_C(t: Char): Char = t def identity_S(t: Short): Short = t def identity_I(t: Int): Int = t def identity_J(t: Long): Long = t def identity_F(t: Float): Float = t def identity_D(t: Double): Double = t

`def identity` def identity[T](t: T): T = t def identity_V(t: Unit): Unit = t def identity_Z(t: Boolean): Boolean = t def identity_B(t: Byte): Byte = t def identity_C(t: Char): Char = t def identity_S(t: Short): Short = t def identity_I(t: Int): Int = t def identity_J(t: Long): Long = t def identity_F(t: Float): Float = t def identity_D(t: Double): Double = t Generates 10 times the original code

28 scala-miniboxing.org Specialization Specialization … … it gets even worse
it gets even worse

29 scala-miniboxing.org Specialization Specialization … … it gets even worse
it gets even worse def pack[T1, T2](t1: T1, t2: T2) = ...

30 scala-miniboxing.org def pack_VV(t1: Unit, t2: Unit) def pack_VZ(t1: Unit,
t2: Boolean) def pack_VB(t1: Unit, t2: Byte) def pack_VC(t1: Unit, t2: Char) def pack_VS(t1: Unit, t2: Short) def pack_VI(t1: Unit, t2: Int) def pack_VJ(t1: Unit, t2: Long) def pack_VF(t1: Unit, t2: Float) def pack_VD(t1: Unit, t2: Double) Specialization Specialization … … it gets even worse it gets even worse def pack[T1, T2](t1: T1, t2: T2) = ...

t2: Boolean) def pack_VB(t1: Unit, t2: Byte) def pack_VC(t1: Unit, t2: Char) def pack_VS(t1: Unit, t2: Short) def pack_VI(t1: Unit, t2: Int) def pack_VJ(t1: Unit, t2: Long) def pack_VF(t1: Unit, t2: Float) def pack_VD(t1: Unit, t2: Double) Specialization Specialization … … it gets even worse it gets even worse def pack[T1, T2](t1: T1, t2: T2) = ... 10^n, where n is the number of type params

t2: Boolean) def pack_VB(t1: Unit, t2: Byte) def pack_VC(t1: Unit, t2: Char) def pack_VS(t1: Unit, t2: Short) def pack_VI(t1: Unit, t2: Int) def pack_VJ(t1: Unit, t2: Long) def pack_VF(t1: Unit, t2: Float) def pack_VD(t1: Unit, t2: Double) Specialization Specialization … … it gets even worse it gets even worse def pack[T1, T2](t1: T1, t2: T2) = ... 10^n, where n is the number of type params And this is common: Maps, Tuples, Functions

35 scala-miniboxing.org Miniboxing Miniboxing

36 scala-miniboxing.org Miniboxing Miniboxing reduces the variants reduces the variants

by using something like a tagged union TAG DATA (VALUE)

by using something like a tagged union TAG DATA (VALUE) Stores the original type

by using something like a tagged union TAG DATA (VALUE) Stores the original type Stores the encoded value in a long integer

by using something like a tagged union TAG DATA (VALUE) BOOL 0x0 false =

by using something like a tagged union TAG DATA (VALUE) BOOL 0x0 false = BOOL 0x1 true =

by using something like a tagged union TAG DATA (VALUE) BOOL 0x0 false = BOOL 0x1 true = INT 0x2A 42 =

by using something like a tagged union TAG DATA (VALUE) and using the static type information – tags are attached to code, not to values

46 scala-miniboxing.org Miniboxing Miniboxing let's revisit `def identity` let's revisit
`def identity` def identity[T](t: T): T = t

`def identity` def identity[T](t: T): T = t def identity_M(T_tag: Byte, t: Long): Long

`def identity` def identity[T](t: T): T = t def identity_M(T_tag: Byte, t: Long): Long TAG

`def identity` def identity[T](t: T): T = t def identity_M(T_tag: Byte, t: Long): Long TAG DATA (VALUE)

`def identity` def identity[T](t: T): T = t def identity_M(T_tag: Byte, t: Long): Long TAG DATA (VALUE) T_tag corresponds to the type parameter, instead of the values being passed around.

`def identity` def identity[T](t: T): T = t def identity_M(T_tag: Byte, t: Long): Long TAG DATA (VALUE) T_tag corresponds to the type parameter, instead of the values being passed around. Tag hoisting

`def identity` def identity[T](t: T): T = t Two variants per type parameter (reference + minibox) def identity_M(T_tag: Byte, t: Long): Long

`def identity` def identity[T](t: T): T = t Two variants per type parameter (reference + minibox) `def pack` will have 4 variants def identity_M(T_tag: Byte, t: Long): Long

`def identity` def identity[T](t: T): T = t Two variants per type parameter (reference + minibox) `def pack` will have 4 variants Tag hoisting is instrumental in obtaining good performance def identity_M(T_tag: Byte, t: Long): Long

56 scala-miniboxing.org Performance Performance Miniboxing Tagged union = needs one
more ingredient needs one more ingredient

57 scala-miniboxing.org Performance Performance Tag hoisting + Miniboxing Tagged union
= needs one more ingredient needs one more ingredient

= needs one more ingredient needs one more ingredient + ???

= needs one more ingredient needs one more ingredient + ??? Why do we need a secret ingredient?

60 scala-miniboxing.org Switching on tags Switching on tags kills performance
kills performance

61 scala-miniboxing.org def toString(T_tag: Byte, value: Long): String = T_tag
match { case UNIT => ... case BOOL => ... ... } Switching on tags Switching on tags kills performance kills performance

62 scala-miniboxing.org def toString(T_tag: Byte, value: Long): String = T_tag
match { case UNIT => ... case BOOL => ... ... } Even more so for consecutive switches Switching on tags Switching on tags kills performance kills performance

63 scala-miniboxing.org T_tag match { case X => op1 }
T_tag match { case X => op2 } Switching on tags Switching on tags kills performance kills performance

T_tag match { case X => op2 } Switching on tags Switching on tags kills performance kills performance Redundant switch

66 scala-miniboxing.org T_tag match { case X => op1; op2
} T_tag match { case X => } Switching on tags Switching on tags kills performance kills performance Redundant switch

} T_tag match { case X => } Switching on tags Switching on tags kills performance kills performance Redundant switch

} T_tag match { case X => } Switching on tags Switching on tags kills performance kills performance Redundant switch Fused together

} T_tag match { case X => } Switching on tags Switching on tags kills performance kills performance This is critical for array operations Redundant switch Fused together

70 scala-miniboxing.org ArrayBuffer.reverse() def reverse(): Unit { var index =
0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } Switching Switching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_tag match { case INT => ... ... } Switching Switching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_tag match { case INT => ... ... } T_tag match { case INT => ... ... } Switching Switching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_tag match { case INT => ... ... } T_tag match { case INT => ... ... } T_tag match { case INT => ... ... } T_tag match { case INT => ... ... } Switching Switching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_tag match { case INT => ... ... } T_tag match { case INT => ... ... } T_tag match { case INT => ... ... } T_tag match { case INT => ... ... } Fuse the operations together? Switching Switching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_tag match { case INT => val tmp1 = ... val tmp2 = ... array(.) = ... array(.) = ... ... } Switching Switching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_tag match { case INT => val tmp1 = ... val tmp2 = ... array(.) = ... array(.) = ... ... } Hoist the switch out of the loop? Switching Switching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } Switching Switching T_tag match { case INT => var index = 0 while (...) { ... index += 1 } }

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(opposite) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } Is that enough? Method may be called from a loop Switching Switching T_tag match { case INT => var index = 0 while (...) { ... index += 1 } }

= needs one more ingredient needs one more ingredient + ??? Can't be switching

= needs one more ingredient needs one more ingredient + ??? Can't be switching Must be something else

82 scala-miniboxing.org • Dispatch object – Encodes array interactions Dispatching
Dispatching

83 scala-miniboxing.org • Dispatch object – Encodes array interactions class
Dispatcher[T] { def array_get(...): Long def array_set(...): Unit } Dispatching Dispatching

Dispatcher[T] { def array_get(...): Long def array_set(...): Unit } Dispatching Dispatching def identity_M(T_dispatcher: Dispatcher[T], t: Long): Long

Dispatcher[T] { def array_get(...): Long def array_set(...): Unit } Dispatching Dispatching def identity_M(T_dispatcher: Dispatcher[T], t: Long): Long instead of tag

Dispatching object IntDispatcher extends Dispatcher[Int] { def array_get(...): Long = ... def array_set(...): Unit = ... }

Dispatching object IntDispatcher extends Dispatcher[Int] { def array_get(...): Long = ... def array_set(...): Unit = ... } object LongDispatcher ... object CharDispatcher ...

88 scala-miniboxing.org • Dispatch object – Encodes array interactions Passing
a dispatcher = hoisted already Dispatching Dispatching object IntDispatcher extends Dispatcher[Int] { def array_get(...): Long = ... def array_set(...): Unit = ... } object LongDispatcher ... object CharDispatcher ...

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(other) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } Dispatching Dispatching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(other) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_dispatcher.array_get Dispatching Dispatching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(other) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_dispatcher.array_get T_dispatcher.array_get Dispatching Dispatching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(other) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_dispatcher.array_get T_dispatcher.array_get T_dispatcher.array_set T_dispatcher.array_set Dispatching Dispatching

0 while (index * 2 < length) { val opposite = length-index-1 val tmp1: T = array(index) val tmp2: T = array(other) array(index) = tmp2 array(opposite) = tmp1 index += 1 } } T_dispatcher.array_get T_dispatcher.array_get T_dispatcher.array_set T_dispatcher.array_set Dispatching Dispatching With inlining, we get good performance

94 scala-miniboxing.org ArrayBuffer.reverse() T_dispatcher.array_get Dispatching Dispatching

95 scala-miniboxing.org ArrayBuffer.reverse() T_dispatcher.array_get Dispatching Dispatching IntDispatcher Monomorphic, okay

96 scala-miniboxing.org ArrayBuffer.reverse() T_dispatcher.array_get Dispatching Dispatching LongDispatcher Polymorphic, okay IntDispatcher
Monomorphic, okay

97 scala-miniboxing.org DoubleDispatcher Megamorphic* no more inlining → * for
the HotSpot JVM ArrayBuffer.reverse() T_dispatcher.array_get Dispatching Dispatching LongDispatcher Polymorphic, okay IntDispatcher Monomorphic, okay

98 scala-miniboxing.org DoubleDispatcher Megamorphic* no more inlining → * for
the HotSpot JVM ArrayBuffer.reverse() T_dispatcher.array_get Dispatching Dispatching LongDispatcher Polymorphic, okay IntDispatcher Monomorphic, okay No more inlining bad performance →

= needs one more ingredient needs one more ingredient + ??? Object oriented dispatch isn't that

101 scala-miniboxing.org The secret ingredient The secret ingredient

102 scala-miniboxing.org • Switch-based dispatching T_tag match { case INT
=> ... ... } The secret ingredient The secret ingredient

103 scala-miniboxing.org • Switch-based dispatching • When instantiating the class
– T_tag is known T_tag match { case INT => ... ... } The secret ingredient The secret ingredient

– T_tag is known – T_tag is a constant T_tag match { case INT => ... ... } The secret ingredient The secret ingredient

– T_tag is known – T_tag is a constant T_tag match { case INT => ... ... } Encode T_tag in the class name? The secret ingredient The secret ingredient

– T_tag is known – T_tag is a constant T_tag match { case INT => ... ... } Encode T_tag in the class name? The secret ingredient The secret ingredient Staticly? Code explosion!

107 scala-miniboxing.org T_tag match { case INT => ... case
CHAR => ... case UNIT => ... ... } Load-time specialization Load-time specialization • Load-time transformation

108 scala-miniboxing.org • Load-time transformation – set T_tag statically T_tag
match { case INT => ... case CHAR => ... case UNIT => ... ... } Load-time specialization Load-time specialization INT

109 scala-miniboxing.org • Load-time transformation – set T_tag statically –
perform constant folding T_tag match { case INT => ... case CHAR => ... case UNIT => ... ... } Load-time specialization Load-time specialization ... INT

perform constant folding – perform dead code elimination Load-time specialization Load-time specialization ...

perform constant folding – perform dead code elimination Load-time specialization Load-time specialization ... Only the useful code

perform constant folding – perform dead code elimination Load-time specialization Load-time specialization ... Only the useful code No dispatching

perform constant folding – perform dead code elimination Is this the secret ingredient? Yes! Load-time specialization Load-time specialization ... Only the useful code No dispatching

= needs one more ingredient needs one more ingredient + Load-time specialization

= needs one more ingredient needs one more ingredient + Load-time specialization Attaching tags to code enables load-time specialization

117 scala-miniboxing.org (less is better) Evaluation - Performance Evaluation -
Performance Best Performance Worst Performance

Performance Best Performance Worst Performance

Performance Best Performance Worst Performance Predictable performance

Performance Best Performance Worst Performance Predictable performance 5x less bytecode

Performance Best Performance Worst Performance Predictable performance 5x less bytecode Similar results on other benchmarks

122 scala-miniboxing.org Spire – numeric abstractions library (12KLOC) Evaluation -
Code size Evaluation - Code size (less is better)

123 scala-miniboxing.org Spire – numeric abstractions library (12KLOC) 2.8x bytecode
reduction (4.7x for Vector in std. lib) Evaluation - Code size Evaluation - Code size (less is better)

124 scala-miniboxing.org Contributions Contributions Miniboxing

125 scala-miniboxing.org Contributions Contributions Miniboxing Tagged union =

126 scala-miniboxing.org Contributions Contributions Tag hoisting + Miniboxing Tagged union
=

127 scala-miniboxing.org Contributions Contributions Tag hoisting + Miniboxing Tagged union
= Load-time specialization +

scala-miniboxing.org • improves performance • reduces bytecode size Conclusions Conclusions
visit visit scala-miniboxing.org scala-miniboxing.org! !

Miniboxing presentation at OOPSLA 2013

Miniboxing presentation at OOPSLA 2013

More Decks by Vlad Ureche

Other Decks in Programming

Featured

Transcript