Engineering fast indexes

ENGINEERING FAST INDEXES Daniel Lemire https://lemire.me Joint work with lots
of super smart people

Our recent work: Roaring Bitmaps http://roaringbitmap.org/ Used by Apache Spark,
Netflix Atlas, LinkedIn Pinot, Apache Lucene, Whoosh, Metamarket's Druid eBay's Apache Kylin Further reading: Frame of Reference and Roaring Bitmaps (at Elastic, the company behind Elasticsearch) 2

Set data structures We focus on sets of integers: S
= {1, 2, 3, 1000}. Ubiquitous in database or search engines. tests: x ∈ S? intersections: S ∩ S unions: S ∪ S differences: S ∖ S Jaccard Index (Tanimoto similarity) ∣S ∩ S ∣/∣S ∪ S ∣ 2 1 2 1 2 1 1 1 1 2 3

"Ordered" Set iterate in sorted order, in reverse order, skippable
iterators (jump to first value ≥ x) Rank: how many elements of the set are smaller than k? Select: find the kth smallest value Min/max: find the maximal and minimal value 4

Let us make some assumptions... Many sets containing more than
a few integers Integers span a wide range (e.g., [0, 100000)) Mostly immutable (read often, write rarely) 5

How do we implement integer sets? Assume sets are mostly
imutable. sorted arrays ( s t d : : v e c t o r ) hash sets ( j a v a . u t i l . H a s h S e t , s t d : : u n o r d e r e d _ s e t ) … bitsets ( j a v a . u t i l . B i t S e t ) compressed bitsets 6

What is a bitset??? Efficient way to represent a set
of integers. E.g., 0, 1, 3, 4 becomes 0 b 1 1 0 1 1 or "27". Also called a "bitmap" or a "bit array". 7

Add and contains on bitset Most of the processors work
on 64‑bit words. Given index x , the corresponding word index is x / 6 4 and within‑ word bit index is x % 6 4 . a d d ( x ) { a r r a y [ x / 6 4 ] | = ( 1 < < ( x % 6 4 ) ) } c o n t a i n s ( x ) { r e t u r n a r r a y [ x / 6 4 ] & ( 1 < < ( x % 6 4 ) ) } 8

How fast can you set bits in a bitset? Very
fast! Roughly three instructions (on x64)... i n d e x = x / 6 4 - > a s i n g l e s h i f t m a s k = 1 < < ( x % 6 4 ) - > a s i n g l e s h i f t a r r a y [ i n d e x ] | - m a s k - > a l o g i c a l O R t o m e m o r y (Or can use BMI's b t s .) On recent x64 can set one bit every ≈ 1.65 cycles (in cache) Recall : Modern processors are superscalar (more than one instruction per cycle) 9

Bit‑level parallelism Bitsets are efficient: intersections Intersection between {0, 1,
3} and {1, 3} can be computed as AND operation between 0 b 1 0 1 1 and 0 b 1 0 1 0 . Result is 0 b 1 0 1 0 or {1, 3}. Enables Branchless processing. 10

Bitsets are efficient: in practice f o r i i
n [ 0 . . . n ] o u t [ i ] = A [ i ] & B [ i ] Recent x64 processors can do this at a speed of ≈ 0.5 cycles per pair of input 64‑bit words (in cache) for n = 1 0 2 4 . 0.5 m e m c p y runs at ≈ 0.3 cycles. 0.3 11

Bitsets can be inefficient Relatively wasteful to represent {1, 32000,
64000} with a bitset. Would use 1000 bytes to store 3 numbers. So we use compression... 12

Memory usage example dataset : census1881_srt format bits per value
hash sets 200 arrays 32 bitsets 900 compressed bitsets (Roaring) 2 https://github.com/RoaringBitmap/CBitmapCompetition 13

Performance example (unions) dataset : census1881_srt format CPU cycles per
value hash sets 200 arrays 6 bitsets 30 compressed bitsets (Roaring) 1 https://github.com/RoaringBitmap/CBitmapCompetition 14

What is happening? (Bitsets) Bitsets are often best... except if
data is very sparse (lots of 0s). Then you spend a lot of time scanning zeros. Large memory usage Bad performance Threshold? ~1 100 15

Hash sets are not always fast Hash sets have great
one‑value look‑up. But they have poor data locality and non‑trivial overhead... h 1 < - s o m e h a s h s e t h 2 < - s o m e h a s h s e t . . . f o r ( x i n h 1 ) { i n s e r t x i n h 2 / / " s u r e " t o h i t a n e w c a c h e l i n e ! ! ! ! } 16

Want to kill Swift? Swift is Apple's new language. Try
this: v a r d = S e t ( ) f o r i i n 1 . . . s i z e { d . i n s e r t ( i ) } / / v a r z = S e t ( ) f o r i i n d { z . i n s e r t ( i ) } This blows up! Quadratic‑time. Same problem with Rust. 17

What is happening? (Arrays) Arrays are your friends. Reliable. Simple.
Economical. But... binary search is branchy and has bad locality... w h i l e ( l o w < = h i g h ) { i n t m i d d l e I n d e x = ( l o w + h i g h ) > > > 1 ; i n t m i d d l e V a l u e = a r r a y . g e t ( m i d d l e I n d e x ) ; i f ( m i d d l e V a l u e i k e y ) { h i g h = m i d d l e I n d e x - 1 ; } e l s e { r e t u r n m i d d l e I n d e x ; } } r e t u r n - ( l o w + 1 ) ; 18

Performance: value lookups (x ∈ S) dataset : weather_sept_85 format
CPU cycles per query hash sets ( s t d : : u n o r d e r e d _ s e t ) 50 arrays 900 bitsets 4 compressed bitsets (Roaring) 80 19

How do you compress bitsets? We have long runs of
0s or 1s. Use run‑length encoding (RLE) Example: 000000001111111100 can be coded as 00000000 − 11111111 − 00 or <5><1> using the format < number of repetitions >< value being repeated > 20

RLE‑compressed bitsets Oracle's BBC WAH (FastBit) EWAH (Git + Apache
Hive) Concise (Druid) … Further reading: http://githubengineering.com/counting‑objects/ 21

Hybrid Model Decompose 32‑bit space into 16‑bit spaces (chunk). Given
value x, its chunk index is x ÷ 2 (16 most significant bits). For each chunk, use best container to store least 16 significant bits: a sorted array ({1,20,144}) a bitset (0b10000101011) a sequences of sorted runs ([0,10],[15,20]) That's Roaring! Prior work: O'Neil's RIDBit + BitMagic 16 22

Roaring All containers fit in 8 kB (several fit in
L1 cache) Attempts to select the best container as you build the bitmaps Calling r u n O p t i m i z e will scan (quickly!) non‑run containers and try to convert them to run containers 23

Performance: union (weather_sept_85) format CPU cycles per value bitsets 0.6
WAH 4 EWAH 2 Concise 5 Roaring 0.6 24

What helps us... All modern processors have fast population‑count functions
( p o p c n t ) to count the number of 1s in a word. Cheap to keep track of the number of values stored in a bitset! Choice between array, run and bitset covers many use cases! 25

Go try it out! Java, Go, C, C++, C#, Rust,
Python... (soon: Swift) http://roaringbitmap.org Documented interoperable serialized format. Free. Well‑tested. Benchmarked. Peer reviewed Consistently faster and smaller compressed bitmaps with Roaring. Softw., Pract. Exper. (2016) Better bitmap performance with Roaring bitmaps. Softw., Pract. Exper. (2016) Optimizing Druid with Roaring bitmaps, IDEAS 2016, 2016 Wide community (dozens of contributors). 26

Engineering fast indexes

Engineering fast indexes

Daniel Lemire

More Decks by Daniel Lemire

Other Decks in Technology

Featured

Transcript

ENGINEERING FAST INDEXES Daniel Lemire https://lemire.me Joint work with lots

Our recent work: Roaring Bitmaps http://roaringbitmap.org/ Used by Apache Spark,

Set data structures We focus on sets of integers: S

"Ordered" Set iterate in sorted order, in reverse order, skippable

Let us make some assumptions... Many sets containing more than

How do we implement integer sets? Assume sets are mostly

What is a bitset??? Efficient way to represent a set

Add and contains on bitset Most of the processors work

How fast can you set bits in a bitset? Very

Bit‑level parallelism Bitsets are efficient: intersections Intersection between {0, 1,

Bitsets are efficient: in practice f o r i i

Bitsets can be inefficient Relatively wasteful to represent {1, 32000,

Memory usage example dataset : census1881_srt format bits per value

Performance example (unions) dataset : census1881_srt format CPU cycles per

What is happening? (Bitsets) Bitsets are often best... except if

Hash sets are not always fast Hash sets have great

Want to kill Swift? Swift is Apple's new language. Try

What is happening? (Arrays) Arrays are your friends. Reliable. Simple.

Performance: value lookups (x ∈ S) dataset : weather_sept_85 format

How do you compress bitsets? We have long runs of

RLE‑compressed bitsets Oracle's BBC WAH (FastBit) EWAH (Git + Apache

Hybrid Model Decompose 32‑bit space into 16‑bit spaces (chunk). Given

Roaring All containers fit in 8 kB (several fit in

Performance: union (weather_sept_85) format CPU cycles per value bitsets 0.6

What helps us... All modern processors have fast population‑count functions

Go try it out! Java, Go, C, C++, C#, Rust,