Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Not Exactly! Approximate Algorithms for Big Data

Druid
October 28, 2013

Not Exactly! Approximate Algorithms for Big Data

Strata + Hadoop World, NYC, Oct. 28. 2013

Druid

October 28, 2013
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. THE PROBLEM MANAGE DATA COST EFFICIENTLY THE DATA DEALING WITH

    EVENT STREAMS SIMPLIFYING STORAGE DATA SUMMARIZATION FINDING UNIQUES HYPERLOGLOG ESTIMATING DISTRIBUTION APPROXIMATE HISTOGRAMS OVERVIEW
  2. Fangjin Yang & Nelson Ray 2013 PROBLEMS ‣ Storing/processing billions

    of rows is expensive ‣ Reduce storage, improve performance ‣ Reduce storage by throwing away information ‣ Throwing away information reduces accuracy
  3. Fangjin Yang & Nelson Ray 2013 THE DATA Timestamp Bid

    Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03
  4. Fangjin Yang & Nelson Ray 2013 DATA SUMMARIZATION Timestamp Revenue

    Number of Prices 2013-10-28T02 2.28 3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2 Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03
  5. Fangjin Yang & Nelson Ray 2013 COMBINING SUMMARIZATIONS Timestamp Revenue

    Number of Prices 2013-10-28T02 2.28 3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2 Timestamp Revenue Number of Prices 2013-10-28 4.66 8
  6. Fangjin Yang & Nelson Ray 2013 ‣ Throw away information

    about individual events ‣ Drastically reduce storage and improve query speed • on average, 40x reduction in storage on with our own data ‣ We’ve lost info about individual prices ‣ Data summarization is not always trivial SUMMARIZATION SUMMARY
  7. Fangjin Yang & Nelson Ray 2013 ‣ Problem: determine unique

    number of elements in a set ‣ Use case: measuring number of unique users CASE STUDY 1 DATA BIG DATA
  8. Fangjin Yang & Nelson Ray 2013 ‣ Store every single

    username (in a Java HashSet) ‣ No loss of information, no accuracy tradeoff EXACT SOLUTION
  9. Fangjin Yang & Nelson Ray 2013 HASHSET Timestamp Username 2013-10-28T02:13:43Z

    user1 2013-10-28T02:14:21Z user2 2013-10-28T02:55:32Z user1 2013-10-28T03:07:28Z user4 2013-10-28T03:13:43Z user97 2013-10-28T04:18:19Z user2 2013-10-28T05:36:34Z user9834 2013-10-28T05:37:59Z user97 Timestamp Usernames 2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}
  10. Fangjin Yang & Nelson Ray 2013 HASHSET Timestamp Usernames 2013-10-28

    {user1, user2, user4, user97, user9834} Timestamp Usernames 2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}
  11. Fangjin Yang & Nelson Ray 2013 ‣ Storage/Computation: O(# uniques)

    ‣ We’re not throwing away any information about usernames ‣ Accuracy: 100% EXACT SOLUTION
  12. Fangjin Yang & Nelson Ray 2013 ‣ High cardinality user

    dimensions == infeasible storage • Storage cost for 10^9 unique elements == ~48GB of storage INFEASIBLE STORAGE
  13. Fangjin Yang & Nelson Ray 2013 ‣ Plenty of literature

    • Linear Counting • Count-Min Sketch • Bloom Filters • LogLog CARDINALITY ESTIMATION
  14. Fangjin Yang & Nelson Ray 2013 ‣ Storage: 1.5 KB

    ( for cardinalities 10^9 and above) • 99.999997% decrease in storage size ‣ Computation: O(1) (for cardinalities < ~10^10) ‣ Accuracy: 97% HYPERLOGLOG
  15. Fangjin Yang & Nelson Ray 2013 ‣ Instead of storing

    all the data, let’s store a “sketch” of the data that represents some result that we care about ‣ Analogy: Imagine we wanted to know how many times we flipped a coin • ~50 % heads/tails • We could store the result of every coin flip as it occurs (HHTTTHTHHT) • Or we could just store the number of times heads appeared as we ingest data and use the magic of probability HYPERLOGLOG
  16. Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG ‣ Maintain a

    series of buckets ‣ Each bucket is storing a number ‣ Each time we see a user, we only update a bucket value if a specific phenomenon is seen ‣ The phenomenon we care about is based on how bits are distributed when we hash a username ‣ We are looking for the position of the first ‘1’ bit ‣ Update a bucket if this position is greater than the existing value
  17. Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG HashFn 01xxx...x user1

    Buckets 2 2 2 1 HashFn 01xxx...x user4 HashFn 01xxx...x user12 HashFn 1xxxx...x user7
  18. Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG Timestamp Buckets 2013-10-28T02

    [3, 2, 2, 1] 2013-10-28T03 [1, 2, 1, 2] 2013-10-28T04 [2, 1, 4, 1] 2013-10-28T05 [2, 2, 3, 1]
  19. Fangjin Yang & Nelson Ray 2013 ‣ Problem: determine distribution

    of values ‣ Use case: quantiles and histograms ‣ Hourly truncation CASE STUDY 2
  20. Fangjin Yang & Nelson Ray 2013 THE DATA Timestamp Bid

    Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03
  21. Fangjin Yang & Nelson Ray 2013 EXACT SOLUTION Timestamp Bid

    Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03 Timestamp Bid Prices 2013-10-28T02 [1.19, 0.05, 1.04] 2013-10-28T03 [0.16, 1.03] 2013-10-28T04 [0.15] 2013-10-28T05 [0.01, 1.03]
  22. Fangjin Yang & Nelson Ray 2013 EXACT SOLUTION Timestamp Bid

    Prices 2013-10-28 [1.19, 0.05, 1.04, 0.16, 1.03, 0.15, 0.01, 1.03] Timestamp Bid Prices 2013-10-28T02 [1.19, 0.05, 1.04] 2013-10-28T03 [0.16, 1.03] 2013-10-28T04 [0.15] 2013-10-28T05 [0.01, 1.03]
  23. Fangjin Yang & Nelson Ray 2013 ‣ Arrays of values

    ‣ Storage: Linear ‣ Computation: Linear ‣ Accuracy: 100% ‣ Problem: Storing raw values can often be more expensive than storing the rest of the row. ‣ Solution: Store an approximate representation! EXACT SOLUTION
  24. Fangjin Yang & Nelson Ray 2013 ‣ “A Streaming Parallel

    Decision Tree Algorithm” ‣ Yael Ben-Haim & Elad Tom-Tov ‣ Storage: Sublinear/Linear ‣ Computation: Sublinear/Linear ‣ Accuracy: pretty good APPROXIMATE HISTOGRAMS
  25. Fangjin Yang & Nelson Ray 2013 RAW DATA • 40

    Prices: 3.46, 5.37, 5.62, 5.87, 6.21, 6.79, 7.11, 7.36, 7.55, 7.64, 7.89, 7.9, 8.07, 8.44, 8.62, 8.78, 8.87, 9.03, 9.24, 9.36, 9.58, 9.59, 9.81, 10.31, 10.35, 10.39, 10.47, 10.77, 10.93, 11.04, 11.1, 13.1, 13.27, 13.29, 13.87, 14.29, 14.51, 14.9, 15.75, 17.07
  26. Fangjin Yang & Nelson Ray 2013 • 100 cc2.8xlarge (1600

    cores, 6TB RAM) Druid cluster • 27B summarized rows/s scan rate • Add 16B summarized (~640B raw) rows/s • Combine 4B HyperLogLog objects/s • Combine 1.5B ApproximateHistogram objects/s BENCHMARKS
  27. Fangjin Yang & Nelson Ray 2013 • Summarization for sums:

    substantially (e.g. ~40x for us) faster/less storage • 100% accuracy • Sketches for cardinality/distribution: 1-2 orders of magnitude faster/ less storage than raw • 97% accuracy • 40x lower costs is make or break • interactive queries that are accurate enough CONCLUSIONS
  28. MORE INFORMATION? OFFICE HOURS time Tuesday, Oct. 29, 3:25pm place

    Third Floor Foyer, Table E DRINKS time Monday, Oct. 28, 6:00pm place Old Castle Pub & Restaurant
  29. Fangjin Yang & Nelson Ray 2013 • Eric Tschetter •

    Xavier Léauté • Gian Merlino • Aggregate Knowledge Blog • High Scalability ACKNOWLEDGEMENTS
  30. Fangjin Yang & Nelson Ray 2013 ‣ “HyperLogLog: the analysis

    of a near-optimal cardinality estimation algorithm” • Flajolet et al. ‣ “A Streaming Parallel Decision Tree Algorithm” • Yael Ben-Haim & Elad Tom-Tov ‣ http://metamarkets.com/2012/fast-cheap-and-98-right- cardinality-estimation-for-big-data/ ‣ http://metamarkets.com/2013/histograms/ REFERENCES
  31. Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG HashFn 00x...x user1

    HashFn 10x...x user2 HashFn 01x...x user3 HashFn 11x...x user4
  32. Fangjin Yang & Nelson Ray 2013 ‣ 50% of hashed

    values will look like this: 1xxxxx…x ‣ 25% of hashed values will look like this: 01xxxx…x ‣ 12.5% of hashed values will look like this: 001xxx…x ‣ 6.25% of hashed values will look like this: 0001xx…x HYPERLOGLOG
  33. Fangjin Yang & Nelson Ray 2013 ‣ Invert this logic

    • If highest index of ‘1’ is 2, we saw 4 unique values • If highest index of ‘1’ is 4, we saw 16 unique values ‣ Use the highest index of ‘1’ to determine cardinality ‣ For better accuracy, the highest index of ‘1’ is stored in a series of buckets HYPERLOGLOG