Upgrade to Pro — share decks privately, control downloads, hide ads and more …

確率的データ構造を Java で扱いたい! #JJUG

確率的データ構造を Java で扱いたい! #JJUG

JJUG ナイト・セミナー 「ビール片手にLT&納涼会 2017」 の発表資料です。
https://jjug.doorkeeper.jp/events/63719

KOMIYA Atsushi

August 23, 2017
Tweet

More Decks by KOMIYA Atsushi

Other Decks in Programming

Transcript

  1. ֬཰తσʔλߏ଄ͷ Java ࣮૷ • stream-lib ‘com.addthis:stream-lib’ • Membership query /

    cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation
  2. ֬཰తσʔλߏ଄ͷ Java ࣮૷ • stream-lib ‘com.addthis:stream-lib’ • Membership query /

    cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation
  3. Bloom filter • ֬཰తʹؒҧͬͨ౴͑ʢଘ൱݁ՌʣΛฦ͢ • ِཅੑ (ଘࡏ͠ͳ͍΋ͷΛଘࡏ͢Δͱޡೝ͢ Δࣄ৅) ͸ੜ͡Δ͕ɺِӄੑ͸ੜ͡ͳ͍ •

    ʮ૝ఆ͞ΕΔཁૉͷछྨ਺ʯ΍ʮڐ༰Ͱ͖Δِ ཅੑͷ֬཰ʯΛࢦఆͯ͠ɺώʔϓ࢖༻ྔΛ੍ޚ Ͱ͖Δ • ཁૉͷ௥Ճ͸Ͱ͖Δ͕ɺ࡟আ͸೉͍͠
  4. ώʔϓ࢖༻ྔΛ֬ೝͯ͠ΈΔ • “Lorem ipsum” ͷςΩετΛྫʹɺJOL (Java Object Layout) Ͱώʔϓ࢖༻ྔΛଌఆ •

    http://openjdk.java.net/projects/code- tools/jol/ • Set: 6,032 bytes • stream-lib BloomFilter: 136 bytes 97.8% smaller !
  5. HyperLogLog++ (1/2) • ҟͳΓ਺Λਪఆ͢Δσʔλߏ଄ • ಘΒΕΔਪఆ஋͸ɺຊདྷͷҟͳΓ਺ʹର্ͯ͠ৼΕɾԼৼ Εͱ΋ʹى͜Γ͏Δ • Redshift /

    BigQuery / Presto ͳͲͰ΋ɺCOUNT(DISTINCT x) Λۙࣅ͢Δखஈͱͯ͠࢖ΘΕ͍ͯΔ • https://aws.amazon.com/jp/about-aws/whats-new/ 2013/11/11/amazon-redshift-new-performance-data- loading-security-features/ • https://cloud.google.com/blog/big-data/2017/07/ counting-uniques-faster-in-bigquery-with-hyperloglog
  6. HyperLogLog++ (2/2) • ʮਪఆ஋ͷਫ਼౓ pʯΛௐ੔͢Δ͜ͱͰɺώʔϓ࢖༻ྔΛ੍ ޚ͢Δ͜ͱ͕Ͱ͖Δ • ஋Λେ͖͘͢Δͱਫ਼౓͕ߴ͘ͳΔ & ۭؒޮ཰͸ѱԽ͢Δ

    • ૝ఆ͞ΕΔҟͳΓ਺΍ඞཁͱ͞ΕΔਫ਼౓ɺώʔϓͷ੍໿ Λߟྀͯ͠ p Λܾఆ͢Δ • HyperLogLog ͷ࢓૊ΈΛཧղ͢Δʹ͸ɺҎԼͷϒϩάΤϯ τϦ͕͓͢͢Ί • http://blog.brainpad.co.jp/entry/2016/06/27/110000
  7. Count-min sketch (2/2) • width ͱ depth ͷೋͭͷύϥϝʔλͰɺۭؒ ޮ཰΍ਫ਼౓Λ੍ޚ͢Δ •

    width * depth ͷݸ਺ͷΧ΢ϯλ͕࡞ΒΕΔ • Χ΢ϯλ͸ 2࣍ݩ഑ྻͰදݱ • depth ͷ਺͚ͩϋογϡؔ਺͕࣮ߦ͞ΕΔͷ Ͱɺ଎౓తͳύϑΥʔϚϯεʹӨڹΛ༩͑Δ
  8. stream-lib ͷ Count-min sketch width:10 * depth:30 ͷΧ΢ϯλʹΑΔ Count-Min sketch

    Λ༻ҙ͢Δ CountMinSketch#add(String, int) ͰΧ΢ϯτ͍ͯ͘͠
  9. ·ͱΊ • ֬཰తσʔλߏ଄Λ༻͍Δ͜ͱͰɺେن໛σʔλॲཧ΍ ΦϯϥΠϯॲཧΛޮ཰తʹ࣮ݱͰ͖Δʢ͔΋ʣ • Java Ͱ֬཰తσʔλߏ଄Λ͓खܰʹѻ͍͍ͨͳΒɺ
 ·ͣ͸stream-lib ͷར༻Λݕ౼ͯ͠ΈΔ •

    ਪఆਫ਼౓ͱۭؒޮ཰ͷτϨʔυΦϑΛ੍ޚ͢Δ
 ύϥϝʔλͷௐ੔͸ɺ৬ਓܳʹͳΓ͕ͪ • JOL ΍ JMH Λ༻͍ͯɺ࣮ࡍͷۭؒޮ཰ͱ࣌ؒޮ཰Λ ͖ͪΜͱଌఆ͠ͳ͕Βௐ੔͢Δ͜ͱΛ͓͢͢Ί͍ͨ͠