Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Uniques in Postgres with HLL

Scalable Uniques in Postgres with HLL

Avatar for Craig Kerstiens

Craig Kerstiens

September 18, 2013
Tweet

More Decks by Craig Kerstiens

Other Decks in Technology

Transcript

  1. Truviso • Extended Postgres to do streaming • Various markets

    • Ad space • Wanted unique impressions • Sort of wanted unique impressions
  2. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging
  3. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging
  4. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging • Implemented by Aggregate Knowledge
  5. Use cases • Semi distinct count • Think pg_stat_statements •

    Ad networks • Web traffic • With rollups/groupings
  6. Digging in CREATE  EXTENSION  hll;    CREATE  TABLE  helloworld  (

               id        integer,            set      hll    );
  7. Digging in CREATE  EXTENSION  hll;    CREATE  TABLE  helloworld  (

               id        integer,            set      hll    );
  8. Inserting data UPDATE  helloworld   SET  set  =  hll_add(set,  hll_hash_integer(12345))

      WHERE  id  =  1; UPDATE  helloworld   SET  set  =  hll_add(set,  hll_hash_text('hello  world'))   WHERE  id  =  1;
  9. Real world CREATE  TABLE  daily_uniques  (        date

                           date  UNIQUE,        users                      hll );
  10. Real world SELECT              

     EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users)) FROM  daily_uniques WHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01' GROUP  BY  1;
  11. Real world SELECT              

     EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users)) FROM  daily_uniques WHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01' GROUP  BY  1;
  12. Good practices • It uses update • Do as a

    batch in most cases • Tweak the config
  13. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage
  14. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register
  15. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse
  16. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse • spareson - on/off for sparse