Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Realtime Systems for Social Data Analysis (RICO...

Realtime Systems for Social Data Analysis (RICON East 2013)

Presentation delivered by Hilary Mason at RICON East 2013.

It's one thing to have a lot of data, and another to make it useful. This talk explores the interplay between infrastructure, algorithms, and data necessary to design robust systems that produce useful and measurable insights for realtime data products. We'll walk through several examples and discuss the design metaphors that bitly uses to rapidly develop these kinds of systems.

About Hilary

Hilary is the Chief Scientist at bitly, the URL- shortening and bookmarking service, where she makes beautiful things with data. She is a former computer science professor with a background is in machine learning and data mining. As native New Yorker, Hilary was appointed to Mayor Bloomberg’s Technology and Innovation Advisory Council. She also co-founded HackNY, created dataists, and is a member of NYCResistor.

Basho Technologies

May 14, 2013
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. {"a":  "Mozilla/5.0  (Windows  NT  6.1)  AppleWebKit/ 537.31  (KHTML,  like  Gecko)

     Chrome/26.0.1410.64   Safari/537.31",  "c":  "US",  "nk":  0,  "tz":   "America/Chicago",  "gr":  "TX",  "g":  "126F3CN",   "i":  "xx.xxx.xxx.xxx",  "h":  "126F3CM",  "k":   "xxxxxx-­‐xxxxx-­‐xxxxx-­‐xxxxxxx",  "l":  "raycom",   "al":  "en-­‐US,en;q=0.8",  "hh":  "bit.ly",  "r":   "https://www.facebook.com/",  "u":  "http:// www.kltv.com/story/22237743/document-­‐sheds-­‐ light-­‐on-­‐events-­‐leading-­‐up-­‐to-­‐longview-­‐standoff? utm_content=bufferf4a8d&utm_source=buffer&utm_me dium=facebook&utm_campaign=Buffer",  "t":   1368478799,  "hc":  1368476067,  "cy":  "Longview",   "ll":  [32.500701904296875,  -­‐94.740501403808594]}
  2. 10s of millions of URLs per day 100s of millions

    of clicks per day 10s of billions of URLs
  3. 51

  4. Data engineering is when the architecture of your system is

    dependent on characteristics of the data flowing through that system.
  5. 1. Research offline 2. Do fancy math – find the

    shortcuts 3. Design infrastructure 4. Re-design to run at scale and speed
  6. use an entropy calculation! def ghash2lang(g, Ri, min_count=3, max_entropy=0.2): !

    """ ! returns the majority vote of a langauge for a given hash ! """ ! lang = R.zrevrange(g,0,0)[0] # let's calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1
  7. {"ck":  1,  "gr":  "X4",  "al":  "en-­‐US,en;q=0.8",   "topic":  "Sports",  "cy":

     "Bargoed",  "hc":   1368535661.0000002,  "ovi":  {"count":  124.0,   "proba":  [0.935458874,  0.064541125]},  "hh":   "mirr.im",  "a":  "Mozilla/5.0  (Windows  NT  6.1;   WOW64)  AppleWebKit/537.31  (KHTML,  like  Gecko)   Chrome/26.0.1410.64  Safari/537.31",  "c":  "GB",   "nk":  1,  "tz":  "Europe/London",  "g":  "16a3eXk",   "i":  "xxx.xxx.xxx.xxx",  "h":  "16a3eXi",  "k":   "xxxxxxxx-­‐xxxxx-­‐xxxxxx-­‐xxxxxxx",  "l":   "dailymirror",  "p":  "fans",  "r":  "http://t.co/ hSpdnJzMIh",  "u":  "http://www.mirror.co.uk/sport/ football/news/picture-­‐special-­‐david-­‐beckhams-­‐ paris-­‐1888664? utm_source=twitterfeed&utm_medium=twitter",  "t":   1368536389.0,  "ll":  [51.683300018,  -­‐3.23329997]}
  8. ‘Realtime’ Search • built on Zoie (Solr plugin) • only

    keeps documents in the index if they have been clicked* in the previous 24 hours
  9. Choosing is important. It must be interpretable, and smooth (but

    not too smooth). We use a distribution for that is a function that sums to 1. The function is 0 at the origin. Dragoneye