Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Moneyballing: Using Data to Win At Fantasy Foot...

Moneyballing: Using Data to Win At Fantasy Football

An overview of data modeling in Hadoop for entities with complex event histories. Includes and introduction to the supernova schema as well as Exhibit--a collection of Hive user defined functions for creating and querying data stored in a supernova schema.

Juliet Hougland

February 11, 2015
Tweet

More Decks by Juliet Hougland

Other Decks in Technology

Transcript

  1. ‹#› © Cloudera, Inc. All rights reserved. Moneyballing: Using Data

    to Win At Fantasy Football Juliet Hougland @j_houg
  2. ‹#› © Cloudera, Inc. All rights reserved. What is Fantasy

    Football? • Draft players • Play players from your pool each week • Those players score points for you according to some rules =>
  3. ‹#› © Cloudera, Inc. All rights reserved. Two 500lb Gorillas

    in the Room img: [5], [6] 3NF Dimensionally Modeled Data Warehouse
  4. ‹#› © Cloudera, Inc. All rights reserved. • Application backend

    • No data duplication => • Low latency CRUD ops Engineered for Different Workloads • Ease and flexibility in per-key aggregations • Business Intelligence (BI) reporting • Conformed dimensions • Obvious sensible joins 3NF Dimensionally Modeled Data Warehouse img: [7]
  5. ‹#› © Cloudera, Inc. All rights reserved. Two ways to

    initially receive data CSVs 3NF img: [8]
  6. ‹#› © Cloudera, Inc. All rights reserved. • Easy to

    update • Data only gets written once 3ND DB Format
  7. ‹#› © Cloudera, Inc. All rights reserved. • Star Schema

    • Facts are in the center of star • Dimensions are the points • Optimized for ease of aggregate queries • Conformed dimensions Data Warehouse
  8. ‹#› © 2014 Cloudera, Inc. All rights reserved. Schema for

    Football Play Prediction SQL statements on columns make sense, because columns are like per-player tables
  9. ‹#› © Cloudera, Inc. All rights reserved. Exhibit? • Exhibit:

    https://github.com/jwills/exhibit • I heard you like SQL…
  10. ‹#› © Cloudera, Inc. All rights reserved. CREATE TABLE IF

    NOT EXISTS passing_game_sess AS SELECT a.psr as player, b.gid, collect_all(named_struct( 'pid', a.pid, 'yds', a.yds, 'intcpt', cast(e.pid is not null as int), 'td', cast(coalesce(c.pts, 0) as int), 'conv', cast(d.pid is not null as int))) as passes FROM football.pass …other stuff… How do we make a (Hive) Supernova table?
  11. ‹#› © Cloudera, Inc. All rights reserved. DESCRIBE TABLE passing_game_sess;

    What does this table look like? Column Name Data Type player string gid int passes array<struct<pid:int, yds:int, intcpt:int, td:int, conv:int>>
  12. ‹#› © Cloudera, Inc. All rights reserved. CREATE TABLE player_game_points

    STORED AS PARQUET AS SELECT a.player, a.gid, b.*, …other stuff… FROM scoring_game_sessions a LATERAL VIEW within_table(array( "select sum(yds) yds, sum(intcpt) intcpt, sum(td) td, sum(conv) conv from t1", "select (yds/25) - 2*intcpt + 2*conv + 4*td as passing from last"), passes) b …other stuff… How do we query a supernova table?
  13. ‹#› © Cloudera, Inc. All rights reserved. DESCRIBE TABLE player_game_points;

    What does this table look like? Column Name Data Type player string gid int passing int rushing int receiving int
  14. ‹#› © Cloudera, Inc. All rights reserved. What are we

    predicting? • We want to maximize points • … and manage risk • Historic fantasy points are easy to calculate given granular play history • => Regression problem • => Start simple. Predict the average pts scored last year. • What is risk for us? • => Variance in points scored.
  15. ‹#› © Cloudera, Inc. All rights reserved. The predictor •

    E[points | year = 2014] = E[point | year =2013] • I always start with that predictor: • E[X | this interval] = E[X | last interval] • We can define risk as the std deviation of their expected points per game, over the season. • Risk(Player) = Sqrt(Variance(X))
  16. ‹#› © Cloudera, Inc. All rights reserved. How do I

    compute per player predictions? CREATE TABLE player_prediction AS select player_game_points.player AS player, games.seas as seas, sum(passing + rushing + receiving) AS totalPts, stddev_pop(sum(passing +rushing + receiving)) AS stddev, players.pos1 FROM player_game_points LEFT JOIN games ON player_game_points.gid = games.gid LEFT JOIN players ON players.player = player_game_points.player GROUP BY player_game_points.player, players.pos1, games.seas; ! <This exact query doesn’t because I removed some null handling to make it fit on the slide.>
  17. ‹#› © Cloudera, Inc. All rights reserved. So, what do

    our rosters look like? file:///home/juliet/src/footballtimeviz/index.html
  18. ‹#› © Cloudera, Inc. All rights reserved. • Yep, but

    the workflow is modular • I always start with the simplest plausible model • Any other model, and an estimate of it’s variance for each player (bootstrap?) would work • What is your favorite regression algorithm? My model is simple
  19. ‹#› © Cloudera, Inc. All rights reserved. Can I find

    data to justify this? from BeautifulSoup import BeautifulSoup from urllib2 import urlopen import csv ! soup = BeautifulSoup(urlopen('http://www.cbssports.com/nfl/injuries')) ! outputFile = open("injuries.csv", "w") csvwriter = csv.writer(outputFile) ! for table in soup.findAll("table", attrs = {"class" : "data"}): rows = table.findAll("tr") if len(rows) > 0: team_name = rows.pop(0).find("a") for row in rows: cells = [cell.text for cell in row.findAll('td')] if len(cells) == 6: cells += team_name csvwriter.writerow(cells) ! outputFile.close()
  20. ‹#› © Cloudera, Inc. All rights reserved. Photo Credits 1.

    bit.ly/1xCEAGE 2. bit.ly/1DJvZGH 3. bit.ly/1D6wb4E 4. bit.ly/1yRu7Zp 5. bit.ly/1Fg3TEt 6. bit.ly/1zgDLnp 7. bit.ly/1CCmOuI 8. bit.ly/1zYnRmN 9. bit.ly/1uAWEWC 10. bit.ly/17bRg22 11. bit.ly/1AidZEP Find the code (but not the proprietary data) on github: https://github.com/jhlch/fantasy-football
  21. ‹#› © Cloudera, Inc. All rights reserved. 1. Get NFL

    data from armchair analysis 2. Read “Dimensional Modeling Toolkit” by Ralph Kimball 3. ??? 4. Profit Next Steps img: 1