Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alex Petralia - Analyzing Data: What pandas and...

Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average

“So tell me,” my manager said, “what is an average?”

There’s probably nothing worse than that sinking feeling when you finish an analysis, email it to your manager or client to review, and they point out a mistake so basic you can’t even fathom how you missed it.

This talk is about mine: how to take an average.

Averages are something we use everywhere - it’s a simple np.mean() in pandas or AVG() in SQL. But recently I’ve come to appreciate just how easy it is to calculate this statistic incorrectly. We learn once - in middle school no less - how to take an average, and never revisit it. Then, when we are faced with multidimensional datasets (ie. pretty much every dataset out there), we never reconsider whether we should be taking an average the same way.

In this talk, we follow my arduous and humbling journey of learning how to properly take an average with multidimensional data. We will cover how improperly calculating it can produce grossly incorrect figures, which can slip into publications, research analyses and management reports.

https://us.pycon.org/2018/schedule/presentation/84/

PyCon 2018

May 11, 2018
Tweet

More Decks by PyCon 2018

Other Decks in Programming

Transcript

  1. pandas vol = pd.read_sql_table(‘trades’, cnxn) vol = pd.to_datetime(vol[‘Date’]).set_index() vol2018 =

    vol[‘2018’] df = vol2018.groupby(‘Exchange’) \ [‘Volume’].mean() SELECT Exchange, AVG(Volume) FROM `trades` WHERE Date LIKE ‘2018%’ GROUP BY Exchange
  2. pandas vol = pd.read_sql_table(‘trades’, cnxn) vol = pd.to_datetime(vol[‘Date’]).set_index() vol2018 =

    vol[‘2018’] df = vol2018.groupby(‘Exchange’) \ [‘Volume’].mean() SELECT Exchange, AVG(Volume) FROM `trades` WHERE Date LIKE ‘2018%’ GROUP BY Exchange “What is our average daily trading volume for each exchange we trade on?”
  3. x

  4. Outline for this talk 1. One-dimensional data 2. Two-dimensional data

    3. Three-dimensional data 4. The pitfall of multi-dimensional averages 5. The magic formula for taking averages
  5. One-dimensional data Id ApplesSold 1 5 2 4 3 8

    4 10 5 6 SUM(AppleSold) COUNT(*) 5 + 4 + 8 + 10 + 6 5 = 6.6 “What is the average amount of apples sold?”
  6. One-dimensional data Id ApplesSold 1 5 2 4 3 8

    4 10 5 6 SELECT SUM(ApplesSold)/COUNT(ApplesSold) FROM apples SELECT AVG(ApplesSold) FROM apples “What is the average amount of apples sold?”
  7. Two-dimensional data SUM(AppleSold) COUNT(DISTINCT Date) 5 + 4 + 8

    + 10 + 6 3 = 11 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?”
  8. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?”
  9. Two-dimensional data Date NumSold Monday 9 Tuesday 8 Thursday 16

    SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date 3 dates, after aggregating!
  10. SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM

    apples GROUP BY Date ) AS tmp Two-dimensional data AVG(NumSold) 11
  11. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 The question matters... ...it determines our denominator! (our row COUNT()) “What is the average daily amount of apples sold?”
  12. Two-dimensional data our level of analysis primary key SELECT AVG(NumSold)

    FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date ) AS tmp “collapsing key” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6
  13. x

  14. Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5

    2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 “What’s the average daily amount of apples sold for each seller?”
  15. Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5

    2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 Seller AVG( ApplesSold) Mary ? Bob ? Jane ? “What’s the average daily amount of apples sold for each seller?”
  16. Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5

    2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 Seller AVG( ApplesSold) Mary Bob Jane 5 6 16 “What’s the average daily amount of apples sold for each seller?”
  17. SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY

    Date, Seller Three-dimensional data Date Seller total Monday Mary 5 Monday Bob 4 Tuesday Bob 8 Thursday Jane 16 replacing the primary key with a new key relevant to this analysis “collapsing key” “What’s the average daily amount of apples sold for each seller?”
  18. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “What’s the average daily amount of apples sold for each seller?”
  19. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?”
  20. Definitions Collapsing key: the collapsed/aggregated data relevant to this analysis

    - we are “overriding” the primary key (ie. what a table defines as an observation) Grouping key: the key defining a group - eg. “for each Seller” is (Seller), “for each Country and City” is (Country, City)
  21. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?”
  22. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?” ?
  23. Multi-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5

    2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 “What’s the average daily amount of apples sold for each seller?”
  24. Definitions Collapsing key: the collapsed/aggregated data relevant to this analysis

    - we are “overriding” the primary key (ie. what a table defines as an observation) Grouping key: the key defining a group - eg. “for each Seller” is (Seller), “for each Country and City” is (Country, City) Observation key: a unit of observation for this analysis - eg. “daily average” is (Date), “across regions” is (Region) - this defines how many rows are in the denominator
  25. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” collapsing key - grouping key = observation key “What’s the average daily amount of apples sold for each seller?”
  26. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied
  27. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” collapsing key - grouping key = observation key “per day” is implied
  28. pd.groupby([‘Date’, ‘Seller’]) \ [‘ApplesSold’].sum() \ .groupby(level=‘Seller’).mean() Multi-dimensional data (pandas) Seller

    AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied
  29. Order of operations 1. Observation key: “daily” 2. Grouping key:

    “for each seller” 3. Collapsing key: [observation key] + [grouping key] = (Date, Seller) “What’s the average daily amount of apples sold for each seller?” collapsing key - grouping key = observation key
  30. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller Order of operations Seller AVG(total) Mary 5 Bob 6 Jane 16 observation key “per day” is implied 1 2 collapsing key - grouping key = observation key 3
  31. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Order of operations Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied 2 1 1 2 3 3
  32. x

  33. What the blank 1. What if the collapsing key is

    blank? 2. What if the grouping key is blank? 3. What if both are blank? collapsing key - grouping key = observation key inner query outer query the difference
  34. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?” implied primary key
  35. Two-dimensional data SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS

    NumSold FROM apples GROUP BY Date ) AS tmp “collapsing key” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?”
  36. If you don’t define a collapsing key, then the primary

    key is automatically your collapsing key inner query
  37. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?”
  38. SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM

    apples GROUP BY Date ) AS t Two-dimensional data implied “Date” observation key “collapsing key” “What is the average daily amount of apples sold?” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 (no grouping asked)
  39. SELECT AVG(ApplesSold) FROM apples One-dimensional data Id ApplesSold 1 5

    2 4 3 8 4 10 5 6 implied observation key is simply the primary key “What is the average amount of apples sold?”
  40. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples implied observation key is simply the primary key “What is the average daily amount of apples sold?”
  41. SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM

    apples GROUP BY Date ) AS t Two-dimensional data implied “Date” observation key “collapsing key” “What is the average daily amount of apples sold?” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6
  42. x

  43. x

  44. Two-dimensional data 5 + 4 + 8 + 10 +

    6 3 = 11 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?” SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM apples GROUP BY Date ) AS t
  45. Two-dimensional data 5 + 4 + 8 + 10 +

    6 5 = 6.6 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?” SELECT AVG(ApplesSold) FROM apples
  46. Two-dimensional data 5 + 4 + 8 + 10 +

    6 5 = 6.6 “What is the average daily amount of apples sold?” 5 + 4 + 8 + 10 + 6 3 = 11 vs.
  47. x

  48. Beware the pitfall of multi-dimensional averages! collapsing key - grouping

    key = observation key inner query outer query the difference