Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistics for Hackers

Statistics for Hackers

(Presented at PyCon 2016. Early version presented at StitchFix, Sept 2015. See the PyCon video at https://www.youtube.com/watch?v=Iq9DzN6mvYA)

The field of statistics has a reputation for being difficult to crack: it revolves around a seemingly endless jargon of distributions, test statistics, confidence intervals, p-values, and more, with each concept subject to its own subtle assumptions. But it doesn't have to be this way: today we have access to computers that Neyman and Pearson could only dream of, and many of the conceptual challenges in the field can be overcome through judicious use of these CPU cycles. In this talk I'll discuss how you can use your coding skills to "hack statistics" – to replace some of the theory and jargon with intuitive computational approaches such as sampling, shuffling, cross-validation, and Bayesian methods – and show that with a grasp of just a few fundamental concepts, if you can write a for-loop you can do statistical analysis.

Jake VanderPlas

May 31, 2016
Tweet

More Decks by Jake VanderPlas

Other Decks in Programming

Transcript

  1. Jake VanderPlas
    PyCon 2016

    View full-size slide

  2. < About Me >
    - Astronomer by training
    - Statistician by accident
    - Active in Python science & open source
    - Data Scientist at UW eScience Institute
    - @jakevdp on Twitter & Github

    View full-size slide

  3. Hacker (n.)
    1. A person who is trying to steal
    your grandma’s bank password.

    View full-size slide

  4. Hacker (n.)
    1. A person who is trying to steal
    your grandma’s bank password.
    2. A person whose natural approach
    to problem-solving involves
    writing code.

    View full-size slide

  5. Statistics is Hard.

    View full-size slide

  6. Statistics is Hard.
    Using programming skills,
    it can be easy.

    View full-size slide

  7. My thesis today:
    If you can write a for-loop,
    you can do statistics

    View full-size slide

  8. Statistics is fundamentally about
    Asking the Right Question.

    View full-size slide

  9. – Dr. Seuss (attr)

    View full-size slide

  10. You toss a coin 30
    times and see 22
    heads. Is it a fair coin?
    Warm-up:
    Coin Toss

    View full-size slide

  11. A fair coin should
    show 15 heads in 30
    tosses. This coin is
    biased.
    Even a fair coin
    could show 22 heads
    in 30 tosses. It might
    be just chance.

    View full-size slide

  12. Classic Method:
    Assume the Skeptic is correct:
    test the Null Hypothesis.
    What is the probability of a fair
    coin showing 22 heads simply
    by chance?

    View full-size slide

  13. Classic Method:
    Start computing probabilities . . .

    View full-size slide

  14. Classic Method:

    View full-size slide

  15. Classic Method:
    Number of
    arrangements
    (binomial
    coefficient) Probability of
    N
    H
    heads
    Probability of
    N
    T
    tails

    View full-size slide

  16. Classic Method:

    View full-size slide

  17. Classic Method:

    View full-size slide

  18. Classic Method:
    0.8 %

    View full-size slide

  19. Classic Method:
    0.8 %
    Probability of 0.8% (i.e. p = 0.008) of
    observations given a fair coin.
    → reject fair coin hypothesis at p < 0.05

    View full-size slide

  20. Could there be
    an easier way?

    View full-size slide

  21. Easier Method:
    Just simulate it!
    M = 0
    for i in range(10000):
    trials = randint(2, size=30)
    if (trials.sum() >= 22):
    M += 1
    p = M / 10000 # 0.008149
    → reject fair coin at p = 0.008

    View full-size slide

  22. In general . . .
    Computing the Sampling
    Distribution is Hard.

    View full-size slide

  23. In general . . .
    Computing the Sampling
    Distribution is Hard.
    Simulating the Sampling
    Distribution is Easy.

    View full-size slide

  24. Four Recipes for
    Hacking Statistics:
    1. Direct Simulation
    2. Shuffling
    3. Bootstrapping
    4. Cross Validation

    View full-size slide

  25. Now, the Star-Belly Sneetches
    had bellies with stars.
    The Plain-Belly Sneetches
    had none upon thars . . .
    Sneeches:
    Stars and
    Intelligence
    *inspired by John Rauser’s
    Statistics Without All The Agonizing Pain

    View full-size slide

  26. ★ ❌
    84 72 81 69
    57 46 74 61
    63 76 56 87
    99 91 69 65
    66 44
    62 69
    ★ mean: 73.5
    ❌ mean: 66.9
    difference: 6.6
    Sneeches:
    Stars and
    Intelligence
    Test Scores

    View full-size slide

  27. ★ mean: 73.5
    ❌ mean: 66.9
    difference: 6.6
    Is this difference of 6.6
    statistically significant?

    View full-size slide

  28. Classic
    Method
    (Welch’s t-test)

    View full-size slide

  29. Classic
    Method
    (Welch’s t-test)

    View full-size slide

  30. Classic
    Method
    (Student’s t distribution)

    View full-size slide

  31. Classic
    Method
    (Student’s t distribution)
    Degree of Freedom: “The number of independent
    ways by which a dynamic system can move,
    without violating any constraint imposed on it.”
    -Wikipedia

    View full-size slide

  32. Degree of Freedom: “The number of independent
    ways by which a dynamic system can move,
    without violating any constraint imposed on it.”
    -Wikipedia
    Classic
    Method
    (Student’s t distribution)

    View full-size slide

  33. Classic
    Method
    ( Welch–Satterthwaite
    equation)

    View full-size slide

  34. Classic
    Method
    ( Welch–Satterthwaite
    equation)

    View full-size slide

  35. Classic
    Method

    View full-size slide

  36. Classic
    Method

    View full-size slide

  37. Classic
    Method
    1.7959

    View full-size slide

  38. Classic
    Method

    View full-size slide

  39. Classic
    Method

    View full-size slide

  40. Classic
    Method

    View full-size slide

  41. “The difference of 6.6 is not
    significant at the p=0.05 level”

    View full-size slide

  42. The biggest problem:
    We’ve entirely lost-track
    of what question we’re
    answering!

    View full-size slide

  43. < One popular alternative . . . >
    “Why don’t you just . . .”
    from statsmodels.stats.weightstats import ttest_ind
    t, p, dof = ttest_ind(group1, group2,
    alternative='larger',
    usevar='unequal')
    print(p) # 0.186

    View full-size slide

  44. < One popular alternative . . . >
    “Why don’t you just . . .”
    from statsmodels.stats.weightstats import ttest_ind
    t, p, dof = ttest_ind(group1, group2,
    alternative='larger',
    usevar='unequal')
    print(p) # 0.186
    . . . But what question is
    this answering?

    View full-size slide

  45. The deep meaning lies in the
    sampling distribution:
    Stepping Back...
    0.8 %
    Same principle as
    the coin example:

    View full-size slide

  46. Let’s use a sampling
    method instead

    View full-size slide

  47. The Problem:
    Unlike coin flipping, we don’t
    have a generative model . . .

    View full-size slide

  48. The Problem:
    Unlike coin flipping, we don’t
    have a generative model . . .
    Solution:
    Shuffling

    View full-size slide

  49. ★ ❌
    84 72 81 69
    57 46 74 61
    63 76 56 87
    99 91 69 65
    66 44
    62 69
    Idea:
    Simulate the distribution
    by shuffling the labels
    repeatedly and computing
    the desired statistic.
    Motivation:
    if the labels really don’t
    matter, then switching
    them shouldn’t change
    the result!

    View full-size slide

  50. ★ ❌
    84 72 81 69
    57 46 74 61
    63 76 56 87
    99 91 69 65
    66 44
    62 69
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  51. ★ ❌
    84 72 81 69
    57 46 74 61
    63 76 56 87
    99 91 69 65
    66 44
    62 69
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  52. ★ ❌
    84 81 72 69
    61 69 74 57
    65 76 56 87
    99 44 46 63
    66 91
    62 69
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  53. ★ ❌
    84 81 72 69
    61 69 74 57
    65 76 56 87
    99 44 46 63
    66 91
    62 69
    ★ mean: 72.4
    ❌ mean: 67.6
    difference: 4.8
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  54. ★ ❌
    84 81 72 69
    61 69 74 57
    65 76 56 87
    99 44 46 63
    66 91
    62 69
    ★ mean: 72.4
    ❌ mean: 67.6
    difference: 4.8
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  55. ★ ❌
    84 81 72 69
    61 69 74 57
    65 76 56 87
    99 44 46 63
    66 91
    62 69
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  56. ★ ❌
    84 56 72 69
    61 63 74 57
    65 66 81 87
    62 44 46 69
    76 91
    99 69
    ★ mean: 62.6
    ❌ mean: 74.1
    difference: -11.6
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  57. ★ ❌
    84 56 72 69
    61 63 74 57
    65 66 81 87
    62 44 46 69
    76 91
    99 69
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  58. ★ ❌
    74 56 72 69
    61 63 84 57
    87 76 81 65
    91 99 46 69
    66 62
    44 69
    ★ mean: 75.9
    ❌ mean: 65.3
    difference: 10.6
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  59. ★ ❌
    84 56 72 69
    61 63 74 57
    65 66 81 87
    62 44 46 69
    76 91
    99 69
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  60. ★ ❌
    84 81 69 69
    61 69 87 74
    65 76 56 57
    99 44 46 63
    66 91
    62 72
    1. Shuffle Labels
    2. Rearrange
    3. Compute means

    View full-size slide

  61. 1. Shuffle Labels
    2. Rearrange
    3. Compute means
    ★ ❌
    74 62 72 57
    61 63 84 69
    87 81 76 65
    91 99 46 69
    66 56
    44 69

    View full-size slide

  62. 1. Shuffle Labels
    2. Rearrange
    3. Compute means
    ★ ❌
    84 81 72 69
    61 69 74 57
    65 76 56 87
    99 44 46 63
    66 91
    62 69

    View full-size slide

  63. score difference
    number

    View full-size slide

  64. score difference
    number

    View full-size slide

  65. 16 %
    score difference
    number

    View full-size slide

  66. “A difference of 6.6 is not
    significant at p = 0.05.”
    That day, all the Sneetches
    forgot about stars
    And whether they had one,
    or not, upon thars.

    View full-size slide

  67. Notes on Shuffling:
    - Works when the Null Hypothesis assumes
    two groups are equivalent
    - Like all methods, it will only work if your
    samples are representative – always be
    careful about selection biases!
    - Needs care for non-independent trials.
    Good discussion in Simon’s Resampling:
    The New Statistics

    View full-size slide

  68. Four Recipes for
    Hacking Statistics:
    1. Direct Simulation
    2. Shuffling
    3. Bootstrapping
    4. Cross Validation

    View full-size slide

  69. Yertle’s Turtle Tower
    On the far-away island
    of Sala-ma-Sond,
    Yertle the Turtle
    was king of the pond. . .

    View full-size slide

  70. How High can Yertle
    stack his turtles?
    - What is the mean of the number of
    turtles in Yertle’s stack?
    - What is the uncertainty on this
    estimate?
    48 24 32 61 51 12 32 18 19 24
    21 41 29 21 25 23 42 18 23 13
    Observe 20 of Yertle’s turtle towers . . .
    # of turtles

    View full-size slide

  71. Classic Method:
    Sample Mean:
    Standard Error of the Mean:

    View full-size slide

  72. What assumptions go into
    these formulae?
    Can we use
    sampling instead?

    View full-size slide

  73. Problem:
    As before, we don’t have a
    generating model . . .

    View full-size slide

  74. Problem:
    As before, we don’t have a
    generating model . . .
    Solution:
    Bootstrap Resampling

    View full-size slide

  75. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.

    View full-size slide

  76. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.

    View full-size slide

  77. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21

    View full-size slide

  78. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19

    View full-size slide

  79. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25

    View full-size slide

  80. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24

    View full-size slide

  81. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23

    View full-size slide

  82. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19

    View full-size slide

  83. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41

    View full-size slide

  84. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23

    View full-size slide

  85. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41

    View full-size slide

  86. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18

    View full-size slide

  87. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61

    View full-size slide

  88. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12

    View full-size slide

  89. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42

    View full-size slide

  90. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42 42

    View full-size slide

  91. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42 42 42

    View full-size slide

  92. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42 42 42 19

    View full-size slide

  93. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42 42 42 19 18

    View full-size slide

  94. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42 42 42 19 18 61

    View full-size slide

  95. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42 42 42 19 18 61 29

    View full-size slide

  96. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42 42 42 19 18 61 29 41

    View full-size slide

  97. Bootstrap Resampling:
    48 24 51 12
    21 41 25 23
    32 61 19 24
    29 21 23 13
    32 18 42 18
    Idea:
    Simulate the distribution
    by drawing samples with
    replacement.
    Motivation:
    The data estimates its
    own distribution – we
    draw random samples
    from this distribution.
    21 19 25 24 23 19 41 23 41 18
    61 12 42 42 42 19 18 61 29 41
    → 31.05

    View full-size slide

  98. Repeat this
    several thousand times . . .

    View full-size slide

  99. for i in range(10000):
    sample = N[randint(20, size=20)]
    xbar[i] = mean(sample)
    mean(xbar), std(xbar)
    # (28.9, 2.9)
    Recovers The Analytic Estimate!
    Height = 29 ± 3 turtles

    View full-size slide

  100. Bootstrap sampling
    can be applied even to
    more involved statistics

    View full-size slide

  101. Bootstrap on Linear
    Regression:
    What is the relationship between speed of wind
    and the height of the Yertle’s turtle tower?

    View full-size slide

  102. Bootstrap on Linear
    Regression:
    for i in range(10000):
    i = randint(20, size=20)
    slope, intercept = fit(x[i], y[i])
    results[i] = (slope, intercept)

    View full-size slide

  103. Notes on Bootstrapping:
    - Bootstrap resampling is well-studied and
    rests on solid theoretical grounds.
    - Bootstrapping often doesn’t work well for
    rank-based statistics (e.g. maximum value)
    - Works poorly with very few samples
    (N > 20 is a good rule of thumb)
    - As always, be careful about selection
    biases & non-independent data!

    View full-size slide

  104. Four Recipes for
    Hacking Statistics:
    1. Direct Simulation
    2. Shuffling
    3. Bootstrapping
    4. Cross Validation

    View full-size slide

  105. Onceler Industries:
    Sales of Thneeds
    I'm being quite useful!
    This thing is a Thneed.
    A Thneed's a Fine-Something-
    That-All-People-Need!

    View full-size slide

  106. Thneed sales seem to show a
    trend with temperature . . .

    View full-size slide

  107. y = a + bx
    y = a + bx + cx2
    But which model is a better fit?

    View full-size slide

  108. y = a + bx
    y = a + bx + cx2
    Can we judge by root-mean-
    square error?
    RMS error = 63.0
    RMS error = 51.5

    View full-size slide

  109. In general, more flexible models will
    always have a lower RMS error.
    y = a + bx
    y = a + bx + cx2
    y = a + bx + cx2 + dx3
    y = a + bx + cx2 + dx3 + ex4
    y = a + ⋯

    View full-size slide

  110. y = a + bx + cx2 + dx3 + ex4 + fx5 + ⋯ + nx14
    RMS error does not
    tell the whole story.

    View full-size slide

  111. Not to worry:
    Statistics has figured this out.

    View full-size slide

  112. Classic Method
    Difference in Mean
    Squared Error follows
    chi-square distribution:

    View full-size slide

  113. Classic Method
    Can estimate degrees of
    freedom easily because
    the models are nested . . .
    Difference in Mean
    Squared Error follows
    chi-square distribution:

    View full-size slide

  114. Classic Method
    Can estimate degrees of
    freedom easily because
    the models are nested . . .
    Difference in Mean
    Squared Error follows
    chi-square distribution:
    Plug in our numbers . . .

    View full-size slide

  115. Classic Method
    Can estimate degrees of
    freedom easily because
    the models are nested . . .
    Difference in Mean
    Squared Error follows
    chi-square distribution:
    Plug in our numbers . . .
    Wait… what question
    were we trying to
    answer again?

    View full-size slide

  116. Another Approach:
    Cross Validation

    View full-size slide

  117. Cross-Validation

    View full-size slide

  118. Cross-Validation
    1. Randomly Split data

    View full-size slide

  119. Cross-Validation
    1. Randomly Split data

    View full-size slide

  120. Cross-Validation
    2. Find the best model for each subset

    View full-size slide

  121. Cross-Validation
    3. Compare models across subsets

    View full-size slide

  122. Cross-Validation
    3. Compare models across subsets

    View full-size slide

  123. Cross-Validation
    3. Compare models across subsets

    View full-size slide

  124. Cross-Validation
    3. Compare models across subsets

    View full-size slide

  125. Cross-Validation
    4. Compute RMS error for each
    RMS = 48.9
    RMS = 55.1
    RMS estimate = 52.1

    View full-size slide

  126. Cross-Validation
    Repeat for as long as
    you have patience . . .

    View full-size slide

  127. Cross-Validation
    5. Compare cross-validated RMS for models:

    View full-size slide

  128. Cross-Validation
    Best model minimizes the
    cross-validated error.
    5. Compare cross-validated RMS for models:

    View full-size slide

  129. . . . I biggered the loads
    of the thneeds I shipped out!
    I was shipping them forth,
    to the South, to the East
    to the West, to the North!

    View full-size slide

  130. Notes on Cross-Validation:
    - This was “2-fold” cross-validation; other
    CV schemes exist & may perform better
    for your data (see e.g. scikit-learn docs)
    - Cross-validation is the go-to method for
    model evaluation in machine learning,
    as statistics of the models are often not
    known in the classical sense.
    - Again: caveats about selection bias and
    independence in data.

    View full-size slide

  131. Four Recipes for
    Hacking Statistics:
    1. Direct Simulation
    2. Shuffling
    3. Bootstrapping
    4. Cross Validation

    View full-size slide

  132. Sampling Methods
    allow you to use intuitive computational
    approaches in place of often
    non-intuitive statistical rules.
    If you can write a for-loop
    you can do statistical analysis.

    View full-size slide

  133. Things I didn’t have time for:
    - Bayesian Methods: very intuitive & powerful
    approaches to more sophisticated modeling.
    (see e.g. Bayesian Methods for Hackers by Cam Davidson-Pilon)
    - Selection Bias: if you get data selection
    wrong, you’ll have a bad time.
    (See Chris Fonnesbeck’s Scipy 2015 talk, Statistical Thinking for Data Science)
    - Detailed considerations on use of sampling,
    shuffling, and bootstrapping.
    (I recommend Statistics Is Easy by Shasha & Wilson
    And Resampling: The New Statistics by Julian Simon)

    View full-size slide

  134. – Dr. Seuss (attr)

    View full-size slide

  135. ~ Thank You! ~
    Email: [email protected]
    Twitter: @jakevdp
    Github: jakevdp
    Web: http://vanderplas.com/
    Blog: http://jakevdp.github.io/
    Slides available at
    http://speakerdeck.com/jakevdp/statistics-for-hackers/

    View full-size slide