Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ensuring statistics have power: sample sizes, e...

Ensuring statistics have power: sample sizes, effect sizes and confidence intervals (and how to use them)

Anderson, Ben, Rushby, Tom, Bahaj, Abubakr and James, Patrick (2021) . Ensuring statistics have power: sample sizes, effect sizes and confidence intervals (and how to use them)Energy Evaluation Europe: 2021 Europe Conference: Accelerating the energy transition for all: Evaluation's role in effective policy making, Online. 10 - 16 Mar 2021.

Avatar for Ben Anderson

Ben Anderson

March 16, 2021
Tweet

More Decks by Ben Anderson

Other Decks in Science

Transcript

  1. Ensuring statistics have power 11th March 2021 Ben Anderson @dataknut

    Sample sizes, effect sizes and confidence intervals (and how to use them)
  2. 3 The Menu • What do we need to know?

    • Effect sizes, precision and the risk of getting it ‘wrong’ • Case studies: • Actual small sample • Simulated large(r) sample • Decisions: • Before: Study design • After: Evidence, certainty and risk • Summary
  3. 4 Evaluation: we need to know • Is the result

    important or useful? • “What is the estimated bang for buck?”) Difference or effect size • Is there uncertainty or variation in response? • “How uncertain is the estimated bang?” Statistical Confidence Intervals • Risk of a Type I error / false positive? • “Risk the bang isn't real?” Statistical p values • Risk of a Type II error / false negative? • “Risk there is a bang when we concluded there wasn't?” Statistical power Is it 2% or 22% 15-29% ? p = 0.1? power = 0.8? Is it useful? Are we sure enough? We might waste £ on something that doesn’t work We might not do something that does work
  4. 5 An example… • Heat pump power demand* • Total

    sample = 53 – There are ‘useful’ differences – But 95% confidence intervals overlap – So none are ‘statistically significant’ – And all are imprecise *Data source: B. Anderson et al., ‘New Zealand GREEN Grid household electricity demand study 2014-2018’, Sep. 2018 Is it useful? Are we sure enough?
  5. 6 An example… 2 • Heat pump power demand* •

    Simulated sample^ = 1,040 – There are ‘very useful’ differences – 95% confidence intervals do not overlap – All are ‘statistically significant’ – And all are much more precise *Data source: B. Anderson et al., ‘New Zealand GREEN Grid household electricity demand study 2014-2018’, Sep. 2018 ^Repeated random sampling from 53 with replacement Is it useful? Are we sure enough?
  6. 7 Decisions before: power analysis The effect size we can

    ‘robustly’ detect ‘False positive’ risk e.g. 5% ( p < 0.05) ‘False negative’ risk e.g. Power = 0.8 With this sample size Effect size Type I error Type II error We might waste £ on something that doesn’t work We might not do something that does work N
  7. 8 Power Analysis: Start here… The effect size we can

    ‘robustly’ detect This ‘false positive’ risk This ‘false negative’ risk and… With this sample size…
  8. 10 Decisions after: Evidence, certainty and risk • Suppose: –

    Trial 1: needs 4% to be worthwhile – Trial 2: needs 18% to be worthwhile Trial 1 Trial 2 Mean effect size 6% 16% 95% Confidence Interval -1% to 13% 10% to 22% Test p value (Type I) 0.12 0.04 Power (Type II) 0.8 0.8 1. Mean effect size is large enough 2. 95% CI • include the target • are wide and include 0 3. The effect is n/s at p = 0.05 and p = 0.1 1. Mean effect size is not quite large enough 2. 95% CI • include the target • are wide but do not include 0 3. The effect is statistically significant at p = 0.05
  9. 11 Summary Reporting evidence: • Sample size -> is it

    big enough? • Effect sizes -> is it useful enough? • Confidence intervals -> is it precise enough? • Statistical significance thresholds -> is it random chance? Thresholds depend on your appetite for: • Type I error (test p value) • You conclude it ‘worked’ when (in fact) it didn’t • Type II error (statistical power) • You conclude it ‘didn’t work’ when (in fact) it did Which depend on: • The social, reputational and £ costs if you’re wrong • The benefits if you’re right We might waste £ on something that doesn’t work We might not do something that does work Is it useful? Are we sure enough?