Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Should all statistics students 
be programmers?

Hadley Wickham
July 12, 2018
4.3k

Should all statistics students 
be programmers?

A presentation at ICOTS 10 (Kyoto, Japan)

Hadley Wickham

July 12, 2018
Tweet

Transcript

  1. What should a statistics student be able to do? Tidy

    Surprises, but doesn't scale Create new variables & new summaries Visualise Transform Model Communicate Scales, but doesn't (fundamentally) surprise Automate Store data consistently Import Understand
  2. 1. Code is text 2. Code is read-able 3. Code

    is shareable 4. Code is open Why is programming preferable for statistics?
  3. library(tidycensus) geo <- get_acs( geography = "metropolitan statistical area...", variables

    = "DP03_0021PE", summary_var = "B01003_001", survey = "acs1", endyear = 2016 ) # Thanks to Kyle Walker (@kyle_e_walker) # For package and example A small example
  4. big_metro <- geo %>% filter(summary_est > 2e6) %>% select(-variable) %>%

    mutate( NAME = gsub(" Metro Area", "", NAME) ) %>% separate(NAME, c("city", "state"), ", ") %>% mutate( city = str_extract(city, "^[A-Za-z ]+"), state = str_extract(state, "^[A-Za-z ]+"), name = paste0(city, ", ", state), summary_moe = na_if(summary_moe, -555555555) ) Followed by data munging
  5. big_metro %>% ggplot(aes( x = estimate, y = reorder(name, estimate))

    ) + geom_errorbarh( aes( xmin = estimate - moe, xmax = estimate + moe ), width = 0.1 ) + geom_point(color = "navy")
  6. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • Indianapolis, IN Kansas City, MO Riverside, CA Charlotte, NC Dallas, TX Tampa, FL Detroit, MI Columbus, OH Phoenix, AZ Cincinnati, OH Houston, TX Orlando, FL Sacramento, CA Austin, TX San Antonio, TX San Juan, PR St, MO San Diego, CA Atlanta, GA Cleveland, OH Las Vegas, NV Miami, FL Denver, CO Minneapolis, MN Los Angeles, CA Pittsburgh, PA Baltimore, MD Portland, OR Philadelphia, PA Seattle, WA Chicago, IL Boston, MA Washington, DC San Francisco, CA New York, NY 0 10 20 30 estimate reorder(name, estimate)
  7. library(tidyverse) library(magick) dir(pattern = ".png") %>% map(image_read) %>% image_join() %>%

    image_animate(fps = 1, loop = 25) %>% image_write("my_animation.gif") And hence you can read unfamiliar code https://twitter.com/ricardokriebel/status/849626401611411458 What does this code do?
  8. All modern programming languages are open source Free Students can

    use same tools as practitioners.
 Anyone can use best tools regardless of wealth.
 Anyone can re-run your analysis You can fix problems
 You can build your own tools Fluid
  9. 1. Code is text 2. Code is read-able 3. Code

    is shareable 4. Code is open Why is programming preferable for statistics?