Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Rectangling

Data Rectangling

Avatar for Jennifer (Jenny) Bryan

Jennifer (Jenny) Bryan

November 18, 2016
Tweet

More Decks by Jennifer (Jenny) Bryan

Other Decks in Programming

Transcript

  1. Big Data Borat: 80% time spent prepare data 20% time

    spent complain about need for prepare data.
  2. Lessons from my fall 2016 teaching: https://jennybc.github.io/purrr-tutorial/ repurrrsive package (non-boring

    examples): https://github.com/jennybc/repurrrsive I am the Annie Leibovitz of lego mini-figures: https://github.com/jennybc/lego-rstats
  3. vectors of same length? DATA FRAME! vectors don’t have to

    be atomic works for lists too! LOVE THE LIST COLUMN!
  4. { "url": "http://www.anapioficeandfire.com/api/characters/1303", "id": 1303, "name": "Daenerys Targaryen", "gender": "Female",

    "culture": "Valyrian", "born": "In 284 AC, at Dragonstone", "died": "", "alive": true, "titles": [ "Queen of the Andals and the Rhoynar and the First Men, Lord of the Seven Kingdoms", "Khaleesi of the Great Grass Sea", "Breaker of Shackles/Chains", "Queen of Meereen", "Princess of Dragonstone" ], "aliases": [ "Dany", "Daenerys Stormborn",
  5. titles #> # A tibble: 29 × 2
 #> name

    titles
 #> <chr> <list>
 #> 1 Theon Greyjoy <chr [3]>
 #> 2 Tyrion Lannister <chr [2]>
 #> 3 Victarion Greyjoy <chr [2]>
 #> 4 Will <list [0]>
 #> 5 Areo Hotah <chr [1]>
 #> 6 Chett <list [0]>
 #> 7 Cressen <chr [1]>
 #> 8 Arianne Martell <chr [1]>
 #> 9 Daenerys Targaryen <chr [5]>
 #> 10 Davos Seaworth <chr [4]>
 #> # ... with 19 more rows
  6. Why would you do this to yourself? The list is

    forced on you by the problem. •String processing, e.g., regex •JSON or XML •Split-Apply-Combine
  7. But why lists in a data frame? All the usual

    reasons! • Keep multiple vectors intact and “in sync” • Use existing toolkit for filter, select, ….
  8. map(got_chars, "name") #> [[1]]
 #> [1] "Theon Greyjoy"
 #> 


    #> [[2]]
 #> [1] "Tyrion Lannister"
 #> 
 #> [[3]]
 #> [1] "Victarion Greyjoy" query
  9. map_chr(got_chars, "name") #> [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"

    
 #> [4] "Will" "Areo Hotah" "Chett" 
 #> [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
 #> [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart" 
 #> [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr" 
 #> [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark" 
 #> [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister" 
 #> [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy" 
 #> [25] "Kevan Lannister" "Melisandre" "Merrett Frey" 
 #> [28] "Quentyn Martell" "Sansa Stark" simplify
  10. > map_df(got_chars, `[`, c("name", "culture", "gender", "born")) #> # A

    tibble: 29 × 4 #> name culture gender born #> <chr> <chr> <chr> <chr> #> 1 Theon Greyjoy Ironborn Male In 278 AC or 279 AC, at Pyke #> 2 Tyrion Lannister Male In 273 AC, at Casterly Rock #> 3 Victarion Greyjoy Ironborn Male In 268 AC or before, at Pyke #> 4 Will Male #> 5 Areo Hotah Norvoshi Male In 257 AC or before, at Norvos #> 6 Chett Male At Hag's Mire #> 7 Cressen Male In 219 AC or 220 AC #> 8 Arianne Martell Dornish Female In 276 AC, at Sunspear #> 9 Daenerys Targaryen Valyrian Female In 284 AC, at Dragonstone #> 10 Davos Seaworth Westeros Male In 260 AC or before, at King's Landing #> # ... with 19 more rows simplify
  11. got_chars %>% { tibble(name = map_chr(., "name"), houses = map(.,

    "allegiances")) } %>% filter(lengths(houses) > 1) %>% unnest() #> # A tibble: 15 × 2 #> name houses #> <chr> <chr> #> 1 Davos Seaworth House Baratheon of Dragonstone #> 2 Davos Seaworth House Seaworth of Cape Wrath #> 3 Asha Greyjoy House Greyjoy of Pyke #> 4 Asha Greyjoy House Ironmaker simplify
  12. gap_nested <- gapminder %>% group_by(country, continent) %>% nest() gap_nested #>

    # A tibble: 142 × 3 #> country continent data #> <fctr> <fctr> <list> #> 1 Afghanistan Asia <tibble [12 × 4]> #> 2 Albania Europe <tibble [12 × 4]> #> 3 Algeria Africa <tibble [12 × 4]> #> 4 Angola Africa <tibble [12 × 4]> #> 5 Argentina Americas <tibble [12 × 4]> #> 6 Australia Oceania <tibble [12 × 4]> #> 7 Austria Europe <tibble [12 × 4]> #> 8 Bahrain Asia <tibble [12 × 4]> #> 9 Bangladesh Asia <tibble [12 × 4]> #> 10 Belgium Europe <tibble [12 × 4]> #> # ... with 132 more rows
  13. modify gap_nested %>% mutate(fit = map(data, ~ lm(lifeExp ~ year,

    data = .x))) %>% filter(continent == "Oceania") %>% mutate(coefs = map(fit, coef)) #> # A tibble: 2 × 5 #> country continent data fit coefs #> <fctr> <fctr> <list> <list> <list> #> 1 Australia Oceania <tibble [12 × 4]> <S3: lm> <dbl [2]> #> 2 New Zealand Oceania <tibble [12 × 4]> <S3: lm> <dbl [2]>
  14. simplify gap_nested %>% … mutate(intercept = map_dbl(coefs, 1), slope =

    map_dbl(coefs, 2)) %>% select(country, continent, intercept, slope) #> # A tibble: 2 × 4 #> country continent intercept slope #> <fctr> <fctr> <dbl> <dbl> #> 1 Australia Oceania -376.1163 0.2277238 #> 2 New Zealand Oceania -307.6996 0.1928210
  15. maybe you don’t, because you don’t know how " for

    loops apply(), [slvmt]apply(), split(), by() with plyr: [adl][adl_]ply() with dplyr: df %>% group_by() %>% do() How are you doing such things today?
  16. map(.x, .f, ...) .f is function to apply name &

    position shortcuts concise ~ formula syntax
  17. “return results like so” map_lgl(.x, .f, ...) map_chr(.x, .f, ...)

    map_int(.x, .f, ...) map_dbl(.x, .f, …) map(.x, .f, …) can be thought of as map_list(.x, .f, …) map_df(.x, .f, …)
  18. walk(.x, .f, …) can be thought of as map_nothing(.x, .f,

    …) map2(.x, .y, .f, …) f(.x[[i]], .y[[i]], …) pmap(.l, .f, …) f(tuple of i-th elements of the vectors in .l, …)
  19. 1 do something easy with the iterative machine 2 do

    the real, hard thing with one representative unit 3 insert logic from 2 into template from 1 workflow