Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TokyoR#103_DataProcessing

kilometer
January 21, 2023

 TokyoR#103_DataProcessing

第103回Tokyo.Rでしゃべった際の資料です。

kilometer

January 21, 2023
Tweet

More Decks by kilometer

Other Decks in Programming

Transcript

  1. BeginneR Advanced Hoxo_m If I have seen further it is

    by standing on the shoulders of Giants. -- Sir Isaac Newton, 1676
  2. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data Science ① ②
  3. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 preprocessing Data processing Data science Data Science
  4. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 preprocessing Data science Data Observa?on Hypothesis feedback Data processing Data Science
  5. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 preprocessing Data science Data Observation Hypothesis feedback Data processing Narra/ve of data
  6. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 preprocessing Data science Data Observa?on Hypothesis Narra/ve of data feedback Data processing
  7. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data processing Data Science
  8. raed_csv() write_csv() Table Data Wide form Long form pivot_longer() Nested

    form pivot_wider() Plot group_nest() unnest() {ggplot2} {patchwork} Image Files ggsave() Data Processing
  9. data.frame tibble raed_csv() write_csv() Table Data Wide form Long form

    pivot_longer() Nested form pivot_wider() Plot group_nest() unnest() {ggplot2} {patchwork} Image Files ggsave() Data Processing
  10. vector in R in Excel pre <- c(1, 2, 3,

    4, 5) post <- pre * 5 > pre [1] 1 2 3 4 5 > post [1] 5 10 15 20 25
  11. vector vec1 <- c(1, 2, 3, 4, 5) vec2 <-

    1:5 vec3 <- seq(from = 1, to = 5, by = 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5 > vec3 [1] 1 2 3 4 5
  12. vector vec1 <- seq(from = 1, to = 5, by

    = 1) vec2 <- seq(1, 5, 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5
  13. > ?seq vector seq{base} Sequence Generation Description Generate regular sequences.

    seq is a standard generic with a default method. … Usage seq(...) ## Default S3 method: seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)
  14. vector vec1 <- rep(1:3, times = 2) vec2 <- rep(1:3,

    each = 2) vec3 <- rep(1:3, times = 2, each = 2) > vec1 [1] 1 2 3 1 2 3 > vec2 [1] 1 1 2 2 3 3 > vec3 [1] 1 1 2 2 3 3 1 1 2 2 3 3
  15. vector vec1 <- 11:15 > vec1 [1] 11 12 13

    14 15 > vec1[1] [1] 11 > vec1[3:5] [1] 13 14 15 > vec1[c(1:2, 5)] [1] 11 12 15
  16. list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1

    [[1]] [1] 1 2 3 4 5 6 [[2]] [1] 11 12 13 14 15 [[3]] [1] "a" "b" "c"
  17. list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1[[1]]

    [1] 1 2 3 4 5 6 > list1[[3]][2:3] [1] "b" "c" > list1[[2]] * 3 [1] 33 36 39 42 45
  18. named list list2 <- list(A = 1:6, B = 11:15,

    C = c("a", "b", "c")) > list2 $A [1] 1 2 3 4 5 6 $B [1] 11 12 13 14 15 $C [1] "a" "b" "c"
  19. > list2$A [1] 1 2 3 4 5 6 >

    list2$C[2:3] [1] "b" "c" > list2$B * 3 [1] 33 36 39 42 45 named list list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c"))
  20. list1 <- list(1:6, 11:15, c("a", "b", "c")) > class(list1) [1]

    "list" > names(list1) NULL list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) > class(list2) [1] "list" > names(list2) [1] "A" "B" "C" named list list
  21. list3 <- list(A = 1:3, B = 11:13) > class(list3)

    [1] "list" > names(list3) [1] "A" "B" df1 <- data.frame(A = 1:3, B = 11:13) > class(df1) [1] "data.frame" > names(df1) [1] "A" "B" named list & data.frame
  22. > str(list3) List of 2 $ A: int [1:3] 1

    2 3 $ B: int [1:3] 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 variables: $ A: int 1 2 3 $ B: int 11 12 13 list3 <- list(A = 1:3, B = 11:13) df1 <- data.frame(A = 1:3, B = 11:13) named list & data.frame
  23. > list3 $A [1] 1 2 3 $B [1] 11

    12 13 > df1 A B 1 1 11 2 2 12 3 3 13 named list & data.frame
  24. data.frame vs. matrix A B 1 1 11 2 2

    12 3 3 13 [,1] [,2] [1,] 1 11 [2,] 2 12 [3,] 3 13 df1 <- data.frame(A = 1:3, B = 11:13) > str(mat1) int [1:3, 1:2] 1 2 3 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 vars.: $ A: int 1 2 3 $ B: int 11 12 13 mat1 <- matrix(c(1:3, 11:13), nrow = 3, ncol = 2)
  25. data.frame tibble raed_csv() write_csv() Table Data Wide form Long form

    pivot_longer() Nested form pivot_wider() Plot group_nest() unnest() {ggplot2} {patchwork} Image Files ggsave() Data Processing
  26. > anscombe x1 x2 x3 x4 y1 y2 y3 y4

    1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89 Wide form data
  27. > df tag x1 x2 x3 x4 y1 y2 y3

    y4 1 1 10 10 10 8 8.04 9.14 7.46 6.58 2 2 8 8 8 8 6.95 8.14 6.77 5.76 3 3 13 13 13 8 7.58 8.74 12.74 7.71 4 4 9 9 9 8 8.81 8.77 7.11 8.84 5 5 11 11 11 8 8.33 9.26 7.81 8.47 6 6 14 14 14 8 9.96 8.10 8.84 7.04 Wide form data df <- rownames_to_column( anscombe, var = "tag" )
  28. Wide form → Long form data df_long_1 <- pivot_longer( data

    = df, cols = !tag ) df_long_2 <- pivot_longer( data = df, cols = !tag, names_to = c(".value", "key"), names_pattern = c("(.)(.)")
  29. Long form → Wide form data pivot_wider( data = df_long_1,

    values_from = value, names_from = name ) pivot_wider( data = df_long_2, values_from = c(x, y), names_from = name )
  30. data.frame / tibble raed_csv() write_csv() Table Data Wide form Long

    form pivot_longer() pivot_wider() Plot {ggplot2} Image Files ggsave() Data Processing
  31. raed_csv() write_csv() Table Data Wide form Long form pivot_longer() pivot_wider()

    Plot {ggplot2} Image Files ggsave() Data Processing Long form Long form Long form Long form Long form Long form Long form Long form data.frame / -bble
  32. It (dplyr) provides simple “verbs” to help you translate your

    thoughts into code. func>ons that correspond to the most common data manipula>on tasks Introduc6on to dplyr h"ps://cran.r-project.org/web/packages/dplyr/vigne"es/dplyr.html WFSCT {dplyr}
  33. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula?on
  34. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula?on 0. %>%
  35. 1JQFBMHFCSB X %>% f X %>% f(y) X %>% f

    %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magri8r} 「dplyr再⼊⾨(基本編)」yutanihilaCon h"ps://speakerdeck.com/yutannihila6on/dplyrzai-ru-men-ji-ben-bian
  36. ① lift Bring milk from the kitchen! lift(Robot, glass, table)

    -> Robot' take ② take(Robot', fridge, milk) -> Robot''
  37. Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)

    Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④
  38. Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)

    Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④
  39. Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk)

    Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④ Thinking Reading Bring milk from the kitchen!
  40. 1JQFBMHFCSB X %>% f X %>% f(y) X %>% f

    %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magri8r} 「dplyr再⼊⾨(基本編)」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian
  41. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula?on 0. %>% ✔
  42. ブール演算⼦ Boolean Algebra A == B A != B George

    Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia
  43. "a" != "b" # is A in B? ブール演算⼦ Boolean

    Algebra [1] TRUE 1 %in% 10:100 # is A in B? [1] FALSE
  44. George Boole 1815 - 1864 A Class-Room Introduc;on to Logic

    h"ps://niyamaklogic.wordpress.com/c ategory/laws-of-thoughts/ Mathematician Philosopher &
  45. WFSCT {dplyr} # Select help func>ons starts_with("s") ends_with("s") contains("se") matches("^.e")

    one_of(c(”tag", ”B")) everything() hEps://kazutan.github.io/blog/2017/04/dplyr-select-memo/ 「dplyr::selectの活⽤例メモ」kazutan
  46. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula?on 0. %>% ✔ ✔ ✔ ✔
  47. (SBNNBSPGEBUBNBOJQVMBUJPO By constraining your options, it helps you think about

    your data manipulation challenges. Introduc6on to dplyr hEps://cran.r-project.org/web/packages/dplyr/vigneEes/dplyr.html
  48. より多くの制約を課す事で、 魂の⾜枷から、より⾃由になる。 Igor Stravinsky И8горь Ф Страви́нский The more constraints

    one imposes, the more one frees one's self of the chains that shackle the spirit. 1882 - 1971 ※ 割と意訳
  49. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data Science ① ②
  50. Text Image First, A. Next, B. Then C. Finally D.

    >me Intention encode "Frozen" structure A B C D Nme value α β
  51. 𝑋 𝑌 𝑦! 𝑥! 𝑦" 𝑥" 𝑋 𝑌 𝑥! 𝑥"

    𝑦! 𝑦" 可視化 ⊂ 写像 mapping
  52. 𝑋 𝑌 𝑦! 𝑥! 𝑦" 𝑥" 𝑋 𝑌 𝑥! 𝑥"

    𝑦! 𝑦" 可視化 ⊂ 写像 mapping x axis, y axis, color, fill, shape, linetype, alpha… aesthetic channels
  53. 𝑋 𝑌 𝑦! 𝑥! 𝑦" 𝑥" 𝑋 𝑌 𝑥! 𝑥"

    𝑦! 𝑦" mapping x axis, y axis, color, fill, shape, linetype, alpha… aesthetic channels data ggplot2 package
  54. ggplot2 # install.packages("tidyverse") library(tidyverse) dat <- data.frame(a = 1:3, b

    = 8:10) Attach package Simple example > dat a b 1 1 8 2 2 9 3 3 10
  55. dat <- data.frame(a = 1:3, b = 8:10) ggplot(data =

    dat) + geom_point(mapping = aes(x = a, y = b)) ggplot2
  56. dat <- data.frame(a = 1:3, b = 8:10) ggplot(data =

    dat) + geom_point(mapping = aes(x = a, y = b)) 𝑋 𝑌 𝑦! 𝑥! 𝑦" 𝑥" 𝑋 𝑌 𝑥! 𝑥" 𝑦! 𝑦" mapping x axis, y axis, color, fill, shape, linetype, alpha… aesthetic channels data ggplot2
  57. dat <- data.frame(a = 1:3, b = 8:10) ggplot(data =

    dat) + geom_point(mapping = aes(x = a, y = b)) + geom_path(mapping = aes(x = a, y = b))
  58. dat <- data.frame(a = 1:3, b = 8:10) ggplot(data =

    dat, mapping = aes(x = a, y = b)) + geom_point() + geom_path() inheritance
  59. dat <- data.frame(a = 1:3, b = 8:10) ggplot(data =

    dat) + aes(x = a, y = b) + geom_point() + geom_path()
  60. dat <- data.frame(a = 1:3, b = 8:10) ggplot(data =

    dat) + aes(x = a, y = b) + geom_point() + geom_path()
  61. dat <- data.frame(a = 1:3, b = 8:10) g <-

    ggplot(data = dat) + aes(x = a, y = b)) + geom_point() g + geom_path()
  62. Anscombe's quartet > anscombe x1 x2 x3 x4 y1 y2

    y3 y4 1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89
  63. ggplot(data = anscombe) + aes(x = x1, y = y1)

    + geom_point() Anscombe's quartet
  64. > anscombe x1 x2 x3 x4 y1 y2 y3 y4

    1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89 x y mapping Anscombe's quartet
  65. > anscombe x1 x2 x3 x4 y1 y2 y3 y4

    1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89 a > anscombe_long # A tibble: 44 x 3 key x y <chr> <dbl> <dbl> 1 1 10 8.04 2 2 10 9.14 3 3 10 7.46 4 4 8 6.58 5 1 8 6.95 6 2 8 8.14 7 3 8 6.77 8 4 8 5.76 Wide form Long form
  66. > anscombe x1 x2 x3 x4 y1 y2 y3 y4

    1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 anscombe_long <- pivot_longer(data = anscombe, cols = everything(), names_pattern = "(.)(.)", names_to = c(".value", "key")) Wide -> Long form
  67. anscombe_long <- pivot_longer(data = anscombe, cols = everything(), names_pattern =

    "(.)(.)", names_to = c(".value", "key")) > anscombe_long # A tibble: 44 x 3 key x y <chr> <dbl> <dbl> 1 1 10 8.04 2 2 10 9.14 3 3 10 7.46 4 4 8 6.58 5 1 8 6.95 6 2 8 8.14 7 3 8 6.77 8 4 8 5.76 Anscombe's quartet
  68. anscombe_long <- pivot_longer(data = anscombe, cols = everything(), names_pattern =

    "(.)(.)", names_to = c(".value", "key")) > anscombe_long # A tibble: 44 x 3 key x y <chr> <dbl> <dbl> 1 1 10 8.04 2 2 10 9.14 3 3 10 7.46 4 4 8 6.58 5 1 8 6.95 6 2 8 8.14 7 3 8 6.77 8 4 8 5.76 Anscombe's quartet g_anscomb <- ggplot(data = anscombe_long) + aes(x = x, y = y, color = key) + geom_point()
  69. anscombe_long <- pivot_longer(data = anscombe, cols = everything(), names_pattern =

    "(.)(.)", names_to = c(".value", "key")) g_anscomb <- ggplot(data = anscombe_long)+ aes(x = x, y = y, color = key)+ geom_point() > anscombe_long # A tibble: 44 x 3 key x y <chr> <dbl> <dbl> 1 1 10 8.04 2 2 10 9.14 3 3 10 7.46 4 4 8 6.58 5 1 8 6.95 6 2 8 8.14 7 3 8 6.77 8 4 8 5.76 Anscombe's quartet
  70. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data Science ① ②
  71. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 preprocessing Data science Data ObservaPon Hypothesis Narra/ve of data feedback Data processing
  72. data.frame / -bble raed_csv() write_csv() Table Data Wide form Long

    form pivot_longer() pivot_wider() Plot {ggplot2} Image Files ggsave() Data Processing
  73. raed_csv() write_csv() Table Data Wide form Long form pivot_longer() pivot_wider()

    Plot {ggplot2} Image Files ggsave() Data Processing Long form Long form Long form Long form Long form Long form Long form Long form data.frame / -bble
  74. It (dplyr) provides simple “verbs” to help you translate your

    thoughts into code. func>ons that correspond to the most common data manipula>on tasks Introduc6on to dplyr h"ps://cran.r-project.org/web/packages/dplyr/vigne"es/dplyr.html WFSCT {dplyr}
  75. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula:on
  76. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data Science ① ②
  77. 𝑋 𝑌 𝑦! 𝑥! 𝑦" 𝑥" 𝑋 𝑌 𝑥! 𝑥"

    𝑦! 𝑦" mapping x axis, y axis, color, fill, shape, linetype, alpha… aesthetic channels data ggplot2 package
  78. dat <- data.frame(a = 1:3, b = 8:10) ggplot(data =

    dat) + aes(x = a, y = b) + geom_point() + geom_path()