Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TokyoR#104_DataProcessing

 TokyoR#104_DataProcessing

第104回Tokyo.Rでしゃべった際の資料です。

kilometer

March 04, 2023
Tweet

More Decks by kilometer

Other Decks in Programming

Transcript

  1. BeginneR Advanced Hoxo_m If I have seen further it is

    by standing on the shoulders of Giants. -- Sir Isaac Newton, 1676
  2. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data Science ① ②
  3. 集合! 集合" 要素# 要素$ 写像 %: ! → "もしくは%: #

    ⟼ $ (始集合・定義域) (終集合・終域) 【写像】 ある集合の要素を他の集合のただ1つの要素に 対応づける規則
  4. 地図空間 ⽣物種名空間 名空間 ⾦銭価値空間 (円) ⾦銭価値空間 (ドル) コーヒー ¥290 $2.53

    [緯度, 経度] Homo sapiens 実存 写像 写像 写像 写像 写像 写像 情報 【写像】 ある集合の要素を他の集合のただ1つの要素に対応づける規則
  5. raed_csv() write_csv() Table Data Wide form Long form pivot_longer() Nested

    form pivot_wider() Plot group_nest() unnest() {ggplot2} {patchwork} Image Files ggsave() Data Processing
  6. data.frame *bble raed_csv() write_csv() Table Data Wide form Long form

    pivot_longer() Nested form pivot_wider() Plot group_nest() unnest() {ggplot2} {patchwork} Image Files ggsave() Data Processing
  7. vector in R in Excel pre <- c(1, 2, 3,

    4, 5) post <- pre * 5 > pre [1] 1 2 3 4 5 > post [1] 5 10 15 20 25
  8. vector vec1 <- c(1, 2, 3, 4, 5) vec2 <-

    1:5 vec3 <- seq(from = 1, to = 5, by = 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5 > vec3 [1] 1 2 3 4 5
  9. vector vec1 <- seq(from = 1, to = 5, by

    = 1) vec2 <- seq(1, 5, 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5
  10. > ?seq vector seq{base} Sequence Generation Description Generate regular sequences.

    seq is a standard generic with a default method. … Usage seq(...) ## Default S3 method: seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)
  11. vector vec1 <- rep(1:3, times = 2) vec2 <- rep(1:3,

    each = 2) vec3 <- rep(1:3, times = 2, each = 2) > vec1 [1] 1 2 3 1 2 3 > vec2 [1] 1 1 2 2 3 3 > vec3 [1] 1 1 2 2 3 3 1 1 2 2 3 3
  12. vector vec1 <- 11:15 > vec1 [1] 11 12 13

    14 15 > vec1[1] [1] 11 > vec1[3:5] [1] 13 14 15 > vec1[c(1:2, 5)] [1] 11 12 15
  13. list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1

    [[1]] [1] 1 2 3 4 5 6 [[2]] [1] 11 12 13 14 15 [[3]] [1] "a" "b" "c"
  14. list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1[[1]]

    [1] 1 2 3 4 5 6 > list1[[3]][2:3] [1] "b" "c" > list1[[2]] * 3 [1] 33 36 39 42 45
  15. named list list2 <- list(A = 1:6, B = 11:15,

    C = c("a", "b", "c")) > list2 $A [1] 1 2 3 4 5 6 $B [1] 11 12 13 14 15 $C [1] "a" "b" "c"
  16. > list2$A [1] 1 2 3 4 5 6 >

    list2$C[2:3] [1] "b" "c" > list2$B * 3 [1] 33 36 39 42 45 named list list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c"))
  17. list1 <- list(1:6, 11:15, c("a", "b", "c")) > class(list1) [1]

    "list" > names(list1) NULL list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) > class(list2) [1] "list" > names(list2) [1] "A" "B" "C" named list list
  18. list3 <- list(A = 1:3, B = 11:13) > class(list3)

    [1] "list" > names(list3) [1] "A" "B" df1 <- data.frame(A = 1:3, B = 11:13) > class(df1) [1] "data.frame" > names(df1) [1] "A" "B" named list & data.frame
  19. > str(list3) List of 2 $ A: int [1:3] 1

    2 3 $ B: int [1:3] 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 variables: $ A: int 1 2 3 $ B: int 11 12 13 list3 <- list(A = 1:3, B = 11:13) df1 <- data.frame(A = 1:3, B = 11:13) named list & data.frame
  20. > list3 $A [1] 1 2 3 $B [1] 11

    12 13 > df1 A B 1 1 11 2 2 12 3 3 13 named list & data.frame
  21. data.frame vs. matrix A B 1 1 11 2 2

    12 3 3 13 [,1] [,2] [1,] 1 11 [2,] 2 12 [3,] 3 13 df1 <- data.frame(A = 1:3, B = 11:13) > str(mat1) int [1:3, 1:2] 1 2 3 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 vars.: $ A: int 1 2 3 $ B: int 11 12 13 mat1 <- matrix(c(1:3, 11:13), nrow = 3, ncol = 2)
  22. data.frame *bble raed_csv() write_csv() Table Data Wide form Long form

    pivot_longer() Nested form pivot_wider() Plot group_nest() unnest() {ggplot2} {patchwork} Image Files ggsave() Data Processing
  23. > anscombe x1 x2 x3 x4 y1 y2 y3 y4

    1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89 Wide form data
  24. > df tag x1 x2 x3 x4 y1 y2 y3

    y4 1 1 10 10 10 8 8.04 9.14 7.46 6.58 2 2 8 8 8 8 6.95 8.14 6.77 5.76 3 3 13 13 13 8 7.58 8.74 12.74 7.71 4 4 9 9 9 8 8.81 8.77 7.11 8.84 5 5 11 11 11 8 8.33 9.26 7.81 8.47 6 6 14 14 14 8 9.96 8.10 8.84 7.04 Wide form data df <- rownames_to_column( anscombe, var = "tag" )
  25. Wide form → Long form data df_long_1 <- pivot_longer( data

    = df, cols = !tag ) df_long_2 <- pivot_longer( data = df, cols = !tag, names_to = c(".value", "key"), names_pattern = c("(.)(.)") )
  26. Long form → Wide form data pivot_wider( data = df_long_1,

    values_from = value, names_from = name ) pivot_wider( data = df_long_2, values_from = c(x, y), names_from = tag )
  27. data.frame / *bble raed_csv() write_csv() Table Data Wide form Long

    form pivot_longer() pivot_wider() Plot {ggplot2} Image Files ggsave() Data Processing
  28. raed_csv() write_csv() Table Data Wide form Long form pivot_longer() pivot_wider()

    Plot {ggplot2} Image Files ggsave() Data Processing Long form Long form Long form Long form Long form Long form Long form Long form data.frame / *bble
  29. It (dplyr) provides simple “verbs” to help you translate your

    thoughts into code. func?ons that correspond to the most common data manipula?on tasks Introduc6on to dplyr h"ps://cran.r-project.org/web/packages/dplyr/vigne"es/dplyr.html WFSCT {dplyr}
  30. (SBNNBSPGEBUBNBOJQVMBUJPO By constraining your op@ons, it helps you think about

    your data manipula@on challenges. Introduc6on to dplyr hFps://cran.r-project.org/web/packages/dplyr/vigneFes/dplyr.html
  31. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula@on
  32. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula@on 0. %>%
  33. 1JQFBMHFCSB X %>% f X %>% f(y) X %>% f

    %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magri7r} 「dplyr再⼊⾨(基本編)」yutanihila@on h"ps://speakerdeck.com/yutannihila6on/dplyrzai-ru-men-ji-ben-bian
  34. ① lift Bring milk from the kitchen! lift(Robot, glass, table)

    -> Robot' take ② take(Robot', fridge, milk) -> Robot''
  35. Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)

    Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④
  36. Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)

    Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④
  37. Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk)

    Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④ Thinking Reading Bring milk from the kitchen!
  38. 1JQFBMHFCSB X %>% f X %>% f(y) X %>% f

    %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magri7r} 「dplyr再⼊⾨(基本編)」yutanihila@on h"ps://speakerdeck.com/yutannihila6on/dplyrzai-ru-men-ji-ben-bian
  39. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula@on 0. %>% ✔
  40. ブール演算⼦ Boolean Algebra A == B A != B George

    Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia
  41. "a" != "b" # is A in B? ブール演算⼦ Boolean

    Algebra [1] TRUE 1 %in% 10:100 # is A in B? [1] FALSE
  42. George Boole 1815 - 1864 A Class-Room Introduc;on to Logic

    h"ps://niyamaklogic.wordpress.com/c ategory/laws-of-thoughts/ Mathema=cian Philosopher &
  43. WFSCT {dplyr} # Select help func?ons starts_with("s") ends_with("s") contains("se") matches("^.e")

    one_of(c(”tag", ”B")) everything() hFps://kazutan.github.io/blog/2017/04/dplyr-select-memo/ 「dplyr::selectの活⽤例メモ」kazutan
  44. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula@on 0. %>% ✔ ✔ ✔ ✔
  45. より多くの制約を課す事で、 魂の⾜枷から、より⾃由になる。 Igor Stravinsky И@горь Ф Страви́нский The more constraints

    one imposes, the more one frees one's self of the chains that shackle the spirit. 1882 - 1971 ※ 割と意訳
  46. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data Science ① ②
  47. Text Image First, A. Next, B. Then C. Finally D.

    ?me Intention encode "Frozen" structure A B C D Xme value α β
  48. # $ %! &! %" &" # $ &! &"

    %! %" σʔλՄࢹԽ ࣸ૾ mapping
  49. # $ %! &! %" &" # $ &! &"

    %! %" σʔλՄࢹԽ ࣸ૾ mapping x axis, y axis, color, fill, shape, linetype, alpha… aesthetic channels ৹ඒతνϟωϧ
  50. # $ %! &! %" &" # $ &! &"

    %! %" σʔλՄࢹԽ ࣸ૾ mapping x axis, y axis, color, fill, shape, linetype, alpha… aesthetic channels ৹ඒతνϟωϧ ggplot(data = my_data) + aes(x = X, y = Y)) + goem_point() HHQMPUʹΑΔ࡞ਤ
  51. ࣮ଘ ࣸ૾ʢ؍࡯ʣ σʔλ ࣸ૾ʢσʔλՄࢹԽʣ άϥϑ ! " #! $! #"

    $" # $ &! &" %! %" EBUB mapping aesthetic channels ৹ඒతνϟωϧ σʔλՄࢹԽ
  52. ॳΊͯͷHHQMPU library(tidyverse) dat <- data.frame(tag = rep(c("a", "b"), each =

    2), X = c(1, 3, 5, 7), Y = c(3, 9, 4, 2)) ggplot() + geom_point(data = dat, mapping = aes(x = X, y = Y))
  53. ॳΊͯͷHHQMPU library(tidyverse) dat <- data.frame(tag = rep(c("a", "b"), each =

    2), X = c(1, 3, 5, 7), Y = c(3, 9, 4, 2)) ggplot() + geom_point(data = dat, mapping = aes(x = X, y = Y)) EBUBGSBNFͷࢦఆ BFT ؔ਺ͷதͰ৹ඒతཁૉͱͯ͠ม਺ͱνϟωϧͷରԠΛࢦఆ ඳը։࢝Λએݴ ه߸Ͱͭͳ͙ BFT ؔ਺ͷҾ਺໊ EBUͷม਺໊ άϥϑͷछྨʹ߹ΘͤͨHFPN@ ؔ਺Λ࢖༻
  54. library(tidyverse) dat <- data.frame(tag = rep(c("a", "b"), each = 2),

    X = c(1, 3, 5, 7), Y = c(3, 9, 4, 2)) ggplot() + geom_point(data = dat, mapping = aes(x = X, y = Y)) + geom_path(data = dat, mapping = aes(x = X, y = Y)) ॳΊ͔ͯΒ൪໨ͷHHQMPU
  55. HHQMPUίʔυͷॻ͖ํͷ৭ʑ ggplot() + geom_point(data = dat, mapping = aes(x =

    X, y = Y)) + geom_path(data = dat, mapping = aes(x = X, y = Y)) ggplot(data = dat, mapping = aes(x = X, y = Y)) + geom_point() + geom_path() ggplot(data = dat) + aes(x = X, y = Y) + geom_point() + geom_path() ڞ௨ͷࢦఆΛHHQMPU ؔ਺ͷதͰߦ͍ɺҎԼলུ͢Δ͜ͱ͕Մೳ NBQQJOHͷ৘ใ͕ॻ͔ΕͨBFT ؔ਺ΛHHQMPU ؔ਺ͷ֎ʹஔ͘͜ͱ΋Ͱ͖Δ
  56. HHQMPUίʔυͷॻ͖ํͷ৭ʑ ggplot() + geom_point(data = dat, mapping = aes(x =

    X, y = Y, color = tag)) + geom_path(data = dat, mapping = aes(x = X, y = Y)) ggplot(data = dat) + aes(x = X, y = Y) + # 括り出すのは共通するものだけ geom_point(mapping = aes(color = tag)) + geom_path() ϙΠϯτͷ৭ͷNBQQJOHΛࢦఆ
  57. HHQMPUίʔυͷॻ͖ํͷ৭ʑ ggplot(data = dat) + aes(x = X, y =

    Y) + geom_point(aes(color = tag)) + geom_path() ggplot(data = dat) + aes(x = X, y = Y) + geom_path() + geom_point(aes(color = tag)) ͋ͱ͔Β ͰॏͶͨཁૉ͕લ໘ʹඳը͞ΕΔ
  58. library(tidyverse) dat <- data.frame(tag = rep(c("a", "b"), each = 2),

    X = c(1, 3, 5, 7), Y = c(3, 9, 4, 2)) g <- ggplot(data = dat) + aes(x = X, y = Y) + geom_path() + geom_point(mapping = aes(color = tag)) HHQMPUը૾ͷอଘ ggsave(filename = "fig/demo01.png", plot = g, width = 4, height = 3, dpi = 150)
  59. library(tidyverse) dat <- data.frame(tag = rep(c("a", "b"), each = 2),

    X = c(1, 3, 5, 7), Y = c(3, 9, 4, 2)) g <- ggplot(data = dat) + aes(x = X, y = Y) + geom_path() + geom_point(mapping = aes(color = tag)) HHQMPUը૾ͷอଘ ggsave(filename = "fig/demo01.png", plot = g, width = 4, height = 3, dpi = 150) αΠζ͸σϑΥϧτͰ͸Πϯν୯ҐͰࢦఆ
  60. library(tidyverse) dat <- data.frame(tag = rep(c("a", "b"), each = 2),

    X = c(1, 3, 5, 7), Y = c(3, 9, 4, 2)) g <- ggplot(data = dat) + aes(x = X, y = Y) + geom_path() + geom_point(mapping = aes(color = tag)) HHQMPUը૾ͷอଘ ggsave(filename = "fig/demo01.png", plot = g, width = 10, height = 7.5, dpi = 150, units = "cm") # "cm", "mm", "in"を指定可能
  61. ෳ਺ͷܥྻΛඳը͢Δ > head(anscombe) x1 x2 x3 x4 y1 y2 y3

    y4 1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 ggplot(data = anscombe) + geom_point(aes(x = x1, y = y1)) + geom_point(aes(x = x2, y = y2), color = "Red") + geom_point(aes(x = x3, y = y3), color = "Blue") + geom_point(aes(x = x4, y = y4), color = "Green") ͜Ε·Ͱͷ஌ࣝͰؤுΔͱ͜͏ͳΔ
  62. HHQMPUʹΑΔσʔλՄࢹԽ ࣮ଘ ࣸ૾ʢ؍࡯ʣ σʔλ ࣸ૾ʢσʔλՄࢹԽʣ άϥϑ ! " #! $!

    #" $" SBXEBUB 写像 aesthetic channels ৹ඒతνϟωϧ ՄࢹԽʹదͨ͠EBUBܗࣜ 変形 ਤͷͭͷ৹ඒతνϟωϧ͕ σʔλͷͭͷม਺ʹରԠ͍ͯ͠Δ
  63. > head(anscombe) x1 x2 x3 x4 y1 y2 y3 y4

    1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 > head(anscombe_long) key x y 1 1 10 8.04 2 2 10 9.14 3 3 10 7.46 4 4 8 6.58 5 1 8 6.95 6 2 8 8.14 ggplot(data = anscombe_long) + aes(x = x, y = y, color = key) + geom_point() ৹ඒతνϟωϧ Y࣠ Z࣠ ৭ ʹରԠ͢Δม਺ʹͳΔΑ͏มܗ ݟ௨͠ྑ͘γϯϓϧʹՄࢹԽͰ͖Δ
  64. > head(anscombe) x1 x2 x3 x4 y1 y2 y3 y4

    1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 > head(anscombe_long) key x y 1 1 10 8.04 2 2 10 9.14 3 3 10 7.46 4 4 8 6.58 5 1 8 6.95 6 2 8 8.14 ৹ඒతνϟωϧ Y࣠ Z࣠ ৭ ʹରԠ͢Δม਺ʹͳΔΑ͏มܗ anscombe_long <- pivot_longer(data = anscombe, cols = everything(), names_to = c(".value", "key"), names_pattern = "(.)(.)") ԣ௕σʔλ ॎ௕σʔλ
  65. ggplot(data = anscombe_long) + aes(x = x, y = y,

    color = key) + geom_point() ggplot(data = anscombe_long) + aes(x = x, y = y, color = key) + geom_point() + facet_wrap(facets = . ~ key, nrow = 1) ਫ४ͰਤΛ෼ׂ͢Δ
  66. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data Science ① ②
  67. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 preprocessing Data science Data Observa=on Hypothesis NarraFve of data feedback Data processing
  68. data.frame / *bble raed_csv() write_csv() Table Data Wide form Long

    form pivot_longer() pivot_wider() Plot {ggplot2} Image Files ggsave() Data Processing
  69. raed_csv() write_csv() Table Data Wide form Long form pivot_longer() pivot_wider()

    Plot {ggplot2} Image Files ggsave() Data Processing Long form Long form Long form Long form Long form Long form Long form Long form data.frame / *bble
  70. It (dplyr) provides simple “verbs” to help you translate your

    thoughts into code. func?ons that correspond to the most common data manipula?on tasks Introduc6on to dplyr h"ps://cran.r-project.org/web/packages/dplyr/vigne"es/dplyr.html WFSCT {dplyr}
  71. 1. mutate() 2. filter() 3. select() 4. group_by() 5. summarize()

    6. left_join() 7. arrange() Data.frame manipula@on
  72. import Tidy Transform Visualize Model Communicate Modified from “R for

    Data Science”, H. Wickham, 2017 Data Science ① ②
  73. # $ %! &! %" &" # $ &! &"

    %! %" mapping x axis, y axis, color, fill, shape, linetype, alpha… aesthetic channels data ggplot2 package
  74. HHQMPUʹΑΔσʔλՄࢹԽ ࣮ଘ ࣸ૾ʢ؍࡯ʣ σʔλ ࣸ૾ʢσʔλՄࢹԽʣ άϥϑ ! " #! $!

    #" $" SBXEBUB 写像 aesthetic channels ৹ඒతνϟωϧ ՄࢹԽʹదͨ͠EBUBܗࣜ 変形 ਤͷͭͷ৹ඒతνϟωϧ͕ σʔλͷͭͷม਺ʹରԠ͍ͯ͠Δ