Upgrade to Pro — share decks privately, control downloads, hide ads and more …

データ分析言語R 1年の振り返り

Sinhrks
December 02, 2017
2.4k

データ分析言語R 1年の振り返り

@ Japan.R 2017

Sinhrks

December 02, 2017
Tweet

Transcript

  1. ࣗݾ঺հ • R • ύοέʔδ։ൃͳͲ • Git Awards ࠃ಺1Ґ •

    Python • http://git-awards.com/users/search?login=sinhrks
  2. 2017೥ͷৼΓฦΓ • R 3.4.xϦϦʔε • RStudio 1.1ϦϦʔε • IEEE The

    2017 Top Programming Languages 6Ґ • CRAN 10,000ύοέʔδಥഁ • υΩϡϝϯςʔγϣϯܥύοέʔδͷॆ࣮ (blogdown, xaringan) • FFI (reticulate) • prophet, tensorflow • ֤छॻ੶
  3. 1. Put each dataset in a tibble. 2. Put each

    variable in a column. tidy dataͱ͸ #> # A tibble: 6 × 4 #> country year cases population #> <chr> <int> <int> <int> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 • R for Data Sciense http://r4ds.had.co.nz/tidy-data.html
  4. ΧϥϜΛத৺ʹߟ͑Δ tidy dataͱ͸ (ࢲݟ) 9 " # # " =

    filter map 9 5 ' ' 5 9 5 5 indexing = summarize_all 9     9 " # aggregation indexing
  5. tidyverseͱ͸ • Opinionated collection of R packages designed for data

    science. • All packages share an underlying philosophy and common APIs.
  6. tidyverse • dplyr, tidyr 0.7.0 • purrr 0.2.3 • forcats,

    stringr • reprex • glue (*) * tidyverse 1.2.1ʹ͸ؚ·Εͳ͍
  7. dplyr, tidyr 0.7.0 • dplyr 0.7.0 • Colwise functions •

    tidyeval • Databases (dbplyr) • UTF-8 • tidyr 0.7.0 • tidyeval • tidyselect
  8. • mutate_xxx, summarise_xxx ͷҰൠԽ →ྻʹର͢Δ indexing Colwise functions ؔ਺໊ ॲཧର৅

    YYY@BMM શͯͷྻ YYY@BU ྻ໊ɺ΋͘͠͸ΠϯσοΫεͰࢦఆͨ͠ྻ YYY@JG ৚݅Λຬͨ͢ྻ • dplyr࠶ೖ໳ʢColwiseฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-colwisebian
  9. Colwise functions • dplyr࠶ೖ໳ʢColwiseฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-colwisebian df <- data_frame(a1 = 1:5, a2

    = 5:1, b1 = 11:15, b2 = 15:11) df %>% rename_all(toupper) df %>% rename_at(c(1, 2), toupper) df %>% rename_if(summarise_all(., mean) > 10, toupper) " " # # rename_all " " C C rename_at B B # # rename_if
  10. tidyeval df <- data_frame(g1 = c(1, 2, 1, 2, 1),

    g2 = c(1, 1, 1, 2, 2), aa = 1:5, bb = 5:1) group_sum_ng <- function(df, by) { df %>% group_by(by) %>% summarise_all(sum) } group_sum_ng(df, g1) grouped_df_impl(data, unname(vars), drop) ͰΤϥʔ: Column `by` is unknown • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian CZͰࢦఆͨ͠ྻͰ άϧʔϓԽ͍ͨ͠
  11. tidyeval • NSE group_sum_enquo <- function(df, by) { qby <-

    enquo(by) df %>% group_by(!! qby) %>% summarise_all(sum) } group_sum_enquo(df, g1) # OK • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian
  12. tidyeval • SE group_sum_sym <- function(df, by) { qby <-

    rlang::sym(by) df %>% group_by(!! qby) %>% summarise_all(sum) } group_sum_sym(df, “g1") # OK • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian
  13. tidyeval • όοΫΤϯυ͸ rlang ύοέʔδ͕ఏڙ • ύΠϓϥΠϯԽ͚ͩΛߟ͑Δͱ • ֎͔ΒNSEͰ౉ͨ͠ม਺໊͸ enquo

    -> !! • ֎͔ΒSEͰ౉ͨ͠ม਺໊͸ sym -> !! • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian
  14. Databases (dbplyr) library(dplyr) con <- DBI::dbConnect(RSQLite::SQLite(), “:memory:") DBI::dbWriteTable(con, "iris", iris)

    df <- dplyr::tbl(con, “iris") head(df) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 7.0 3.2 4.7 1.4 versicolor 2 6.4 3.2 4.5 1.5 versicolor 3 6.9 3.1 4.9 1.5 versicolor αϯϓϧॻ͖ࠐΈ ࢦఆͨ͠ςʔϒϧΛಡΈࠐΈ
  15. purrr 0.2.3 • pluck • map functions a <- list(a

    = 1, b = list(x = 1, y = 2), c = 3) pluck(a, "b", "x") [1] 1 imap(a, ~toupper(.y)) $a [1] "A" $b [1] "B" $c [1] "C" BCY ΩʔΛZͰऔಘ
  16. forcats, stringr library(forcats) x <- factor(c("a", "b", "a", "c", "d"))

    x %>% forcats::fct_other(keep = c("a", "b")) [1] a b a Other Other Levels: a b Other library(stringr) vals <- c("a1", "a2", "b1", "b2") stringr::str_which(vals, "b") [1] 3 4
  17. reprex library(reprex) reprex(1 + 3) reprex(1 + 3, venue =

    "so") ``` r 1 + 3 #> [1] 4 ``` ΫϦοϓϘʔυʹอଘ <!-- language-all: lang-r --> <br/> 1 + 3 #> [1] 4 ΫϦοϓϘʔυʹอଘ
  18. glue • glue_sql con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbWriteTable(con, "iris", iris)

    var <- "Sepal.Length" tbl <- "iris" num <- 5 q <- glue_sql("SELECT * FROM {`tbl`} WHERE {`tbl`}.{var} > {num} ", .con = con) q <SQL> SELECT * FROM `iris` WHERE `iris`.'Sepal.Length' > 5
  19. glue • glue_sql df <- as_data_frame(DBI::dbGetQuery(con, q)) df # A

    tibble: 61 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <chr> 1 7.0 3.2 4.7 1.4 versicolor 2 6.4 3.2 4.5 1.5 versicolor 3 6.9 3.1 4.9 1.5 versicolor 4 6.5 2.8 4.6 1.5 versicolor 5 6.3 3.3 4.7 1.6 versicolor # ... with 56 more rows
  20. ར༻ྫ 1 library(httr) library(glue) library(purrr) library(dplyr) library(lubridate) org <- "tidyverse"

    url <- glue::glue(‘https://api.github.com/orgs/{org}/repos') p <- httr::GET(url, query = list(per_page = 100)) %>% httr::content(“parsed") p[[2]] $id [1] 148017 $name [1] "lubridate" $full_name [1] “tidyverse/lubridate" … ϑΥʔϚοτจࣈྻϦςϥϧ
  21. ར༻ྫ 1 cols <- c("name", "stargazers_count", "created_at", "updated_at") dt <-

    dplyr::vars(dplyr::ends_with(“_at")) pkgs <- p %>% purrr::map(~ .[cols]) %>% dplyr::bind_rows() %>% dplyr::mutate_at(dt, lubridate::ymd_hms) %>% dplyr::rename_at(dt, dplyr::funs(sub("_at", "_time", .))) # A tibble: 28 x 4 name stargazers_count created_time updated_time <chr> <int> <dttm> <dttm> 1 ggplot2 2780 2008-05-25 01:21:32 2017-12-01 14:48:13 2 lubridate 333 2009-03-11 01:18:52 2017-11-26 21:49:53 3 stringr 226 2009-11-08 22:20:08 2017-11-30 09:00:39 4 dplyr 2087 2012-10-28 13:39:17 2017-12-01 14:02:30 # ... with 24 more rows Ϧετ͔Βಛఆͷ஋Λબ୒ UJCCMFʹม׵ ೔࣌จࣈྻΛύʔε ྻ໊Λมߋ $PMXJTFGVODUJPO 
  22. • αϯϓϦϯά, CV, Ϟσϧૢ࡞ͳͲͷػೳΛఏڙ ิ଍: modelr head(trees, n = 3)

    Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 m <- glm(trees, Volume ~ Girth + Height) trees %>% modelr::add_predictions(m) %>% modelr::add_residuals(m) Girth Height Volume pred resid 1 8.3 70 10.3 4.837660 5.46234035 2 8.6 65 10.3 4.553852 5.74614837 3 8.8 63 10.2 4.816981 5.38301873 ༧ଌ஋Λྻͱͯ͠௥Ճ ࢒ࠩΛྻͱͯ͠௥Ճ
  23. ར༻ྫ 2 my_model <- function(df, tgt, var) { qtgt <-

    rlang::enexpr(tgt) qvar <- rlang::enexpr(var) glm(rlang::new_formula(qtgt, qvar), data = df) } m <- my_model(trees, Volume, Girth + Height) m Call: glm(formula = rlang::new_formula(qtgt, qvar), data = df) Coefficients: (Intercept) Girth Height -57.9877 4.7082 0.3393 Degrees of Freedom: 30 Total (i.e. Null); 28 Residual Null Deviance: 8106 Residual Deviance: 421.9 AIC: 176.9 /4&ͰGPSNVMBΛ࡞੒
  24. ར༻ྫ 2 get_besides <- function(df, model, tgt) { qtgt <-

    enquo(tgt) df %>% modelr::add_predictions(model) %>% modelr::add_residuals(model) %>% dplyr::filter(abs(resid) > (!! qtgt) * 0.5) } get_besides(trees, m, Volume) Girth Height Volume pred resid 1 8.3 70 10.3 4.837660 5.462340 2 8.6 65 10.3 4.553852 5.746148 3 8.8 63 10.2 4.816981 5.383019 ৚݅Λຬͨ͢ߦΛϑΟϧλ UJEZFWBM ༧ଌ஋Λྻͱͯ͠௥Ճ ࢒ࠩΛྻͱͯ͠௥Ճ
  25. r-lib • ϢʔςΟϦςΟ(httr, xml2…) • ύοέʔδ։ൃ·ΘΓ(testthat, pkgdown, covr, usethis…) •

    Πϯλʔφϧ(R6, memoise…) • ίϯιʔϧ·ΘΓ(cli, progress, crayon…)
  26. ར༻ྫ library(cli) library(crayon) library(progress) rule(center = "ॲཧ։࢝", line_col = "red")

    cat(red(symbol$tick, "check1 \n")) cat(blue(symbol$tick, "check2 \n")) cat(green(symbol$tick, "check3 \n")) pb <- progress_bar$new(total = 100) for (i in 1:100) { pb$tick() Sys.sleep(1 / 50) } rule(center = "ॲཧऴྃ", line_col = "red") DMJ DSBZPO QSPHSFTT