Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rによる大規模データの処理

Uryu Shinya
December 21, 2022

 Rによる大規模データの処理

2022年12月17日に開催された「R研究集会2022」の発表資料です。

当日の投影内容から、メモリの消費量の算出方法に誤りがあったため一部内容を変更しています。
リポジトリ: https://github.com/uribo/talk_221217_rjpusers

Uryu Shinya

December 21, 2022
Tweet

More Decks by Uryu Shinya

Other Decks in Programming

Transcript

  1. Ұԡ͠ϙΠϯτ େن໛σʔληοτͷޮ཰తͳॲཧͱసૹͷͨΊʹઃܭ "QBDIF "SSPX \BSSPX^ ύοέʔδ ΠϯϝϞϦͷΧϥϜܕσʔλΛऔΓѻ͏ʢγϦΞϥΠζෆཁʣ ଟݴޠπʔϧϘοΫεͱͯ͠ɺ͞·͟·ͳϓϩάϥϛϯάݴޠͰ࣮૷ $ 3

    1ZUIPO 3VTU +VMJB %VDL%#ͳͲ EQMZSόοΫΤϯυΛఏڙɻύΠϓԋࢉࢠʹΑΔॲཧΛద༻Մೳ QBSRVFUʹରԠɻύʔςΟγϣχϯάʹΑΔ؅ཧ MVCSJEBUFɺSMBOHɺTUSJOHSɺUJEZTFMFDUͳͲͷؔ਺ʹ΋ରԠ ະରԠͳؔ਺Λ\EVDLEC^ύοέʔδ͕Χόʔ ଎͍ খ͍͞ ಡΈࠐΈɾूܭՃ޻ॲཧ͕ ϑΝΠϧɾϝϞϦαΠζ͕ ޮ཰త DTWϑΝΠϧ͔ΒͷEQMZS΍EBUBUBCMFͰͷॲཧͱൺֱͯ͠ɺ͔ͳΓߴ଎ σʔλͷૢ࡞ɾ؅ཧ͕
  2. ࣗݾ঺հ ӝੜਅ໵ ͏ΓΎ͏͠Μ΍ ஍ཧۭؒσʔλղੳ ೥݄ʙಙౡ ౷ܭɺσʔλαΠΤϯε Ծ $  Ҋ

    Χόʔɿϓϩηε $ ࢖༻ $(ɿετοΫϑΥτͷ 3' ը૾ʢ  ԁʣ σʔλՄࢹԽ 3ݴޠ aདྷ೥݄ग़൛
  3. ΧϥϜܕσʔλσʔλ෼ੳʹಛԽ IUUQTBSSPXBQBDIFPSHPWFSWJFX ৯೑ྨ ྶ௕ྨ ྶ௕ྨ     

     Ϩοαʔύϯμ νϯύϯδʔ Ϛϯτώώ ৯೑ྨ ௗྨ ϥΠΦϯ ϑϯϘϧτϖϯΪϯ     ߦ୯ҐͰσʔλΛอ࣋ɾऔΓग़͠ $47΍.Z42-ͳͲͷσʔλϕʔε ߦࢦ޲ ಛఆྻͷ஋΁ͷૢ࡞ ˠશ෦ͷߦɾྻ΁ͷΞΫηεΛ൐͏ ྻ୯ҐͰσʔλΛอ࣋ɾऔΓग़͠ ྻࢦ޲ ৯೑ྨ ྶ௕ྨ ྶ௕ྨ       Ϩοαʔύϯμ νϯύϯδʔ Ϛϯτώώ ৯೑ྨ ௗྨ ϥΠΦϯ ϑϯϘϧτϖϯΪϯ     ˠࢦఆ͞ΕͨྻͷΈಡΈࠐΈ େྔͷߦʹର͢Δू໿ॲཧ͕ޮ཰త
  4. \BSSPX^Λ࢖ͬͨॲཧͷྲྀΕ σʔλͷಡΈࠐΈ QBSRVFU GFBUIFS UYU DTW KTPO UTW 📄 σʔλϑϨʔϜͱͯ͠ฦΓ஋ΛಘΔ

    \EQMZS^ͳͲͷؔ਺ʹΑΔॲཧ d <- read_parquet("data/zoo.parquet.parquet", as_data_frame = FALSE) result <- d |> select(name, taxon, body_length_cm) |> filter(taxon %in% c("霊長類", "齧歯類", "鳥類")) |> group_by(taxon) |> summarise(body_length_mean = mean(body_length_cm)) result |> collect() ू໿͞ΕͨσʔλΛ෼ੳɾՄࢹԽ σʔλΛϝϞϦ্ʹಡΈࠐΉ ஗ԆධՁ
  5. \BSSPX^͕ରԠ͠ͳ͍ؔ਺ʁ list_compute_functions() "SSPXRVFSZFOHJOF લड़ͷͱ͓Γɺ\BSSPX^Ͱ͸"SSPX$ ΁ͷΠϯλʔϑΣΠεΛఏڙ \EQMZS^ͷؔ਺΋ར༻Մೳ ?acero ˠ͢΂ͯͷ3ͷؔ਺͕\BSSPX^Ͱ࣮ߦՄೳͳΘ͚Ͱ͸ͳ͍ d <-

    read_parquet(here::here("data/typeB/36_tokushima/year=2022/month=10/part-0.parquet"), as_data_frame = FALSE) ࣍ͷεϥΠυͰྫΛࣔͨ͢ΊʹBSSPXΦϒδΣΫτΛ༻ҙ EBUBGSBNFʹม׵ͯ͠ॲཧPSΤϥʔ
  6. EBUBGSBNFʹม׵ͯ͠ॲཧ d |> select(datetime) |> mutate(is_jholiday = zipangu::is_jholiday(datetime)) #> Warning:

    Expression #> zipangu::is_jholiday(datetime) not supported in Arrow; #> pulling data into R #> datetime is_jholiday #> 1 2022-10-01 FALSE #> 2 2022-10-01 FALSE #> 3 2022-10-01 FALSE #> 4 2022-10-01 FALSE #> 5 2022-10-01 FALSE #> 6 2022-10-01 FALSE େ͖͍σʔλͷ··ͩͱॲཧʹ͕͔͔࣌ؒΔ͔΋
  7. ΤϥʔͰॲཧΛఀࢭ ˠ\EVDLEC^΁ͷม׵ͰରԠ d |> group_by(location_no) |> slice_min(order_by = traffic, n

    = 1) |> collect() #> Error: Slicing grouped data not supported in Arrow d |> group_by(location_no) |> to_duckdb() |> slice_min(order_by = traffic, n = 1) |> collect() #> # A tibble: 25,237 × 10 #> # Groups: location_no [274] #> datetime source_…¹ locat…² locat…³ meshc…⁴ link_…⁵ link_no traffic #> <dttm> <chr> <int> <chr> <chr> <int> <int> <int> #> 1 2022-10-29 17:50:00 3028 54 西ノ丸… 513404 2 662 2 #> 2 2022-10-29 19:50:00 3028 54 西ノ丸… 513404 2 662 2
  8. QBSRVSUͷಡΈࠐΈෳ਺ϑΝΠϧ ڞ௨ͷྻ഑ྻΛ΋ͭෳ਺ͷQBSRVFUϑΝΠϧΛಡΈࠐΉʢ͜ͷஈ֊Ͱ͸εΩʔϚͷΈʣ open_dataset(here::here("data/typeB/13_tokyo/year=2021/")) #> FileSystemDataset with 12 Parquet files #>

    datetime: timestamp[us, tz=Asia/Tokyo] #> source_code: string #> location_no: int32 #> location_name: string #> meshcode10km: string #> link_type: int32 #> link_no: int32 #> traffic: int32 #> to_link_end_10m: string #> link_ver: int32 #> month: int32 #> #> See $metadata for additional Schema metadata ݄ຖʹ෼͔ΕͨݸͷϑΝΠϧ DTWͰ΋0, open_dataset(..., format = "csv")
  9. ύʔςΟγϣχϯά Ұͭͷେ͖ͳσʔληοτΛෳ਺ͷࡉ͔ͳϑΝΠϧʹ෼ׂ͢Δઓུ )JWFܗࣜʜLFZWBMVF ඞཁͳσʔλʹૉૣ͘ΞΫηεɺॲཧͷෛ୲͕ܰݮ ෼ׂΛత֬ʹߦ͏͜ͱ͕ॏཁ ྫ͑͹ʜ೥ɺ݄ʢʣɺ౎ಓ෎ݝʢʣ d |> write_dataset(path =

    "data", partitioning = c("year", "month")) σʔληοτΛ ೥ɺ݄ʹ෼͚ͯอଘ EBUBZFBSNPOUIQBSUQBSRVFU NPOUIQBSUQBSRVFU NPOUIQBSUQBSRVFU ʜ EBUBZFBSNPOUIQBSUQBSRVFU NPOUIQBSUQBSRVFU NPOUIQBSUQBSRVFU
  10. εΩʔϚʢܕͷࢦఆʣ ྻࢦ޲ͷσʔλܗࣜʜྻ͝ͱʹܾ·ͬͨܕʢ·ͱ·ͬͨ഑ஔʣ͕͋Δ ๛෋ͳσʔλλΠϓͰྻ഑ྻͷ஋Λఆٛ͢Δ Ϗοτ੔਺ʢJOUʣͱϏοτ੔਺ JOU ͸۠ผ͞ΕΔ open_dataset("data", schema = schema(

    datetime = timestamp(unit = "ms", timezone = "Asia/Tokyo"), source_code = utf8(), location_no = int32(), …)) #> FileSystemDataset with 12 Parquet files #> datetime: timestamp[ms, tz=Asia/Tokyo] #> source_code: string #> location_no: int32 #> location_name: string #> ... IUUQTBSSPXBQBDIFPSHEPDTSSFGFSFODFEBUBUZQFIUNM
  11. Ұൠಓ࿏ͷஅ໘ަ௨ྔ৘ใσʔλ [JQϑΝΠϧల։ޙͷDTWΛEBUBUBCMFͱͯ͠ಡΈࠐΉؔ਺Λ༻ҙ IUUQTHJUIVCDPNVSJCPKBSUJDS remotes::install_github("uribo/jarticr") dplyr::glimpse( jarticr::read_jartic_traffic("data-raw/typeB_tokushima_2022_10/徳島県警_202210.csv") ) #> Rows: 2,384,773

    #> Columns: 10 #> $ datetime <dttm> 2022-10-01, 2022-10-01, 2022-10-01, 2022-10-01, 2022-… #> $ source_code <chr> "3028", "3028", "3028", "3028", "3028", "3028", "3028"… #> $ location_no <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,… #> $ location_name <chr> "徳島本町→県庁前", "県庁前→徳島本町", "新助任橋北詰→徳… #> $ meshcode10km <chr> "513404", "513404", "513404", "513404", "513404", "513… #> $ link_type <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … #> $ link_no <int> 710, 705, 681, 676, 679, 678, 11, 375, 6, 5, 6, 5, 842… #> $ traffic <int> 21, 24, 25, 27, 25, 28, 30, 16, 42, 42, 27, 25, 16, 26… #> $ to_link_end_10m <chr> "6", "6", "11", "27", "21", "29", "26", "8", "7", "92"… #> $ link_ver <int> 202100, 202100, 202100, 202100, 202100, 202100, 202100…
  12. ݕূσʔλ αΠζͷҟͳΔσʔλΛछ༻ҙ -BSHF 4NBMM શࠃ .FEJVN )VHF શࠃ ೥ؒ ೥ؒ

    ౦ژ౎ ೥ؒ ౦ژ౎ ϲ݄ؒ ߦ਺ ྻ਺ σʔλαΠζ QBSRVFU ஍Ҭ ظؒ ϑΝΠϧαΠζ DTW ສ  ԯສ (# .#  (# (#   ԯສ (# ԯສ (# (# (# -BSHFɺ)VHFͷDTWϑΝΠϧαΠζ͸[JQѹॖ࣌ͷେ͖͞ ʢల։ޙ͸͞Βʹେ͖͍ʣ
  13. ݕূ؀ڥ "QQMF..BD 3 ϝϞϦ(#ʢݱߦͷ.#1Ͱ͸࠷େʣ library(arrow) # 10.0.1 library(readr) # 2.1.3

    library(data.table) # 1.14.6 library(dplyr) # 1.0.10 library(dtplyr) # 1.2.2 library(duckdb) # 0.6.1 ͍ͣΕͷύοέʔδ΋$3"/࠷৽൛
  14. ݕূͭͷॲཧΛൺֱ άϧʔϓԽͱूܭ ಡΈࠐΈ ݁߹ DTWϑΝΠϧͷಡΈࠐΈ଎౓ readr::read_csv() data.table::fread() arrow::read_csv_arrow() \BSSPX^ \EQMZS^

    \EBUBUBCMF^ \EVDLEC^ʜBSSPXΦϒδΣΫτ͔Βม׵ \EUQMZS^ʜEBUBUBCMFΦϒδΣΫτΛEQMZSͷؔ਺Ͱॲཧ όοΫΤϯυͰ\EBUBUBCMF^͕ػೳ
  15. ݁ՌϑΝΠϧͷಡΈࠐΈ -BSHF )VHFͷDTW༻ҙͰ͖ͣ ಉن໛ͷQBSRVSUϑΝΠϧΛ ͰಡΈࠐΉʜ0, arrow::open_dataset() # 全国1年分(2021年) traffic_large <-

    open_dataset(here::here("data/typeB/"), schema = jartic_typeB_schema) |> filter(year == 2021) dim(traffic_large) #> [1] 3923767686 12 lobstr::obj_size(traffic_large) #> 487.10 kB # 全国4年分 traffic_huge <- open_dataset(here::here("data/typeB/"), schema = jartic_typeB_schema) dim(traffic_huge) #> [1] 15677276121 12 lobstr::obj_size(traffic_huge) #> 261.59 kB -BSHF )VHF ॲཧͷ಺༰ʹΑͬͯ͸݁ՌΛಘΒΕΔঢ়ଶ
  16. ݁Ռ݁߹ -BSHF )VHFͱ΋ʹࣦഊ fi MUFS΍EJTUJODUΛͤͣʹ݁߹ͨͨ͠Ίʁ tictoc::tic(); traffic_large |> distinct(location_no, meshcode10km)

    |> left_join(arrow_mesh10km, by = c("meshcode10km" = "meshcode10km")) |> collect(); tictoc::toc() #> 16.746 sec elapsed tictoc::tic(); traffic_huge |> distinct(location_no, meshcode10km) |> left_join(arrow_mesh10km, by = c("meshcode10km" = "meshcode10km")) |> collect(); tictoc::toc() #> 67.593 sec elapsed EJTUJODUΛڬΊ͹0, -BSHF )VHF
  17. ࢀߟจݙɾ63- ͦΖͦΖ3Ϣʔβʔ΋"QBDIF"SSPXͰ1BSRVFUΛ࢖ͬͯΈ·ͤΜ͔ʁ 
 3GPS%BUB4DJFODF OEFEJUJPO  
 "QBDIF"SSPXϑΥʔϚοτ͸ͳͥ଎͍ͷ͔ 
 ʲ3ʳ"QBDIF"SSPXͱEVDLECΛࢼͯ͠ΈΔ2JJUB

    
 "QBDIF"SSPXَ͸͑͑ʂ͜ͷ··$47શ෦1BSRVFUʹม׵͍ͯ͜͠͏ͥʂ 
 "QBDIF"SSPX3$PPLCPPL 
 -BSHFS5IBO.FNPSZ%BUB8PSL fl PXTXJUI"QBDIF"SSPX 
 IUUQTOPUDIBJOFEIBUFOBCMPHDPNFOUSZ IUUQTSETIBEMFZO[BSSPXIUNM IUUQTTMJEFSBCCJUTIPDLFSPSHBVUIPSTLPVECUFDITIPXDBTFPOMJOF IUUQTFJUTVQJHJUIVCJPUPLZPSTMJEFUPLZPS@ IUUQTBSSPXVTFSOFUMJGZBQQ IUUQTBSSPXBQBDIFPSHDPPLCPPLSJOEFYIUNM IUUQTRJJUBDPNFJUTVQJJUFNTDFFCGCFFEF