Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Apache Arrow C++ Datasets
Search
Kenta Murata
December 11, 2019
Technology
4
1.7k
Apache Arrow C++ Datasets
Introduce Apache Arrow C++ Datasets.
Presented Apache Arrow Tokyo Meetup 2019.
Kenta Murata
December 11, 2019
Tweet
Share
More Decks by Kenta Murata
See All by Kenta Murata
waitany と waitall を作った話
mrkn
0
260
HolidayJp.jl を作りました
mrkn
0
280
Calling Julia functions from Streamlit applications
mrkn
1
500
Red Data Tools で切り開く Ruby の未来
mrkn
3
1.2k
Method-based JIT compilation by transpiling to Julia
mrkn
0
7.7k
Reducing ActiveRecord memory consumption using Apache Arrow
mrkn
0
1.7k
RubyData and Rails
mrkn
0
3.2k
Tensor and Arrow
mrkn
0
990
RubyData Current and Future
mrkn
1
3.6k
Other Decks in Technology
See All in Technology
浸透しなさいRFC 5322&7208
hinono
0
120
第4回 関東Kaggler会 [Training LLMs with Limited VRAM]
tascj
12
1.8k
Understanding Go GC #coefl_go_jp
bengo4com
0
1.1k
[CV勉強会@関東 CVPR2025 読み会] MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos (Li+, CVPR2025)
abemii
0
200
広島発!スタートアップ開発の裏側
tsankyo
0
250
Yahoo!広告ビジネス基盤におけるバックエンド開発
lycorptech_jp
PRO
1
280
人を動かすことについて考える
ichimichi
2
330
TypeScript入門
recruitengineers
PRO
23
7.3k
実践データベース設計 ①データベース設計概論
recruitengineers
PRO
3
370
[OCI Skill Mapping] AWSユーザーのためのOCI(2025年8月20日開催)
oracle4engineer
PRO
2
150
認知戦の理解と、市民としての対抗策
hogehuga
0
370
あとはAIに任せて人間は自由に生きる
kentaro
3
1.1k
Featured
See All Featured
Large-scale JavaScript Application Architecture
addyosmani
512
110k
Rails Girls Zürich Keynote
gr2m
95
14k
Building Adaptive Systems
keathley
43
2.7k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.8k
GitHub's CSS Performance
jonrohan
1031
460k
The World Runs on Bad Software
bkeepers
PRO
70
11k
How to train your dragon (web standard)
notwaldorf
96
6.2k
Making the Leap to Tech Lead
cromwellryan
134
9.5k
RailsConf 2023
tenderlove
30
1.2k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
358
30k
Facilitating Awesome Meetings
lara
55
6.5k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
Transcript
Apache Arrow C++ Datasets Kenta Murata Speee, Inc. 2019.12.11 Apache
Arrow Tokyo Meetup 2019
Kenta Murata • Fulltime OSS developer at Speee, Inc. •
CRuby committer (as of 2010.02) • Apache Arrow committer (as of 2019.10) • The 24th place (44 commits) • SparseTensor in Arrow C++ • GLib and Ruby binding, etc.
Apache Arrow C++ ͷߏ Base Datasets Query Engine Data Frame
Apache Arrow C++ Datasets • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ •
༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ • ෳछྨͷετϨʔδ͔ΒͷσʔλೖྗʹରԠͰ͖Δ • কདྷతʹϑΝΠϧͷॻ͖ग़͠ʹରԠ͢Δ༧ఆ
ෳͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ a.parquet b.parquet Query 1 Query 2 c.csv d.json Record
Batch 1 Record Batch 2 Amazon S3 Amazon Redshift Local File System In-Memory Arrow Table
ϑΝΠϧ͔ΒͷಡΈࠐΈ Discover Scan Filter & Project Collect
ϑΝΠϧ͔ΒͷಡΈࠐΈ • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ • ෳϑΝΠϧΛฒྻεΩϟϯͰ͖Δ • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ
• ෳͷϑΝΠϧʹׂ͞ΕͨσʔλΛ࠶ߏ͢Δ • σʔλΛෳϑΝΠϧʹׂ͢Δͱ͖ͷεΩʔϚׂͷنଇʹैͬͯॲཧ͢Δ • ݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)
ϑΝΠϧͷൃݟ • ϕʔεσΟϨΫτϦͷҐஔͱϑΝΠϧϑΥʔϚοτΛࢦఆ ͢ΔͱɺͦͷσΟϨΫτϦҎԼʹ͋ΔରϑΝΠϧΛ͢ ͯϦετΞοϓͯ͘͠ΕΔ • αϒσΟϨΫτϦΛ࠶ؼతʹ୳͢͜ͱՄೳ • ແࢹ͢ΔϑΝΠϧ໊ͷϓϨϑΟοΫεΛࢦఆͰ͖Δ •
ରϑΝΠϧΛͯ͢ಡΈࠐΉͨΊʹඞཁͳϚʔδࡁΈͷ εΩʔϚΛ࡞ͬͯ͘ΕΔ (༧ఆ)
ϑΝΠϧͷൃݟͷྫ /data/.metadata /data/2018/12/JP/Tokyo/001.parquet /data/2018/12/JP/Tokyo/002.parquet /data/2018/12/JP/Osaka/001.parquet /data/2018/12/US/CA/001.parquet /data/2019/01/JP/Tokyo/001.parquet /data/2019/01/JP/Osaka/001.parquet /data/2019/01/US/CA/001.parquet /data/2019/01/US/NY/001.parquet
/tmp/Tokyo.parquet ↓͜ΕΒͷϑΝΠϧ͚ͩϐοΫΞοϓ͍ͨ͠
ϑΝΠϧͷൃݟͷྫ using namespace arrow; using namespace arrow::dataset; fs::Selector selector; selector.base_dir
= “/data”; selector.recursive = true; std::shared_ptr<FileSystemDataSourceDiscovery> discovery; ARROW_OK_AND_ASSIGN( discovery, FileSystemDataSourceDiscovery::Make( fs, selector, std::make_shared<dataset::ParquetFileFormat>(), FileSystemDiscoveryOptions())); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());
σʔλׂͷنଇΛࢦఆ /data/2018 /data/2018/12 /data/2018/12/JP /data/2018/12/JP/Tokyo/001.parquet auto partition_scheme = schema({field(“year”, int32()),
field(“month”, int32()), field(“country”, utf8()), field(“city”, utf8())}); ASSERT_OK(discovery->SetPartitionScheme(partition_scheme)); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish()); year month country city => {“year": 2018} => {“year”: 2018, “month”: 12} => {“year”: 2018, “month”: 12, “country”: “JP”} => {“year”: 2018, “month”: 12, “country”: “JP”, “city”: “Tokyo”}
ϑΟϧλϦϯά • ݅ࣜΛͬͯߦΛϑΟϧλϦϯάͰ͖Δ • year ͕ 2019 Ͱ sales ͕
100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ ग़͢߹࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ “year”_ == 2019 && “sales”_ > 100.0 • εΩʔϚׂͷنଇʹैͬͯɺ݅ʹ߹க͠ͳ͍ϑΝΠϧ ͷಡΈࠐΈΛলུ͢Δ
औΓग़͢ΧϥϜͷࢦఆ • ͯ͢ͷΧϥϜΛಡΈࠐ·ͳͯ͘ྑ͍߹ɺϓϩδΣΫ γϣϯ (ࣹӨ) ػೳΛͬͯऔΓग़͢ΧϥϜΛ੍ݶͰ͖Δ • ͜ͷػೳͰಡΈࠐΉΧϥϜΛ੍ݶ͢ΔͱɺෆཁͳΧϥϜͷ σγϦΞϥΠζͱܕม͕লུ͞ΕͯɺϑΝΠϧϑΥʔ ϚοτʹΑͬͯσʔλͷಡΈग़͕͘͠ͳΔ
σʔληοτΛ࡞ͬͯಡΈࠐΜͰ Arrow Table Λ࡞Δ·Ͱͷྫ // σʔληοτͷ࡞ ASSERT_OK_AND_ASSIGN(auto dataset, Dataset::Make({data_source}, discovery->Inspect()));
// εΩϟφϏϧμ ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); // ϑΟϧλͷઃఆ auto filter = (“year”_ == 2019 && “sales”_ > 100.0); ASSERT_OK(scanner_builder->Filter(filter)); // ϓϩδΣΫγϣϯͷઃఆ std::vector<std::string> columns{“item_id”, “item_name”, “sales”}; ASSERT_OK(scanner_builder->Project(columns)); // εΩϟφੜ ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish(); // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ) ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
ෳϑΝΠϧͷฒྻಡΈࠐΈ • ϑΝΠϧ୯ҐͰಡΈࠐΈλεΫ͕࡞ΒΕɺεϨουϓʔϧ ͰλεΫ͕ฒྻ࣮ߦ͞ΕΔ • Parquet ϑΥʔϚοτͰɺ1ͭͷϑΝΠϧߦάϧʔϓ ͝ͱʹγʔέϯγϟϧʹಡΈࠐ·ΕΔ • 1ͭͷϑΝΠϧ͔Β1ͭҎ্ͷ
Arrow Record Batch ͕ੜ ͞Εͯɺ࠷ޙʹ·ͱΊͯ Arrow Table ͕ੜ͞ΕΔ
༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ • ݱࡏෳͷ Parquet ϑΝΠϧʹׂ͞Εͨσʔληο τͷରԠΛඋத • AVRO, ORC, JSON,
CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ ϑΥʔϚοτকདྷతʹରԠ͞ΕΔ • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request ৗʹ welcome ͩͱࢥ͏
༷ʑͳϑΝΠϧγεςϜͷରԠ • ରԠࡁΈͷͷ • ϩʔΧϧϑΝΠϧγεςϜ • HDFS • Amazon S3
• ςετ༻ͷϞοΫϑΝΠϧγεςϜ • কདྷతʹରԠ͍ͨ͠ͷ • Google Cloud Storage • Microsoft Azure BLOB Storage
RDB ͔ΒͷಡΈࠐΈ • RDB ͷςʔϒϧΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͑͠ΔΑ͏ʹ͢Δ ܭը͋Δ • ࣍ͷγεςϜ໊ࢦ͠͞Ε͍ͯΔ • SQLite3
• PostgreSQL protocol (pgsql, Vertica, Redshift) • MySQL (and MemSQL) • Microsoft SQL Server (TDS) • HiveServer2 (Hive and Impala) • ClickHouse
Apache Arrow C++ Datasets • Apache Arrow C++ Datasets ͕͋Εɺ͍Ζ͍Ζͳॴ
ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ • Arrow Table Λ࡞ͬͨ͋ͱʁ • ͞Βʹੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ • ूܭ౷ܭॲཧΛ͍ͨ͠
Arrow Table Λ࡞ͬͨ͋ͱ • ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ => Apache Arrow C++ Query
Engine • ूܭ౷ܭॲཧΛ͍ͨ͠ => Apache Arrow C++ Data Frame
Apache Arrow C++ Query Engine • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠SQL෩ͷΫΤ ϦɺσʔλੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞ pivot ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ • σʔλϕʔεΛஔ͖͑Δ͜ͱҙਤͤͣɺC++ ͷڞ༗ϥ ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰΘ ΕΔ͜ͱΛఆ͍ͯ͠Δ • ·ͩ։ൃ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ
Apache Arrow C++ Data Frame • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠ɺ͍ΘΏΔ σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺੳɺू ܭͳͲͷػೳΛఏڙ͢Δ • ։ൃ·ͩ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ • pandas2 Arrow C++ Data Frame ΛόοΫΤϯυͱ ͯ͠࡞ΕΒΕΔͷ͔ͳʁ
Datasets Query Engine Data Frame ϑΝΠϧDBʹอଘ͞Εͨσʔλ ͷΞΫηε͕؆୯ʹͳΔ ϝϞϦ্ͷςʔϒϧσʔλʹର͢Δ ੳΫΤϦ͕؆୯ʹ࣮ߦͰ͖Δ ϝϞϦ্ͷςʔϒϧσʔλΛσʔλ
ϑϨʔϜͱͯ͠ར༻Ͱ͖Δ