Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Apache Arrow C++ Datasets
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Kenta Murata
December 11, 2019
Technology
1.9k
4
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Apache Arrow C++ Datasets
Introduce Apache Arrow C++ Datasets.
Presented Apache Arrow Tokyo Meetup 2019.
Kenta Murata
December 11, 2019
More Decks by Kenta Murata
See All by Kenta Murata
waitany と waitall を作った話
mrkn
0
320
HolidayJp.jl を作りました
mrkn
0
370
Calling Julia functions from Streamlit applications
mrkn
1
610
Red Data Tools で切り開く Ruby の未来
mrkn
3
1.3k
Method-based JIT compilation by transpiling to Julia
mrkn
0
9.1k
Reducing ActiveRecord memory consumption using Apache Arrow
mrkn
0
1.9k
RubyData and Rails
mrkn
0
3.4k
Tensor and Arrow
mrkn
0
1.1k
RubyData Current and Future
mrkn
1
3.8k
Other Decks in Technology
See All in Technology
LayerXにおけるセキュリティ管理の現在地と次の一手
tosho
0
240
AWS Security Agent といっしょに脅威モデリングをやってみよう
amarelo_n24
1
180
MUSUBI 田中裕一『AIと共に行う「しごとのリデザイン」- スモールバックオフィス編』AI Ops Lab #4
musubi
0
270
Kiroで書いた 設計書 が AI レビューの 採点基準 になる
ezaki
0
130
PostgreSQL 19 新機能概要 OSC Hokkaido 2026
nori_shinoda
0
140
IaC コードを資産へ:AWS CDK 社内ライブラリと横断展開 / aws-summit-japan-2026
gotok365
5
1.1k
秘密度ラベル初心者が第1歩でつまづかないための「設計・運用」ポイント
seafay
PRO
0
210
エラーバジェットのアラートのタイミングを考える.pdf
kairim0
0
170
Oracle Cloud Infrastructure:2026年6月度サービス・アップデート
oracle4engineer
PRO
0
130
徹底討論!ECS vs EKS!
daitak
0
230
20260619 私の日常業務での生成 AI 活用
masaruogura
1
230
AI駆動開発を通して感じた、 AI時代のデザイナーの役割変化
whisaiyo
4
2.3k
Featured
See All Featured
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
3
610
The Spectacular Lies of Maps
axbom
PRO
1
820
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
10k
Applied NLP in the Age of Generative AI
inesmontani
PRO
4
2.3k
Tell your own story through comics
letsgokoyo
1
960
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
410
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
Are puppies a ranking factor?
jonoalderson
1
3.6k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
620
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
270
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
2k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
71
40k
Transcript
Apache Arrow C++ Datasets Kenta Murata Speee, Inc. 2019.12.11 Apache
Arrow Tokyo Meetup 2019
Kenta Murata • Fulltime OSS developer at Speee, Inc. •
CRuby committer (as of 2010.02) • Apache Arrow committer (as of 2019.10) • The 24th place (44 commits) • SparseTensor in Arrow C++ • GLib and Ruby binding, etc.
Apache Arrow C++ ͷߏ Base Datasets Query Engine Data Frame
Apache Arrow C++ Datasets • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ •
༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ • ෳछྨͷετϨʔδ͔ΒͷσʔλೖྗʹରԠͰ͖Δ • কདྷతʹϑΝΠϧͷॻ͖ग़͠ʹରԠ͢Δ༧ఆ
ෳͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ a.parquet b.parquet Query 1 Query 2 c.csv d.json Record
Batch 1 Record Batch 2 Amazon S3 Amazon Redshift Local File System In-Memory Arrow Table
ϑΝΠϧ͔ΒͷಡΈࠐΈ Discover Scan Filter & Project Collect
ϑΝΠϧ͔ΒͷಡΈࠐΈ • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ • ෳϑΝΠϧΛฒྻεΩϟϯͰ͖Δ • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ
• ෳͷϑΝΠϧʹׂ͞ΕͨσʔλΛ࠶ߏ͢Δ • σʔλΛෳϑΝΠϧʹׂ͢Δͱ͖ͷεΩʔϚׂͷنଇʹैͬͯॲཧ͢Δ • ݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)
ϑΝΠϧͷൃݟ • ϕʔεσΟϨΫτϦͷҐஔͱϑΝΠϧϑΥʔϚοτΛࢦఆ ͢ΔͱɺͦͷσΟϨΫτϦҎԼʹ͋ΔରϑΝΠϧΛ͢ ͯϦετΞοϓͯ͘͠ΕΔ • αϒσΟϨΫτϦΛ࠶ؼతʹ୳͢͜ͱՄೳ • ແࢹ͢ΔϑΝΠϧ໊ͷϓϨϑΟοΫεΛࢦఆͰ͖Δ •
ରϑΝΠϧΛͯ͢ಡΈࠐΉͨΊʹඞཁͳϚʔδࡁΈͷ εΩʔϚΛ࡞ͬͯ͘ΕΔ (༧ఆ)
ϑΝΠϧͷൃݟͷྫ /data/.metadata /data/2018/12/JP/Tokyo/001.parquet /data/2018/12/JP/Tokyo/002.parquet /data/2018/12/JP/Osaka/001.parquet /data/2018/12/US/CA/001.parquet /data/2019/01/JP/Tokyo/001.parquet /data/2019/01/JP/Osaka/001.parquet /data/2019/01/US/CA/001.parquet /data/2019/01/US/NY/001.parquet
/tmp/Tokyo.parquet ↓͜ΕΒͷϑΝΠϧ͚ͩϐοΫΞοϓ͍ͨ͠
ϑΝΠϧͷൃݟͷྫ using namespace arrow; using namespace arrow::dataset; fs::Selector selector; selector.base_dir
= “/data”; selector.recursive = true; std::shared_ptr<FileSystemDataSourceDiscovery> discovery; ARROW_OK_AND_ASSIGN( discovery, FileSystemDataSourceDiscovery::Make( fs, selector, std::make_shared<dataset::ParquetFileFormat>(), FileSystemDiscoveryOptions())); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());
σʔλׂͷنଇΛࢦఆ /data/2018 /data/2018/12 /data/2018/12/JP /data/2018/12/JP/Tokyo/001.parquet auto partition_scheme = schema({field(“year”, int32()),
field(“month”, int32()), field(“country”, utf8()), field(“city”, utf8())}); ASSERT_OK(discovery->SetPartitionScheme(partition_scheme)); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish()); year month country city => {“year": 2018} => {“year”: 2018, “month”: 12} => {“year”: 2018, “month”: 12, “country”: “JP”} => {“year”: 2018, “month”: 12, “country”: “JP”, “city”: “Tokyo”}
ϑΟϧλϦϯά • ݅ࣜΛͬͯߦΛϑΟϧλϦϯάͰ͖Δ • year ͕ 2019 Ͱ sales ͕
100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ ग़͢߹࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ “year”_ == 2019 && “sales”_ > 100.0 • εΩʔϚׂͷنଇʹैͬͯɺ݅ʹ߹க͠ͳ͍ϑΝΠϧ ͷಡΈࠐΈΛলུ͢Δ
औΓग़͢ΧϥϜͷࢦఆ • ͯ͢ͷΧϥϜΛಡΈࠐ·ͳͯ͘ྑ͍߹ɺϓϩδΣΫ γϣϯ (ࣹӨ) ػೳΛͬͯऔΓग़͢ΧϥϜΛ੍ݶͰ͖Δ • ͜ͷػೳͰಡΈࠐΉΧϥϜΛ੍ݶ͢ΔͱɺෆཁͳΧϥϜͷ σγϦΞϥΠζͱܕม͕লུ͞ΕͯɺϑΝΠϧϑΥʔ ϚοτʹΑͬͯσʔλͷಡΈग़͕͘͠ͳΔ
σʔληοτΛ࡞ͬͯಡΈࠐΜͰ Arrow Table Λ࡞Δ·Ͱͷྫ // σʔληοτͷ࡞ ASSERT_OK_AND_ASSIGN(auto dataset, Dataset::Make({data_source}, discovery->Inspect()));
// εΩϟφϏϧμ ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); // ϑΟϧλͷઃఆ auto filter = (“year”_ == 2019 && “sales”_ > 100.0); ASSERT_OK(scanner_builder->Filter(filter)); // ϓϩδΣΫγϣϯͷઃఆ std::vector<std::string> columns{“item_id”, “item_name”, “sales”}; ASSERT_OK(scanner_builder->Project(columns)); // εΩϟφੜ ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish(); // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ) ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
ෳϑΝΠϧͷฒྻಡΈࠐΈ • ϑΝΠϧ୯ҐͰಡΈࠐΈλεΫ͕࡞ΒΕɺεϨουϓʔϧ ͰλεΫ͕ฒྻ࣮ߦ͞ΕΔ • Parquet ϑΥʔϚοτͰɺ1ͭͷϑΝΠϧߦάϧʔϓ ͝ͱʹγʔέϯγϟϧʹಡΈࠐ·ΕΔ • 1ͭͷϑΝΠϧ͔Β1ͭҎ্ͷ
Arrow Record Batch ͕ੜ ͞Εͯɺ࠷ޙʹ·ͱΊͯ Arrow Table ͕ੜ͞ΕΔ
༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ • ݱࡏෳͷ Parquet ϑΝΠϧʹׂ͞Εͨσʔληο τͷରԠΛඋத • AVRO, ORC, JSON,
CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ ϑΥʔϚοτকདྷతʹରԠ͞ΕΔ • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request ৗʹ welcome ͩͱࢥ͏
༷ʑͳϑΝΠϧγεςϜͷରԠ • ରԠࡁΈͷͷ • ϩʔΧϧϑΝΠϧγεςϜ • HDFS • Amazon S3
• ςετ༻ͷϞοΫϑΝΠϧγεςϜ • কདྷతʹରԠ͍ͨ͠ͷ • Google Cloud Storage • Microsoft Azure BLOB Storage
RDB ͔ΒͷಡΈࠐΈ • RDB ͷςʔϒϧΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͑͠ΔΑ͏ʹ͢Δ ܭը͋Δ • ࣍ͷγεςϜ໊ࢦ͠͞Ε͍ͯΔ • SQLite3
• PostgreSQL protocol (pgsql, Vertica, Redshift) • MySQL (and MemSQL) • Microsoft SQL Server (TDS) • HiveServer2 (Hive and Impala) • ClickHouse
Apache Arrow C++ Datasets • Apache Arrow C++ Datasets ͕͋Εɺ͍Ζ͍Ζͳॴ
ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ • Arrow Table Λ࡞ͬͨ͋ͱʁ • ͞Βʹੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ • ूܭ౷ܭॲཧΛ͍ͨ͠
Arrow Table Λ࡞ͬͨ͋ͱ • ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ => Apache Arrow C++ Query
Engine • ूܭ౷ܭॲཧΛ͍ͨ͠ => Apache Arrow C++ Data Frame
Apache Arrow C++ Query Engine • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠SQL෩ͷΫΤ ϦɺσʔλੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞ pivot ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ • σʔλϕʔεΛஔ͖͑Δ͜ͱҙਤͤͣɺC++ ͷڞ༗ϥ ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰΘ ΕΔ͜ͱΛఆ͍ͯ͠Δ • ·ͩ։ൃ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ
Apache Arrow C++ Data Frame • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠ɺ͍ΘΏΔ σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺੳɺू ܭͳͲͷػೳΛఏڙ͢Δ • ։ൃ·ͩ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ • pandas2 Arrow C++ Data Frame ΛόοΫΤϯυͱ ͯ͠࡞ΕΒΕΔͷ͔ͳʁ
Datasets Query Engine Data Frame ϑΝΠϧDBʹอଘ͞Εͨσʔλ ͷΞΫηε͕؆୯ʹͳΔ ϝϞϦ্ͷςʔϒϧσʔλʹର͢Δ ੳΫΤϦ͕؆୯ʹ࣮ߦͰ͖Δ ϝϞϦ্ͷςʔϒϧσʔλΛσʔλ
ϑϨʔϜͱͯ͠ར༻Ͱ͖Δ