Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Arrow C++ Datasets

Kenta Murata
December 11, 2019

Apache Arrow C++ Datasets

Introduce Apache Arrow C++ Datasets.

Presented Apache Arrow Tokyo Meetup 2019.

Kenta Murata

December 11, 2019
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. Kenta Murata • Fulltime OSS developer at Speee, Inc. •

    CRuby committer (as of 2010.02) • Apache Arrow committer (as of 2019.10) • The 24th place (44 commits) • SparseTensor in Arrow C++ • GLib and Ruby binding, etc.
  2. Apache Arrow C++ Datasets • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ •

    ༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ • ෳ਺छྨͷετϨʔδ͔Βͷσʔλೖྗʹ΋ରԠͰ͖Δ • কདྷతʹ͸ϑΝΠϧ΁ͷॻ͖ग़͠ʹ΋ରԠ͢Δ༧ఆ
  3. ෳ਺ͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ a.parquet b.parquet Query 1 Query 2 c.csv d.json Record


    Batch 1 Record
 Batch 2 Amazon S3 Amazon Redshift Local File System In-Memory Arrow Table
  4. ϑΝΠϧ͔ΒͷಡΈࠐΈ • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ • ෳ਺ϑΝΠϧΛฒྻεΩϟϯͰ͖Δ • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ

    • ෳ਺ͷϑΝΠϧʹ෼ׂ͞ΕͨσʔλΛ࠶ߏ੒͢Δ • σʔλΛෳ਺ϑΝΠϧʹ෼ׂ͢Δͱ͖ͷεΩʔϚ෼ׂͷنଇʹैͬͯॲཧ͢Δ • ৚݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)
  5. ϑΝΠϧͷൃݟͷྫ using namespace arrow; using namespace arrow::dataset; fs::Selector selector; selector.base_dir

    = “/data”; selector.recursive = true; std::shared_ptr<FileSystemDataSourceDiscovery> discovery; ARROW_OK_AND_ASSIGN( discovery, FileSystemDataSourceDiscovery::Make( fs, selector, std::make_shared<dataset::ParquetFileFormat>(), FileSystemDiscoveryOptions())); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());
  6. σʔλ෼ׂͷنଇΛࢦఆ /data/2018 /data/2018/12 /data/2018/12/JP /data/2018/12/JP/Tokyo/001.parquet auto partition_scheme =
 schema({field(“year”, int32()),

    field(“month”, int32()),
 field(“country”, utf8()), field(“city”, utf8())}); ASSERT_OK(discovery->SetPartitionScheme(partition_scheme));
 ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish()); year month country city => {“year": 2018} => {“year”: 2018, “month”: 12} => {“year”: 2018, “month”: 12, “country”: “JP”} => {“year”: 2018, “month”: 12,
 “country”: “JP”, “city”: “Tokyo”}
  7. ϑΟϧλϦϯά • ৚݅ࣜΛ࢖ͬͯߦΛϑΟϧλϦϯάͰ͖Δ • year ͕ 2019 Ͱ sales ͕

    100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ ग़͢৔߹͸࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ “year”_ == 2019 && “sales”_ > 100.0 • εΩʔϚ෼ׂͷنଇʹैͬͯɺ৚݅ʹ߹க͠ͳ͍ϑΝΠϧ ͷಡΈࠐΈΛলུ͢Δ
  8. σʔληοτΛ࡞ͬͯಡΈࠐΜͰ
 Arrow Table Λ࡞Δ·Ͱͷྫ // σʔληοτͷ࡞੒ ASSERT_OK_AND_ASSIGN(auto dataset,
 Dataset::Make({data_source}, discovery->Inspect()));

    // εΩϟφϏϧμ ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); // ϑΟϧλͷઃఆ auto filter = (“year”_ == 2019 && “sales”_ > 100.0); ASSERT_OK(scanner_builder->Filter(filter)); // ϓϩδΣΫγϣϯͷઃఆ std::vector<std::string> columns{“item_id”, “item_name”, “sales”}; ASSERT_OK(scanner_builder->Project(columns)); // εΩϟφੜ੒ ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish(); // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ) ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
  9. ༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ • ݱࡏ͸ෳ਺ͷ Parquet ϑΝΠϧʹ෼ׂ͞Εͨσʔληο τ΁ͷରԠΛ੔උத • AVRO, ORC, JSON,

    CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ ϑΥʔϚοτ͸কདྷతʹରԠ͞ΕΔ • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request ͸ৗʹ welcome ͩͱࢥ͏
  10. ༷ʑͳϑΝΠϧγεςϜ΁ͷରԠ • ରԠࡁΈͷ΋ͷ • ϩʔΧϧϑΝΠϧγεςϜ • HDFS • Amazon S3

    • ςετ༻ͷϞοΫϑΝΠϧγεςϜ • কདྷతʹରԠ͍ͨ͠΋ͷ • Google Cloud Storage • Microsoft Azure BLOB Storage
  11. RDB ͔ΒͷಡΈࠐΈ • RDB ͷςʔϒϧ΍ΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͠࢖͑ΔΑ͏ʹ͢Δ ܭը΋͋Δ • ࣍ͷγεςϜ͸໊ࢦ͠͞Ε͍ͯΔ • SQLite3

    • PostgreSQL protocol (pgsql, Vertica, Redshift) • MySQL (and MemSQL) • Microsoft SQL Server (TDS) • HiveServer2 (Hive and Impala) • ClickHouse
  12. Apache Arrow C++ Datasets • Apache Arrow C++ Datasets ͕͋Ε͹ɺ͍Ζ͍Ζͳ৔ॴ

    ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ཰ Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ • Arrow Table Λ࡞ͬͨ͋ͱ͸ʁ • ͞Βʹ෼ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ • ूܭ΍౷ܭॲཧΛ͍ͨ͠
  13. Arrow Table Λ࡞ͬͨ͋ͱ • ෼ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ => Apache Arrow C++ Query

    Engine • ूܭ΍౷ܭॲཧΛ͍ͨ͠ => Apache Arrow C++ Data Frame
  14. Apache Arrow C++ Query Engine • ϝϞϦ্ͷ Arrow Record Batch

    ʹରͯ͠SQL෩ͷΫΤ Ϧ΍ɺσʔλ෼ੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞΍ pivot ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ • σʔλϕʔεΛஔ͖׵͑Δ͜ͱ͸ҙਤͤͣɺC++ ͷڞ༗ϥ ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰ࢖Θ ΕΔ͜ͱΛ૝ఆ͍ͯ͠Δ • ·ͩ։ൃ͸࢝·͍ͬͯͳ͍͕ٞ࿦͸͞Ε͍ͯΔ
  15. Apache Arrow C++ Data Frame • ϝϞϦ্ͷ Arrow Record Batch

    ʹରͯ͠ɺ͍ΘΏΔ σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺ෼ੳɺू ܭͳͲͷػೳΛఏڙ͢Δ • ։ൃ͸·ͩ࢝·͍ͬͯͳ͍͕ٞ࿦͸͞Ε͍ͯΔ • pandas2 ͸ Arrow C++ Data Frame ΛόοΫΤϯυͱ ͯ͠࡞ΕΒΕΔͷ͔ͳʁ