Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PythonとApache Arrow
Search
Sinhrks
December 08, 2018
6
1.8k
PythonとApache Arrow
@Apache Arrow東京ミートアップ2018
https://speee.connpass.com/event/103514/
Sinhrks
December 08, 2018
Tweet
Share
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
370
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.1k
機械学習と解釈可能性
sinhrks
7
5.6k
LIME
sinhrks
2
1.3k
データ分析言語R 1年の振り返り
sinhrks
5
2.4k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
19k
Dask Distributedによる分散機械学習
sinhrks
4
1.4k
Data processing using pandas and Dask
sinhrks
1
240
pandasでのOSS活動事例
sinhrks
0
740
Featured
See All Featured
Building Applications with DynamoDB
mza
93
6.2k
Building Flexible Design Systems
yeseniaperezcruz
328
38k
Java REST API Framework Comparison - PWX 2021
mraible
28
8.3k
A Philosophy of Restraint
colly
203
16k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
19
2.3k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
127
18k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
3
360
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
29
960
The Cult of Friendly URLs
andyhume
78
6.1k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
28
2.2k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Transcript
PythonͱApache Arrow Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • ງӽ ਅө • (ג)ARISE analytics • σʔλੳͱ͔ •
A member of core developers (ϦϋϏϦத): • GitHub: https://github.com/sinhrks
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
PythonͷσʔλੳΤίγεςϜͷݱঢ় #PLFI NBUQMPUMJC 5FOTPS'MPX 1Z5BCMFT 42-"MDIFNZ *CJT 1Z4QBSL QBOEBT 7JTVBMJ[BUJPO
#JH%BUB *0 .BDIJOF-FBSOJOH SQZ 0UIFS1SPHSBNNJOH -BOHVBHFT 4DJLJUMFBSO /VN1Z %BTL %BUB)BOEMJOH
NumPy ndarray • (ݪଇͱͯ͠)୯ҰͷܕΛ࣋ͭ࿈ଓͨ͠ϝϞϦ্ͷྻ • ཧతͳදݱΛͲ͏ѻ͑ྑ͍͔ΛϝλσʔλͰཧ
00000001000000100000001100000100 base view ཧදݱ ཧදݱ 4MJDF
pandas DataFrame • Arrow։ൃऀͷWes McKinney͕։ൃ (2010͝Ζ-) • ྻ͝ͱʹෳͷܕΛ࣋ͭςʔϒϧܗࣜͷσʔλ • ෦ͷσʔλྻํͷϒϩοΫͰཧ
• ֤ϒϩοΫͷ࣮ମNumPy ndarray … $PMVNO *OEFY .JYFEEBUBUZQFT $PMVNOT *OEFY … *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
Dask DataFrame • pandasͷॲཧΛฒྻɾࢄ࣮ߦ • ܭࢉάϥϑΛಈతʹεέδϡʔϦϯά࣮ͯ͠ߦ Blocked Algorithm 4VN $PODBU
Dask DataFrame pandas DataFrame 4VN 4VN
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
σʔλॲཧʹ͓͚Δ՝ • NumPyඇৗʹ͕ߴ͍ܭࢉύοέʔδͰ͋Δ • σʔλετϨʔδͱͯ͠͏·͘ಈ͘ • ͔͠͠ͳ͕Βɺඞͣ͠࠷దͰͳ͍Ϣʔεέʔεݟ͖͑ͯ ͍ͯΔ
ܽଛͷ੍ • NumPy: • Ұ؏ͨܽ͠ଛ͕ͳ͍ • Ұ෦ͷσʔλܕ (float, datetime, timedelta)
ʹܽଛ૬ͷ͕ଘࡏ (NaN, NaT) • Ұൠʹmasked arrayͰରԠ • pandas: • NumPyͷܽଛରԠͷ੍Λड͚Δ • ܽଛͷૠೖʹΑΓɺҙਤ͠ͳ͍ܕม͕ൃੜ (ӈද) • ܽଛͷ༗ແΛௐΔͨΊʹͷ͕ࠪඞཁ NEP 12 — Missing Data Functionality in NumPy https://www.numpy.org/neps/nep-0012-missing-data.html 0SJHJOBM /"JOTFSTJPOSFTVMU JOU qPBU qPBU qPBU CPPM qPBU EBUFUJNF EBUFUJNF UJNFEFMUB UJNFEFMUB PCKFDU PCKFDU DBUFHPSJDBM DBUFHPSJDBM
CopyͱView • NumPy, pandasͷڍಈΛ(શʹ)༧ଌ͢Δ͜ͱ͕͍͠ • ҙਤ͠ͳ͍σʔλͷॻ͖͑ʹର͢Δޚతͳίϐʔ͕ඞཁ جຊతͳϧʔϧ͋ΔͷͷɺOS/CPUܕʹΑΓҙਤ͠ͳ͍݁ՌʹͳΔ߹ https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html
Object ndarray • ͍ • ࣮ମ͕࿈ଓͨ͠ཧྖҬʹͳ͍߹͕͋Δ • CPythonΠϯλʔϓϦλ্Ͱܭࢉॲཧ͕ߦΘΕΔ (GIL, ϝιου໊ղܾ…)
• pandasͰɺจࣈྻΛObject ܕͱͯ͠ѻ͏ (ܽଛରԠͷͨΊ) OEBSSBZ1SJNJUJWF 1Z0CKFDU@)&"% EBUB OE ʜ OEBSSBZ0CKFDU 1Z0CKFDU@)&"% EBUB OE ʜ 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU@)&"% ʜ … Why Python is Slow: Looking Under the Hood https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/l 1Z0CKFDU 1Z0CKFDU@)&"% ʜ
ࣜධՁ • ࣜஞ࣍ධՁ͞ΕΔ • தؒ݁ՌͷͨΊͷྖҬ֬อͳͲΦʔόʔϔου͕ൃੜ ॲཧͱϓϥοτϑΥʔϜʹΑͬͯɺҰ࣌ྖҬͷ֬อ͕ෆཁͱͳΔΑ͏࠷దԽ͞ΕΔ https://github.com/numpy/numpy/pull/7997
ࣜධՁ • ࣜஞ࣍ධՁ͞ΕΔ • σʔλͱॲཧͷத͕૾Ͱ͖͍ͯͳ͍ͱɺޮతͳॲཧΛॻ͘ͷ͍͠ • ΫΤϦ࠷దԽͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • NumExpr: Fast
numerical array expression evaluator • Numba: NumPy aware dynamic Python compiler using LLVM df[df[‘Year’] >= 2018].groupby(‘Store’).sum()[‘price’] DaskͰܭࢉάϥϑϨϕϧͰಛఆॲཧͷ࠷దԽΛ࣮ࢪ http://docs.dask.org/en/latest/optimize.html
ߦํͷՃ • ͍ • ྻ͝ͱʹɺϝϞϦͷ࠶֬อͱͷ ίϐʔ͕ൃੜ • ϧʔϓͰ1ߦͣͭߦΛՃΛ͢Δ ͱ… …
$PMVNOT *OEFY *OEF $PMVNOT +
ߦͷؔద༻ $PMVNOT *OEFY df.apply(lambda row: row[‘a’] + row[‘b’], axis=1) •
͍ • 1ߦͣͭεϥΠεͯ͠Series(ҰྻͷΈͷσʔλ)Λ ࡞ • SeriesΛ࡞͢ΔͨͼɺϝϞϦͷ֬อͱܕͷਪ ͕ൃੜ • Series͔ΒΛऔΓग़͢ࡍʹϥϕϧ͔ΒҐஔΛ ݕࡧ͢Δॲཧ͕ൃੜ • ֤ߦͷॲཧ݁Ռ͔Βɺ݁ՌશମͷܕΛਪ • ॲཧ݁Ռ͕ඇܾఆత
• ֎෦ύοέʔδ • Python࣮ͩͱ͍ • ಠ࣮ࣗ • ϝϯςφϯε͕ͭΒ͍ σʔλIO magic
= (b"\x00\x00\x00\x00\x00\x00\x00\x00" + b"\x00\x00\x00\x00\xc2\xea\x81\x60" + b"\xb3\x14\x11\xcf\xbd\x92\x08\x00" + b"\x09\xc7\x31\x8c\x18\x1f\x10\x11") class SASIndex(object): row_size_index = 0 column_size_index = 1 subheader_counts_index = 2 column_text_index = 3 column_name_index = 4 … subheader_signature_to_index = { b"\xF7\xF7\xF7\xF7": SASIndex.row_size_index, b"\x00\x00\x00\x00\xF7\xF7\xF7\xF7": SASIndex.row_size_index, … https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas
• ඪ४ (+ެࣜαϒύοέʔδ)ͰҎԼΛαϙʔτ σʔλIO https://pandas.pydata.org/pandas-docs/stable/io.html
σʔλIO (ύοέʔδؒ) • PythonͷσʔλੳΤίγεςϜNumPyΛத৺ʹൃల • Scikit-learnNumPy ndarrayΛೖྗͱͯ͠ఆ • pandas DataFrame͕ೖྗ͞ΕΔͱɺ෦Ͱ
NumPy ndarrayʹม • NumPy͕ APIΛنఆ (Array interface) NumPy: The Array Interface https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.interface.html QBOEBT %BUB'SBNF /VN1Z OEBSSBZ "SSBZ*OUFSGBDFʹΑΓɺ /VN1ZOEBSSBZʹม ܭࢉॲཧ
ฒྻॲཧ • Global Interpreter Lock (GIL) • CPythonΠϯλʔϓϦλ্ͰෳͷεϨουΛಉ࣮࣌ߦͰ͖ͳ͍ • CythonͰ໌ࣔతʹղ์Ͱ͖Δ
• ύοέʔδଆͰॲཧ͝ͱʹରԠ͕ඞཁ • ղ์ޙ PyObject Λѻ͑ͳ͍ • ฒྻࢄॲཧͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • Dask: A flexible library for parallel computing in Python • γϦΞϥΠζpickleͳͲछྨΛ͍͚ Understanding the Python GIL http://www.dabeaz.com/GIL/
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
Arrowʹظ͢Δػೳ • Ұ؏ͨܽ͠ଛͷαϙʔτ • ಈ࡞ͷ༧ଌՄೳੑ • Query optimization (Gandiva?) •
ॊೈͳσʔλՃ (Chunk?) • ߦͷநදݱ (RowSet?) • ߴͳIO • ฒྻॲཧͰͷσʔλڞ༗ (Plasma?)
Arrow͕ͨΒ͢ະདྷ • σʔλੳύοέʔδArrow͕نఆ͢ΔදݱΛॲཧ͢Δͷ ʹͳΔʁ • ArrowΛૢ࡞͢Δύοέʔδͱɺ֎෦ͱͷIOΛ୲͏ύοέʔ δʹ͔Ε͍ͯ͘ʁɹ
PythonͰؤுΔ͖͜ͱ • ͱ͍͑ɺશͯͷॲཧΛ Arrow Ͱߦ͏͜ͱͰ͖ͳ͍ͷͰ • Pythonඪ४ɺσϑΝΫτύοέʔδͱͷ߹ੑ • จࣈྻॲཧɺॲཧͱ͔ •
Object Array • Γ͍ͨ͜ͱʹԠͯ͡ɺPython্ͰArrowͷදݱΛॲཧ͢Δඞ ཁʁ
ExtensionArray • ҙͷΫϥεΛɺ(ArrowͳͲ)ѻ͍͍͢෦දݱʹϚοϐϯ ά͢ΔͨΊͷΠϯλʔϑΣʔε • pandas ExtensionArrayͱͯ͠APIΛඋத Extension Arrays for
pandas https://tomaugspurger.github.io/pandas-extension-arrays.html IP Address ExtensionArray IP Address 4FU (FU '192.168.1.1' '2001:0db8:85a3…' '192.168.1.1' '2001:0db8:85a3…' hi: [0, …]. lo: [3232235777, …]
Arrow Integration • pandasͰɺExtensionArrayΛܦ༝ͯ͠PyArrowΛϥοϓ • ςετίʔυதʹɺطʹPyArrowΛ༻͍ͨ Bool ܕͷ࣮͋ Γ https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/arrow/bool.py
GPU DataFrame • σʔλαΠζͷංେԽʹΑΓɺGPUͰͷੳχʔζ͕૿େ • σʔλ(ྻ)ͷಛʹԠͯ͡ɺGPU/CPUʹॲཧΛৼΓ͚Δඞཁ • ྻ͝ͱʹ࠷దͳσόΠεʹ֨ೲ͖͢ʁ • ϝλσʔλindexerͷόεసૹ͕՝ʁ
… $PMVNOT *OEFY *OU $16 0CKFDU $16 *OU (16 'MPBU (16 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning https://developer.nvidia.com/rapids
·ͱΊ • ArrowʹΑͬͯɺσʔλॲཧʹ͓͚Δ՝େ͖͘վળ͞Εͦ͏ • σʔλॲཧύοέʔδͷ։ൃ໘ന͍ • ίΞ෦ʹՃ͑ɺจࣈྻɺ࣌ɺIOɺՄࢹԽͳͲ༷ʑͳॲཧ͕ • ࠷ॳ؆୯ͳͷ͔Β •
υΩϡϝϯτɺΤϥʔϝοηʔδमਖ਼ • ՄࢹԽͳͲɺinternalͷAPIΛΘͳ͍ͱ͜Ζ