Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PythonとApache Arrow
Search
Sinhrks
December 08, 2018
6
1.8k
PythonとApache Arrow
@Apache Arrow東京ミートアップ2018
https://speee.connpass.com/event/103514/
Sinhrks
December 08, 2018
Tweet
Share
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
370
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.1k
機械学習と解釈可能性
sinhrks
7
5.6k
LIME
sinhrks
2
1.3k
データ分析言語R 1年の振り返り
sinhrks
5
2.4k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
19k
Dask Distributedによる分散機械学習
sinhrks
4
1.4k
Data processing using pandas and Dask
sinhrks
1
240
pandasでのOSS活動事例
sinhrks
0
740
Featured
See All Featured
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
159
15k
Building Adaptive Systems
keathley
38
2.3k
Six Lessons from altMBA
skipperchong
27
3.5k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
132
33k
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.5k
How to Ace a Technical Interview
jacobian
276
23k
Measuring & Analyzing Core Web Vitals
bluesmoon
4
170
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
48
2.2k
Into the Great Unknown - MozCon
thekraken
33
1.5k
Gamification - CAS2011
davidbonilla
80
5.1k
Building Flexible Design Systems
yeseniaperezcruz
327
38k
Facilitating Awesome Meetings
lara
50
6.1k
Transcript
PythonͱApache Arrow Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • ງӽ ਅө • (ג)ARISE analytics • σʔλੳͱ͔ •
A member of core developers (ϦϋϏϦத): • GitHub: https://github.com/sinhrks
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
PythonͷσʔλੳΤίγεςϜͷݱঢ় #PLFI NBUQMPUMJC 5FOTPS'MPX 1Z5BCMFT 42-"MDIFNZ *CJT 1Z4QBSL QBOEBT 7JTVBMJ[BUJPO
#JH%BUB *0 .BDIJOF-FBSOJOH SQZ 0UIFS1SPHSBNNJOH -BOHVBHFT 4DJLJUMFBSO /VN1Z %BTL %BUB)BOEMJOH
NumPy ndarray • (ݪଇͱͯ͠)୯ҰͷܕΛ࣋ͭ࿈ଓͨ͠ϝϞϦ্ͷྻ • ཧతͳදݱΛͲ͏ѻ͑ྑ͍͔ΛϝλσʔλͰཧ
00000001000000100000001100000100 base view ཧදݱ ཧදݱ 4MJDF
pandas DataFrame • Arrow։ൃऀͷWes McKinney͕։ൃ (2010͝Ζ-) • ྻ͝ͱʹෳͷܕΛ࣋ͭςʔϒϧܗࣜͷσʔλ • ෦ͷσʔλྻํͷϒϩοΫͰཧ
• ֤ϒϩοΫͷ࣮ମNumPy ndarray … $PMVNO *OEFY .JYFEEBUBUZQFT $PMVNOT *OEFY … *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
Dask DataFrame • pandasͷॲཧΛฒྻɾࢄ࣮ߦ • ܭࢉάϥϑΛಈతʹεέδϡʔϦϯά࣮ͯ͠ߦ Blocked Algorithm 4VN $PODBU
Dask DataFrame pandas DataFrame 4VN 4VN
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
σʔλॲཧʹ͓͚Δ՝ • NumPyඇৗʹ͕ߴ͍ܭࢉύοέʔδͰ͋Δ • σʔλετϨʔδͱͯ͠͏·͘ಈ͘ • ͔͠͠ͳ͕Βɺඞͣ͠࠷దͰͳ͍Ϣʔεέʔεݟ͖͑ͯ ͍ͯΔ
ܽଛͷ੍ • NumPy: • Ұ؏ͨܽ͠ଛ͕ͳ͍ • Ұ෦ͷσʔλܕ (float, datetime, timedelta)
ʹܽଛ૬ͷ͕ଘࡏ (NaN, NaT) • Ұൠʹmasked arrayͰରԠ • pandas: • NumPyͷܽଛରԠͷ੍Λड͚Δ • ܽଛͷૠೖʹΑΓɺҙਤ͠ͳ͍ܕม͕ൃੜ (ӈද) • ܽଛͷ༗ແΛௐΔͨΊʹͷ͕ࠪඞཁ NEP 12 — Missing Data Functionality in NumPy https://www.numpy.org/neps/nep-0012-missing-data.html 0SJHJOBM /"JOTFSTJPOSFTVMU JOU qPBU qPBU qPBU CPPM qPBU EBUFUJNF EBUFUJNF UJNFEFMUB UJNFEFMUB PCKFDU PCKFDU DBUFHPSJDBM DBUFHPSJDBM
CopyͱView • NumPy, pandasͷڍಈΛ(શʹ)༧ଌ͢Δ͜ͱ͕͍͠ • ҙਤ͠ͳ͍σʔλͷॻ͖͑ʹର͢Δޚతͳίϐʔ͕ඞཁ جຊతͳϧʔϧ͋ΔͷͷɺOS/CPUܕʹΑΓҙਤ͠ͳ͍݁ՌʹͳΔ߹ https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html
Object ndarray • ͍ • ࣮ମ͕࿈ଓͨ͠ཧྖҬʹͳ͍߹͕͋Δ • CPythonΠϯλʔϓϦλ্Ͱܭࢉॲཧ͕ߦΘΕΔ (GIL, ϝιου໊ղܾ…)
• pandasͰɺจࣈྻΛObject ܕͱͯ͠ѻ͏ (ܽଛରԠͷͨΊ) OEBSSBZ1SJNJUJWF 1Z0CKFDU@)&"% EBUB OE ʜ OEBSSBZ0CKFDU 1Z0CKFDU@)&"% EBUB OE ʜ 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU@)&"% ʜ … Why Python is Slow: Looking Under the Hood https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/l 1Z0CKFDU 1Z0CKFDU@)&"% ʜ
ࣜධՁ • ࣜஞ࣍ධՁ͞ΕΔ • தؒ݁ՌͷͨΊͷྖҬ֬อͳͲΦʔόʔϔου͕ൃੜ ॲཧͱϓϥοτϑΥʔϜʹΑͬͯɺҰ࣌ྖҬͷ֬อ͕ෆཁͱͳΔΑ͏࠷దԽ͞ΕΔ https://github.com/numpy/numpy/pull/7997
ࣜධՁ • ࣜஞ࣍ධՁ͞ΕΔ • σʔλͱॲཧͷத͕૾Ͱ͖͍ͯͳ͍ͱɺޮతͳॲཧΛॻ͘ͷ͍͠ • ΫΤϦ࠷దԽͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • NumExpr: Fast
numerical array expression evaluator • Numba: NumPy aware dynamic Python compiler using LLVM df[df[‘Year’] >= 2018].groupby(‘Store’).sum()[‘price’] DaskͰܭࢉάϥϑϨϕϧͰಛఆॲཧͷ࠷దԽΛ࣮ࢪ http://docs.dask.org/en/latest/optimize.html
ߦํͷՃ • ͍ • ྻ͝ͱʹɺϝϞϦͷ࠶֬อͱͷ ίϐʔ͕ൃੜ • ϧʔϓͰ1ߦͣͭߦΛՃΛ͢Δ ͱ… …
$PMVNOT *OEFY *OEF $PMVNOT +
ߦͷؔద༻ $PMVNOT *OEFY df.apply(lambda row: row[‘a’] + row[‘b’], axis=1) •
͍ • 1ߦͣͭεϥΠεͯ͠Series(ҰྻͷΈͷσʔλ)Λ ࡞ • SeriesΛ࡞͢ΔͨͼɺϝϞϦͷ֬อͱܕͷਪ ͕ൃੜ • Series͔ΒΛऔΓग़͢ࡍʹϥϕϧ͔ΒҐஔΛ ݕࡧ͢Δॲཧ͕ൃੜ • ֤ߦͷॲཧ݁Ռ͔Βɺ݁ՌશମͷܕΛਪ • ॲཧ݁Ռ͕ඇܾఆత
• ֎෦ύοέʔδ • Python࣮ͩͱ͍ • ಠ࣮ࣗ • ϝϯςφϯε͕ͭΒ͍ σʔλIO magic
= (b"\x00\x00\x00\x00\x00\x00\x00\x00" + b"\x00\x00\x00\x00\xc2\xea\x81\x60" + b"\xb3\x14\x11\xcf\xbd\x92\x08\x00" + b"\x09\xc7\x31\x8c\x18\x1f\x10\x11") class SASIndex(object): row_size_index = 0 column_size_index = 1 subheader_counts_index = 2 column_text_index = 3 column_name_index = 4 … subheader_signature_to_index = { b"\xF7\xF7\xF7\xF7": SASIndex.row_size_index, b"\x00\x00\x00\x00\xF7\xF7\xF7\xF7": SASIndex.row_size_index, … https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas
• ඪ४ (+ެࣜαϒύοέʔδ)ͰҎԼΛαϙʔτ σʔλIO https://pandas.pydata.org/pandas-docs/stable/io.html
σʔλIO (ύοέʔδؒ) • PythonͷσʔλੳΤίγεςϜNumPyΛத৺ʹൃల • Scikit-learnNumPy ndarrayΛೖྗͱͯ͠ఆ • pandas DataFrame͕ೖྗ͞ΕΔͱɺ෦Ͱ
NumPy ndarrayʹม • NumPy͕ APIΛنఆ (Array interface) NumPy: The Array Interface https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.interface.html QBOEBT %BUB'SBNF /VN1Z OEBSSBZ "SSBZ*OUFSGBDFʹΑΓɺ /VN1ZOEBSSBZʹม ܭࢉॲཧ
ฒྻॲཧ • Global Interpreter Lock (GIL) • CPythonΠϯλʔϓϦλ্ͰෳͷεϨουΛಉ࣮࣌ߦͰ͖ͳ͍ • CythonͰ໌ࣔతʹղ์Ͱ͖Δ
• ύοέʔδଆͰॲཧ͝ͱʹରԠ͕ඞཁ • ղ์ޙ PyObject Λѻ͑ͳ͍ • ฒྻࢄॲཧͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • Dask: A flexible library for parallel computing in Python • γϦΞϥΠζpickleͳͲछྨΛ͍͚ Understanding the Python GIL http://www.dabeaz.com/GIL/
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
Arrowʹظ͢Δػೳ • Ұ؏ͨܽ͠ଛͷαϙʔτ • ಈ࡞ͷ༧ଌՄೳੑ • Query optimization (Gandiva?) •
ॊೈͳσʔλՃ (Chunk?) • ߦͷநදݱ (RowSet?) • ߴͳIO • ฒྻॲཧͰͷσʔλڞ༗ (Plasma?)
Arrow͕ͨΒ͢ະདྷ • σʔλੳύοέʔδArrow͕نఆ͢ΔදݱΛॲཧ͢Δͷ ʹͳΔʁ • ArrowΛૢ࡞͢Δύοέʔδͱɺ֎෦ͱͷIOΛ୲͏ύοέʔ δʹ͔Ε͍ͯ͘ʁɹ
PythonͰؤுΔ͖͜ͱ • ͱ͍͑ɺશͯͷॲཧΛ Arrow Ͱߦ͏͜ͱͰ͖ͳ͍ͷͰ • Pythonඪ४ɺσϑΝΫτύοέʔδͱͷ߹ੑ • จࣈྻॲཧɺॲཧͱ͔ •
Object Array • Γ͍ͨ͜ͱʹԠͯ͡ɺPython্ͰArrowͷදݱΛॲཧ͢Δඞ ཁʁ
ExtensionArray • ҙͷΫϥεΛɺ(ArrowͳͲ)ѻ͍͍͢෦දݱʹϚοϐϯ ά͢ΔͨΊͷΠϯλʔϑΣʔε • pandas ExtensionArrayͱͯ͠APIΛඋத Extension Arrays for
pandas https://tomaugspurger.github.io/pandas-extension-arrays.html IP Address ExtensionArray IP Address 4FU (FU '192.168.1.1' '2001:0db8:85a3…' '192.168.1.1' '2001:0db8:85a3…' hi: [0, …]. lo: [3232235777, …]
Arrow Integration • pandasͰɺExtensionArrayΛܦ༝ͯ͠PyArrowΛϥοϓ • ςετίʔυதʹɺطʹPyArrowΛ༻͍ͨ Bool ܕͷ࣮͋ Γ https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/arrow/bool.py
GPU DataFrame • σʔλαΠζͷංେԽʹΑΓɺGPUͰͷੳχʔζ͕૿େ • σʔλ(ྻ)ͷಛʹԠͯ͡ɺGPU/CPUʹॲཧΛৼΓ͚Δඞཁ • ྻ͝ͱʹ࠷దͳσόΠεʹ֨ೲ͖͢ʁ • ϝλσʔλindexerͷόεసૹ͕՝ʁ
… $PMVNOT *OEFY *OU $16 0CKFDU $16 *OU (16 'MPBU (16 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning https://developer.nvidia.com/rapids
·ͱΊ • ArrowʹΑͬͯɺσʔλॲཧʹ͓͚Δ՝େ͖͘վળ͞Εͦ͏ • σʔλॲཧύοέʔδͷ։ൃ໘ന͍ • ίΞ෦ʹՃ͑ɺจࣈྻɺ࣌ɺIOɺՄࢹԽͳͲ༷ʑͳॲཧ͕ • ࠷ॳ؆୯ͳͷ͔Β •
υΩϡϝϯτɺΤϥʔϝοηʔδमਖ਼ • ՄࢹԽͳͲɺinternalͷAPIΛΘͳ͍ͱ͜Ζ