Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Sparkによる分散処理 / 2015-01-16 PyData.Tokyo#3
Search
shunsukeaihara
January 17, 2015
Technology
11
3.5k
Sparkによる分散処理 / 2015-01-16 PyData.Tokyo#3
shunsukeaihara
January 17, 2015
Tweet
Share
More Decks by shunsukeaihara
See All by shunsukeaihara
BONXを支える技術:発話区間検出(VAD)の話/Akerun & BONX Tech Talk
shunsukeaihara
4
7.6k
Goのnet.TCPConnの話/shibuya.go01
shunsukeaihara
3
810
Norikra in Gunosy Network Ads@Norikra meetup #2
shunsukeaihara
1
6k
LevelDB on S3 As A KVS
shunsukeaihara
1
2.8k
色恒常性仮説に基づく色補正ライブラリcolorcorrect / 2015-01-31-kantocv27
shunsukeaihara
3
2.4k
ゼロから始めた Gunosyアドサーバ開発運用記 / 2014-12-16-dots
shunsukeaihara
6
1.2k
Gunosy.Go#5 index/io/log
shunsukeaihara
0
160
Gunosy.go#2 package/compress
shunsukeaihara
0
110
Other Decks in Technology
See All in Technology
Node-RED × MCP 勉強会 vol.1
1ftseabass
PRO
0
170
GitHub Copilot の概要
tomokusaba
1
140
整頓のジレンマとの戦い〜Tidy First?で振り返る事業とキャリアの歩み〜/Fighting the tidiness dilemma〜Business and Career Milestones Reflected on in Tidy First?〜
bitkey
0
140
PHP開発者のためのSOLID原則再入門 #phpcon / PHP Conference Japan 2025
shogogg
4
900
AI導入の理想と現実~コストと浸透〜
oprstchn
0
120
Yamla: Rustでつくるリアルタイム性を追求した機械学習基盤 / Yamla: A Rust-Based Machine Learning Platform Pursuing Real-Time Capabilities
lycorptech_jp
PRO
4
160
Tokyo_reInforce_2025_recap_iam_access_analyzer
hiashisan
0
120
LangChain Interrupt & LangChain Ambassadors meetingレポート
os1ma
2
170
AIの最新技術&テーマをつまんで紹介&フリートークするシリーズ #1 量子機械学習の入門
tkhresk
0
140
解析の定理証明実践@Lean 4
dec9ue
1
190
AWS Organizations 新機能!マルチパーティ承認の紹介
yhana
1
200
作曲家がボカロを使うようにPdMはAIを使え
itotaxi
0
290
Featured
See All Featured
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
124
52k
Automating Front-end Workflow
addyosmani
1370
200k
Building Flexible Design Systems
yeseniaperezcruz
328
39k
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.6k
The Language of Interfaces
destraynor
158
25k
jQuery: Nuts, Bolts and Bling
dougneiner
63
7.8k
Thoughts on Productivity
jonyablonski
69
4.7k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
8
680
What's in a price? How to price your products and services
michaelherold
246
12k
Site-Speed That Sticks
csswizardry
10
670
Bash Introduction
62gerente
614
210k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.5k
Transcript
SparkʹΑΔࢄॲཧ (ͱPythonͰͷࢄॲཧ) Gunosy Inc. Shunsuke Aihara
ࣗݾհ • ҄൧ݪढ़հ (http://argmax.jp) @shunsukeaihara • GunosyͷϚωʔδϟʔ • ࠂ৴γεςϜͷ։ൃશମͱR&DܥΛ୲ •
ઐ: ܭࢉݴޠֶ • PythonͱඇಉظࢄγεςϜΛΉ • ը૾ॲཧɾԻ৴߸ॲཧͰ͍Ζ͍ΖϥΠϒϥϦ࡞ͬͯΔ • https://bitbucket.org/aihara
Agenda • Spark֓ཁ • ࢄॲཧ(ͱSpark)ͷ • GunosyͰͷSparkͷϢʔεέʔε • PythonͰͷࢄॲཧΤίγεςϜ
Sparkʹ͍ͭͯ(1) • HadoopͷΤίγεςϜ(HDFS, MESOS, YARN)ͱ࿈ܞ͢ΔΦϯϝϞ Ϧࢄॲཧܥ • Resillient Distributed Datasetsͱ͍͏োੑΛ࣋ͬͨࢄσʔλߏ
ʹର͢Δࢄϓϩάϥϛϯάڥ • RDDʹద༻͢ΔฒྻܭࢉΛɺߴ֊ؔͷνΣΠϯͷܗͰScalaɺ PythonͰ࣮ߦ • immutableͳσʔλߏ • RDDͷཁૉΫϥελͷΦϯϝϞϦʹࢄɾϨϓϦέʔγϣϯ • ഁଛɾϩετͨ͠σʔλӬଓԽͨ͠ݩσʔλ͔Β෮ݩ
Sparkʹ͍ͭͯ(2) • RDDʹର͢Δࢄॲཧج൫ͷ্ʹҎԼΛ࣮ • σʔλετϦʔϜॲཧ(Spark Streaming) • ࢄSQL(SparkSQL) • ࢄػցֶशϥΠϒϥϦ(Mllib)
• ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)
ࢄॲཧ(ͱSpark)ͷ
େنσʔλࢄॲཧͷ؊ • ΫϥελϚωʔδϝϯτ • σʔλͷࢄஔͷࣗಈԽ • σʔλଟॏԽ/ฒྻReadʹΑΔߴԽ • σʔλϩʔΧϦςΟΛอͬͨܭࢉ •
োੑ / ࠶ૹɾ࠶ܭࢉॲཧ
HadoopʹࢸΔ·Ͱ • ෳࡶͳฒྻॲཧϝοηʔδύογϯάͰಠࣗʹ࣮͢Δͱେม • εέϧτϯฒྻϓϩάϥϛϯά(Cole, 1989) • සग़͢ΔฒྻܭࢉύλʔϯͷΈ߹ΘͤͰɺ༷ʑͳฒྻॲཧΛߏతʹߏங ͢ΔؔϓϩάϥϛϯάͷΈͱෳͷ࣮ •
σʔλฒྻεέϧτϯ(map, fold/reduce, filter, zip…) • σʔλͷҟͳΔ෦ʹɼಉ࣌ʹಉ͡ૢ࡞Λߦ͏ܭࢉύλʔϯ • λεΫฒྻεέϧτϯ(pipe, farm…) • σʔλͷετϦʔϜʹରͯ͠ɼͦΕͧΕܭࢉΛద༻ͨ͠σʔλετϦʔ ϜΛฦ͢ύλʔϯ
εέϧτϯฒྻϓϩάϥϛϯά މৼߐ ؠ࡚ӳ࠸ εέϧτϯฒྻϓϩάϥϛϯάใॲཧ 7PM /P QQ
HadoopҎલͷࢄॲཧ • MPI άϦουγΣϧΛ༻͍࣮ͯ • σʔλͷஔࣗͰϚωʔδ • ڞ༗ϝϞϦ͔ڞ༗FSʹࣗͰஔ͕લఏ • ڊେσʔλͷஔͱͯ໘
• োੑಠ࣮ࣗͰอূ • ϝϞϦʹࡌΓΒͳ͍σʔλΛѻ͏ͷ͍͠
T-shirts message@WOMPAT2001 “Life is too short for MPI.”
Hadoop͕ղܾͨ͠ͷ • Պֶܭࢉ͚Ͱͳ͘େنσʔλʹಛԽ • ڊେσʔλͷஔͱॲཧͷ࣮ߦΛࣗಈཧ • HDFSͰͷࣗಈࢄஔͱɺஔॴͰMAPॲཧ
HadoopҎ߱ͷ৽ͨͳχʔζ • Hadoop / Hiveεϧʔϓοτॏࢹͷόονܥ • σʔλαΠΤϯςΟετͷχʔζΠϯλϥΫςΟϒͳ ੳɾϦΞϧλΠϜॲཧ • ॲཧֻ͚ͯ࣌ؒͪݫ͍͠
• Hadoop, Hiveߴ৴པੑͷ֬อͱҾ͖͑ʹதؒσʔλ ͷDisk I/O͕ϘτϧωοΫʹ • αʔόͨΓͷϝϞϦ༰ྔ૿େ
HadoopޙͷϓϩμΫτ • HiveͷΦϯϝϞϦߴԽ • ϦΞϧλΠϜͷετϦʔ Ϝσʔλॲཧ • ෳͷσʔλιʔε / DB
ʹ·͕ͨͬͯͷߴूܭ • λεΫ࣮ߦΛ࠷దԽ͠ϨΠςϯγΛ࣮ݱ
Spark • ൚༻ͷࢄϓϩάϥϛϯάڥ • RDDΛجૅʹ͓͍ͨεέϧτϯฒྻϓϩάϥϛϯάڥ • ΦϯϝϞϦͷRDDΛ༻͍Δ͜ͱͰɺϨΠςϯγʔͷ ࢄܭࢉΛ࣮ݱ • ϝϞϦʹΒͳ͍ͷDiskʹอଘ
• RDDʹର͢Δૢ࡞ΛΈ߹ΘͤΔ͜ͱͰɺػցֶशε τϦʔϜσʔλॲཧΛ࣮ݱ
RDDʹର͢Δجຊԋࢉ • ScalaͷSeqॲཧͷߴ֊ؔ+α͕ࢄ࣮ߦ • map, flatMap, filter, sort, union, zip
• reduce, fold, reduceByKey, groupBy, groupByKey, count cogroup, cross • join, leftOuterJoin, rightOuterJoin • sample, take, first, partitionBy, mapWith, pipe, save • etc….
RDDͷσʔλϩʔΧϦςΟ • λεΫͷ࣮ߦॴɾॱংσʔλɾιʔεͷ ஔॴΛݩʹ࠷దͳDAGදݱͰཧ )%'4 3%% 3%% NBQ NBQ NBQ
NBQ 3%% 3FEVDF
RDDͷোੑ • RDDͷ֤ཁૉ͕ࣗͲͷΑ͏ͳܦ࿏Ͱੜ ͞Ε͔ͨه )%'4 NBQ NBQ ☓ഁଛ )%'4 NBQ
NBQ NBQ ࠶ඞཁʹͳͬͨ࣌ɺσʔλɾιʔε͔Β࠶ੜ
Sparkʹ͍ͭͯ(2) • RDDʹର͢Δࢄॲཧج൫ͷ্ʹҎԼΛ࣮ • σʔλετϦʔϜॲཧ(Spark Streaming) • ࢄSQL(SparkSQL) • ࢄػցֶशϥΠϒϥϦ(Mllib)
• ࢄάϥϑॲཧϥΠϒϥϦ(GraphX)
PySpark + IPython Notebook • PySparkIPython্Ͱ࣮ߦՄೳ • AWSͳΒɺίϚϯυϥΠϯ1ൃͰΫϥελߏஙՄೳ • Spark
on EMR(YARNରԠ)Λಈ͔͢ • http://qiita.com/shunsukeaihara/items/1524b66579e91d1cf7cf
• ఆظόονܥfluentd -> RedshiftͰॲཧ • ΞυϗοΫͳϩάੳFluentd -> S3 -> Spark
• S3্ͷେྔͷϑΝΠϧΛखܰʹॲཧՄೳ GunosyͷSparkϢʔεέʔε "1*αʔό 4QBSLPO"84&.3 3FETIJGU$MVTUFS
GunosyͷSparkϢʔεέʔε(1) • CloudTrailsͷϩά͔ΒΘΕ͍ͯΔCredentialΛ୳ͯ͠ ௵͢ͱ͔… • େྔͷJSONϑΝΠϧΛಡΈࠐΜͰHiveQLΛ࣮ߦ EBUBTDUFYU'JMF TCVDLFU@OBNFQBUI H[
IJWFQZTQBSLTRM)JWF$POUFYU TD IUIJWFKTPO3%% EBUB IUSFHJTUFS5FNQ5BCMF USBJMMT IUDBDIF5BCMF USBJMMT IJWFTRM 4&-&$5%*45*/$5SFDPSEVTFS*EFOUJUZBDDFTT,FZ*E '30.USBJMMT-"5&3"-7*&8FYQMPEF 3FDPSET TBTSFDPSE
GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผྨ • Ϣʔβຖʹclickͨ͠هࣄͷidΛListΛcsvͰS3ʹอଘ • TF-IDFͰॏΈ͚ͭ TD4QBSL$POUFYU NBMFTDUFYU'JMF
lTCVDLFUQBUINBMF@ H[l GFNBMFTDUFYU'JMF lTCVDLFUQBUINBMF@ H[l UG)BTIJOH5' OVN'FBUVSFT NBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z GFNBMFNBMFNBQ MBNCEBYUGUSBOTGPSN YTQMJU l z JEG*%' JEG@NPEFMJEGpU NBMFVOJPO GFNBMF NBMFJEG@NPEFMUSBOTGPSN NBMF GFNBMFJEG@NPEFMUSBOTGPSN GFNBMF
GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผྨ • LabeledPointʹม͠ϩδεςΟοΫճؼͰֶश/ ྨ NBMFNBMFNBQ MBNCEBY-BCFMFE1PJOU Y
GFNBMFGFNBMFNBQ MBNCEBY-BCFMFE1PJOU Y USBJOJOHNBMFVOJPO GFNBMF USBJOJOHDBDIF NPEFM-PHJTUJD3FHSFTTJPO8JUI4(%USBJO USBJOJOH
GunosyͷSparkϢʔεέʔε(2) • Ϣʔβͷهࣄϩά͔Βͷੑผྨ • ઌ಄͕ϢʔβID, ͦΕҎ͕߱هࣄIDͷϦετ͔Βਪఆ EFGQBSTF Y EBUB<JOU
J GPSJJOYTQMJU l z > SFUVSO-BCFMFE1PJOU EBUB<> EBUB<> VOLOPXOTDUFYU'JMF lTCVDLFUQBUIVOLOPXO@ H[l VOLOPXOVOLOPXONBQ MBNCEBYUGUSBOTGPSN YTQMJU l z VOLOPXOVOLOPXONBQ MBNCEBY Y<> JEG@NPEFMUPSBOTGPSN UGUSBOTGPSNY<> VOLOPXONBQ MBNCEBY Y<> NPEFMQSFEJDU Y<> DPMMFDU
Pyspark͓ख͚ܰͩͲ… • PythonͷؔΛPickleͯ͠ࢄ࣮ߦ͢ΔͷͰ͍Ζ͍Ζ͍ • JavaͷϥΠϒϥϦ(kuromoji)Λར༻͍ͨ͠߹Scala ͷϥούʔ + py4jͷϥούʔ͕ඞཁ • Scala͔ΒͳΒͦͷ··͑Δ
• ؤுͬͯΈ͚ͨͲ࠳ંɻpy4jͱʹ͔ͭ͘Β͍ • Spark༻్ఔͳΒScalaͷֶशίετ͍ • ͱ͍͑sbt໘͚ͩͲ…
Pythonͷࢄॲཧڥ
PythonͷࢄॲཧϥΠϒϥϦ • Ϋϥελܭࢉ༻ • PyRC, dispy, Pyro4(GensimͷLSI, LDAͷࢄԽόοΫΤϯυʹར༻) • ࢄλεΫΩϡʔ
• Celery : σίϨʔλΛ͚ͭΔ͚ͩͰؔ୯ҐͰඇಉظࢄԽ • IPython Cluster: ؆୯ͳλεΫࢄ༻ • Spartan: Numpy arrayͷZeroMQʹΑΔࢄԽ(SparkͷRDDΠϯεύΠΞ) • Disco: PythonMapReduceϑϨʔϜϫʔΫ
GunosyͷPythonࢄॲཧڥ • ػցֶशͷαʔϏε࿈ܞλεΫฒྻ(ฒྻετϦʔϜॲཧ)͕ॏ ཁͰφΠʔϒͳࢄॲཧͰ͍͍ͨͯͳ͍(ex. Jubatus) • aws্ͩͱجຊσʔλશͯS3ʹूੵ • λεΫཧͱϦτϥΠCelery(AMQP)ʹͤΔ •
ϫʔΧʔͷσϓϩΠChef + OpsworksͰશࣗಈԽ • ΦϯϥΠϯֶशͷࢄԽparameter iterative mixing • EMΞϧΰϦζϜͷࢄԽσʔλΛਫฏࢄͯ͠ಠཱʹܭࢉͨ͠ ύϥϝʔλͷฏۉΛऔΔ
• هࣄऩूϢʔβຖͷਪનΛϫʔΧʔʹόϥϚΩ GunosyͷPythonࢄॲཧڥ هࣄΫϩʔϥʔ DFMFSZXPSLFS ਪનΤϯδϯ DFMFSZXPSLFS هࣄΫϦοΫϩά ίϯτϩʔϥ EKBOHPDFMFSZ
·ͱΊ • Sparkͷ؊RDDͱ͍͏σʔλߏͱεέϧτϯฒྻϕʔ εͷ൚༻తͳฒྻϓϩάϥϛϯάڥ • Python͔Βͷखܰʹࢄॲཧͱࢄػցֶश͕͑ͯศར • ͰPython͔Βෳࡶͳ͜ͱΛ͠Α͏ͱ͢ΔͱຊʹΩπΠ ͷͰScalaͰॻ͖·͠ΐ͏ •
Ͳ͏ͯ͠Python͕ྑ͍ͳΒଞͷPythonͷࢄॲཧΤ ίγεςϜΛݕ౼͠·͠ΐ͏