Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
1日あたり数百万商品をクロールする 大規模クローラーの裏側 / How IQON crawle...
Search
Takehiro Shiozaki
August 01, 2017
Technology
4
1.6k
1日あたり数百万商品をクロールする 大規模クローラーの裏側 / How IQON crawler works
Takehiro Shiozaki
August 01, 2017
Tweet
Share
More Decks by Takehiro Shiozaki
See All by Takehiro Shiozaki
全部見せます! BigQueryのコスト削減の手法とその効果 / BigQuery Cost Reduction Methods
shiozaki
5
3.2k
タイムトラベルはじめました 〜時をかけるBigQuery〜 / Now serving Time Machine 〜BigQuery Which Leapt Through Time〜
shiozaki
0
5k
これからのZOZOを支える ログ収集基盤を設計した話 / Log collection infrastructure to support ZOZO in the future
shiozaki
6
14k
Amazon AuroraのデータをリアルタイムにGoogle BigQueryに連携してみた / Realtime data linkage from Amazon Aurora to Google BigQuery
shiozaki
10
15k
ZOZOTOWNの事業を支えるBigQueryの話 / BigQuery behind ZOZOTOWN
shiozaki
7
9.7k
ZOZOTOWNのDWHをRedshiftからBigQueryにお引越しした話 / Moving ZOZOTOWN DWH from Redshift to BigQuery
shiozaki
16
11k
ZOZOTOWNのバッチデータ転送基盤紹介 / ZOZOTOWN's data transfer batch
shiozaki
0
530
Digdagを仕事で使ってみて良かったこと、ハマったこと / Using Digdag in production environment
shiozaki
1
2k
ファッションIT業界あるある / fashion IT aruaru
shiozaki
1
790
Other Decks in Technology
See All in Technology
Why Platform Engineering? - マルチプロダクト・少人数 SRE の壁を越える挑戦 -
nulabinc
PRO
4
410
10分で学ぶ、RAGの仕組みと実践
supermarimobros
0
930
雑に疎通確認だけしたい...せや!CloudShell使ったろ!
alchemy1115
0
220
正式リリースされた Semantic Kernel の Agent Framework 全部紹介!
okazuki
1
1.1k
Coding Agentに値札を付けろ
watany
3
470
Google Cloud Next 2025 Recap 生成AIモデルとマーケティングでのコンテンツ生成 / Generative AI models and content creation in marketing
kyou3
0
170
Part1 GitHubってなんだろう?その2
tomokusaba
2
750
地に足の付いた現実的な技術選定から魔力のある体験を得る『AIレシート読み取り機能』のケーススタディ / From Grounded Tech Choices to Magical UX: A Case Study of AI Receipt Scanning
moznion
4
1.4k
クラウドネイティブ環境の脅威モデリング
kyohmizu
2
410
LLMの開発と社会実装の今と未来 / AI Builders' Community (ABC) vol.2
pfn
PRO
1
130
Azure × MCP 入門
ry0y4n
8
1.7k
試作とデモンストレーション / Prototyping and Demonstrations
ks91
PRO
0
120
Featured
See All Featured
The Invisible Side of Design
smashingmag
299
50k
Producing Creativity
orderedlist
PRO
344
40k
Optimizing for Happiness
mojombo
378
70k
Raft: Consensus for Rubyists
vanstee
137
6.9k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.7k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
160
15k
Building a Modern Day E-commerce SEO Strategy
aleyda
40
7.2k
How GitHub (no longer) Works
holman
314
140k
Statistics for Hackers
jakevdp
799
220k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
For a Future-Friendly Web
brad_frost
177
9.7k
Transcript
© 2017 VASILY,Inc. ͋ͨΓඦສΛΫϩʔϧ͢Δ େنΫϩʔϥʔͷཪଆ 4QFFF$BGF.FFUVQ Ԙ㟒݈߂
© 2017 VASILY,Inc. ࣗݾհ ⾣ Ԙ㟒݈߂ ⾣ 7"4*-:৽ଔೖࣾ ⾣ όοΫΤϯυΤϯδχΞ
⾣ 3VCZ ⾣ (PPHMF#JH2VFSZ ⾣ "QBDIF4PMS ⾣ &NCVML ⾣ %JHEBH ▶ $ crontab -l 0 0 7 8 * /bin/increment_age ฐٕࣾज़ސ.BU[ࢯ
© 2017 VASILY,Inc. ໊ࣾגࣜձࣾ7"4*-: ϰΝγϦʔ 7"4*-: *OD ઃཱ݄ ॴࡏ౦ژ۠ޒాδχΞεϏϧ' ैۀһ໊
ࢿຊۚԯ දऔకۚࢁ༟थ औకࠓଜխઍ༿େี ओཁגओ άϩʔϏεϕϯνϟʔΩϟϐλϧ ҏ౻ςΫϊϩδʔϕϯνϟʔζ (.0ϕϯνϟʔύʔτφʔζ ,%%*גࣜձࣾ גࣜձࣾߨஊࣾ
© 2017 VASILY,Inc. Ҏ্ͷϑΝογϣϯ&$αΠτ͔ΒͷສΛ͑ΔΛܝࡌ ݄ؒສਓҎ্͕ར༻͢Δຊ࠷େڃͷϑΝογϣϯαΠτ
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. *20/ͷΫϩʔϥʔ ⾣ ఏܞαΠτҎ্ ⾣ ৗ࣌ߪങՄೳɿສҎ্
© 2017 VASILY,Inc. *20/ͷΫϩʔϥʔ ⾣ ߲ͷใΛऔಘ ࣸਅ Ձ֨ ࡏݿ
ϒϥϯυ ໊
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ࢄΫϩʔϦϯά ⾣ ରϖʔδ͕େͳͷͰࢄॲཧ ⾣ ࢄΫϩʔϧ༻ͷϑϨʔϜϫʔΫͳ͍ ⾣ 3VCZΛ༻͍ϑϧεΫϥονͰ࣮
4DSBQZEPFTO`UQSPWJEFBOZCVJMUJOGBDJMJUZGPSSVOOJOH DSBXMTJOBEJTUSJCVUF NVMUJTFSWFS NBOOFS IUUQTEPDTDSBQZPSHFOMBUFTUUPQJDTQSBDUJDFTIUNMEJTUSJCVUFEDSBXMT
© 2017 VASILY,Inc. UBTLRVFVFΛհͨ͠ࢄɾඇಉظॲཧ ⾣ 424ΛUBTLRVFVFͱͯ͠༻ ⾣ UBTLΛ࣮ߦ͢ΔϓϩηεʢXPSLFSʣಉ࢜ૄ݁߹ ⾣ ඇಉظॲཧϥΠϒϥϦͱͯ͠ɺ4IPSZVLFOΛ༻
XPSLFS XPSLFS FORVFVF EFRVFVF 424
© 2017 VASILY,Inc. 4IPSZVLFO ⾣ 3VCZͰॻ͔ΕͨඇಉظॲཧϑϨʔϜϫʔΫ ⾣ ෳΩϡʔͷཧػೳ ⾣
ϚϧνεϨου class HelloWorker include Shoryuken::Worker shoryuken_options queue: 'hello', auto_delete: true def perform(sqs_msg, name) puts "Hello, #{name}" end end HelloWorker.perform_async('joe') BTZODDBMM
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. $MPVE8BUDI-BNCEB ⾣ ఆظతʹ$MPVE8BUDI&WFOU͕ൃՐ͠ɺ-BNCEBΛىಈ ⾣ -BNCEB͕424ʹλεΫΛೖ͢Δ͜ͱͰɺΫϩʔϧ։࢝ $MPVE8BUDI 4/4
-BNCEB 424 JOWPLF FORVFVF
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. TUBSUDSBXMFSXPSLFS ⾣ 3%4͔ΒͦͷͷΫϩʔϧରαΠτใΛऔಘ ⾣ ͦΕΒͷαΠτΛΫϩʔϧ͢ΔͨΊͷλεΫΛೖ TUBSUDSBXMFS 424
3%4 ΫϩʔϧରαΠτใऔಘ FORVFVF º&$αΠτͷ
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. MJTUQBHFFORVFVFXPSLFS ⾣ &$αΠτͷϖʔδૹΓ෦ΛεΫϨΠϐϯά ⾣ શϖʔδͷ63-Λղੳ͢ΔͨΊͷλεΫΛೖ MJTUQBHFFORVFVF 424
ϖʔδૹΓ෦ղੳ FORVFVF ºϖʔδͷ &$TJUF IUUQTFYBNQMFDPNJUFNT QBHF IUUQTFYBNQMFDPNJUFNT QBHF IUUQTFYBNQMFDPNJUFNT QBHF
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFFORVFVFXPSLFS ⾣ Ϧετϖʔδ͔Βৄࡉϖʔδͷ63-Λղੳ ⾣ શϖʔδͷ63-Λղੳ͢ΔͨΊͷλεΫΛೖ JUFNQBHFFORVFVF 424
Ϧετϖʔδˠৄࡉϖʔδͷ63-Λऔಘ FORVFVF ºϖʔδͷ &$TJUF IUUQTFYBNQMFDPNJUFNT IUUQTFYBNQMFDPNJUFNT IUUQTFYBNQMFDPNJUFNT
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFEPXOMPBEXPSLFS ⾣ ৄࡉϖʔδ͔Β)5.-Λμϯϩʔυ ⾣ μϯϩʔυִؒͷௐͷͨΊʹ3FEJTʹࢄϩοΫʢޙड़ʣΛ࣮ݱ ⾣ )5.-Λղੳ͢ΔͨΊͷλεΫΛೖ
JUFNQBHFFORVFVF 424 ৄࡉϖʔδͷ)5.-Λऔಘ FORVFVF &$TJUF <!DOCTYPE> <HTML><HEAD><TITLE>トップス... ϩοΫऔಘ
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFQBSTFXPSLFS ⾣ 91BUIɾਖ਼نදݱΛ͍)5.-Λύʔε ⾣ ύʔεઃఆʢ91BUIɾਖ਼نදݱXFCπʔϧͰೖߘ ⾣ ύʔε݁ՌΛ%#ʹॻ͖ࠐΉλεΫΛೖ
JUFNQBHFQBSTF 424 ύʔεઃఆΛऔಘ FORVFVF { "title": "トップス", "price": 9800, 3%4 ύʔεઃఆΛೖߘ
© 2017 VASILY,Inc. ͜ΕҎ߱ͷॲཧ ⾣ ࣌ؒͷ߹ͰࠓճׂѪ ⾣ Ϋϩʔϧ݁ՌΛ%#ʹॻ͖ࠐΈ ⾣ ը૾ॲཧ
⾣ ಁաॲཧ ⾣ ද৭நग़ ⾣ ΧςΰϦʔࣗಈྨ
© 2017 VASILY,Inc. μϯϩʔυִؒ ⾣ 3FEJTͰࢄϩοΫΛ࣮ݱ͠ɺμϯϩʔυִؒΛௐ IUUQTSFEJTJPUPQJDTEJTUMPDL EPXOMPBEXPSLFS" EPXOMPBEXPSLFS#
HFU@MPDLTVDDFTT MPDLFE HFU@MPDLGBJM EPXOMPBE HFU@MPDLGBJM FYQJSF HFU@MPDLTVDDFTT
© 2017 VASILY,Inc. จࣈίʔυ ⾣ NFUBDIBSTFU༻͍ͯ͠ͳ͍ ⾣ ,DPOWʢOLGϥούʔʣͷจࣈίʔυࣗಈਪଌػೳΛར༻ ▶
::Kconv.toutf8(str)͚ͩͰ0,
© 2017 VASILY,Inc. 41" 4JOHMF1BHF"QQMJDBUJPO ͷରԠ ⾣ 41")5.-ʹ΄ͱΜͲͷใ͕ͳ͍ ⾣
1IBOUPN+4Λͬͨ1SPYZΛհ͢Δ ⾣ PO-PBEΠϕϯτൃՐޙͷใΛऔಘ EPXOMPBEXPSLFS &$TJUF EPXOMPBE
© 2017 VASILY,Inc. 424ͷαΠζ੍ݶ ⾣ 424ʹ,#ҎԼͷςΩετσʔλ͔֨͠ೲͰ͖ͳ͍ ⾣ Ұ෦ͷϖʔδͷ)5.-͜ΕΛա ⾣
)UNM$PNQSFTTPS ;MJC #BTFͰ)5.-Λѹॖ ⾣ ʙఔʹѹॖ Base64.encode64( Zlib::Deflate.deflate( HtmlCompressor::Compressor.new.compress(html) ) )
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. Πϯϑϥߏਤ &$TQPUqFFUJOTUBODFT XPSLFSTJODPOUBJOFS %FQMPZ$POUBJOFS 424 -BNCEB
$MPVE8BUDI 8BUDI.FUSJDT "VUP4MBDF FORVFVFEFRVFVF
© 2017 VASILY,Inc. "QBDIF.FTPT.BSBUIPO ⾣ "QBDIF.FTPT ⾣ "EJTUSJCVUFETZTUFNTLFSOFM ⾣
ෳϚγϯΛͭͷܭࢉػϓʔϧͱͯ͠நԽ ⾣ .BSBUIPO ⾣ .FTPT্Ͱಈ࡞͢ΔίϯςφΦʔέετϨʔγϣϯπʔϧ ⾣ .FTPTͷλεΫΛσʔϞϯԽ
© 2017 VASILY,Inc. EPXOMPBEXPSLFSͷՔಇ ίϯςφ ॲཧ ⾣ EPXOMPBEXPSLFSʹϩοΫ͕͋ΔͨΊɺ͕ඞཁ XJUI
XJUIPVU
© 2017 VASILY,Inc. EPXOMPBEΩϡʔͷଟॏԽ ⾣ EPXOMPBEΩϡʔɾࢄϩοΫ&$αΠτຖʹಠཱ EPXOMPBEXPSLFS EPXOMPBEXPSLFS ϥϯμϜʹEFRVFVF
αΠτʹରԠͨ͠ ࢄϩοΫΛऔಘ
© 2017 VASILY,Inc. ΦʔτεέʔϧͷͨΊͷϝτϦΫε 424 -BNCEB $MPVE8BUDI &WFOU ⾣
EPXOMPBEΩϡʔະॲཧͷλεΫͷΛࢹͯ͠ҙຯ͕ͳ͍ ⾣ ͦͷΘΓʹɺະॲཧλεΫͷΩϡʔͷݸΛࢹ ⾣ $MPVE8BUDIͰࢹͰ͖ͳ͍ ⾣ -BNCEBͰ424ͷ"1*Λୟ͘ JOWPLF HFUUIFOVNCFSPG OPOFNQUZEPXOMPBERVFVFT DIBOHFUIFOVNCFSPGDPOUBJOFST
© 2017 VASILY,Inc. 'PSNPSFJOGPSNBUJPO ⾣ %PDLFS"QBDIF.FTPT.BSBUIPOʹΑΔഒ͍*20/Ϋϩʔϥʔͷߏங ⾣ IUUQUFDIWBTJMZKQFOUSZJRPODSBXMFSCZEPDLFSBOENFTPTBOENBSBUIPO ⾣
"QBDIF.FTPT.BSBUIPOΛຊ൪Ͱӡ༻͢ΔͨΊͷͭͷ5JQT ⾣ IUUQUFDIWBTJMZKQFOUSZBQBDIFNFTPTBOENBSBUIPOUJQT ⾣ 1SPEVDUJPOEFQMPZNFOUPGUIF%PDLFSDPOUBJOFSXJUI.BSBUIPO ⾣ IUUQTTQFBLFSEFDLDPNLPUBUTVQSPEVDUJPOEFQMPZNFOUPGUIFEPDLFSDPOUBJOFSXJUINBSBUIPO ⾣ "QBDIF.FTPTXJUI"NB[PO&$4QPU'MFFU ⾣ IUUQTTQFBLFSEFDLDPNLPUBUTVBQBDIFNFTPTXJUIBNB[POFDTQPUqFFU
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ·ͱΊ ⾣ *20/ͷΫϩʔϥʔຖඦສΛΫϩʔϧ͍ͯ͠Δ ⾣ େنͳࢄΫϩʔϥʔΛ3VCZͰϑϧεΫϥονͰ࣮ ⾣
ඇಉظॲཧΛ׆༻ͨ͠ॊೈͳΞϓϦέʔγϣϯ ⾣ εέʔϥϒϧͳΠϯϑϥͷ্ͰɺεϐʔυΞοϓˍඅ༻ݮ
© 2017 VASILY,Inc. 8FSF)JSJOH IUUQTWBTJMZKQSFDSVJU