Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
1日あたり数百万商品をクロールする 大規模クローラーの裏側 / How IQON crawle...
Search
Takehiro Shiozaki
August 01, 2017
Technology
4
1.6k
1日あたり数百万商品をクロールする 大規模クローラーの裏側 / How IQON crawler works
Takehiro Shiozaki
August 01, 2017
Tweet
Share
More Decks by Takehiro Shiozaki
See All by Takehiro Shiozaki
全部見せます! BigQueryのコスト削減の手法とその効果 / BigQuery Cost Reduction Methods
shiozaki
5
3.3k
タイムトラベルはじめました 〜時をかけるBigQuery〜 / Now serving Time Machine 〜BigQuery Which Leapt Through Time〜
shiozaki
0
5.1k
これからのZOZOを支える ログ収集基盤を設計した話 / Log collection infrastructure to support ZOZO in the future
shiozaki
6
14k
Amazon AuroraのデータをリアルタイムにGoogle BigQueryに連携してみた / Realtime data linkage from Amazon Aurora to Google BigQuery
shiozaki
10
15k
ZOZOTOWNの事業を支えるBigQueryの話 / BigQuery behind ZOZOTOWN
shiozaki
7
9.8k
ZOZOTOWNのDWHをRedshiftからBigQueryにお引越しした話 / Moving ZOZOTOWN DWH from Redshift to BigQuery
shiozaki
16
11k
ZOZOTOWNのバッチデータ転送基盤紹介 / ZOZOTOWN's data transfer batch
shiozaki
0
530
Digdagを仕事で使ってみて良かったこと、ハマったこと / Using Digdag in production environment
shiozaki
1
2k
ファッションIT業界あるある / fashion IT aruaru
shiozaki
1
810
Other Decks in Technology
See All in Technology
OAuth/OpenID Connectで実現するMCPのセキュアなアクセス管理
kuralab
5
850
生成AIでwebアプリケーションを作ってみた
tajimon
2
120
CI/CDとタスク共有で加速するVibe Coding
tnbe21
0
230
Snowflake Summit 2025 データエンジニアリング関連新機能紹介 / Snowflake Summit 2025 What's New about Data Engineering
tiltmax3
0
230
CIでのgolangci-lintの実行を約90%削減した話
kazukihayase
0
340
Amazon S3標準/ S3 Tables/S3 Express One Zoneを使ったログ分析
shigeruoda
2
380
Amazon ECS & AWS Fargate 運用アーキテクチャ2025 / Amazon ECS and AWS Fargate Ops Architecture 2025
iselegant
14
4.4k
IAMのマニアックな話 2025を執筆して、 見えてきたAWSアカウント管理の現在
nrinetcom
PRO
4
660
BigQuery Remote FunctionでLooker Studioをインタラクティブ化
cuebic9bic
2
230
成立するElixirの再束縛(再代入)可という選択
kubell_hr
0
900
第9回情シス転職ミートアップ_テックタッチ株式会社
forester3003
0
130
【TiDB GAME DAY 2025】Shadowverse: Worlds Beyond にみる TiDB 活用術
cygames
0
850
Featured
See All Featured
Code Reviewing Like a Champion
maltzj
524
40k
How STYLIGHT went responsive
nonsquared
100
5.6k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
8
660
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
161
15k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.3k
Docker and Python
trallard
44
3.4k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
7
700
Music & Morning Musume
bryan
46
6.6k
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.6k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.5k
GraphQLとの向き合い方2022年版
quramy
46
14k
Transcript
© 2017 VASILY,Inc. ͋ͨΓඦສΛΫϩʔϧ͢Δ େنΫϩʔϥʔͷཪଆ 4QFFF$BGF.FFUVQ Ԙ㟒݈߂
© 2017 VASILY,Inc. ࣗݾհ ⾣ Ԙ㟒݈߂ ⾣ 7"4*-:৽ଔೖࣾ ⾣ όοΫΤϯυΤϯδχΞ
⾣ 3VCZ ⾣ (PPHMF#JH2VFSZ ⾣ "QBDIF4PMS ⾣ &NCVML ⾣ %JHEBH ▶ $ crontab -l 0 0 7 8 * /bin/increment_age ฐٕࣾज़ސ.BU[ࢯ
© 2017 VASILY,Inc. ໊ࣾגࣜձࣾ7"4*-: ϰΝγϦʔ 7"4*-: *OD ઃཱ݄ ॴࡏ౦ژ۠ޒాδχΞεϏϧ' ैۀһ໊
ࢿຊۚԯ දऔకۚࢁ༟थ औకࠓଜխઍ༿େี ओཁגओ άϩʔϏεϕϯνϟʔΩϟϐλϧ ҏ౻ςΫϊϩδʔϕϯνϟʔζ (.0ϕϯνϟʔύʔτφʔζ ,%%*גࣜձࣾ גࣜձࣾߨஊࣾ
© 2017 VASILY,Inc. Ҏ্ͷϑΝογϣϯ&$αΠτ͔ΒͷສΛ͑ΔΛܝࡌ ݄ؒສਓҎ্͕ར༻͢Δຊ࠷େڃͷϑΝογϣϯαΠτ
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. *20/ͷΫϩʔϥʔ ⾣ ఏܞαΠτҎ্ ⾣ ৗ࣌ߪങՄೳɿສҎ্
© 2017 VASILY,Inc. *20/ͷΫϩʔϥʔ ⾣ ߲ͷใΛऔಘ ࣸਅ Ձ֨ ࡏݿ
ϒϥϯυ ໊
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ࢄΫϩʔϦϯά ⾣ ରϖʔδ͕େͳͷͰࢄॲཧ ⾣ ࢄΫϩʔϧ༻ͷϑϨʔϜϫʔΫͳ͍ ⾣ 3VCZΛ༻͍ϑϧεΫϥονͰ࣮
4DSBQZEPFTO`UQSPWJEFBOZCVJMUJOGBDJMJUZGPSSVOOJOH DSBXMTJOBEJTUSJCVUF NVMUJTFSWFS NBOOFS IUUQTEPDTDSBQZPSHFOMBUFTUUPQJDTQSBDUJDFTIUNMEJTUSJCVUFEDSBXMT
© 2017 VASILY,Inc. UBTLRVFVFΛհͨ͠ࢄɾඇಉظॲཧ ⾣ 424ΛUBTLRVFVFͱͯ͠༻ ⾣ UBTLΛ࣮ߦ͢ΔϓϩηεʢXPSLFSʣಉ࢜ૄ݁߹ ⾣ ඇಉظॲཧϥΠϒϥϦͱͯ͠ɺ4IPSZVLFOΛ༻
XPSLFS XPSLFS FORVFVF EFRVFVF 424
© 2017 VASILY,Inc. 4IPSZVLFO ⾣ 3VCZͰॻ͔ΕͨඇಉظॲཧϑϨʔϜϫʔΫ ⾣ ෳΩϡʔͷཧػೳ ⾣
ϚϧνεϨου class HelloWorker include Shoryuken::Worker shoryuken_options queue: 'hello', auto_delete: true def perform(sqs_msg, name) puts "Hello, #{name}" end end HelloWorker.perform_async('joe') BTZODDBMM
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. $MPVE8BUDI-BNCEB ⾣ ఆظతʹ$MPVE8BUDI&WFOU͕ൃՐ͠ɺ-BNCEBΛىಈ ⾣ -BNCEB͕424ʹλεΫΛೖ͢Δ͜ͱͰɺΫϩʔϧ։࢝ $MPVE8BUDI 4/4
-BNCEB 424 JOWPLF FORVFVF
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. TUBSUDSBXMFSXPSLFS ⾣ 3%4͔ΒͦͷͷΫϩʔϧରαΠτใΛऔಘ ⾣ ͦΕΒͷαΠτΛΫϩʔϧ͢ΔͨΊͷλεΫΛೖ TUBSUDSBXMFS 424
3%4 ΫϩʔϧରαΠτใऔಘ FORVFVF º&$αΠτͷ
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. MJTUQBHFFORVFVFXPSLFS ⾣ &$αΠτͷϖʔδૹΓ෦ΛεΫϨΠϐϯά ⾣ શϖʔδͷ63-Λղੳ͢ΔͨΊͷλεΫΛೖ MJTUQBHFFORVFVF 424
ϖʔδૹΓ෦ղੳ FORVFVF ºϖʔδͷ &$TJUF IUUQTFYBNQMFDPNJUFNT QBHF IUUQTFYBNQMFDPNJUFNT QBHF IUUQTFYBNQMFDPNJUFNT QBHF
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFFORVFVFXPSLFS ⾣ Ϧετϖʔδ͔Βৄࡉϖʔδͷ63-Λղੳ ⾣ શϖʔδͷ63-Λղੳ͢ΔͨΊͷλεΫΛೖ JUFNQBHFFORVFVF 424
Ϧετϖʔδˠৄࡉϖʔδͷ63-Λऔಘ FORVFVF ºϖʔδͷ &$TJUF IUUQTFYBNQMFDPNJUFNT IUUQTFYBNQMFDPNJUFNT IUUQTFYBNQMFDPNJUFNT
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFEPXOMPBEXPSLFS ⾣ ৄࡉϖʔδ͔Β)5.-Λμϯϩʔυ ⾣ μϯϩʔυִؒͷௐͷͨΊʹ3FEJTʹࢄϩοΫʢޙड़ʣΛ࣮ݱ ⾣ )5.-Λղੳ͢ΔͨΊͷλεΫΛೖ
JUFNQBHFFORVFVF 424 ৄࡉϖʔδͷ)5.-Λऔಘ FORVFVF &$TJUF <!DOCTYPE> <HTML><HEAD><TITLE>トップス... ϩοΫऔಘ
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFQBSTFXPSLFS ⾣ 91BUIɾਖ਼نදݱΛ͍)5.-Λύʔε ⾣ ύʔεઃఆʢ91BUIɾਖ਼نදݱXFCπʔϧͰೖߘ ⾣ ύʔε݁ՌΛ%#ʹॻ͖ࠐΉλεΫΛೖ
JUFNQBHFQBSTF 424 ύʔεઃఆΛऔಘ FORVFVF { "title": "トップス", "price": 9800, 3%4 ύʔεઃఆΛೖߘ
© 2017 VASILY,Inc. ͜ΕҎ߱ͷॲཧ ⾣ ࣌ؒͷ߹ͰࠓճׂѪ ⾣ Ϋϩʔϧ݁ՌΛ%#ʹॻ͖ࠐΈ ⾣ ը૾ॲཧ
⾣ ಁաॲཧ ⾣ ද৭நग़ ⾣ ΧςΰϦʔࣗಈྨ
© 2017 VASILY,Inc. μϯϩʔυִؒ ⾣ 3FEJTͰࢄϩοΫΛ࣮ݱ͠ɺμϯϩʔυִؒΛௐ IUUQTSFEJTJPUPQJDTEJTUMPDL EPXOMPBEXPSLFS" EPXOMPBEXPSLFS#
HFU@MPDLTVDDFTT MPDLFE HFU@MPDLGBJM EPXOMPBE HFU@MPDLGBJM FYQJSF HFU@MPDLTVDDFTT
© 2017 VASILY,Inc. จࣈίʔυ ⾣ NFUBDIBSTFU༻͍ͯ͠ͳ͍ ⾣ ,DPOWʢOLGϥούʔʣͷจࣈίʔυࣗಈਪଌػೳΛར༻ ▶
::Kconv.toutf8(str)͚ͩͰ0,
© 2017 VASILY,Inc. 41" 4JOHMF1BHF"QQMJDBUJPO ͷରԠ ⾣ 41")5.-ʹ΄ͱΜͲͷใ͕ͳ͍ ⾣
1IBOUPN+4Λͬͨ1SPYZΛհ͢Δ ⾣ PO-PBEΠϕϯτൃՐޙͷใΛऔಘ EPXOMPBEXPSLFS &$TJUF EPXOMPBE
© 2017 VASILY,Inc. 424ͷαΠζ੍ݶ ⾣ 424ʹ,#ҎԼͷςΩετσʔλ͔֨͠ೲͰ͖ͳ͍ ⾣ Ұ෦ͷϖʔδͷ)5.-͜ΕΛա ⾣
)UNM$PNQSFTTPS ;MJC #BTFͰ)5.-Λѹॖ ⾣ ʙఔʹѹॖ Base64.encode64( Zlib::Deflate.deflate( HtmlCompressor::Compressor.new.compress(html) ) )
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. Πϯϑϥߏਤ &$TQPUqFFUJOTUBODFT XPSLFSTJODPOUBJOFS %FQMPZ$POUBJOFS 424 -BNCEB
$MPVE8BUDI 8BUDI.FUSJDT "VUP4MBDF FORVFVFEFRVFVF
© 2017 VASILY,Inc. "QBDIF.FTPT.BSBUIPO ⾣ "QBDIF.FTPT ⾣ "EJTUSJCVUFETZTUFNTLFSOFM ⾣
ෳϚγϯΛͭͷܭࢉػϓʔϧͱͯ͠நԽ ⾣ .BSBUIPO ⾣ .FTPT্Ͱಈ࡞͢ΔίϯςφΦʔέετϨʔγϣϯπʔϧ ⾣ .FTPTͷλεΫΛσʔϞϯԽ
© 2017 VASILY,Inc. EPXOMPBEXPSLFSͷՔಇ ίϯςφ ॲཧ ⾣ EPXOMPBEXPSLFSʹϩοΫ͕͋ΔͨΊɺ͕ඞཁ XJUI
XJUIPVU
© 2017 VASILY,Inc. EPXOMPBEΩϡʔͷଟॏԽ ⾣ EPXOMPBEΩϡʔɾࢄϩοΫ&$αΠτຖʹಠཱ EPXOMPBEXPSLFS EPXOMPBEXPSLFS ϥϯμϜʹEFRVFVF
αΠτʹରԠͨ͠ ࢄϩοΫΛऔಘ
© 2017 VASILY,Inc. ΦʔτεέʔϧͷͨΊͷϝτϦΫε 424 -BNCEB $MPVE8BUDI &WFOU ⾣
EPXOMPBEΩϡʔະॲཧͷλεΫͷΛࢹͯ͠ҙຯ͕ͳ͍ ⾣ ͦͷΘΓʹɺະॲཧλεΫͷΩϡʔͷݸΛࢹ ⾣ $MPVE8BUDIͰࢹͰ͖ͳ͍ ⾣ -BNCEBͰ424ͷ"1*Λୟ͘ JOWPLF HFUUIFOVNCFSPG OPOFNQUZEPXOMPBERVFVFT DIBOHFUIFOVNCFSPGDPOUBJOFST
© 2017 VASILY,Inc. 'PSNPSFJOGPSNBUJPO ⾣ %PDLFS"QBDIF.FTPT.BSBUIPOʹΑΔഒ͍*20/Ϋϩʔϥʔͷߏங ⾣ IUUQUFDIWBTJMZKQFOUSZJRPODSBXMFSCZEPDLFSBOENFTPTBOENBSBUIPO ⾣
"QBDIF.FTPT.BSBUIPOΛຊ൪Ͱӡ༻͢ΔͨΊͷͭͷ5JQT ⾣ IUUQUFDIWBTJMZKQFOUSZBQBDIFNFTPTBOENBSBUIPOUJQT ⾣ 1SPEVDUJPOEFQMPZNFOUPGUIF%PDLFSDPOUBJOFSXJUI.BSBUIPO ⾣ IUUQTTQFBLFSEFDLDPNLPUBUTVQSPEVDUJPOEFQMPZNFOUPGUIFEPDLFSDPOUBJOFSXJUINBSBUIPO ⾣ "QBDIF.FTPTXJUI"NB[PO&$4QPU'MFFU ⾣ IUUQTTQFBLFSEFDLDPNLPUBUTVBQBDIFNFTPTXJUIBNB[POFDTQPUqFFU
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ·ͱΊ ⾣ *20/ͷΫϩʔϥʔຖඦສΛΫϩʔϧ͍ͯ͠Δ ⾣ େنͳࢄΫϩʔϥʔΛ3VCZͰϑϧεΫϥονͰ࣮ ⾣
ඇಉظॲཧΛ׆༻ͨ͠ॊೈͳΞϓϦέʔγϣϯ ⾣ εέʔϥϒϧͳΠϯϑϥͷ্ͰɺεϐʔυΞοϓˍඅ༻ݮ
© 2017 VASILY,Inc. 8FSF)JSJOH IUUQTWBTJMZKQSFDSVJU