Upgrade to Pro — share decks privately, control downloads, hide ads and more …

web scraping with polite package

Avatar for bk bk
December 05, 2020

web scraping with polite package

Avatar for bk

bk

December 05, 2020
Tweet

More Decks by bk

Other Decks in Programming

Transcript

  1. ର৅ͱ໨త 3 ର৅ࢹௌऀ ໨త ࿩͞ͳ͍͜ͱ • WebεΫϨΠϐϯά Λ;ΜΘΓ஌͍ͬͯ Δ •

    WebεΫϨΠϐϯά ͷن੍पΓʹෆ҆Λ ײ͍ͯ͡Δ • politeύοέʔδΛ௨ ͯ͠WebεΫϨΠϐ ϯάΤνέοτͷج ૅΛֶͿ • rvestͳͲΛ࢖༻ͨ͠ εΫϨΠϐϯάͦͷ ΋ͷͷૢ࡞ • औಘͨ͠৘ใͷ࢖༻ ্ͷ੍ݶ
  2. politeύοέʔδͱ͸ 10 https://www.rdocumentation.org/packages/polite/versions/0.1.1 The three pillars of a polite session

    are seeking permission, taking slowly and never asking twice. Be Nice on the Web
  3. 12 politeͷ࣮ફ͢ΔεΫϨΠϐϯάͷΤνέοτ ɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever

    asking twiceʙ politeύοέʔδͱ͸ ๏తͳ՝୊ͳͲΛશͯճආͰ͖Δ༁Ͱ͸ͳ͍ ݸਓ৘ใ ར༻ن໿ ஶ࡞ݖ
  4. politeύοέʔδͷ࢖༻खॱ 14 1, bow ʙ Introduce yourself to the hostʙ

    2, scrape ʙ Scrape the content of authorized page/APIʙ ( 3, rvest ʙ helps you scrape information from web pagesʙ ) 4, nod ʙ Agree Modification Of Session Path With The Hostʙ polite࢖༻खॱ https://www.rdocumentation.org/packages/polite/versions/0.1.1 https://www.rdocumentation.org/packages/rvest/versions/0.3.6
  5. bow

  6. 26 ϦϦʔε೔ɿr ։ൃऀɿ Hadley Wickham URLɿhttps://github.com/tidyverse/rvest Easily Harvest (Scrape) Web

    Pages https://www.rdocumentation.org/packages/rvest/versions/0.3.6 politeύοέʔδͷ࢖༻खॱʙ rvestʙ
  7. 27 scrapeͰऔಘͨ͠HTML Document͔Βඞཁͳ৘ใΛநग़ ࢀߟจݙɿ Intro to {polite} Web Scraping of

    Soccer Data with R!ʢhttps://ryo-n7.github.io/2020-05-14-webscrape-soccer-data-with-R/ʣ RʹΑΔεΫϨΠϐϯάೖ໳ʢhttps://www.amazon.co.jp/dp/486354216Xʣ RϢʔβͷͨΊͷRStudio[࣮ફ]ೖ໳−tidyverseʹΑΔϞμϯͳ෼ੳϑϩʔͷੈք−ʢhttps://www.amazon.co.jp/dp/4774198536ʣ politeύοέʔδͷ࢖༻खॱʙ rvestʙ
  8. nod

  9. ·ͱΊ 31 politeͷखॱ ɹɹ1, bow……robots.txtͷ֬ೝɺUserAgentͷ௨஌ɺdelayͷઃఆɻ ɹɹ2, scrape……ίϯςϯπͷऔಘ ɹɹ( 3, rvest……ඞཁͳ৘ใͷநग़

    ) ɹɹ4, nod……ର৅ύεͷมߋ politeͷ3ͭͷओػೳ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɹɹɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɹɹɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever asking twiceʙ → bowͰઃఆɻσϑΥϧτ͸5ඵɻ → nodͰରԠɻ → bowͰrobots.txt͔Βऔಘɻ
  10. ͦͷଞ஫ҙࣄ߲ 32 ར༻ن໿ ن໿ͰεΫϨΠϐϯά͕ېࢭ͞Ε͍ͯΔαΠτͰεΫϨΠϐϯάΛߦ͏ͱɺଛ֐ഛঈ ੥ٻͳͲͷՄೳੑ͕͋Δɻʢͨͩ͠ɺձһొ࿥ͳͲϢʔβʔ͕ಉҙͨ͠৔߹ͷΈɻʣ ݸਓ৘ใ ݸਓ৘ใͷऔಘ͢Δࡍʹ͸ɺར༻໨తΛຊਓʹ໌ࣔ͢Δඞཁ͕͋Δɻ ஶ࡞ݖ ஶ࡞෺Λݖརऀͷಉҙͳ͘ίϐʔ΍อଘΛ͢Δߦҝ͸ɺஶ࡞ݖ৵֐ʹ౰ͨΔɻ ʢͨͩ͠ɺ৘ใղੳͷͨΊͷෳ੡౳͸ɺݖརऀͷಉҙͳ͘ߦ͏͜ͱ͕Ͱ͖Δɻʣ

    PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ, https://it-bengosi.com/blog/scraping/
  11. ͦͷଞ஫ҙࣄ߲ ར༻ن໿ ن໿ͰεΫϨΠϐϯά͕ېࢭ͞Ε͍ͯΔαΠτͰεΫϨΠϐϯάΛߦ͏ͱɺଛ֐ഛঈ ੥ٻͳͲͷՄೳੑ͕͋Δɻʢͨͩ͠ɺձһొ࿥ͳͲϢʔβʔ͕ಉҙͨ͠৔߹ͷΈɻʣ ݸਓ৘ใ ݸਓ৘ใͷऔಘ͢Δࡍʹ͸ɺར༻໨తΛຊਓʹ໌ࣔ͢Δඞཁ͕͋Δɻ ஶ࡞ݖ ஶ࡞෺Λݖརऀͷಉҙͳ͘ίϐʔ΍อଘΛ͢Δߦҝ͸ɺஶ࡞ݖ৵֐ʹ౰ͨΔɻ ʢͨͩ͠ɺ৘ใղੳͷͨΊͷෳ੡౳͸ɺݖརऀͷಉҙͳ͘ߦ͏͜ͱ͕Ͱ͖Δɻʣ PigData,

    ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ, https://it-bengosi.com/blog/scraping/ 33 ֤ࣗɺཁ֬ೝͰ͓ئ͍͠·͢……ɻ
  12. ࢀߟจݙ 34 R Documentation, ʮpolite packageʯ https://www.rdocumentation.org/packages/polite/versions/0.1.1 R Documentation, ʮrvest

    packageʯ https://www.rdocumentation.org/packages/rvest/versions/0.3.6 ΞΫηεղੳπʔϧʮAIΞφϦετʯϒϩά, ʮrobots.txtͱ͸ʁҙຯ͔Βઃఆํ๏·Ͱৄ͘͠ղઆʯ https://wacul-ai.com/blog/seo/internal-seo/seo-robots-txt/ Octoparse, ʮWebεΫϨΠϐϯάͱ͸ʁఆ͔ٛΒԠ༻·Ͱͷઆ໌ʯ https://www.octoparse.jp/blog/web-scraping/ PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ ɹhttps://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/
  13. ࢀߟจݙ 35 จԽி, ʮஶ࡞෺͕ࣗ༝ʹ࢖͑Δ৔߹ʯ https://www.bunka.go.jp/seisaku/chosakuken/seidokaisetsu/gaiyo/chosakubutsu_jiyu.html Stimulator, ʮWebεΫϨΠϐϯά͢ΔࡍͷϧʔϧͱPythonʹΑΔن໿ͷಡΈࠐΈʯ https://vaaaaaanquish.hatenablog.com/entry/2017/12/01/064227 IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠ

    ϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ https://it-bengosi.com/blog/scraping/ PigData, ʮʲεΫϨΠϐϯάʳҧ๏ʹͳΒͳ͍αʔϏεύλʔϯ5બʯ https://services.sms-datatech.co.jp/pig-data/2020/01/15/scrapinglaw3/ R Documentation, ʮrobotstxt packageʯ https://www.rdocumentation.org/packages/robotstxt/versions/0.7.13
  14. ࢀߟจݙ 36 ੴా ج޿, ࢢ઒ ଠ༞, ӝੜ ਅ໵, ౬୩ ܒ໌,

    γʔΞϯυΞʔϧݚڀॴ,ʮRʹΑΔεΫϨΠϐϯάೖ໳ʯ https://www.amazon.co.jp/dp/486354216X দଜ ༏࠸, ౬୩ ܒ໌, લా ࿨׮, لϊఆ อྱ, ٕज़ධ࿦ࣾ, ʮRϢʔβͷͨΊͷRStudio[࣮ફ]ೖ໳−tidyverseʹΑΔϞμϯͳ෼ੳϑϩʔͷੈք−ʯ https://www.amazon.co.jp/dp/4774198536 ʮIntro to {polite} Web Scraping of Soccer Data with R!ʯ https://ryo-n7.github.io/2020-05-14-webscrape-soccer-data-with-R/
  15. Appendix 40 UserAgentͱ͸ ϢʔβʔΤʔδΣϯτ(User Agent)ͱ͸ɺ΢ΣϒαΠτ΁ΞΫηε͢Δࡍ ʹ࢖༻͞ΕΔϓϩάϥϜɺ͋Δ͍͸ͦΕΒΛࣝผ͢ΔͨΊͷจࣈྻͷ ͜ͱΛࢦ͢ɻ*1 *1 https://www.irep.co.jp/knowledge/glossary/detail/id=10210/ *2

    https://qiita.com/nightyknite/items/b2590a69f2e0135756dc ྫɿMozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 ɹɹ(KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36 *2