Upgrade to Pro — share decks privately, control downloads, hide ads and more …

スクレイピングの安定運用のために苦労したところ、工夫したところ

shida
August 21, 2016

 スクレイピングの安定運用のために苦労したところ、工夫したところ

Bayside Tech Bridge 2 016.08.21
クローリングのスペシャリストが語る、クローラー運用の裏側!

shida

August 21, 2016
Tweet

More Decks by shida

Other Decks in Programming

Transcript

  1. require 'capybara/poltergeist' Capybara.register_driver :poltergeist do |app| Capybara::Poltergeist::Driver.new(app) end Capybara.default_driver =

    :poltergeist agent = Capybara.current_session agent.visit('URL') number = agent.find('CSSηϨΫλ').text.to_i 1PSUFSHFJTUͰεΫϨΠϐϯά
  2. def save_cookie(agent, user) cookies_str = Base64.encode64( Marshal.dump( agent.driver.browser.cookies)) user.update_attributes(cookies: cookies_str)

    end def load_cookie(agent, user) cookies = Marshal.load( Base64.decode64(user.cookies)) cookies.values.each do |cookie| cookie_hash = JSON.parse(cookie.to_json) ["attributes"] agent.driver.browser.set_cookie(cookie_hash) end end $PPLJFʹΑΔೝূ
  3. agent.driver.headers = { "User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS

    X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36 #{Time.now.to_f.to_s}" } 6TFS"HFOUΛͪΐ͍ͪΐ͍ม͑Δ
  4. Ͱ͖Δ͚ͩදࣔܥ͔Β͸εΫϨΠϐϯά͠ͳ͍ ॓ധਓ਺ 9໊ ৘ใදࣔը໘ ॓ധਓ਺ 9 ϑΥʔϜը໘ ໊ ϑΥʔϜ෦෼ͷϚʔΫΞοϓ͸αʔόʔαΠυͷϓϩάϥϜͱ࿈ ܞ͍ͯ͠ΔͷͰมߋ͕ൃੜ͠ʹ͍͘

    <div data-bootstrap-data="{a: 'b', ... }" /> JavaScriptଆʹJSONจࣈྻͰ৘ใΛ౉͍ͯ͠Δͱ͜Ζͱ͔΋ม ߋ͕ൃੜ͠ʹ͍͘ http://example.com/users/12345678 URL΋มߋ͕ൃੜ͠ʹ͍͘
  5. ϩʔυ଴ͪɺදࣔ࣌ؒΛԆ͹͢ Capybara.register_driver :poltergeist do |app| Capybara::Poltergeist::Driver.new(app, :timeout => 60) end

    Capybara.default_driver = :poltergeist Capybara.default_max_wait_time = 30 agent = Capybara.current_session # ࠷େ60ඵ଴ͬͯ͘ΕΔ agent.visit('URL') # ࠷େ30ඵJavaScriptͷඇಉظߋ৽ͳͲͷऴྃΛ଴ͬͯ͘ΕΔ number = agent.find('CSSηϨΫλ').text.to_i
  6. ϢχοτςετΛఆظతʹࣗಈ࣮ߦ project='ϦϙδτϦ໊' branch='master' api_token='APIτʔΫϯ' url=https://circleci.com/api/v1/project/${project}/ tree/${branch}?circle-token=${api_token} curl \ --header "Accept:

    application/json" \ --header "Content-Type: application/json" \ --request POST ${url} CircleCIͷϏϧυΛAPIΛ࢖ͬͯcron͔Βఆظ࣮ߦ ͚ͨ͜ΒCircleCI͕Slackʹ௨஌ͯ͘͠ΕΔ