Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction of Web scraping for PHP users

Avatar for kazusuke sasezaki kazusuke sasezaki
December 30, 2013
110

Introduction of Web scraping for PHP users

slides for Japanese PHP Conference 2013
http://phpcon.php.gr.jp/w/2013/#program

Avatar for kazusuke sasezaki

kazusuke sasezaki

December 30, 2013
Tweet

Transcript

  1. Sorry, today I don't talk about HTTP Request side. (no

    time talk about HttpClient, Spider, crawler in 15 minutes)
  2. There are some fact you should take a act •

    ConTENT ENCODING • ChaRSET ENCODING • NORMALIZE HTML • EXTRACTING FROM HTML • SOLVE CONTEXT
  3. CONTENT ENCODING • gzip • deflate • compress • identity

    I recommend using good Response handlers before struggling.
  4. CONTENT ENCODING • gzip • deflate I recommend using good

    Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle
  5. CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way.

    You had already got hint in Response Headers & html's meta nodes.
  6. NORMALIZE HTML before parse as HTML, you can fix it.

    • php-ext/tidy, HTMLParser • other beautifiers • manually :-(
  7. EXTRACTING FROM HTML Yes, there are several way in PHP.

    • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
  8. EXTRACTING FROM HTML Mostly, boredom for entire HTML. • PCRE

    / String Functions • dom • SimpleXML • php-ext/html_parse
  9. EXTRACTING FROM HTML DOM is a API FOR HTML &

    XML • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
  10. EXTRACTING FROM HTML Xpath is your friend • PCRE /

    String Functions • dom • SimpleXML • php-ext/html_parse
  11. EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions • PCRE

    / String Functions • dom • SimpleXML • php-ext/html_parse
  12. Solve Context You will need solve context from got response

    • Filtering extracted result for Domain.
  13. Solve Context Don't reinvent the wheel • Resolve relative URI

    / RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it
  14. Solve Context Don't reinvent the wheel • “Databases” that helps

    you - wedata - OSS's repositories (not only PHP)
  15. LIBRALIES • behat/mink • goutte, symfony/browserKit • zendframework/zend-dom • diggin-scraper

    • simple_html_dom • phpQuery • fluentDOM • php-jsonpointer • beberlei/phpricot
  16. Move forward, PHP. • HTML5 • HTTP 2.0 / SPDY

    • Browser binding ardemiranda/WebKitGtk • concurrent programing, asynchronous • collective intelligence • NLP natural language processing
  17. Deeper and Deeper • elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php • Accessing Web Resources

    with PHP http://joind.in/3386 • Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ • fuba: exthtml https://fuba.jottit.com/exthtml • kitamomonga http://d.hatena.ne.jp/kitamomonga/ • The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja