Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction of Web scraping for PHP users
Search
kazusuke sasezaki
December 30, 2013
0
100
Introduction of Web scraping for PHP users
slides for Japanese PHP Conference 2013
http://phpcon.php.gr.jp/w/2013/#program
kazusuke sasezaki
December 30, 2013
Tweet
Share
More Decks by kazusuke sasezaki
See All by kazusuke sasezaki
できる!!! Validation !!! - builderscon tokyo 2017
sasezaki
1
170
はじめてのミューテーション解析 / Mutation Testing
sasezaki
2
1.3k
こんなPHP開発者はイヤだ
sasezaki
2
3.7k
Featured
See All Featured
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
29
2k
Stop Working from a Prison Cell
hatefulcrawdad
267
20k
Product Roadmaps are Hard
iamctodd
PRO
49
11k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
26
1.9k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
665
120k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
Designing for Performance
lara
604
68k
It's Worth the Effort
3n
183
28k
The Art of Programming - Codeland 2020
erikaheidi
53
13k
GraphQLとの向き合い方2022年版
quramy
44
13k
Facilitating Awesome Meetings
lara
50
6.1k
Music & Morning Musume
bryan
46
6.2k
Transcript
Introduction of Web scraping for PHP users
15 years ago
15 years ago PHP 3.0.4 release, includes get_meta_tags().
It works.. <?php get_meta_tags("http://example.com/"); array(1) { 'viewport' => string(35) "width=device-width,
initial-scale=1" }
It works, sometimes! <?php get_meta_tags("http://www.discogs.com/"); PHP Warning: get_meta_tags(http://www.disc ogs.com/): failed
to open stream: HTTP request failed! 500 Client Refused
You are file_get_contents() fanboy
<?php get_meta_tags("data://text/html,". file_get_contents( "http://www.discogs.com/", false, stream_context_create( ["http" => ["header" =>
"User-Agent: Mozilla/4.0"] ] ) ) );
It works, too $php -d user_agent="Mozilla/4.0" \ -r 'get_meta_tags("http://www.discogs.com/");'
You would FEEL there are some problem.
"do separation of concerns! GET HTML & parse HTML."
"do separation of concerns! GET HTML & parse HTML." Doubt!
Doubt! Doubt!
Handling Request & Handling Response
HTTP Request
Sorry, today I don't talk about HTTP Request side. (no
time talk about HttpClient, Spider, crawler in 15 minutes)
HTTP ReSPONSE
HTTP ReSPONSE HeaDERS - BODY - Not only HTML ;-)
SCRAP SCRAPING FROM RESPONSE!
There are some fact you should take a act
There are some fact you should take a act •
ConTENT ENCODING • ChaRSET ENCODING • NORMALIZE HTML • EXTRACTING FROM HTML • SOLVE CONTEXT
CONTENT ENCODING
CONTENT ENCODING TODAY, WE ALREADY ACCEPTED IT.
CONTENT ENCODING • gzip • deflate • compress • identity
I recommend using good Response handlers before struggling.
CONTENT ENCODING • gzip • deflate I recommend using good
Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle
CHARSET ENCODING
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html)
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way.
You had already got hint in Response Headers & html's meta nodes.
CHARSET ENCODING <?php header("Content-Type: text/html; charset=Shift-JIS"); ?> ①②③④⑤ But, Don't
forget, Most of Japanese PHP users do LIE.
CHARSET ENCODING diggin/diggin-http-charset diggin/guzzle-plugin-AutoCharsetEncodingPlugi I hope my component will help
you.
NORMALIZE HTML
NORMALIZE HTML
NORMALIZE HTML before parse as HTML, you can fix it.
NORMALIZE HTML before parse as HTML, you can fix it.
• php-ext/tidy, HTMLParser • other beautifiers • manually :-(
EXTRACTING FROM HTML
EXTRACTING FROM HTML Yes, there are several way in PHP.
• PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Mostly, boredom for entire HTML. • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML DOM is a API FOR HTML &
XML • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Xpath is your friend • PCRE /
String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
Solve Context
Solve Context You will need solve context from got response
• Filtering extracted result for Domain.
Solve Context Don't reinvent the wheel • Resolve relative URI
/ RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it
Solve Context Don't reinvent the wheel • “Databases” that helps
you - wedata - OSS's repositories (not only PHP)
LIBRALIES • behat/mink • goutte, symfony/browserKit • zendframework/zend-dom • diggin-scraper
• simple_html_dom • phpQuery • fluentDOM • php-jsonpointer • beberlei/phpricot
Today, web is under control by JavaScript
JAVASCRIPT We need "REAL" BROWSER for AUTOMATION • Selenium •
PhantomJS / CasperJS • SlimerJS
You have a chance to survive with php.
Move forward, PHP. • HTML5 • HTTP 2.0 / SPDY
• Browser binding ardemiranda/WebKitGtk • concurrent programing, asynchronous • collective intelligence • NLP natural language processing
Deeper and Deeper • elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php • Accessing Web Resources
with PHP http://joind.in/3386 • Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ • fuba: exthtml https://fuba.jottit.com/exthtml • kitamomonga http://d.hatena.ne.jp/kitamomonga/ • The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja
Thanks