Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction of Web scraping for PHP users
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
kazusuke sasezaki
December 30, 2013
130
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Introduction of Web scraping for PHP users
slides for Japanese PHP Conference 2013
http://phpcon.php.gr.jp/w/2013/#program
kazusuke sasezaki
December 30, 2013
More Decks by kazusuke sasezaki
See All by kazusuke sasezaki
PHPでの .gitattributes
sasezaki
0
110
できる!!! Validation !!! - builderscon tokyo 2017
sasezaki
1
200
はじめてのミューテーション解析 / Mutation Testing
sasezaki
2
1.6k
こんなPHP開発者はイヤだ
sasezaki
2
3.9k
Featured
See All Featured
Technical Leadership for Architectural Decision Making
baasie
3
410
Navigating Algorithm Shifts & AI Overviews - #SMXNext
aleyda
1
1.3k
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
4k
Lessons Learnt from Crawling 1000+ Websites
charlesmeaden
PRO
1
1.3k
AI Search: Where Are We & What Can We Do About It?
aleyda
0
7.6k
So, you think you're a good person
axbom
PRO
2
2.1k
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
210
Believing is Seeing
oripsolob
1
150
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
2k
Art, The Web, and Tiny UX
lynnandtonic
304
22k
The Language of Interfaces
destraynor
162
27k
Transcript
Introduction of Web scraping for PHP users
15 years ago
15 years ago PHP 3.0.4 release, includes get_meta_tags().
It works.. <?php get_meta_tags("http://example.com/"); array(1) { 'viewport' => string(35) "width=device-width,
initial-scale=1" }
It works, sometimes! <?php get_meta_tags("http://www.discogs.com/"); PHP Warning: get_meta_tags(http://www.disc ogs.com/): failed
to open stream: HTTP request failed! 500 Client Refused
You are file_get_contents() fanboy
<?php get_meta_tags("data://text/html,". file_get_contents( "http://www.discogs.com/", false, stream_context_create( ["http" => ["header" =>
"User-Agent: Mozilla/4.0"] ] ) ) );
It works, too $php -d user_agent="Mozilla/4.0" \ -r 'get_meta_tags("http://www.discogs.com/");'
You would FEEL there are some problem.
"do separation of concerns! GET HTML & parse HTML."
"do separation of concerns! GET HTML & parse HTML." Doubt!
Doubt! Doubt!
Handling Request & Handling Response
HTTP Request
Sorry, today I don't talk about HTTP Request side. (no
time talk about HttpClient, Spider, crawler in 15 minutes)
HTTP ReSPONSE
HTTP ReSPONSE HeaDERS - BODY - Not only HTML ;-)
SCRAP SCRAPING FROM RESPONSE!
There are some fact you should take a act
There are some fact you should take a act •
ConTENT ENCODING • ChaRSET ENCODING • NORMALIZE HTML • EXTRACTING FROM HTML • SOLVE CONTEXT
CONTENT ENCODING
CONTENT ENCODING TODAY, WE ALREADY ACCEPTED IT.
CONTENT ENCODING • gzip • deflate • compress • identity
I recommend using good Response handlers before struggling.
CONTENT ENCODING • gzip • deflate I recommend using good
Response handlers before struggling. pear/HTTP_Request2 zendframework/zend-http guzzle/guzzle
CHARSET ENCODING
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html)
CHARSET ENCODING mb_convert_encoding(“UTF-8”, “auto”, $html) This is not best way.
You had already got hint in Response Headers & html's meta nodes.
CHARSET ENCODING <?php header("Content-Type: text/html; charset=Shift-JIS"); ?> ①②③④⑤ But, Don't
forget, Most of Japanese PHP users do LIE.
CHARSET ENCODING diggin/diggin-http-charset diggin/guzzle-plugin-AutoCharsetEncodingPlugi I hope my component will help
you.
NORMALIZE HTML
NORMALIZE HTML
NORMALIZE HTML before parse as HTML, you can fix it.
NORMALIZE HTML before parse as HTML, you can fix it.
• php-ext/tidy, HTMLParser • other beautifiers • manually :-(
EXTRACTING FROM HTML
EXTRACTING FROM HTML Yes, there are several way in PHP.
• PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Mostly, boredom for entire HTML. • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML DOM is a API FOR HTML &
XML • PCRE / String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Xpath is your friend • PCRE /
String Functions • dom • SimpleXML • php-ext/html_parse
EXTRACTING FROM HTML Remember PHP's Feature, DOMXPath:: registerPhpFunctions • PCRE
/ String Functions • dom • SimpleXML • php-ext/html_parse
Solve Context
Solve Context You will need solve context from got response
• Filtering extracted result for Domain.
Solve Context Don't reinvent the wheel • Resolve relative URI
/ RFC-3986 - pear/Net_URL2, zendframework/zend-uri supports it
Solve Context Don't reinvent the wheel • “Databases” that helps
you - wedata - OSS's repositories (not only PHP)
LIBRALIES • behat/mink • goutte, symfony/browserKit • zendframework/zend-dom • diggin-scraper
• simple_html_dom • phpQuery • fluentDOM • php-jsonpointer • beberlei/phpricot
Today, web is under control by JavaScript
JAVASCRIPT We need "REAL" BROWSER for AUTOMATION • Selenium •
PhantomJS / CasperJS • SlimerJS
You have a chance to survive with php.
Move forward, PHP. • HTML5 • HTTP 2.0 / SPDY
• Browser binding ardemiranda/WebKitGtk • concurrent programing, asynchronous • collective intelligence • NLP natural language processing
Deeper and Deeper • elazar/web-scraping-with-php https://github.com/elazar/web-scraping-with-php • Accessing Web Resources
with PHP http://joind.in/3386 • Spidering Hacks http://www.oreilly.co.jp/books/4873111870/ • fuba: exthtml https://fuba.jottit.com/exthtml • kitamomonga http://d.hatena.ne.jp/kitamomonga/ • The Architecture of Open Source Applications selenium - https://github.com/m-takagi/aosa-ja
Thanks