Web crawling is a hard problem and the web is messy. There is no shortage of semantic web standards -- basically, everyone has one. How do you make sense of the noise of our web of billions of pages?
This talk presents two key technologies that can be used: Scrapy, an open source & scalable web crawling framework, and Mr. Schemato, a new, open source semantic web validator and distiller.
Talk given by Andrew Montalenti, CTO of Parse.ly. See http://parse.ly
Slides were built with reST and S5, and thus are available in raw text form here (quite pleasant to browse): https://raw.github.com/Parsely/python-crawling-slides/master/index.rst
You can also view these slides directly in the browser, using your arrow keys to navigate. http://bit.ly/crawling-slides