Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Data Gathering And Research

Luca Matteis
November 26, 2011

Improving Data Gathering And Research

How to improve data gathering using web scraping methodologies.

Luca Matteis

November 26, 2011
Tweet

More Decks by Luca Matteis

Other Decks in Programming

Transcript

  1. "In the broadest sense of the word, the definition of

    research includes any gathering of data, information and facts for the advancement of knowledge."
  2. "Research is a process of steps used to collect and

    analyze information to increase our understanding of a topic or issue"
  3. Where do we get data from? Einstein got his data

    from his own experiments and from other peoples experiments Information exchange took weeks if not months
  4. http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/

    themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....
  5. Information that can be extremely valuable, lives somewhere online and

    we don’t know it because we can’t find it
  6. http://science.com/paper.... http://newton.com/research... http://national.com/ goo... http://biology.com/ science... http://newscientist.com/ neutrinodiscovery... http:// astronomynow.com/

    themoon http://space.com/ november2001 http://science.com/ paper.... http://newton.com/research... http://science.com/paper.... http://space.com/astro... http://space.com/astro... http://space.com/astro... http://science.com/paper....
  7. When our information is centralized by context, we can more

    easily find what we’re looking for
  8. Each center sends us their data in the form of

    Excel or Access files, through FTP or Email
  9. • no human interference • less communication hassles • less

    human errors • more accurate data • more data What are the advantages of automating the data exchange process?
  10. How do we automate? Centers no longer have to send

    us anything. We get it directly from their website
  11. There’s no secret. Google, hotel sites, flight search engines and

    many others do this It is called web scraping
  12. We automatically navigate to the centers websites and fetch the

    information that we need This is done by little scripts called spiders or web crawlers
  13. “A Web crawler (or spider) is a computer program that

    browses the World Wide Web in a methodical, automated manner or in an orderly fashion.”
  14. For each center to have a website that displays their

    information The main requirement Without a website we wouldn’t be able to automate this exchange
  15. RECAP Automation of the data exchange process is the only

    sustainable solution With new technologies, web scraping has become a very reliable system
  16. RECAP Automation of the data exchange process is the only

    sustainable solution With new technologies, web scraping has become a very reliable system The process is modular and will allow us to plug-in systems such as GRIN-Global