Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lean GHTorrent: Github data on demand

Lean GHTorrent: Github data on demand

Presentation given at the MSR 2014 data track

Avatar for Georgios Gousios

Georgios Gousios

June 03, 2014
Tweet

More Decks by Georgios Gousios

Other Decks in Research

Transcript

  1. MSR ! 19 GB VISSOFT! 0.5 GB GHTorrent ! 3.5TB

    Sun = 109x Earth! GHTorrent = 184x MSR
  2. I need a fortune for H/W I need an army

    of researchers Replication?
  3. VS

  4. @gousiosg http://ghtorrent.org/lean.html Lean GHTorrent: Github data on demand Georgios Gousios,

    Bogdan Vasilescu, Alexander Serebrenik and Andy Zaidman {g.gousios, a.e.zaidman}@tudelft.nl {b.n.vasilescu, a.serebrenik}@tue.nl Web server Web form 1 GHTorrent server 5 6 8 Job db Retrieval workers … Requests queue Responses queue 3 Dispatcher GHTorrent db GitHub API 2 Request listener Response listener 4 9 7 Requests db Software Engineering Research Group http://swerl.tudelft.nl/ Delft University of Technology Want to do research with GHTorrent data? It is now as easy as: 2. Getting the data! No need to care about this (but ask if you do!) 1. Filling in the form at ghtorrent.org/lean.html ( ( In the package, you will find: • A MySQL dump (to query like a boss) • MongoDB collection dumps (all Github API data) for all repos specified in step 1!