Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
300
The troubles of modern dependency management and what to do about them
gousiosg
0
560
Mining Repositories with Apache Spark
gousiosg
0
660
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
790
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
380
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
940
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
300
Other Decks in Technology
See All in Technology
ヘンリー会社紹介資料(エンジニア向け) / company deck for engineer
henryofficial
0
430
猫でもわかるAmazon Q Developer CLI 解体新書
kentapapa
1
160
ラスベガスの歩き方 2025年版(re:Invent 事前勉強会)
junjikoide
0
620
OPENLOGI Company Profile for engineer
hr01
1
46k
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
14
82k
可観測性は開発環境から、開発環境にもオブザーバビリティ導入のススメ
layerx
PRO
4
2.2k
20251024_TROCCO/COMETAアップデート紹介といくつかデモもやります!_#p_UG 東京:データ活用が進む組織の作り方
soysoysoyb
0
140
実践マルチモーダル検索!
shibuiwilliam
1
440
AI時代の発信活動 ~技術者として認知してもらうための発信法~ / 20251028 Masaki Okuda
shift_evolve
PRO
1
120
Oracle Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
0
390
AIを使ってテストを楽にする
kworkdev
PRO
0
320
プレイドのユニークな技術とインターンのリアル
plaidtech
PRO
1
540
Featured
See All Featured
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.3k
Typedesign – Prime Four
hannesfritz
42
2.8k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
190
55k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.1k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
140
34k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
YesSQL, Process and Tooling at Scale
rocio
173
15k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
The Art of Programming - Codeland 2020
erikaheidi
56
14k
Practical Orchestrator
shlominoach
190
11k
Building Adaptive Systems
keathley
44
2.8k
Fireside Chat
paigeccino
41
3.7k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github