Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
280
The troubles of modern dependency management and what to do about them
gousiosg
0
520
Mining Repositories with Apache Spark
gousiosg
0
640
My adventures with open everything
gousiosg
0
280
Structure and Evolution of Package Dependency Networks
gousiosg
0
760
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
360
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
910
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
280
Other Decks in Technology
See All in Technology
2025/6/21 日本学術会議公開シンポジウム発表資料
keisuke198619
0
220
Cloud Native Scalability for Internal Developer Platforms
hhiroshell
2
450
成立するElixirの再束縛(再代入)可という選択
kubell_hr
0
250
AIにどこまで任せる?実務で使える(かもしれない)AIエージェント設計の考え方
har1101
3
1k
「伝える」を加速させるCursor術
naomix
0
620
活きてなかったデータを活かしてみた話 / Shirokane Kougyou vol 19
sansan_randd
1
260
自分を理解するAI時代の準備 〜マイプロフィールMCPの実装〜
edo_m18
0
110
堅牢な認証基盤の実現 TypeScriptで代数的データ型を活用する
kakehashi
PRO
2
220
今からでも間に合う! 生成AI「RAG」再入門 / Re-introduction to RAG in Generative AI
hideakiaoyagi
1
170
CIでのgolangci-lintの実行を約90%削減した話
kazukihayase
0
210
ユーザーのプロフィールデータを活用した推薦精度向上の取り組み
yudai00
0
290
Introduction to Bill One Development Engineer
sansan33
PRO
0
250
Featured
See All Featured
Facilitating Awesome Meetings
lara
54
6.4k
Into the Great Unknown - MozCon
thekraken
39
1.8k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
123
52k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
900
Balancing Empowerment & Direction
lara
1
290
The Cult of Friendly URLs
andyhume
79
6.4k
Done Done
chrislema
184
16k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
35
2.3k
4 Signs Your Business is Dying
shpigford
184
22k
Speed Design
sergeychernyshev
30
990
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
Measuring & Analyzing Core Web Vitals
bluesmoon
7
480
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github