Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
320
The troubles of modern dependency management and what to do about them
gousiosg
0
640
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
340
Structure and Evolution of Package Dependency Networks
gousiosg
0
840
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
960
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
330
Other Decks in Technology
See All in Technology
Claude Code for NOT Programming
kawaguti
PRO
1
110
M&A 後の統合をどう進めるか ─ ナレッジワーク × Poetics が実践した組織とシステムの融合
kworkdev
PRO
1
530
Context Engineeringが企業で不可欠になる理由
hirosatogamo
PRO
3
690
顧客との商談議事録をみんなで読んで顧客解像度を上げよう
shibayu36
0
350
Kiro IDEのドキュメントを全部読んだので地味だけどちょっと嬉しい機能を紹介する
khmoryz
0
210
Oracle AI Database移行・アップグレード勉強会 - RAT活用編
oracle4engineer
PRO
0
110
顧客の言葉を、そのまま信じない勇気
yamatai1212
1
370
Why Organizations Fail: ノーベル経済学賞「国家はなぜ衰退するのか」から考えるアジャイル組織論
kawaguti
PRO
1
220
コスト削減から「セキュリティと利便性」を担うプラットフォームへ
sansantech
PRO
3
1.6k
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
15
93k
AIエージェントを開発しよう!-AgentCore活用の勘所-
yukiogawa
0
200
30万人の同時アクセスに耐えたい!新サービスの盤石なリリースを支える負荷試験 / SRE Kaigi 2026
genda
4
1.4k
Featured
See All Featured
Speed Design
sergeychernyshev
33
1.5k
Fireside Chat
paigeccino
41
3.8k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
Building Adaptive Systems
keathley
44
2.9k
Skip the Path - Find Your Career Trail
mkilby
0
60
Tell your own story through comics
letsgokoyo
1
810
Heart Work Chapter 1 - Part 1
lfama
PRO
5
35k
Building a A Zero-Code AI SEO Workflow
portentint
PRO
0
320
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.7k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
9.9k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
590
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
230
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github