Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
130k
3
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
380
The troubles of modern dependency management and what to do about them
gousiosg
0
700
Mining Repositories with Apache Spark
gousiosg
0
730
My adventures with open everything
gousiosg
0
360
Structure and Evolution of Package Dependency Networks
gousiosg
0
920
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
450
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
990
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
350
Other Decks in Technology
See All in Technology
データサイエンスを価値につなげるプロジェクト設計 〜 DS一年目が現場で得た気づき 〜
ysd113
1
240
機械学習を「社会実装」するということ 2026年夏版 / Social Implementation of Machine Learning June 2026 Version
moepy_stats
5
2.4k
失敗を経て、Harness Engineering で 大切にしたいことを考える / Learning from Failure: What Matters in Harness Engineering
bitkey
PRO
1
370
小さく始める AI 活用推進 ― 日経電子版 Web チームの事例/nikkei-tech-talk47
nikkei_engineer_recruiting
0
270
LayerX コーポレートエンジニアリング室におけるサプライチェーンセキュリティへの取り組み / Supply Chain Security at LayerX Corporate Engineering
yuyatakeyama
1
260
SONiCのLinuxベースを活かしたZabbix監視
sonic
0
160
AIのReact習熟度を測る
uhyo
2
550
エラーバジェットのアラートのタイミングを考える.pdf
kairim0
0
150
Oracle AI Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
4
2.9k
スキルと MCP ツール、責務をどう分けるか? AI が迷わないインターフェース設計の戦略
cdataj
1
1.1k
【セミナー資料】Claude Code をセキュアに使うための考え方と設定の勘どころ / Claude Code Webinar 20260616
masahirokawahara
2
310
小さくはじめるSLI/SLO ~育てながら組織に定着させる実践知~ / Starting Small with SLI/SLOs: Building Adoption Through Continuous Growth
nari_ex
7
1.9k
Featured
See All Featured
Chasing Engaging Ingredients in Design
codingconduct
0
220
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
620
KATA
mclloyd
PRO
35
15k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
850
A designer walks into a library…
pauljervisheath
211
24k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.8k
Rails Girls Zürich Keynote
gr2m
96
14k
Fireside Chat
paigeccino
42
3.9k
Building Adaptive Systems
keathley
44
3.1k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.9k
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
230
Marketing to machines
jonoalderson
1
5.4k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github