Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
300
The troubles of modern dependency management and what to do about them
gousiosg
0
560
Mining Repositories with Apache Spark
gousiosg
0
670
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
800
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
390
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
940
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
300
Other Decks in Technology
See All in Technology
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
peisuke
0
160
ABEMAのCM配信を支えるスケーラブルな分散カウンタの実装
hono0130
4
1k
米軍Platform One / Black Pearlに学ぶ極限環境DevSecOps
jyoshise
2
520
大規模モノレポの秩序管理 失速しない多言語化フロントエンドの運用 / JSConf JP 2025
shoota
0
290
ある編集者のこれまでとこれから —— 開発者コミュニティと歩んだ四半世紀
inao
5
3.5k
QAを"自動化する"ことの本質
kshino
1
140
Service Monitoring Platformについて
lycorptech_jp
PRO
0
320
DDD x Microservice Architecture : Findy Architecture Conf 2025
syobochim
12
3.1k
Kubernetesと共にふりかえる! エンタープライズシステムのインフラ設計・テストの進め方大全
daitak
0
410
LINEギフト・LINEコマース領域の開発
lycorptech_jp
PRO
0
340
AI時代の戦略的アーキテクチャ 〜Adaptable AI をアーキテクチャで実現する〜 / Enabling Adaptable AI Through Strategic Architecture
bitkey
PRO
14
7k
Perlの生きのこり - YAPC::Fukuoka 2025
kfly8
0
580
Featured
See All Featured
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.8k
Music & Morning Musume
bryan
46
7k
Rails Girls Zürich Keynote
gr2m
95
14k
Context Engineering - Making Every Token Count
addyosmani
9
410
Building Applications with DynamoDB
mza
96
6.8k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
Producing Creativity
orderedlist
PRO
348
40k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.8k
Practical Orchestrator
shlominoach
190
11k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Designing for humans not robots
tammielis
254
26k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.5k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github