Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
280
The troubles of modern dependency management and what to do about them
gousiosg
0
530
Mining Repositories with Apache Spark
gousiosg
0
650
My adventures with open everything
gousiosg
0
290
Structure and Evolution of Package Dependency Networks
gousiosg
0
770
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
370
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
920
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
290
Other Decks in Technology
See All in Technology
解消したはずが…技術と人間のエラーが交錯する恐怖体験
lamaglama39
0
150
ecspressoの設計思想に至る道 / sekkeinight2025
fujiwara3
12
2.3k
With Devin -AIの自律とメンバーの自立
kotanin0
2
970
AIエージェントを支える設計
tkikuchi1002
12
2.8k
「手を動かした者だけが世界を変える」ソフトウェア開発だけではない開発者人生
onishi
15
8k
LLM開発を支えるエヌビディアの生成AIエコシステム
acceleratedmu3n
0
360
オブザーバビリティプラットフォーム開発におけるオブザーバビリティとの向き合い / Hatena Engineer Seminar #34 オブザーバビリティの実現と運用編
arthur1
0
230
みんなのSRE 〜チーム全員でのSRE活動にするための4つの取り組み〜
kakehashi
PRO
2
110
LLMでAI-OCR、実際どうなの? / llm_ai_ocr_layerx_bet_ai_day_lt
sbrf248
0
390
P2P ではじめる WebRTC のつまづきどころ
tnoho
1
290
2025新卒研修・HTML/CSS #弁護士ドットコム
bengo4com
3
4.3k
反脆弱性(アンチフラジャイル)とデータ基盤構築
cuebic9bic
2
130
Featured
See All Featured
Art, The Web, and Tiny UX
lynnandtonic
301
21k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
331
22k
How STYLIGHT went responsive
nonsquared
100
5.7k
How to Think Like a Performance Engineer
csswizardry
25
1.8k
GitHub's CSS Performance
jonrohan
1031
460k
Rails Girls Zürich Keynote
gr2m
95
14k
Raft: Consensus for Rubyists
vanstee
140
7k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
33
2.4k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
161
15k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github