Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Georgios Gousios
May 17, 2013
Technology
130k
3
Share
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
370
The troubles of modern dependency management and what to do about them
gousiosg
0
690
Mining Repositories with Apache Spark
gousiosg
0
720
My adventures with open everything
gousiosg
0
360
Structure and Evolution of Package Dependency Networks
gousiosg
0
910
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
440
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
980
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
350
Other Decks in Technology
See All in Technology
情シスがMCP環境導入時に打ちのめされる認可の崖
oidfj
0
710
eBPF Can Do It! A 5-Minute Tour of 5 Real-World PHP Issues Solved with eBPF
egmc
0
300
Kaigi Effect Effect
ngtyuk
0
100
まだ道半ば、AI-DLCを歩み始めている話
news_it_enj
2
200
権限管理設計を完全に理解した
rsugi
2
220
JJUG CCC 2026 Spring AI時代の開発こそ標準化を武器に! ― 方式・プロセス・プラットフォームの標準化
s27watanabe
2
510
脅威をエンジニアリングの糧にして:恐怖を乗り越えた先にあったもの / Turn threats into fuel for engineering: what lay beyond overcoming fear
nrslib
1
330
CloudFront VPCオリジンとVPC Latticeサービスの内部ALBをマルチアカウントで一元利用しよう
duelist2020jp
5
250
AI時代から振り返るTerraform drift運用の歴史 / AI Age Reflections on the History of Terraform Drift Operations
aeonpeople
0
530
最低限これだけ押さえれ大丈夫_Claude Enterprise/Team企業展開ガバナンス入門
tkikuchi
1
300
大規模環境でどのように監視を実現する?
yuobayashi
1
260
開発を止めない CI/CD ~CI Visibilityによる継続的最適化~
pensuke628
0
120
Featured
See All Featured
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.5k
Building Applications with DynamoDB
mza
96
7k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
28
3.5k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.3k
Fireside Chat
paigeccino
42
3.9k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
128
55k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.2k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Everyday Curiosity
cassininazir
0
210
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
340
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
1.1k
State of Search Keynote: SEO is Dead Long Live SEO
ryanjones
0
200
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github