Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Georgios Gousios
May 17, 2013
Technology
130k
3
Share
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
340
The troubles of modern dependency management and what to do about them
gousiosg
0
670
Mining Repositories with Apache Spark
gousiosg
0
710
My adventures with open everything
gousiosg
0
350
Structure and Evolution of Package Dependency Networks
gousiosg
0
890
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
430
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
970
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
340
Other Decks in Technology
See All in Technology
Code Interpreter で、AIに安全に コードを書かせる。
yokomachi
0
6.4k
Azure Lifecycle with Copilot CLI
torumakabe
3
950
AWS認定資格は本当に意味があるのか?
nrinetcom
PRO
1
240
ハーネスエンジニアリングの概要と設計思想
sergicalsix
4
620
Sansan Engineering Unit 紹介資料
sansan33
PRO
1
4.2k
終盤で崩壊させないAI駆動開発
j5ik2o
2
2.2k
2026年、知っておくべき最新 サーバレスTips10選/serverless-10-tips
slsops
12
4.9k
サイボウズ 開発本部採用ピッチ / Cybozu Engineer Recruit
cybozuinsideout
PRO
10
78k
Introduction to Sansan, inc / Sansan Global Development Center, Inc.
sansan33
PRO
0
3k
試されDATA SAPPORO [LT]Claude Codeで「ゆっくりデータ分析」
ishikawa_satoru
0
400
[OpsJAWS 40]リリースしたら終わり、じゃなかった。セキュリティ空白期間をAWS Security Agentで埋める
sh_fk2
1
140
LLM とプロンプトエンジニアリング/チューターを定義する / LLMs and Prompt Engineering, and Defining Tutors
ks91
PRO
0
400
Featured
See All Featured
Rebuilding a faster, lazier Slack
samanthasiow
85
9.5k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
810
My Coaching Mixtape
mlcsv
0
97
The untapped power of vector embeddings
frankvandijk
2
1.7k
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
170
Embracing the Ebb and Flow
colly
88
5k
Producing Creativity
orderedlist
PRO
348
40k
Fireside Chat
paigeccino
42
3.9k
Imperfection Machines: The Place of Print at Facebook
scottboms
270
14k
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
770
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
1
3.5k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github