Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
260
The troubles of modern dependency management and what to do about them
gousiosg
0
480
Mining Repositories with Apache Spark
gousiosg
0
610
My adventures with open everything
gousiosg
0
260
Structure and Evolution of Package Dependency Networks
gousiosg
0
720
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
340
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
890
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
250
Other Decks in Technology
See All in Technology
あなたの興味は信頼性?それとも生産性? SREとしてのキャリアに悩むみなさまに伝えたい選択肢
jacopen
6
3.4k
日本語プログラミングとSpring Bootアプリケーション開発 #kanjava
yusuke
2
350
顧客の声を集めて活かすリクルートPdMのVoC活用事例を徹底解剖!〜プロデザ!〜
recruitengineers
PRO
0
210
Tech Blog執筆のモチベート向上作戦
imamura_ko_0314
0
750
srekaigi2025-hajimete-ippo-aws
masakichieng
0
250
Grid表示のレイアウトで Flow layoutsを使う
cffyoha
1
150
ObservabilityCON on the Road Tokyoの見どころ
hamadakoji
0
220
FastConnect の冗長性
ocise
1
9.3k
ココナラのセキュリティ組織の体制・役割・今後目指す世界
coconala_engineer
0
220
Bounded Context: Problem or Solution?
ewolff
1
120
Creative Pair
kawaguti
PRO
1
140
業務ツールをAIエージェントとつなぐ - Composio
knishioka
0
120
Featured
See All Featured
Measuring & Analyzing Core Web Vitals
bluesmoon
6
220
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
120k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
Code Review Best Practice
trishagee
65
17k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
20
2.4k
Building a Modern Day E-commerce SEO Strategy
aleyda
38
7.1k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
11
900
Optimising Largest Contentful Paint
csswizardry
33
3k
Git: the NoSQL Database
bkeepers
PRO
427
64k
Music & Morning Musume
bryan
46
6.3k
We Have a Design System, Now What?
morganepeng
51
7.4k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github