Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
290
The troubles of modern dependency management and what to do about them
gousiosg
0
540
Mining Repositories with Apache Spark
gousiosg
0
650
My adventures with open everything
gousiosg
0
300
Structure and Evolution of Package Dependency Networks
gousiosg
0
780
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
370
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
920
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
290
Other Decks in Technology
See All in Technology
トヨタ生産方式(TPS)入門
recruitengineers
PRO
6
1.4k
データアナリストからアナリティクスエンジニアになった話
hiyokko_data
0
260
新規案件の立ち上げ専門チームから見たAI駆動開発の始め方
shuyakinjo
0
640
シークレット管理だけじゃない!HashiCorp Vault でデータ暗号化をしよう / Beyond Secret Management! Let's Encrypt Data with HashiCorp Vault
nnstt1
2
130
LLM翻訳ツールの開発と海外のお客様対応等への社内導入事例
gree_tech
PRO
0
430
Kubernetes における cgroup v2 でのOut-Of-Memory 問題の解決
pfn
PRO
0
440
オブザーバビリティが広げる AIOps の世界 / The World of AIOps Expanded by Observability
aoto
PRO
0
250
【5分でわかる】セーフィー エンジニア向け会社紹介
safie_recruit
0
30k
進捗
ydah
2
230
kubellが考える戦略と実行を繋ぐ活用ファーストのデータ分析基盤
kubell_hr
0
120
JavaScript 研修
recruitengineers
PRO
6
1.4k
【Grafana Meetup Japan #6】Grafanaをリバプロ配下で動かすときにやること ~ Grafana Liveってなんだ ~
yoshitake945
0
220
Featured
See All Featured
A Tale of Four Properties
chriscoyier
160
23k
StorybookのUI Testing Handbookを読んだ
zakiyama
30
6.1k
Build The Right Thing And Hit Your Dates
maggiecrowley
37
2.8k
Practical Orchestrator
shlominoach
190
11k
Music & Morning Musume
bryan
46
6.8k
Faster Mobile Websites
deanohume
309
31k
A Modern Web Designer's Workflow
chriscoyier
696
190k
Facilitating Awesome Meetings
lara
55
6.5k
Visualization
eitanlees
147
16k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
44
2.5k
Unsuck your backbone
ammeep
671
58k
Stop Working from a Prison Cell
hatefulcrawdad
271
21k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github