Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
270
The troubles of modern dependency management and what to do about them
gousiosg
0
510
Mining Repositories with Apache Spark
gousiosg
0
630
My adventures with open everything
gousiosg
0
270
Structure and Evolution of Package Dependency Networks
gousiosg
0
740
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
350
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
910
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
270
Other Decks in Technology
See All in Technology
Road to Go Gem #rubykaigi
sue445
0
1k
Как мы автоматизировали интеграционное тестирование с Gonkey и не пожалели. Паша Егорычев, Кирилл Поляков
lamodatech
0
640
Databricksで完全履修!オールインワンレイクハウスは実在した!
akuwano
0
130
AndroidアプリエンジニアもMCPを触ろう
kgmyshin
2
510
PagerDuty×ポストモーテムで築く障害対応文化/Building a culture of incident response with PagerDuty and postmortems
aeonpeople
3
500
読んで学ぶ Amplify Gen2 / Amplify と CDK の関係を紐解く #jawsug_tokyo
tacck
PRO
1
290
AIと共に乗り越える、 入社後2ヶ月の苦労と学習の軌跡
sai_kaneko
0
150
意思決定を支える検索体験を目指してやってきたこと
hinatades
PRO
0
350
SREからゼロイチプロダクト開発へ ー越境する打席の立ち方と期待への応え方ー / Product Engineering Night #8
itkq
2
1.1k
ここはMCPの夜明けまえ
nwiizo
32
12k
30代からでも遅くない! 内製開発の世界に飛び込み、最前線で戦うLLMアプリ開発エンジニアになろう
minorun365
PRO
16
4.9k
持続可能なドキュメント運用のリアル: 1年間の成果とこれから
akitok_
1
260
Featured
See All Featured
How to train your dragon (web standard)
notwaldorf
91
6k
Site-Speed That Sticks
csswizardry
6
520
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
13
810
[RailsConf 2023] Rails as a piece of cake
palkan
54
5.5k
Speed Design
sergeychernyshev
29
920
Optimizing for Happiness
mojombo
378
70k
Embracing the Ebb and Flow
colly
85
4.7k
Product Roadmaps are Hard
iamctodd
PRO
52
11k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
14
1.4k
Being A Developer After 40
akosma
91
590k
Gamification - CAS2011
davidbonilla
81
5.2k
GitHub's CSS Performance
jonrohan
1030
460k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github