Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
290
The troubles of modern dependency management and what to do about them
gousiosg
0
550
Mining Repositories with Apache Spark
gousiosg
0
660
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
790
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
380
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
930
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
290
Other Decks in Technology
See All in Technology
LLM時代にデータエンジニアの役割はどう変わるか?
ikkimiyazaki
1
580
業務自動化プラットフォーム Google Agentspace に入門してみる #devio2025
maroon1st
0
190
社内報はAIにやらせよう / Let AI handle the company newsletter
saka2jp
4
300
成長自己責任時代のあるきかた/How to navigate the era of personal responsibility for growth
kwappa
3
280
【新卒研修資料】LLM・生成AI研修 / Large Language Model・Generative AI
brainpadpr
24
17k
JAZUG 15周年記念 × JAT「AI Agent開発者必見:"今"のOracle技術で拡張するAzure × OCIの共存アーキテクチャ」
shisyu_gaku
0
120
20201008_ファインディ_品質意識を育てる役目は人かAIか___2_.pdf
findy_eventslides
1
460
OCI Network Firewall 概要
oracle4engineer
PRO
1
7.8k
Function calling機能をPLaMo2に実装するには / PFN LLMセミナー
pfn
PRO
0
940
ACA でMAGI システムを社内で展開しようとした話
mappie_kochi
1
270
関係性が駆動するアジャイル──GPTに人格を与えたら、対話を通してふりかえりを習慣化できた話
mhlyc
0
130
BtoBプロダクト開発の深層
16bitidol
0
350
Featured
See All Featured
Balancing Empowerment & Direction
lara
4
680
For a Future-Friendly Web
brad_frost
180
9.9k
Docker and Python
trallard
46
3.6k
Building Better People: How to give real-time feedback that sticks.
wjessup
368
20k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
Statistics for Hackers
jakevdp
799
220k
Building an army of robots
kneath
306
46k
4 Signs Your Business is Dying
shpigford
185
22k
Agile that works and the tools we love
rasmusluckow
331
21k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.2k
Learning to Love Humans: Emotional Interface Design
aarron
274
40k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.6k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github