Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
GitHub Insights: Understanding Open Source
Search
Georgios Gousios
May 19, 2016
Technology
0
340
GitHub Insights: Understanding Open Source
Talk given at OSCON 2016
Georgios Gousios
May 19, 2016
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
260
The troubles of modern dependency management and what to do about them
gousiosg
0
490
Mining Repositories with Apache Spark
gousiosg
0
610
My adventures with open everything
gousiosg
0
260
Structure and Evolution of Package Dependency Networks
gousiosg
0
720
Mining Github for fun and profit
gousiosg
9
63k
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
900
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
250
The #issue32 incident
gousiosg
2
16k
Other Decks in Technology
See All in Technology
ビジネスモデリング道場 目的と背景
masuda220
PRO
9
560
OpenID Connect for Identity Assurance の概要と翻訳版のご紹介 / 20250219-BizDay17-OIDC4IDA-Intro
oidfj
0
280
現場で役立つAPIデザイン
nagix
34
12k
【Developers Summit 2025】プロダクトエンジニアから学ぶ、 ユーザーにより高い価値を届ける技術
niwatakeru
2
1.4k
7日間でハッキングをはじめる本をはじめてみませんか?_ITエンジニア本大賞2025
nomizone
2
1.9k
現場の種を事業の芽にする - エンジニア主導のイノベーションを事業戦略に装着する方法 -
kzkmaeda
2
2.1k
表現を育てる
kiyou77
1
220
目の前の仕事と向き合うことで成長できる - 仕事とスキルを広げる / Every little bit counts
soudai
25
7.2k
Swiftの “private” を テストする / Testing Swift "private"
yutailang0119
0
130
利用終了したドメイン名の最強終活〜観測環境を育てて、分析・供養している件〜 / The Ultimate End-of-Life Preparation for Discontinued Domain Names
nttcom
2
200
ソフトウェアエンジニアと仕事するときに知っておいたほうが良いこと / Key points for working with software engineers
pinkumohikan
0
100
Classmethod AI Talks(CATs) #16 司会進行スライド(2025.02.12) / classmethod-ai-talks-aka-cats_moderator-slides_vol16_2025-02-12
shinyaa31
0
110
Featured
See All Featured
Statistics for Hackers
jakevdp
797
220k
Stop Working from a Prison Cell
hatefulcrawdad
267
20k
Being A Developer After 40
akosma
89
590k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
46
2.3k
Designing for Performance
lara
604
68k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
27
1.6k
A Tale of Four Properties
chriscoyier
158
23k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
30
4.6k
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.1k
Git: the NoSQL Database
bkeepers
PRO
427
64k
Product Roadmaps are Hard
iamctodd
PRO
50
11k
Transcript
GitHub Insights Understanding Open Source @jeffmcaffer–Microsoft Georgios Gousios –Delft University
of Technology (TU Delft) Kevin Lewis – Microsoft
Snapshot overview
Inspire confidence
How open is a project? http://ghtorrent.org/pullreq-perf/
Commits (core vs community)
Commits (origin)
Comments (core vs community)
PR lifelines
Are we using git in a distributed way?
How may devs are there per country?
Insights
Business insights
Research insights
Cross-domain insights
Operational insights
Approach Data for the masses
GitHub by the numbers (Mid 2016)
Approach http://ghtorrent.org
How does it work? http://api.github.com/events
Example event (condensed) https://api.github.com/users/Cephei https://api.github.com/repos/PowerDMS/Owin.Scim https://api.github.com/repos/PowerDMS/Owin.Scim/commits/c751014f634d73e0b72f78a53c8cf137888b3 https://api.github.com/orgs/PowerDMS
Entities
GHTorrent architecture Github API Event Retrieval Commits Queue Project Events
Queue Events Data Retrieval Projects Commits evt.commit evt.watch evt.fork Data Retrieval Data Retrieval Data Retrieval Mirroring Cluster
GHTorrent by the numbers
Using the data You can do it too!
Using the data: Hosted http://ghtorrent.org
Using the data: Download
Using the data: Self-service https://github.com/ghtorrent/ghtorrent-webhook
Using the data: Azure Data Lake
Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights @gousiosg @jeffmcaffer @kelewis