Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Build a GitHub

How to Build a GitHub

Learn about the growth patterns and the architecture behind github.com.

Zach Holman

August 05, 2012
Tweet

More Decks by Zach Holman

Other Decks in Programming

Transcript


  1. githu
    H O W
    t
    B U I L D
    GITHUB
    a

    View full-size slide

  2. 6.5MM REPOSITORIES
    LARGEST GIT HOST
    1.9MM USERS
    SINCE 2008

    View full-size slide

  3. 6.5MM REPOSITORIES
    LARGEST GIT HOST
    1.9MM USERS
    SINCE 2008
    SVN HOST

    View full-size slide

  4. gh

    gh

    gh

    gh

    gh

    gh

    View full-size slide

  5. gh

    gh

    gh

    gh

    gh

    gh

    SHOW YOU OUR CARDS
    going t

    View full-size slide

  6. MAGIC BULLET
    there i n

    View full-size slide

  7. FOUR STAGES OF GROWTH
    happiness
    the
    EVERYTHING
    automate

    View full-size slide

  8. NO
    FORKING
    HOLMAN
    @
    LOST
    YO QUIT READING THIS SHIT

    View full-size slide

  9. ho
    DID WE GIT HERE

    View full-size slide

  10. 1809:
    PERL INVENTED

    View full-size slide

  11. 1814:
    COMPUTERS INVENTED

    View full-size slide

  12. 1814-2004:
    ANARCHY AND CHAOS AND
    ZOMG EVERYONE’S DYING

    View full-size slide

  13. 2005:
    VERSION CONTROL INVENTED
    git

    View full-size slide

  14. 2007:
    githu
    GLOBAL PEACE AND
    HAPPINESS ACHIEVED

    View full-size slide

  15. ...or something like that

    View full-size slide

  16. PRESTON-WERNER
    TOM
    GRIT
    O C TOBER 9, 2 0 07
    git via ruby

    View full-size slide

  17. GRIT
    git via ruby
    github’s interface to git
    object-oriented, read/write
    open source

    View full-size slide

  18. repo = Grit::Repo.new('/tmp/repository')
    grit
    repo.commits

    View full-size slide

  19. grit
    shelling out to git is expensive
    grit reimplements portions of git in ruby
    native packfile and git object support
    2x-100x speedup on low-level operations

    View full-size slide

  20. grit
    slowly reimplement grit for speed
    allows for incremental improvements

    View full-size slide

  21. LED TO GITHUB
    grit O C TOBER 19, 2 0 07

    View full-size slide

  22. TODAY
    ADDING 2TB A MONTH
    22 FILESERVER PAIRS
    23TB OF REPO DATA

    View full-size slide

  23. GITHUB GROWTH
    THE FOUR STAGES
    of

    View full-size slide

  24. LOCAL NETWORKED NET-SHARD GITRPC
    FOUR STAGES OF GROWTH
    GITHUB:

    View full-size slide

  25. LOCAL NETWORKED NET-SHARD GITRPC
    FOUR STAGES OF GROWTH
    GITHUB:
    2008 2009 2010 2012

    View full-size slide

  26. LOCAL NETWORKED NET-SHARD GITRPC
    FOUR STAGES OF GROWTH
    GITHUB:

    View full-size slide

  27. JAN 2008 DEC 2008
    FOUR STAGES OF GROWTH
    GITHUB:
    42,000 USERS 

    View full-size slide

  28. JAN 2008 DEC 2008
    FOUR STAGES OF GROWTH
    GITHUB:
    80,000 REPOSITORIES 

    View full-size slide

  29. LOCAL
    MULTI-VM
    SHARED GFS MOUNT

    View full-size slide

  30. LOCAL
    MULTI-VM
    WEB FRONTENDS
    BACKGROUND WORKERS

    View full-size slide

  31. LOCAL
    MULTI-VM
    SIMPLE ARCHITECTURE
    HORIZONTALLY SCALABLE-ish

    View full-size slide

  32. LOCAL
    SHARED GFS MOUNT
    SHARED MOUNT ON EACH VM
    SIMILAR PRODUCTION + DEVELOPMENT ACCESS
    ALLOWED LOCAL ACCESS VIA GRIT

    View full-size slide

  33. SIMPLE APPROACH, COMMON GIT
    INTERFACE, QUICK TO BUILD AND SHIP
    LOCAL

    View full-size slide

  34. LOCAL NETWORKED
    FOUR STAGES OF GROWTH
    GITHUB:
    NET-SHARD GITRPC

    View full-size slide

  35. 2008 2009 2010
    FOUR STAGES OF GROWTH
    GITHUB:
    166,000 USERS 

    View full-size slide

  36. 2008 2009 2010
    FOUR STAGES OF GROWTH
    GITHUB:
    484,000 REPOSITORIES 

    View full-size slide

  37. the problem:
    is slow
    GFS
    performance degraded as repos added

    View full-size slide

  38. the problem:
    i/o-bound
    we’re
    read/write to disk needs to be fast

    View full-size slide

  39. THE PLAN
    NETWORKED
    HARDWARE
    MOVE DATACENTERS

    View full-size slide

  40. NETWORKED
    HARDWARE
    bare metal servers
    16 machines
    6x RAM
    machine roles
    solid datacenter
    got dat cloud

    View full-size slide

  41. NETWORKED
    FRONTENDS FILESERVERS AUX DB
    LAUNCH:
    SERVER PAIRS

    View full-size slide

  42. NETWORKED
    GRIT IS LOCAL
    NEEDS TO BE NETWORKED

    View full-size slide

  43. NETWORKED
    smoke service is run on each fs;
    facilitates disk access
    chimney routes the smoke,
    stores routing table in redis
    stub local grit calls, retain API
    usage, but send over network

    View full-size slide

  44. NETWORKED
    server pairs offer failover via DRBD
    real servers, real big RAM allocations

    View full-size slide

  45. NETWORKED
    LATENCY
    networked routing adds 2-10ms per request
    optimize for the roundtrip
    smoke contains smarter server-side logic

    View full-size slide

  46. NETWORKED
    LATENCY
    smoke has custom git extension commands
    git-distinct-commits
    returns commits only contained on a given branch
    calls to git-show-refs and git-rev-list
    run all calls server-side in one roundtrip

    View full-size slide

  47. NETWORKED
    HORIZONTALLY-SCALABLE, LATENCY-
    CONSIDERATE, API-COMPATIBLE WITH GRIT

    View full-size slide

  48. LOCAL
    FOUR STAGES OF GROWTH
    GITHUB:
    NET-SHARD GITRPC
    NETWORKED

    View full-size slide

  49. 2008 2009 2010 2011
    FOUR STAGES OF GROWTH
    GITHUB:
    510,000 USERS 

    View full-size slide

  50. 2008 2009 2010 2011
    FOUR STAGES OF GROWTH
    GITHUB:
    1.3MM REPOSITORIES 

    View full-size slide

  51. the problem:
    duplication
    data
    each fork is a full project history

    View full-size slide

  52. duplication
    data 
    i create a repo
    you fork my repo
    fs5:/data/repositories/6/nw/6b/de/92/1/1.git
    fs7:/data/repositories/4/na/3b/dr/72/2/2.git

    View full-size slide

  53. duplication
    data 
    1,000 commits
    1,001 commits
    10MB
    10MB
    20MB total disk
    }

    View full-size slide

  54. duplication
    data 
    1,000 commits
    1 commit
    1KB
    10MB
    10MB total disk
    }GOAL:

    View full-size slide

  55. duplication
    data 
    75 MB repo
    3.5k forks
    x
    ~250 GB
    x 2 fs pairs + offsite backups

    View full-size slide

  56. NET-SHARD
    shard by repository network
    (“forks”)

    View full-size slide

  57. NET-SHARD
    network.git
    1.git
    2.git
    3.git
    4.git
    CONTAINS DELTA
    }CONTAINS ALL REFS

    View full-size slide

  58. NET-SHARD
    network.git
    GIT ALTERNATES
    store git object data externally to repository
    we fetch refs into your fork, transparently

    View full-size slide

  59. NET-SHARD
    network.git
    PRIVACY
    potential leaking of refs cross-network
    net-shard enabled on all-public and all-private
    repository networks only

    View full-size slide

  60. NET-SHARD
    network.git
    DISK
    halves disk usage
    increase disk and kernel cache hits

    View full-size slide

  61. NET-SHARD
    network.git
    MIGRATION
    gradually transitioned repos to network.git
    effectively feature-flagged by repo

    View full-size slide

  62. NET-SHARD
    SAVE DISK, IMPROVE PERFORMANCE

    View full-size slide

  63. LOCAL
    FOUR STAGES OF GROWTH
    GITHUB:
    GITRPC
    NETWORKED NET-SHARD

    View full-size slide

  64. 2008 2009 2010 2011 2012
    FOUR STAGES OF GROWTH
    GITHUB:
    1.2MM USERS 

    View full-size slide

  65. 2008 2009 2010 2011 2012 AUGUST
    FOUR STAGES OF GROWTH
    GITHUB:
    1.9MM USERS 

    View full-size slide

  66. 2008 2009 2010 2011 2012
    FOUR STAGES OF GROWTH
    GITHUB:
    3.4MM REPOSITORIES 

    View full-size slide

  67. 2008 2009 2010 2011 2012 AUGUST
    FOUR STAGES OF GROWTH
    GITHUB:
    6.5MM REPOSITORIES 

    View full-size slide

  68. the problem:
    GRIT
    git via ruby

    View full-size slide

  69. the problem:
    local, ruby-based grit ended up
    in a high-traffic distributed system

    View full-size slide

  70. the problem:
    inelegant code spread out everywhere

    View full-size slide

  71. GITRPC
    network-oriented library for git access
    GitRPC

    View full-size slide

  72. GITRPC
    open source
    fastest git implementation (C)
    github-sponsored project
    bindings for all major languages
    used in our mac, windows clients

    View full-size slide

  73. GITRPC
    rugged (RUBY)
    libgit2 (C)
    gitrpc (RUBY)

    View full-size slide

  74. GITRPC
    like smoke, gitrpc aims to
    reduce latency by reducing roundtrips
    LATENCY

    View full-size slide

  75. GITRPC
    operations cached on library level
    CACHING
    yank out tons of app-level cache logic

    View full-size slide

  76. GITRPC
    the move to gitrpc started this
    summer and will take months
    MIGRATION
    gradually replace smoke and grit;
    avoids a risky deploy

    View full-size slide

  77. FAST AND STABLE NETWORKED GIT ACCESS
    GITRPC

    View full-size slide

  78. LOCAL NETWORKED NET-SHARD GITRPC
    FOUR STAGES OF GROWTH
    GITHUB:

    View full-size slide

  79. identify
    WHAT’S BROKEN

    View full-size slide

  80. sma
    CHANGES, FAST DEVELOPMENT

    View full-size slide

  81. realCODE BEATS
    IMAGINARY CODE

    View full-size slide

  82. EVERYTHING
    automate
    automate
    automate
    automate
    automate
    AUTOMATE
    automate
    automate
    automate
    automate
    automate
    automate

    View full-size slide





  83. m . manage
    LOL DEVELOPERS
    SOFTWARE
    DEVELOPMENT

    View full-size slide




  84. m . manage
    DEADLINES
    MEETINGS
    PRIORITIES
    ESTIMATES

    View full-size slide




  85. m . manage
    DEADLINES
    MEETINGS
    PRIORITIES
    ESTIMATES

    View full-size slide

  86.  EVERYONE
    i
    A MANAGER

    View full-size slide

  87. AUTOMATE AWAY PAIN
    DEPLOYMENT RECOVERY
    DEVELOPMENT

    View full-size slide

  88. DEVELOPMENT
    automate

    View full-size slide

  89. DEVELOPMENT
    > ./do-work
    RUN THIS IN EACH PROJECT:
    ...AND YOU’RE DONE!
    loljk

    View full-size slide

  90. DEVELOPMENT
    YOU CAN AUTOMATE THE PAIN OF
    DEVELOPMENT

    View full-size slide

  91. SETUP
    DEVELOPMENT
    the

    View full-size slide

  92. SETUP DEVELOPMENT
    the
    ONE-LINER INSTALLS ALL
    GITHUB DEVELOPMENT
    DEPENDENCIES

    View full-size slide


  93. 30 min
    SETUP DEVELOPMENT
    the
    CLEAN MACHINE TO
    FULL DEVELOPMENT
    ENVIRONMENT

    View full-size slide

  94. SETUP DEVELOPMENT
    the
    NEW EMPLOYEES
    SHIP
    THEIR FIRST WEEK

    View full-size slide

  95. SETUP DEVELOPMENT
    the
    PUPPET
    HANDLES ALL DEPENDENCIES

    View full-size slide

  96. DEPLOYMENT
    automate

    View full-size slide

  97. DEPLOYMENT
    REAL BROGRAMMERS
    DEPLOY WITH
    NO FEAR
    SO FUCK THAT

    View full-size slide

  98. DEPLOYMENT
    DEPLOYS SHOULD BE CAUTIOUS,
    COMMONPLACE, AND AUTOMATED

    View full-size slide

  99. DEPLOYMENT
    GITHUB DEPLOYS 20-40 TIMES A DAY

    View full-size slide

  100. DEPLOYMENT
    PUSH BRANCH
    DEPLOY BRANCH
    EVERYWHERE · MACHINE CLASS · SPECIFIC SERVERS
    HUBOT RUNS TESTS
    IN ABOUT 200 SECONDS
    USUALLY OPEN A PULL REQUEST

    View full-size slide

  101. DEPLOYMENT
    DEPLOY LOCKING
    CAN’T DEPLOY IF A BRANCH IS DEPLOYED
    AUTODEPLOYS
    PUSHED TO MASTER WITH GREEN TESTS? DEPLOY.

    View full-size slide

  102. DEPLOYMENT
    STAFF-ONLY FEATURE FLAGS
    LIMITS EXPOSURE · REAL-WORLD · AVOIDS MERGES

    View full-size slide

  103. RECOVERY
    automate

    View full-size slide

  104. RECOVERY
    SOMETHING WILL ALWAYS BREAK

    View full-size slide

  105. RECOVERY
    HUBOT
    IS A SYSADMIN

    View full-size slide

  106. RECOVERY
    HUBOT LOAD
    HUBOT QUERIES
    HUBOT CONNS
    SERVER LOAD
    RUNNING DB QUERIES
    ALL OPEN CONNECTIONS

    View full-size slide

  107. RECOVERY
    HUBOT RESTORE
    HUBOT PUSH-LOG
    HUBOT GH-EACH
    RESTORE A REPO FROM BACKUPS
    SEE RECENT PUSH LOGS TO A REPO
    RUN COMMAND ON SPECIFIC HOSTS

    View full-size slide

  108. HIGH-LEVEL OVERVIEW IN MINUTES
    SPEND MORE TIME FIXING AND LESS TIME INVESTIGATING
    RECOVERY

    View full-size slide


  109. happiness
    the




    View full-size slide

  110. EMPLOYEES
    HAVE QUIT
    YEARS
    5
    EMPLOYEES
    108
    ZERO

    View full-size slide

  111. 1-2 MONTHS
    HIRE
    1-3 MONTHS
    RAMP-UP
    2 WEEKS
    LEAVE

    View full-size slide

  112. LOSING AN EMPLOYEE CAN
    SET YOU BACK HALF A YEAR

    View full-size slide

  113. remove
    ANY REASON TO
    LEAVE
    — — — — — — — — — — — — — — — — —

    View full-size slide

  114. TDD✓
    PAIR
    PROGRAMMING

    BDD

    TEST-FIRST

    DESIGN-FIRST

    (just kidding)
    EMACS
    x
    NONE OF
    THESE

    View full-size slide

  115. WE CARE ABOUT
    THE WORK
    YOU DO, NOT ABOUT
    HOW YOU DO IT

    View full-size slide

  116. LOCATION

    HOURS

    DIRECTION

    View full-size slide

  117. LOCATION
     HOURS

    DIRECTION

    GITHUB EMPLOYEES
    WORK REMOTELY

    View full-size slide

  118. LOCATION
     HOURS

    DIRECTION

    FAMILY RELOCATION,
    TRAVEL FREEDOM

    View full-size slide

  119. LOCATION

    HOURS
     DIRECTION

    CHOOSE
    YOUR
    SCHEDULE
    CHOOSE
    YOUR
    VACATIONS
    FRESH, CREATIVE EMPLOYEES

    View full-size slide

  120. LOCATION

    HOURS

    DIRECTION

    YOU
    HACK ON THINGS
    THAT INTEREST YOU
    REDUCES BURNOUT

    View full-size slide

  121. flexible
    LOCATION

    HOURS

    DIRECTION

    BE
    TOWARDS WORK/LIFE

    View full-size slide

  122. basica y,
    MOVE FAST =
    SMALL CHANGES

    View full-size slide

  123. basica y,
    BE STABLE =
    DEPLOY CONSTANTLY

    View full-size slide

  124. basica y,
    HAPPY COMPANY =
    HAPPY EMPLOYEES

    View full-size slide

  125. NO
    FORKING
    HOLMAN
    @
    LOST
    YO QUIT READING THIS SHIT
    ZACHHOLMAN.COM/TALKS

    View full-size slide