The Code Archive - HOPE XI

The Code Archive Clone ALL the code.

@FiloSottile Filippo Valsorda @saljam_ Salman Aljammaz

GitHub alone has 34,000,000 repositories 14,000,000 users

Go import ( "github.com/miekg/dns" "golang.org/x/exp/io/i2c" )

Repositories get deleted.

Branches get rebased.

Histories get rewritten.

Services go down.

Services disappear.

We <3 GitHub!!

We want a Wayback Machine for code

And we built it! There are 390,123 snapshots of 91,396
repositories totalling 1.025 terabytes As of 2016-07-22

The prototype • GitHub only • Active repositories (fetched on
push) • Popular repositories (at least 10 ★ ) • Reasonable repositories size

Architecture Drinker Fetcher Pack blob storage GH API Queue git
pull Frontend

The Drinker Drink the GitHub Firehose!

The Drinker • Monitor firehose for push, create, open source
events • Queue repositories • Filter by number of stars

The Drinker • Monitor firehose for push, create, open source
events • Queue repositories • Filter by number of stars • GitHub API rate limit: 5K / hour • Drink from https://www.githubarchive.org/ • Cache number of stars, update via events

Cache size: 7 million

https://github.com/google/go-github/pull/317

The Fetcher Just fetch the repos!

The Fetcher Just fetch the repos! But fetch to what?

00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...] 003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master 003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3 0000 Server → client
Client → server 003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...] 0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 0009done 0000 Server → client: [packfile]

HEAD → 867387cf1373910c60c78cab81085cb846fadfdb master → 867387cf1373910c60c78cab81085cb846fadfdb v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 2016-05-03
09:55:11 Z

HEAD → 867387cf1373910c60c78cab81085cb846fadfdb master → 867387cf1373910c60c78cab81085cb846fadfdb v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c HEAD
→ 5b53898f17dda3d2af6bc599b45b0d7b76f900f0 master → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0 v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 2016-05-03 10:55:33 Z 2016-05-03 09:55:11 Z

00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...] 003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master 003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3 0000 Server → client
Client → server 003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...] 0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 0009done 0000 Server → client: [packfile]

Cold cool storage • Upload is cheap, retrieval and download
is expensive • Keep the refs (branch/tag name to commit) in a live db • Store the packfiles and never look at them • The server will do the diffs

Forks • Forks sync from parent • Parent gets PR
from forks • Send “ haves ” for the entire network • Build a packfile dependency tree

• Run at off-peak hours • Use the raw git
protocol • Set user agents • Only fetch diffs, packs have to be used together

Gigantic repositories

Gigantic repositories 9GB repo. WTF.

Disappearing repositories 401 Unauthorized

Disappearing repositories DMCA $ git clone git://github.com/rtmpdump/rtmpdump-2.5.git Cloning into 'rtmpdump-2.5'...
fatal: remote error: Repository unavailable due to DMCA takedown. See the takedown notice for more details: https://github.com/github/dmca/blob/master/2016-07-22-rtmpdump. md.

Last minute crashes “Have you tried turning it off and
on again?”

The Backpanel

The Backpanel Our admin UI! E.g. blacklist of excessively large
repos, whitelist of exceptions to that, manually deleted repos … but building UIs sucks (for us, anyway)

The Backpanel

The Backpanel Lazy... But it works!

The Backpanel https://trello.com/b/04pbw4Gv/blacklist

The Frontend

The Frontend • git clone interface • Web interface

The Frontend • git clone interface • Web interface •
Retrieval is expensive • Outbound bandwidth even more

Local cache and alternates . ├── HEAD ├── config ├──
objects/ │ └── info/ │ └── alternates ├── packed-refs └── refs/

Clone one or all snapshots Exactly like it looked at
a given time: $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt All the snapshots at once: $ git clone https://codearchive.org/all/github.com/FiloSottile/gvt $ git branch 2016-07-01Z11:44:11/master 2016-07-02Z22:44:00/master

One more step $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt Welcome to the
Code Archive! Since download bandwidth is expensive, please click here to verify that you are human: https://codearchive.org/captcha/72f878a9670ab664 The download will start automatically...

Web UI • Work in progress • Wayback machine style
slider at the top • We suck at UIs. PRs welcomed!

Things to come

Beyond git and GitHub

Hiding things :( • Login with GitHub and hide your
repositories • Automated DMCA processing

Long term storage B2 object storage. Sponsored by

Thank you! https://codearchive.org Filippo Valsorda - @FiloSottile Salman Aljammaz -
@saljam_

The Code Archive - HOPE XI

The Code Archive - HOPE XI

More Decks by Filippo Valsorda

Other Decks in Programming

Featured

Transcript