- sharding/routing Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web node 2 node 1 node 3 node 4
heavy system - Stores ~120TB of url and timestamp indexed HTML pages - Stats (as Monoids) Storage System* - All we want is approximate aggregates real-time - Real-time scheduler for our crawlers* - Finds out which of the 20 urls to crawl now out of 3+ billion urls - Helps crawler crawl 20+ million urls everyday suuchi @indix
contact - Each node handles or forwards requests transparently - Uses pluggable partitioner scheme - Can be customized as weighted distribution / Rendezvous Hash etc. primitives - request routing