Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MapReduce and Columnar DB's

Avatar for samant samant
April 02, 2014

MapReduce and Columnar DB's

Avatar for samant

samant

April 02, 2014
Tweet

More Decks by samant

Other Decks in Programming

Transcript

  1. MapReduce - Definition • One of Google’s greatest contributions to

    computer science • MapReduce is an algorithmic framework for executing jobs in parallel over several nodes
  2. MapReduce - Major Implementation • Almost always based on Hadoop

    - a Framework for the storage and processing of large scaled and distributed data supported by Apache • Itself inspired by Google BigTable Project
  3. Columnar DB’s - Definition Columnar databases are so named because

    the important aspect of their design is that data from a given column is stored together. (By contrast, a row-oriented database keeps information about a row together.) In column-oriented databases, adding columns is quite inexpensive.
  4. Columnar DB’s - Queries get ‘t1′, ‘r1′, {COLUMN => ‘c1′}

    get ‘t1′, ‘r1′, {COLUMN => ['c1', 'c2', 'c3']} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMESTAMP => ts1} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMERANGE => [ts1, ts2], VERSIONS => 4} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMESTAMP => ts1, VERSIONS => 4}
  5. Columnar DBs - Supporting Companies • Facebook • Yahoo •

    Ebay • Twitter • Amazon • Google • ...
  6. Columnar DB’s - Pro’s • Horizontal scalability (replication and partitioning)

    • Versioning is trivial • No real storage cost for null values • Used mainly for Big Data / data mining / Business Intelligence analysis
  7. Columnar DB’s - Con’s • Complexity (Installation, infrastructure and usage)

    • Design your schema based on how you plan to query the data • Some operations are really time expensive
  8. Facebook Messaging Index Table Keyword #1 Keyword #2 Keyword #3

    Keyword #... User ID #1 User ID #2 User ID #... Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id
  9. References Seven Databases in Seven Weeks: A Guide to Modern

    Databases and the NoSQL Movement by Eric Redmond and Jim R. Wilson