system HDFS Velocity Volume Variety In memory NewSQL / domain specific VoltDB Traditional [SQL] RDBMS MongoDB Distributed DBMS Cassandra HBase CAP theorem: • Consistency • Availability • Partition tolerance - pick two Eventual consistency Transactional? • No Choose the DB
easy to use • ~zero routine ops • it works (!!) as promised: o real-time replication o node/site failure recovery o zero load writes o double of nodes = double of speed
a platform to run custom code (as MongoDB); o not an extension (as HBase); o highly optimized No-master, eventually consistent NoSQL Data model - Key-Value http://cassandra.apache.org Apache Cassandra
in 2008 In use: • Netflix - main non-content data store~500 Cassandra nodes (2012) • eBay - recommendation system"dozens of nodes", 200 TB storage (2012) • Twitter - tweet analysis100 + TB of data • More clients: (http://www.datastax.com/cassandrausers) History
January 2013 2.0 - expected this summer (2013) June 26 2013 - 158 bugs, 89 worth to notice Sperasoft Experience: • hit 1 bug in production (stability issue) • hit 1 bug in QA (in a crafted case) Mature & Agile
Apache Thrift Custom protocol Synchronous Asynchronous Schema-less Static schema Store & Forward Cursors promised in 2.0 API for any language Java; Python, C# coming Cryptic API JDBC-like API Supported yet Going forward
- shape data for queries • No joins - materialized views • Data duplication - OK • Remember eventual consistency • Queries are precious • Use right data types - timestamp, uuid Why? Because NoSQL is a low level tool for high optimization. Data Modeling for NoSQL
Y 1 Z 1 1 A Y Z 1 a b c d UPDATE ... SET b = 'Y' WHERE id = 1 INSERT INTO ... SET (id, c) values (1, 'Z') DELETE d FROM ... WHERE id = 1 SELECT * FROM ... WHERE id = 1 have to fetch 4 rows slo-o-ow Plan Data Immutable
count(*) FROM ... WHERE .... ; Full scan over the selection => Default 10'000 rows limit => wrong count Have an integer column and increment it CREATE TABLE count_table ( id uuid, value counter, PRIMARY KEY (id) ); ... UPDATE count_table SET value = value + 1 WHERE id = ... ; Counter column family http://www.datastax.com/documentation/cassandr a/1.2/cassandra/cql_using/use_counter_t.html slo-o-ow mess mess How Many?
(id) ); id chunk_no data CREATE TABLE blob ( id uuid, chunk_no int, data blob, PRIMARY KEY (id, chunk_no) ); id data http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage http://wiki.apache.org/cassandra/CassandraLimitations OutOfMemory Blobs