dynamic schema, can store unstructured and semi-structured data Not distributed, hard to scale. • Scalable, most NoSQL databases are distributed. • High performance. 4 Overview about storage: NoSQL
at Facebook by one of the authors of Amazon's Dynamo Avinash Lakshman and Prashant Malik to power inbox search feature. • Facebook released Cassandra as an open- source project on Google code in July 2008 • In March 2009 became an Apache Incubator project. • On February 17, 2010 it graduated to a top- level project. • Was adopeted by many big companies until now. 5
multiple data centers. - Million of simultaneous users. - Billion of writes perday. A user want to seach inbox his or her inbox for messages using one of two strategies: • Term search – keyword. • Interaction search – search by username. 6
stored. • Data center − It is a collection of related nodes. • Cluster − A cluster is a component that contains one or more data centers. • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. • SSTable − It is a disk file to which the data is flushed from the mem- table when its contents reach a threshold value. 9 Detail about Cassandra : Architecture overview
strategy. Partitioning data as evenly as possible across all nodes using an MD5 hash of every column family row key. • Ordered partitioning – stores column family row keys in sorted order across the nodes in a database cluster. 14 There are two basic data partitioning strategies:
on different nodes. Replication Factor of 3 means, 3 copies of data maintained on 3 different nodes. So if 2 of the nodes go down we still have one copy of data safe. • Replication Strategy: There is two replication strategy. 15 Detail about Cassandra: Replication
and reliable, is FREE!. • Cassandra can be integrated with other Apache Open-source projects like Hadoop, Apache Pig, Apache Hive,.. Peer-to-Peer architecture • No single point of failure. • Any number of servers/nodes can be added to any Cassandra cluster in any of the datacenters. • Any server can entertain request from any client.
or scaled-down. • Any number of nodes can be added or deleted in Cassandra cluster without much disturbance. • You don’t have to restart the cluster or change queries related Cassandra application while scaling up or down. High Availability and Fault Tolerance • Data replication which makes Cassandra highly available and fault-tolerant (each data is stored at more than one location). • Data replication can also happen across multiple data centres.
was to harness the hidden capabilities of several multicore machines.. • Cassandra has proven itself to be excellently reliable when it comes to a large set of data. Column Oriented • Unlike traditional databases, where column names only consist of metadata, in Cassandra column names can also consist of the actual data. • Cassandra rows can consist of masses of columns. ⇒ Cassandra is endowed with a rich data model.
client is approved as soon as the cluster accepts the write. • Strong consistency means that any update is broadcasted to all machines or all the nodes where the particular data is situated. • The mixture of the two consistency is also a possibility. Schema Free • In Cassandra, columns can be created at your will within the rows.
logs in the Commit Log. Data will be captured and stored in the Mem-Table. • When mem-table is full, data is flushed to the SSTable data file. Source: https://www.scnsoft.com/blog/cassandra-performance 24 Detail about Cassandra: Write Operation
~300 ms Reads Average : ~350 ms • Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms Comparison with MySQL Source: Stats provided by Authors using facebook data. Comparison with MySQL
access paths. Example: lots of secondary indexes. • The application depends on identifying rows with sequential values. MySQL autoincrement or Oracle sequences. • Cassandra does not do ACID. LSD, Sulphuric or any other kind. If you think you need it go elsewhere. Many times people think they do need it when they don’t. • Aggregates: Cassandra does not support aggregates, if you need to do a lot of them, think another database. • Joins: You many be able to data model yourself out of this one, but take care. • Locks: Honestly, Cassandra does not support locking. There is a good reason for this. Don’t try to implement them yourself. I have seen the end result of people trying to do locks using Cassandra and the results were not pretty. • Updates: Cassandra is very good at writes, okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious. • Transactions: CQL has no begin/commit transaction syntax. If you think you need it then Cassandra is a poor choice for you. Don’t try to simulate it. The results won’t be pretty.
than one server node. • Scale linearly: By adding nodes, not more hardware on existing nodes. • Work globally: A cluster may be geographically distributed. • Favor writes over reads: Writes are an order of magnitude faster than reads. • Democratic peer to peer architecture: No master/slave. • Favor partition tolerance and availability over consistency: Eventually consistent (see the CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem.) • Support fast targeted reads by primary key: Focus on primary key reads alternative paths are very sub-optimal. • Support data with a defined lifetime: All data in a Cassandra database has a defined lifetime no need to delete it after the lifetime expires the data goes away.
and places of interest data, and data mining over the entire user store. • Mahalo uses it for its primary near-time data store. • Facebook still uses it for inbox search, though they are using a proprietary fork. • Digg uses it for its primary near-time data store. • Rackspace uses it for its cloud service, monitoring, and logging. • Reddit uses it as a persistent cache. • Cloudkick uses it for monitoring statistics and analytics. • Ooyala uses it to store and serve near real-time video analytics data. • SimpleGeo uses it as the main data store for its real-time location infrastructure. • Onespot uses it for a subset of its main data store 33 Companies that use Cassandra