Introduction to Big data and Cassandra

Introduction to big data and Group 02: • Nguyen Dang
Anh Thi • Dang Phuong Nam • Le Minh Nghia 1 Presenter:Nguyen Dang Anh Thi

2 Social network IOT devices Stock market 35 zettabytes in
2020 According to Digital Universe Overview about Big Data

• Relational databases are not good for storing big data.
• Fixed schema. • Not distributed, hard to scale. • Poor write performance for a high-throughput. 3 Overview about storage: RDBMS

• NoSQL is better for big data application. • Flexibility:
dynamic schema, can store unstructured and semi-structured data Not distributed, hard to scale. • Scalable, most NoSQL databases are distributed. • High performance. 4 Overview about storage: NoSQL

Introduce Cassandra: Cassandra History • Cassandra was developed in 2007
at Facebook by one of the authors of Amazon's Dynamo Avinash Lakshman and Prashant Malik to power inbox search feature. • Facebook released Cassandra as an open- source project on Google code in July 2008 • In March 2009 became an Apache Incubator project. • On February 17, 2010 it graduated to a top- level project. • Was adopeted by many big companies until now. 5

Inbox search problem - Requirements: a scalability, distributed data across
multiple data centers. - Million of simultaneous users. - Billion of writes perday. A user want to seach inbox his or her inbox for messages using one of two strategies: • Term search – keyword. • Interaction search – search by username. 6

Big table(2006) 7 Dynamo paper(2007) A data model that is:
• Reliable. • High-performant. • Always available. • Richer Data model. • 1 keys lot of values. • Fast sequential access.

Cassandara(2009) 8 - Distributed features of DynamoDB. - Data model
and storage from big table.

• Node − It is the place where data is
stored. • Data center − It is a collection of related nodes. • Cluster − A cluster is a component that contains one or more data centers. • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. • Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. • SSTable − It is a disk file to which the data is flushed from the mem- table when its contents reach a threshold value. 9 Detail about Cassandra : Architecture overview

• Peer to peer, masterless, ring architecture. • Every nodes
is the same, no master, no slave. • Data is partitioned among all nodes in the cluster. • Data replication to ensure fault tolerance . 10

• Keyspace is the outermost container for data in Cassandra.
• Columns are grouped into Column Families. • Each Column has ▪ Name ▪ Value 11 Detail about Cassandra: Data modeling

Source: https://www.scnsoft.com/blog/cassandra-performance 13 Detail about Cassandra: Partitioning

• Random partitioning – this is the default and recommended
strategy. Partitioning data as evenly as possible across all nodes using an MD5 hash of every column family row key. • Ordered partitioning – stores column family row keys in sorted order across the nodes in a database cluster. 14 There are two basic data partitioning strategies:

• Replication Factor: Replication means the no. of copies maintained
on different nodes. Replication Factor of 3 means, 3 copies of data maintained on 3 different nodes. So if 2 of the nodes go down we still have one copy of data safe. • Replication Strategy: There is two replication strategy. 15 Detail about Cassandra: Replication

Simple strategy: This strategy is used when there is only
one data center, data is copied in a clockwise manner on all the nodes. 16 Source: https://data-flair.training/blogs/cassandra-architecture/

Network topology strategy: This strategy is highly recommended as there
is a possibility to expand according to the future use. 17 Source: https://data-flair.training/blogs/cassandra-architecture/

18 Source: https://data-flair.training/blogs/cassandra-architecture/ Detail about Cassandra: Key features

19 Open Source • Cassandra, though it is very powerful
and reliable, is FREE!. • Cassandra can be integrated with other Apache Open-source projects like Hadoop, Apache Pig, Apache Hive,.. Peer-to-Peer architecture • No single point of failure. • Any number of servers/nodes can be added to any Cassandra cluster in any of the datacenters. • Any server can entertain request from any client.

20 Elastic Scalability - Cassandra cluster can be easily scaled-up
or scaled-down. • Any number of nodes can be added or deleted in Cassandra cluster without much disturbance. • You don’t have to restart the cluster or change queries related Cassandra application while scaling up or down. High Availability and Fault Tolerance • Data replication which makes Cassandra highly available and fault-tolerant (each data is stored at more than one location). • Data replication can also happen across multiple data centres.

21 High Performance • The basic idea behind developing Cassandra
was to harness the hidden capabilities of several multicore machines.. • Cassandra has proven itself to be excellently reliable when it comes to a large set of data. Column Oriented • Unlike traditional databases, where column names only consist of metadata, in Cassandra column names can also consist of the actual data. • Cassandra rows can consist of masses of columns. ⇒ Cassandra is endowed with a rich data model.

22 Tunable Consistency • Eventual consistency makes sure that the
client is approved as soon as the cluster accepts the write. • Strong consistency means that any update is broadcasted to all machines or all the nodes where the particular data is situated. • The mixture of the two consistency is also a possibility. Schema Free • In Cassandra, columns can be created at your will within the rows.

Source: https://www.scnsoft.com/blog/cassandra-performance 23 Detail about Cassandra: Read Operation

When write request comes to the node: • Firstly, it
logs in the Commit Log. Data will be captured and stored in the Mem-Table. • When mem-table is full, data is flushed to the SSTable data file. Source: https://www.scnsoft.com/blog/cassandra-performance 24 Detail about Cassandra: Write Operation

25 Cassandra performance benchmark Load process: Cassandra Vs. MongoDB Vs.
HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks

26 Load process: Cassandra Vs. MongoDB Vs. HBase Vs. Couchbase
Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks

27 Mixed Operational And Analytical Workload Cassandra Vs. MongoDB Vs.
HBase Vs. Couchbase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks

28 Mixed Operational And Analytical Workload Cassandra Vs. MongoDB Vs.
HBase Source: https://www.datastax.com/products/compare/nosql-performance-benchmarks

29 • MySQL > 50 GB Data Writes Average :
~300 ms Reads Average : ~350 ms • Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms Comparison with MySQL Source: Stats provided by Authors using facebook data. Comparison with MySQL

30 Summary Advantages: Disadvanteges

31 When not to use Cassandra: • Tables have multiple
access paths. Example: lots of secondary indexes. • The application depends on identifying rows with sequential values. MySQL autoincrement or Oracle sequences. • Cassandra does not do ACID. LSD, Sulphuric or any other kind. If you think you need it go elsewhere. Many times people think they do need it when they don’t. • Aggregates: Cassandra does not support aggregates, if you need to do a lot of them, think another database. • Joins: You many be able to data model yourself out of this one, but take care. • Locks: Honestly, Cassandra does not support locking. There is a good reason for this. Don’t try to implement them yourself. I have seen the end result of people trying to do locks using Cassandra and the results were not pretty. • Updates: Cassandra is very good at writes, okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious. • Transactions: CQL has no begin/commit transaction syntax. If you think you need it then Cassandra is a poor choice for you. Don’t try to simulate it. The results won’t be pretty.

32 When to use Cassandra: • Distributed: Runs on more
than one server node. • Scale linearly: By adding nodes, not more hardware on existing nodes. • Work globally: A cluster may be geographically distributed. • Favor writes over reads: Writes are an order of magnitude faster than reads. • Democratic peer to peer architecture: No master/slave. • Favor partition tolerance and availability over consistency: Eventually consistent (see the CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem.) • Support fast targeted reads by primary key: Focus on primary key reads alternative paths are very sub-optimal. • Support data with a defined lifetime: All data in a Cassandra database has a defined lifetime no need to delete it after the lifetime expires the data goes away.

• Twitter is using Cassandra for analytics: real-time analytics, geolocation
and places of interest data, and data mining over the entire user store. • Mahalo uses it for its primary near-time data store. • Facebook still uses it for inbox search, though they are using a proprietary fork. • Digg uses it for its primary near-time data store. • Rackspace uses it for its cloud service, monitoring, and logging. • Reddit uses it as a persistent cache. • Cloudkick uses it for monitoring statistics and analytics. • Ooyala uses it to store and serve near real-time video analytics data. • SimpleGeo uses it as the main data store for its real-time location infrastructure. • Onespot uses it for a subset of its main data store 33 Companies that use Cassandra

References • https://www.slideshare.net/asismohanty/cassandra-basics-20 • https://data-flair.training/blogs/cassandra-architecture/ • https://www.slideshare.net/quangntta/introduction-to-cassandra- 59962524?from_action=save • https://data-flair.training/blogs/cassandra-data-model/
• https://technospirituality.com/2016/07/apache-cassandra-a-quick- look/ • https://www.slideshare.net/DataStax/understanding-data- partitioning-and-replication-in-apache-cassandra 34

References • https://www.scnsoft.com/blog/cassandra-performance • https://www.edureka.co/blog/interview-questions/cassandra- interview-questions/ • https://www.gocit.vn/files/Cassandra.The.Definitive.Guide- www.gocit.vn.pdf •
https://www.edureka.co/blog/apache-cassandra-advantages/ 35

Demo Cassandra on Docker 36

Thank you 37

Introduction to Big data and Cassandra

Introduction to Big data and Cassandra

More Decks by Anh Thi Nguyen

Featured

Transcript