random read/write on top of HDFS • Provides Tables with a “Key:Column/Value” interface • Dynamic columns (qualifiers), no schema needed • “Fixed” column groups (families) • table[row:family:column] = value 2
and reliability (Write-Ahead Log) • Apache ZooKeeper for distributed coordination • Apache Hadoop MapReduce built-in support for running MapReduce jobs ZK HDFS App MR
Region Region Region Server Region Region Region Region Server Region Region Region Client ZooKeeper Master • Region Server • Server that contains a set of Regions • Responsible to handle reads and writes • Region • The basic unit of scalability in HBase • Subset of the table’s data • Contiguous, sorted range of rows stored together. • Master • Coordinates the HBase Cluster • Assignment/Balancing of the Regions • Handles admin operations • create/delete/modify table, …
of the table’s data • When there is too much data in a Region… • a split is triggered, creating 2 regions • The association “Region -> Server” is stored in a System Table • The Location of .META. Is stored in ZooKeeper 6 Table Start Key Region ID Region Server testTable Key-00 1 machine01.host testTable Key-31 2 machine03.host testTable Key-65 3 machine02.host testTable Key-83 4 machine01.host … … … … users Key-AB 1 machine03.host users Key-KG 2 machine02.host machine01 Region 1 - testTable Region 4 - testTable machine02 Region 3 - testTable Region 1 - users machine03 Region 2 - testTable Region 2 - users
The client asks to the master to create a new Table • hbase> create ‘myTable’, ‘cf’ • The Master • Store the Table information (“schema”) • Create Regions based on the key-splits provided • no splits provided, one single region by default • Assign the Regions to the Region Servers • The assignment Region -> Server is written to a system table called “.META.” Client Master Region Server Region Server createTable() Store Table “Metadata” Assign the Regions “enable” Region Region Region Region Region Server Region Region
• The client asks ZooKeeper the location of .META. • The client scans .META. searching for the Region Server responsible to handle the Key • The client asks the Region Server to insert/update/delete the specified key/value. • The Region Server process the request and dispatch it to the Region responsible to handle the Key • The operation is written to a Write-Ahead Log (WAL) • …and the KeyValues added to the Store: “MemStore” Client Where is .META.? ZooKeeper Region Server Region Scan .META. Region Region Server Region Region Region Insert KeyValue
• Files in HDFS are • Append-Only • Immutable once closed • HBase provides Random Writes? • …not really from a storage point of view • KeyValues are stored in memory and written to disk on pressure • Don’t worry your data is safe in the WAL! • (The Region Server can recover data from the WAL is case of crash) • But this allow to sort data by Key before writing on disk • Deletes are like Inserts but with a “remove me flag” RS Region Region Region WAL MemStore + Store Files (HFiles) Key0 – value 0 Key1 – value 1 Key2 – value 2 Key3 – value 3 Key4 – value 4 Key5 – value 5 Store Files
asks ZooKeeper the location of .META. • The client scans .META. searching for the Region Server responsible to handle the Key • The client asks the Region Server to get the specified key/value. • The Region Server process the request and dispatch it to the Region responsible to handle the Key • MemStore and Store Files are scanned to find the key Client Where is .META.? ZooKeeper Region Server Region Scan .META. Region Region Server Region Region Region Get Key
• Each flush a new file is created • Each file have KeyValues sorted by key • Two or more files can contains the same key (updates/deletes) • To find a Key you need to scan all the files • …with some optimizations • Filter Files Start/End Key • Having a bloom filter on each file Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key5 – value 5.0 Key1 – value 1.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0
• Large Sequential Reads are better • Why grouping records in blocks? • Easy to split • Easy to read • Easy to cache • Easy to index (if records are sorted) • Block Compression (snappy, lz4, gz, …) Record 0 Record 1 Header … Record N Record 0 Record 1 Header … Record N Index 0 … Index N Blocks Key/Value (record) Key Length : int Value Length : int Key : byte[] Value : byte[] Trailer
byte[] Family Length : byte Family : byte[] Qualifier : byte[] Timestamp : long Type : byte “on-disk” KeyValue • “Be aware of the data” • Block Encoding allows to compress the Key based on what we know • Keys are sorted… prefix may be similar in most cases • One file contains keys from one Family only • Timestamps are “similar”, we can store the diff • Type is “put” most of the time…
into during a scan • Removing duplicated keys (updated values) • Removing deleted keys • Creates a new file by merging the content of two or more files • Remove the old files Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0 Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0
of the data • Time Series? I guess no updates from the 80s • Be aware of the requests • Compact based on statistics • which files are hot and which are not • which keys are hot and which are not Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0 Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0
a copy of the table” • a Snapshot is a set of metadata information • The table “schema” (column families and attributes) • The Regions information (start key, end key, …) • The list of Store Files • The list of WALs active RS Region Region Region WAL Store Files (HFiles) RS Region Region Region WAL Store Files (HFiles) Master ZK ZK ZK
the RSs • the communication is done via ZooKeeper • using a “2-phase commit like” transaction (prepare/commit) • Each RS is responsible to take its “piece” of snapshot • For each Region store the metadata information needed • (list of Store Files, WALs, region start/end keys, …) RS Region Region Region WAL Store Files (HFiles) RS Region Region Region WAL Store Files (HFiles) Master ZK ZK ZK
‘snapshotName’, ‘tableName’ … • Creates a new table with the data “contained” in the snapshot • No data copies involved • HFiles are immutable, and shared between tables and snapshots • You can insert/update/remove data from the new table • No repercussions on the snapshot, original tables or other cloned tables
between tables and snapshots • On compaction or table deletion, files are removed from disk • If one of these files are referenced by a snapshot or a cloned table • The file is moved to an “archive” directory • And deleted later, when there’re no references to it
security • Multiple WALs per Region Server (MTTR) • Data Placement Awareness (MTTR) • Data Type Awareness • Compaction policies, based on the data needs • Managing blocks directly (instead of files)