HBase Storage Internals

1 Speaker Name or Subhead Goes Here Matteo Bertozzi |
@Cloudera March 2013 - Hadoop Summit Europe HBase Storage Internals, present and future!

What is HBase? • Open source Storage Manager that provides
random read/write on top of HDFS • Provides Tables with a “Key:Column/Value” interface • Dynamic columns (qualifiers), no schema needed • “Fixed” column groups (families) • table[row:family:column] = value 2

HBase ecosystem 3 • Apache Hadoop HDFS for data durability
and reliability (Write-Ahead Log) • Apache ZooKeeper for distributed coordination • Apache Hadoop MapReduce built-in support for running MapReduce jobs ZK HDFS App MR

4 How HBase Works “View from 10000ft”

Region Server Master, Region Servers and Regions 5 HDFS Region
Region Region Region Server Region Region Region Region Server Region Region Region Client ZooKeeper Master • Region Server • Server that contains a set of Regions • Responsible to handle reads and writes • Region • The basic unit of scalability in HBase • Subset of the table’s data • Contiguous, sorted range of rows stored together. • Master • Coordinates the HBase Cluster • Assignment/Balancing of the Regions • Handles admin operations • create/delete/modify table, …

Autosharding and .META. table • A Region is a Subset
of the table’s data • When there is too much data in a Region… • a split is triggered, creating 2 regions • The association “Region -> Server” is stored in a System Table • The Location of .META. Is stored in ZooKeeper 6 Table Start Key Region ID Region Server testTable Key-00 1 machine01.host testTable Key-31 2 machine03.host testTable Key-65 3 machine02.host testTable Key-83 4 machine01.host … … … … users Key-AB 1 machine03.host users Key-KG 2 machine02.host machine01 Region 1 - testTable Region 4 - testTable machine02 Region 3 - testTable Region 1 - users machine03 Region 2 - testTable Region 2 - users

The Write Path – Create a New Table 7 •
The client asks to the master to create a new Table • hbase> create ‘myTable’, ‘cf’ • The Master • Store the Table information (“schema”) • Create Regions based on the key-splits provided • no splits provided, one single region by default • Assign the Regions to the Region Servers • The assignment Region -> Server is written to a system table called “.META.” Client Master Region Server Region Server createTable() Store Table “Metadata” Assign the Regions “enable” Region Region Region Region Region Server Region Region

The Write Path – “Inserting” data 8 • table.put(row-key:family:column, value)
• The client asks ZooKeeper the location of .META. • The client scans .META. searching for the Region Server responsible to handle the Key • The client asks the Region Server to insert/update/delete the specified key/value. • The Region Server process the request and dispatch it to the Region responsible to handle the Key • The operation is written to a Write-Ahead Log (WAL) • …and the KeyValues added to the Store: “MemStore” Client Where is .META.? ZooKeeper Region Server Region Scan .META. Region Region Server Region Region Region Insert KeyValue

The Write Path – Append Only to Random R/W 9
• Files in HDFS are • Append-Only • Immutable once closed • HBase provides Random Writes? • …not really from a storage point of view • KeyValues are stored in memory and written to disk on pressure • Don’t worry your data is safe in the WAL! • (The Region Server can recover data from the WAL is case of crash) • But this allow to sort data by Key before writing on disk • Deletes are like Inserts but with a “remove me flag” RS Region Region Region WAL MemStore + Store Files (HFiles) Key0 – value 0 Key1 – value 1 Key2 – value 2 Key3 – value 3 Key4 – value 4 Key5 – value 5 Store Files

The Read Path – “reading” data 10 • The client
asks ZooKeeper the location of .META. • The client scans .META. searching for the Region Server responsible to handle the Key • The client asks the Region Server to get the specified key/value. • The Region Server process the request and dispatch it to the Region responsible to handle the Key • MemStore and Store Files are scanned to find the key Client Where is .META.? ZooKeeper Region Server Region Scan .META. Region Region Server Region Region Region Get Key

The Read Path – Append Only to Random R/W 11
• Each flush a new file is created • Each file have KeyValues sorted by key • Two or more files can contains the same key (updates/deletes) • To find a Key you need to scan all the files • …with some optimizations • Filter Files Start/End Key • Having a bloom filter on each file Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key5 – value 5.0 Key1 – value 1.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0

12 HBase Store File Format HFile

HFile Format 13 • Only Sequential Writes, just append(key, value)
• Large Sequential Reads are better • Why grouping records in blocks? • Easy to split • Easy to read • Easy to cache • Easy to index (if records are sorted) • Block Compression (snappy, lz4, gz, …) Record 0 Record 1 Header … Record N Record 0 Record 1 Header … Record N Index 0 … Index N Blocks Key/Value (record) Key Length : int Value Length : int Key : byte[] Value : byte[] Trailer

Data Block Encoding 14 Row Length : short Row :
byte[] Family Length : byte Family : byte[] Qualifier : byte[] Timestamp : long Type : byte “on-disk” KeyValue • “Be aware of the data” • Block Encoding allows to compress the Key based on what we know • Keys are sorted… prefix may be similar in most cases • One file contains keys from one Family only • Timestamps are “similar”, we can store the diff • Type is “put” most of the time…

15 Optimize the read-path Compactions

Compactions 16 • Reduce the number of files to look
into during a scan • Removing duplicated keys (updated values) • Removing deleted keys • Creates a new file by merging the content of two or more files • Remove the old files Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0 Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0

Pluggable Compactions 17 • Try different algorithm • Be aware
of the data • Time Series? I guess no updates from the 80s • Be aware of the requests • Compact based on statistics • which files are hot and which are not • which keys are hot and which are not Key0 – value 0.0 Key2 – value 2.0 Key3 – value 3.0 Key5 – value 5.0 Key8 – value 8.0 Key9 – value 9.0 Key0 – value 0.1 Key1 – value 1.0 Key4– value 4.0 Key5 – [deleted] Key6 – value 6.0 Key7– value 7.0 Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0

18 Zero-Copy Snapshots and Table Clones Snapshots

What Is a Snapshot? 19 • “a Snapshot is not
a copy of the table” • a Snapshot is a set of metadata information • The table “schema” (column families and attributes) • The Regions information (start key, end key, …) • The list of Store Files • The list of WALs active RS Region Region Region WAL Store Files (HFiles) RS Region Region Region WAL Store Files (HFiles) Master ZK ZK ZK

How Taking a Snapshot Works? 20 • The master orchestrate
the RSs • the communication is done via ZooKeeper • using a “2-phase commit like” transaction (prepare/commit) • Each RS is responsible to take its “piece” of snapshot • For each Region store the metadata information needed • (list of Store Files, WALs, region start/end keys, …) RS Region Region Region WAL Store Files (HFiles) RS Region Region Region WAL Store Files (HFiles) Master ZK ZK ZK

Cloning a Table from a Snapshots 21 • hbase> clone_snapshot
‘snapshotName’, ‘tableName’ … • Creates a new table with the data “contained” in the snapshot • No data copies involved • HFiles are immutable, and shared between tables and snapshots • You can insert/update/remove data from the new table • No repercussions on the snapshot, original tables or other cloned tables

Compactions & Archiving 22 • HFiles are immutable, and shared
between tables and snapshots • On compaction or table deletion, files are removed from disk • If one of these files are referenced by a snapshot or a cloned table • The file is moved to an “archive” directory • And deleted later, when there’re no references to it

23 What can be improved? Future

0.96 is coming up • Moving RPC to Protobuf •
Allows rolling upgrades with no surprises • HBase Snapshots • Pluggable Compactions • Remove -ROOT- • Table Locks 24

0.98 and Beyond 25 • Transparent Table/Column-Family Encryption • Cell-level
security • Multiple WALs per Region Server (MTTR) • Data Placement Awareness (MTTR) • Data Type Awareness • Compaction policies, based on the data needs • Managing blocks directly (instead of files)

26 Headline Goes Here Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12 Questions? Matteo Bertozzi | @Cloudera

HBase Storage Internals

HBase Storage Internals

Matteo Bertozzi

More Decks by Matteo Bertozzi

Other Decks in Technology

Featured

Transcript

1 Speaker Name or Subhead Goes Here Matteo Bertozzi |

What is HBase? • Open source Storage Manager that provides

HBase ecosystem 3 • Apache Hadoop HDFS for data durability

4 How HBase Works “View from 10000ft”

Region Server Master, Region Servers and Regions 5 HDFS Region

Autosharding and .META. table • A Region is a Subset

The Write Path – Create a New Table 7 •

The Write Path – “Inserting” data 8 • table.put(row-key:family:column, value)

The Write Path – Append Only to Random R/W 9

The Read Path – “reading” data 10 • The client

The Read Path – Append Only to Random R/W 11

12 HBase Store File Format HFile

HFile Format 13 • Only Sequential Writes, just append(key, value)

Data Block Encoding 14 Row Length : short Row :

15 Optimize the read-path Compactions

Compactions 16 • Reduce the number of files to look

Pluggable Compactions 17 • Try different algorithm • Be aware

18 Zero-Copy Snapshots and Table Clones Snapshots

What Is a Snapshot? 19 • “a Snapshot is not

How Taking a Snapshot Works? 20 • The master orchestrate

Cloning a Table from a Snapshots 21 • hbase> clone_snapshot

Compactions & Archiving 22 • HFiles are immutable, and shared

23 What can be improved? Future

0.96 is coming up • Moving RPC to Protobuf •

0.98 and Beyond 25 • Transparent Table/Column-Family Encryption • Cell-level

26 Headline Goes Here Speaker Name or Subhead Goes Here

27