Hbase Mapr
Hbase Mapr
The Hadoop DataNode stores the data that the Region Server is managing. All
HBase data is stored in HDFS files. Region Servers are collocated with the
HDFS DataNodes, which enable data locality (putting the data close to where it is
needed) for the data served by the RegionServers. HBase data is local when it is
written, but when a region is moved, it is not local until compaction.
The NameNode maintains metadata information for all the physical data blocks
that comprise the files.
Regions
HBase Tables are divided horizontally by row key range into “Regions.” A region
contains all rows in the table between the region’s start key and end key.
Regions are assigned to the nodes in the cluster, called “Region Servers,” and
these serve data for reads and writes. A region server can serve about 1,000
regions.
HBase HMaster
Region assignment, DDL (create, delete tables) operations are handled by the
HBase Master.
If a region server or the active HMaster fails to send a heartbeat, the session is
expired and the corresponding ephemeral node is deleted. Listeners for updates
will be notified of the deleted nodes. The active HMaster listens for region
servers, and will recover region servers on failure. The Inactive HMaster listens
for active HMaster failure, and if an active HMaster fails, the inactive HMaster
becomes active.
This is what happens the first time a client reads or writes to HBase:
1. The client gets the Region server that hosts the META table from
ZooKeeper.
2. The client will query the .META. server to get the region server
corresponding to the row key it wants to access. The client caches this
information along with the META table location.
3. It will get the Row from the corresponding Region Server.
For future reads, the client uses the cache to retrieve the META location and
previously read row keys. Over time, it does not need to query the META table,
unless there is a miss because a region has moved; then it will re-query and
update the cache.
HBase Meta Table
This META table is an HBase table that keeps a list of all regions in the
system.
The .META. table is like a b tree.
The .META. table structure is as follows:
- Key: region start key,region id
- Values: RegionServer
Region Server Components
A Region Server runs on an HDFS data node and has the following components:
WAL: Write Ahead Log is a file on the distributed file system. The WAL is
used to store new data that hasn't yet been persisted to permanent
storage; it is used for recovery in the case of failure.
BlockCache: is the read cache. It stores frequently read data in memory.
Least Recently Used data is evicted when full.
MemStore: is the write cache. It stores new data which has not yet been
written to disk. It is sorted before writing to disk. There is one MemStore
per column family per region.
Hfiles store the rows as sorted KeyValues on disk.
HBase Write Steps (1)
When the client issues a Put request, the first step is to write the data to the
write-ahead log, the WAL:
- Edits are appended to the end of the WAL file that is stored on disk.
- The WAL is used to recover not-yet-persisted data in case a server crashes.
HBase Write Steps (2)
Once the data is written to the WAL, it is placed in the MemStore. Then, the put
request acknowledgement returns to the client.
HBase MemStore
The MemStore stores updates in memory as sorted KeyValues, the same as it
would be stored in an HFile. There is one MemStore per column family. The
updates are sorted per column family.
Note that this is one reason why there is a limit to the number of column families
in HBase. There is one MemStore per CF; when one is full, they all flush. It also
saves the last written sequence number so the system knows what was persisted
so far.
The highest sequence number is stored as a meta field in each HFile, to reflect
where persisting has ended and where to continue. On region startup, the
sequence number is read, and the highest is used as the sequence number for
new edits.
HBase HFile
Data is stored in an HFile which contains sorted key/values. When the MemStore
accumulates enough data, the entire sorted KeyValue set is written to a new
HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the
disk drive head.
The trailer points to the meta blocks, and is written at the end of persisting the
data to the file. The trailer also has information like bloom filters and time range
info. Bloom filters help to skip files that do not contain a certain row key. The time
range info is useful for skipping the file if it is not in the time range the read is
looking for.
HFile Index
The index, which we just discussed, is loaded when the HFile is opened and kept
in memory. This allows lookups to be performed with a single disk seek.
HBase Read Merge
We have seen that the KeyValue cells corresponding to one row can be in
multiple places, row cells already persisted are in Hfiles, recently updated cells
are in the MemStore, and recently read cells are in the Block cache. So when
you read a row, how does the system get the corresponding cells to return? A
Read merges Key Values from the block cache, MemStore, and HFiles in the
following steps:
1. First, the scanner looks for the Row cells in the Block cache - the read
cache. Recently Read Key Values are cached here, and Least Recently
Used are evicted when memory is needed.
2. Next, the scanner looks in the MemStore, the write cache in memory
containing the most recent writes.
3. If the scanner does not find all of the row cells in the MemStore and Block
Cache, then HBase will use the Block Cache indexes and bloom filters to
load HFiles into memory, which may contain the target row cells.
HBase Read Merge
As discussed earlier, there may be many HFiles per MemStore, which means for
a read, multiple files may have to be examined, which can affect the
performance. This is called read amplification.
When the HMaster detects that a region server has crashed, the HMaster
reassigns the regions from the crashed server to active Region servers. In order
to recover the crashed region server’s memstore edits that were not flushed to
disk. The HMaster splits the WAL belonging to the crashed region server into
separate files and stores these file in the new region servers’ data nodes. Each
Region Server then replays the WAL from the respective split WAL, to rebuild the
memstore for that region.
Data Recovery
WAL files contain a list of edits, with one edit representing a single put or delete.
Edits are written chronologically, so, for persistence, additions are appended to
the end of the WAL file that is stored on disk.
What happens if there is a failure when the data is still in memory and not
persisted to an HFile? The WAL is replayed. Replaying a WAL is done by
reading the WAL, adding and sorting the contained edits to the current
MemStore. At the end, the MemStore is flush to write changes to an HFile.
MapR Academy
If you need to learn more about the HBase data model, we have 4 lessons that
will help you. Module 2 of the FREE course, DEV 320 - Apache HBase Data
Model and Architecture will walk you through everything you need to know.
MapR Database offers many benefits over HBase, while maintaining the virtues
of the HBase API and the idea of data being sorted according to primary key.
MapR Database provides operational benefits such as no compaction delays and
automated region splits that do not impact the performance of the database. The
tables in MapR Database can also be isolated to certain machines in a cluster by
utilizing the topology feature of MapR. The final differentiator is that MapR
Database is just plain fast, due primarily to the fact that it is tightly integrated into
the MapR Distributed File and Object Store itself, rather than being layered on
top of a distributed file system that is layered on top of a conventional file system.
You can take this free On Demand training to learn more about MapR XD and
MapR Database
In this blog post, you learned more about the HBase architecture and its main
benefits over NoSQL data store solutions. If you have any questions about
HBase, please ask them in the comments section below.
Want