0% found this document useful (0 votes)

124 views

Hbase Mapr

HBase has a master-slave architecture with RegionServers serving data for reads and writes. The HBase Master handles region assignment and DDL operations. Data is stored in HDFS files and served from DataNodes that are collocated with RegionServers for locality. Regions contain contiguous ranges of rows and are assigned to RegionServers. The HBase META table stored in ZooKeeper tracks the location of regions. Writes are written to the Write Ahead Log and MemStore before being flushed to multiple HFiles, while reads merge data from caches, MemStores and HFiles. Compactions merge HFiles to reduce read amplification.

Uploaded by

Mohsin Noor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views

Hbase Mapr

Uploaded by

Mohsin Noor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

HBase Architectural Components

Physically, HBase is composed of three types of servers in a master slave type

of architecture. Region servers serve data for reads and writes. When accessing
data, clients communicate with HBase RegionServers directly. Region
assignment, DDL (create, delete tables) operations are handled by the HBase
Master process. Zookeeper, which is part of HDFS, maintains a live cluster state.

The Hadoop DataNode stores the data that the Region Server is managing. All
HBase data is stored in HDFS files. Region Servers are collocated with the
HDFS DataNodes, which enable data locality (putting the data close to where it is
needed) for the data served by the RegionServers. HBase data is local when it is
written, but when a region is moved, it is not local until compaction.

The NameNode maintains metadata information for all the physical data blocks
that comprise the files.

Regions
HBase Tables are divided horizontally by row key range into “Regions.” A region
contains all rows in the table between the region’s start key and end key.
Regions are assigned to the nodes in the cluster, called “Region Servers,” and
these serve data for reads and writes. A region server can serve about 1,000
regions.

HBase HMaster
Region assignment, DDL (create, delete tables) operations are handled by the
HBase Master.

A master is responsible for:

 Coordinating the region servers

- Assigning regions on startup , re-assigning regions for recovery or load
balancing
- Monitoring all RegionServer instances in the cluster (listens for
notifications from zookeeper)
 Admin functions
- Interface for creating, deleting, updating tables
ZooKeeper: The Coordinator
HBase uses ZooKeeper as a distributed coordination service to maintain server
state in the cluster. Zookeeper maintains which servers are alive and available,
and provides server failure notification. Zookeeper uses consensus to guarantee
common shared state. Note that there should be three or five machines for
consensus.
How the Components Work Together
Zookeeper is used to coordinate shared state information for members of
distributed systems. Region servers and the active HMaster connect with a
session to ZooKeeper. The ZooKeeper maintains ephemeral nodes for active
sessions via heartbeats.
Each Region Server creates an ephemeral node. The HMaster monitors these
nodes to discover available region servers, and it also monitors these nodes for
server failures. HMasters vie to create an ephemeral node. Zookeeper
determines the first one and uses it to make sure that only one master is active.
The active HMaster sends heartbeats to Zookeeper, and the inactive HMaster
listens for notifications of the active HMaster failure.

If a region server or the active HMaster fails to send a heartbeat, the session is
expired and the corresponding ephemeral node is deleted. Listeners for updates
will be notified of the deleted nodes. The active HMaster listens for region
servers, and will recover region servers on failure. The Inactive HMaster listens
for active HMaster failure, and if an active HMaster fails, the inactive HMaster
becomes active.

HBase First Read or Write

There is a special HBase Catalog table called the META table, which holds the
location of the regions in the cluster. ZooKeeper stores the location of the META
table.

This is what happens the first time a client reads or writes to HBase:

1. The client gets the Region server that hosts the META table from
ZooKeeper.
2. The client will query the .META. server to get the region server
corresponding to the row key it wants to access. The client caches this
information along with the META table location.
3. It will get the Row from the corresponding Region Server.

For future reads, the client uses the cache to retrieve the META location and
previously read row keys. Over time, it does not need to query the META table,
unless there is a miss because a region has moved; then it will re-query and
update the cache.
HBase Meta Table
 This META table is an HBase table that keeps a list of all regions in the
system.
 The .META. table is like a b tree.
 The .META. table structure is as follows:
- Key: region start key,region id
- Values: RegionServer
Region Server Components
A Region Server runs on an HDFS data node and has the following components:

 WAL: Write Ahead Log is a file on the distributed file system. The WAL is
used to store new data that hasn't yet been persisted to permanent
storage; it is used for recovery in the case of failure.
 BlockCache: is the read cache. It stores frequently read data in memory.
Least Recently Used data is evicted when full.
 MemStore: is the write cache. It stores new data which has not yet been
written to disk. It is sorted before writing to disk. There is one MemStore
per column family per region.
 Hfiles store the rows as sorted KeyValues on disk.
HBase Write Steps (1)
When the client issues a Put request, the first step is to write the data to the
write-ahead log, the WAL:

- Edits are appended to the end of the WAL file that is stored on disk.
- The WAL is used to recover not-yet-persisted data in case a server crashes.
HBase Write Steps (2)
Once the data is written to the WAL, it is placed in the MemStore. Then, the put
request acknowledgement returns to the client.

HBase MemStore
The MemStore stores updates in memory as sorted KeyValues, the same as it
would be stored in an HFile. There is one MemStore per column family. The
updates are sorted per column family.

HBase Region Flush

When the MemStore accumulates enough data, the entire sorted set is written to
a new HFile in HDFS. HBase uses multiple HFiles per column family, which
contain the actual cells, or KeyValue instances. These files are created over time
as KeyValue edits sorted in the MemStores are flushed as files to disk.

Note that this is one reason why there is a limit to the number of column families
in HBase. There is one MemStore per CF; when one is full, they all flush. It also
saves the last written sequence number so the system knows what was persisted
so far.

The highest sequence number is stored as a meta field in each HFile, to reflect
where persisting has ended and where to continue. On region startup, the
sequence number is read, and the highest is used as the sequence number for
new edits.
HBase HFile
Data is stored in an HFile which contains sorted key/values. When the MemStore
accumulates enough data, the entire sorted KeyValue set is written to a new
HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the
disk drive head.

HBase HFile Structure

An HFile contains a multi-layered index which allows HBase to seek to the data
without having to read the whole file. The multi-level index is like a b+tree:

 Key value pairs are stored in increasing order

 Indexes point by row key to the key value data in 64KB “blocks”
 Each block has its own leaf-index
 The last key of each block is put in the intermediate index
 The root index points to the intermediate index

The trailer points to the meta blocks, and is written at the end of persisting the
data to the file. The trailer also has information like bloom filters and time range
info. Bloom filters help to skip files that do not contain a certain row key. The time
range info is useful for skipping the file if it is not in the time range the read is
looking for.

HFile Index
The index, which we just discussed, is loaded when the HFile is opened and kept
in memory. This allows lookups to be performed with a single disk seek.
HBase Read Merge
We have seen that the KeyValue cells corresponding to one row can be in
multiple places, row cells already persisted are in Hfiles, recently updated cells
are in the MemStore, and recently read cells are in the Block cache. So when
you read a row, how does the system get the corresponding cells to return? A
Read merges Key Values from the block cache, MemStore, and HFiles in the
following steps:

1. First, the scanner looks for the Row cells in the Block cache - the read
cache. Recently Read Key Values are cached here, and Least Recently
Used are evicted when memory is needed.
2. Next, the scanner looks in the MemStore, the write cache in memory
containing the most recent writes.
3. If the scanner does not find all of the row cells in the MemStore and Block
Cache, then HBase will use the Block Cache indexes and bloom filters to
load HFiles into memory, which may contain the target row cells.
HBase Read Merge
As discussed earlier, there may be many HFiles per MemStore, which means for
a read, multiple files may have to be examined, which can affect the
performance. This is called read amplification.

HBase Minor Compaction

HBase will automatically pick some smaller HFiles and rewrite them into fewer
bigger Hfiles. This process is called minor compaction. Minor compaction
reduces the number of storage files by rewriting smaller files into fewer but larger
ones, performing a merge sort.

HBase Major Compaction

Major compaction merges and rewrites all the HFiles in a region to one HFile per
column family, and in the process, drops deleted or expired cells. This improves
read performance; however, since major compaction rewrites all of the files, lots
of disk I/O and network traffic might occur during the process. This is called write
amplification.

Major compactions can be scheduled to run automatically. Due to write

amplification, major compactions are usually scheduled for weekends or
evenings. Note that MapR Database has made improvements and does not need
to do compactions. A major compaction also makes any data files that were
remote, due to server failure or load balancing, local to the region server.
Region = Contiguous Keys
Let’s do a quick review of regions:

 A table can be divided horizontally into one or more regions. A region

contains a contiguous, sorted range of rows between a start key and an
end key
 Each region is 1GB in size (default)
 A region of a table is served to the client by a RegionServer
 A region server can serve about 1,000 regions (which may belong to the
same table or different tables)
Region Split
Initially there is one region per table. When a region grows too large, it splits into
two child regions. Both child regions, representing one-half of the original region,
are opened in parallel on the same Region server, and then the split is reported
to the HMaster. For load balancing reasons, the HMaster may schedule for new
regions to be moved off to other servers.
Read Load Balancing
Splitting happens initially on the same region server, but for load balancing
reasons, the HMaster may schedule for new regions to be moved off to other
servers. This results in the new Region server serving data from a remote HDFS
node until a major compaction moves the data files to the Regions server’s local
node. HBase data is local when it is written, but when a region is moved (for load
balancing or recovery), it is not local until major compaction.
HDFS Data Replication
All writes and Reads are to/from the primary node. HDFS replicates the WAL and
HFile blocks. HFile block replication happens automatically. HBase relies on
HDFS to provide the data safety as it stores its files. When data is written in
HDFS, one copy is written locally, and then it is replicated to a secondary node,
and a third copy is written to a tertiary node.
HDFS Data Replication (2)
The WAL file and the Hfiles are persisted on disk and replicated, so how does
HBase recover the MemStore updates not persisted to HFiles? See the next
section for the answer.
HBase Crash Recovery
When a RegionServer fails, Crashed Regions are unavailable until detection and
recovery steps have happened. Zookeeper will determine Node failure when it
loses region server heart beats. The HMaster will then be notified that the Region
Server has failed.

When the HMaster detects that a region server has crashed, the HMaster
reassigns the regions from the crashed server to active Region servers. In order
to recover the crashed region server’s memstore edits that were not flushed to
disk. The HMaster splits the WAL belonging to the crashed region server into
separate files and stores these file in the new region servers’ data nodes. Each
Region Server then replays the WAL from the respective split WAL, to rebuild the
memstore for that region.

Data Recovery
WAL files contain a list of edits, with one edit representing a single put or delete.
Edits are written chronologically, so, for persistence, additions are appended to
the end of the WAL file that is stored on disk.

What happens if there is a failure when the data is still in memory and not
persisted to an HFile? The WAL is replayed. Replaying a WAL is done by
reading the WAL, adding and sorting the contained edits to the current
MemStore. At the end, the MemStore is flush to write changes to an HFile.

Apache HBase Architecture Benefits

HBase provides the following benefits:

 Strong consistency model

- When a write returns, all readers will see same value
 Scales automatically
- Regions split when data grows too large
- Uses HDFS to spread and replicate data
 Built-in recovery
- Using Write Ahead Log (similar to journaling on file system)
 Integrated with Hadoop
- MapReduce on HBase is straightforward

Apache HBase Has Problems Too…

 Business continuity reliability:
- WAL replay slow
- Slow complex crash recovery
- Major Compaction I/O storms

MapR Database with MapR XD does not

have these problems
The diagram below compares the application stacks for Apache HBase on top of
HDFS on the left, Apache HBase on top of MapR's read/write file system MapR
XD in the middle, and MapR Database and MapR XD in a Unified Storage Layer
on the right.
MapR Database exposes the same HBase API and the Data model for MapR
Database is the same as for Apache HBase. However the MapR Database
implementation integrates table storage into the MapR Distributed File and
Object Store, eliminating all JVM layers and interacting directly with disks for both
file and table storage.

MapR Academy

Apache HBase Data Model and Architecture

Explore the FREE Course

If you need to learn more about the HBase data model, we have 4 lessons that
will help you. Module 2 of the FREE course, DEV 320 - Apache HBase Data
Model and Architecture will walk you through everything you need to know.

MapR Database offers many benefits over HBase, while maintaining the virtues
of the HBase API and the idea of data being sorted according to primary key.
MapR Database provides operational benefits such as no compaction delays and
automated region splits that do not impact the performance of the database. The
tables in MapR Database can also be isolated to certain machines in a cluster by
utilizing the topology feature of MapR. The final differentiator is that MapR
Database is just plain fast, due primarily to the fact that it is tightly integrated into
the MapR Distributed File and Object Store itself, rather than being layered on
top of a distributed file system that is layered on top of a conventional file system.

Key differences between MapR Database

and Apache HBase
 Tables part of the MapR Read/Write File system
o Guaranteed data locality

 Smarter load balancing

o Uses container Replicas

 Smarter fail over

o Uses container replicas

 Multiple small WALs

o Faster recovery

 Memstore Flushes Merged into Read/Write File System

o No compaction!

You can take this free On Demand training to learn more about MapR XD and
MapR Database

In this blog post, you learned more about the HBase architecture and its main
benefits over NoSQL data store solutions. If you have any questions about
HBase, please ask them in the comments section below.

Want

6 4360704 Nosql Lab Manual
No ratings yet
6 4360704 Nosql Lab Manual
169 pages
Forcepoint DLP Training Doc 1 PDF
100% (2)
Forcepoint DLP Training Doc 1 PDF
46 pages
Certificazione SAP
No ratings yet
Certificazione SAP
1 page
Computer Science Textbook Solutions - 28
0% (1)
Computer Science Textbook Solutions - 28
30 pages
HBASE
No ratings yet
HBASE
35 pages
Hbase
100% (1)
Hbase
30 pages
DSS - U4 - HBASE Rev 1.0
No ratings yet
DSS - U4 - HBASE Rev 1.0
20 pages
BDA Unit-4 Part-2 HBase,Hive,Pig
No ratings yet
BDA Unit-4 Part-2 HBase,Hive,Pig
74 pages
HBase Architecture PDF
No ratings yet
HBase Architecture PDF
32 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
Bda Unit 5
No ratings yet
Bda Unit 5
16 pages
HBase
No ratings yet
HBase
31 pages
Apache Hbase
No ratings yet
Apache Hbase
5 pages
4 4HBase
No ratings yet
4 4HBase
17 pages
UNIT5
No ratings yet
UNIT5
42 pages
Lecture 3.2.3
No ratings yet
Lecture 3.2.3
13 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
Assignment 10
No ratings yet
Assignment 10
9 pages
Unit - IV_Notes
No ratings yet
Unit - IV_Notes
23 pages
HBASE (1)
No ratings yet
HBASE (1)
18 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
lec18
No ratings yet
lec18
18 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Bda Unit-4
No ratings yet
Bda Unit-4
63 pages
Hbase in Practice
No ratings yet
Hbase in Practice
46 pages
Lineland - HBase Architecture 101 - Storage
No ratings yet
Lineland - HBase Architecture 101 - Storage
15 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
HBase
No ratings yet
HBase
6 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
lec18
No ratings yet
lec18
21 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
6 pages
BDA Unit 5
No ratings yet
BDA Unit 5
33 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
Unit 5 Hbase - Hive - Pig
No ratings yet
Unit 5 Hbase - Hive - Pig
93 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
10_HBase
No ratings yet
10_HBase
13 pages
HBase
No ratings yet
HBase
27 pages
Hadoop HBase Notes-Abhijit-Nagargoje
No ratings yet
Hadoop HBase Notes-Abhijit-Nagargoje
24 pages
Lesson 6 NoSQL Databases HBase
100% (1)
Lesson 6 NoSQL Databases HBase
47 pages
Reader Rates
No ratings yet
Reader Rates
94 pages
Unit - 5 Part - 1
No ratings yet
Unit - 5 Part - 1
8 pages
HBASE
No ratings yet
HBASE
18 pages
Data Analytics Units 5
No ratings yet
Data Analytics Units 5
12 pages
HBase Presentation
No ratings yet
HBase Presentation
23 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
Hbase
No ratings yet
Hbase
3 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
Hbase +Fosdem+2010+Nosql 2
No ratings yet
Hbase +Fosdem+2010+Nosql 2
43 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
18 pages
HBASE
No ratings yet
HBASE
11 pages
Unit V
No ratings yet
Unit V
6 pages
Module 05 HBase - Distributed NoSQL Database
No ratings yet
Module 05 HBase - Distributed NoSQL Database
54 pages
Cs525: Special Topics in DBS: Large-Scale Data Management
No ratings yet
Cs525: Special Topics in DBS: Large-Scale Data Management
35 pages
Big data UNIT 5 own
No ratings yet
Big data UNIT 5 own
18 pages
Unit 5
No ratings yet
Unit 5
10 pages
BDT UNIT - V
No ratings yet
BDT UNIT - V
15 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Oracle Database 12c Quickstart
From Everand
Oracle Database 12c Quickstart
Michael Elliott
5/5 (5)
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
introduction_to_business_intelligence
No ratings yet
introduction_to_business_intelligence
73 pages
2018 JCI - Scientific Reports
No ratings yet
2018 JCI - Scientific Reports
17 pages
SQL Overview: Data Definition Language
No ratings yet
SQL Overview: Data Definition Language
9 pages
Technology Review: Scimago Journal & Country Rank
No ratings yet
Technology Review: Scimago Journal & Country Rank
1 page
Tabina Hendrick Microsoft Access 2022 - Complete Beginner To Expert Guide That Teaches Everything You
100% (1)
Tabina Hendrick Microsoft Access 2022 - Complete Beginner To Expert Guide That Teaches Everything You
77 pages
Data Science: Concepts and Practice: Course Slides
No ratings yet
Data Science: Concepts and Practice: Course Slides
9 pages
ADD A TITLE SLI-WPS Office
No ratings yet
ADD A TITLE SLI-WPS Office
30 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
12 pages
Unit 3: by Dr. Anand Vyas
No ratings yet
Unit 3: by Dr. Anand Vyas
20 pages
Search Engine Optimization - NOtes
No ratings yet
Search Engine Optimization - NOtes
8 pages
Data Visualization
No ratings yet
Data Visualization
4 pages
Fulltext
No ratings yet
Fulltext
22 pages
Ppt of Chapter 3.3- Database Recovery of Database
No ratings yet
Ppt of Chapter 3.3- Database Recovery of Database
28 pages
Cover Letter Akash
No ratings yet
Cover Letter Akash
1 page
CIA Triad
No ratings yet
CIA Triad
5 pages
Imc 451 Week 4 - Afma
No ratings yet
Imc 451 Week 4 - Afma
20 pages
Data Engineer Phillips66
No ratings yet
Data Engineer Phillips66
1 page
Guidelines - INR
No ratings yet
Guidelines - INR
5 pages
Birla Institute of Technology & Science, Pilani: Instruction Division First Semester 2010-2011 Course Handout: Part-II
No ratings yet
Birla Institute of Technology & Science, Pilani: Instruction Division First Semester 2010-2011 Course Handout: Part-II
4 pages
AutoCAD Civil 3D 2013 Essentials
100% (1)
AutoCAD Civil 3D 2013 Essentials
3 pages
ERD ML Prediction
No ratings yet
ERD ML Prediction
1 page
Visual Guide How To Utilize NodeXL Importers and Further Functionalities. (Dr. Verónica Espinoza)
No ratings yet
Visual Guide How To Utilize NodeXL Importers and Further Functionalities. (Dr. Verónica Espinoza)
22 pages
COBIT 2019
No ratings yet
COBIT 2019
17 pages
Chapter1 BI
No ratings yet
Chapter1 BI
41 pages
What Is Interaction Design?
No ratings yet
What Is Interaction Design?
48 pages
Types of Distributed Databases.: Homogeneous Distributed Databases System Heterogeneous Distributed Database System
No ratings yet
Types of Distributed Databases.: Homogeneous Distributed Databases System Heterogeneous Distributed Database System
22 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hbase Mapr

Uploaded by

Hbase Mapr

Uploaded by

HBase Architectural Components

Physically, HBase is composed of three types of servers in a master slave type

A master is responsible for:

 Coordinating the region servers

HBase First Read or Write

HBase Region Flush

HBase HFile Structure

 Key value pairs are stored in increasing order

HBase Minor Compaction

HBase Major Compaction

Major compactions can be scheduled to run automatically. Due to write

 A table can be divided horizontally into one or more regions. A region

Apache HBase Architecture Benefits

 Strong consistency model

Apache HBase Has Problems Too…

MapR Database with MapR XD does not

Apache HBase Data Model and Architecture

Explore the FREE Course

Key differences between MapR Database

 Smarter load balancing

 Smarter fail over

 Multiple small WALs

 Memstore Flushes Merged into Read/Write File System

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.