9 HBase
9 HBase
Vinod Kumar S,
ESDM Team,
CDAC,Hydrabad
HBase
◼ HBase (Hadoop Database) is a NoSQL database which
is most popularly used in Big Data for storage.
◼ It can run on top of Hadoop (HDFS) as distributed mode
or as standalone mode.
◼ It is meant to host large tables with billions of rows with
potentially millions of columns and run across a cluster
of commodity hardware.
◼ HBase is a powerful database for real-time query
capabilities with the speed of a key/value store and
offline or batch processing via MapReduce.
◼ HBase allows you to query for individual records as well
as derive aggregate analytic reports across a massive
amount of data.
◼ For the timely search on Google via internet, it has introduced
following technologies
❑ Google File System : A scalable distributed file system for
MapReduce MapReduce
Hadoop
HDFS MapReduce
json
❑ Hadoop enforces no constraints on these
Google
Filesystem MapReduce HDFS MapReduce
◼ Columnar Store
◼ Denormalized Storage
◼ Only CRUD operations
◼ ACID at row level
◼ Unique row and similar column for each row
◼ Fixed schema for each row
◼ If the attribute does not exist that particular cell in the data will be empty.
Every column is a attribute of a particular record.
◼ A traditional database is a two-dimensional model.
◼ You have to specify two dimensions, the unique identifier of the row and the
specific column in order to identify a single cell.
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Advantages of columnar store
◼ Columnar Storage
❑ Dynamically add new
attributes as rows in this table
❑ No wastage of space with
empty cells!
RDBMS Minimize Redundancy
◼ Employee Details
◼ Employee Subordinates
◼ Employee Address
◼ Normalization
❑ Optimizes Storage
Denormalized Storage
◼ Traditional RDBMS
❑ Data arranged in rows and columns
❑ Supports SQL
❑ Complex queries such as grouping, aggregates, joins etc
❑ Normalized storage to minimize redundancy and optimize space
❑ ACID compliant
◼ Hbase
❑ Data arranged in a column-wise manner
❑ NoSQL database
❑ Only basic operations such as create, read, update and delete
❑ Denormalized storage to minimize disk seeks
❑ ACID compliant at the row level
Hbase has 4 dimensional data model
◼ Row key
◼ Column Family
◼ Column
◼ Timestamp
◼ Rowkey
❑ Uniquely identifies a row
❑ Can be primitives, structures,arrays
❑ Represented internally as a bytearray
❑ Sorted in ascending order
◼ Column Family
❑ All rows have the same set of column families
❑ Each column family is stored in a separate data file
❑ Set up at schema definition time
❑ Can have different columns for each row
◼ Column
❑ Columns are units within a column family
❑ New columns can be added on the fly
❑ ColumnFamily: ColumnName =Work:Department
◼ Timestamp
❑ Used as the version number for the values stored
in a column
❑ The value for any version can be accessed
Insert and update data using the put
command
◼ Rowkey
◼ Insert data one cell at a time
◼ The column family prefix for every colmn
qualifier
put 'census', 1, 'personal:name', 'Mike Jones„
scan „census‟
put '<HBase_table_name>',
'row_key', '<colfamily:colname>', '<value>'
SQL> select * from tablename
hbase> scan 'tablename'
Syntax: KeyOnlyFilter ()
Ex: scan „tablename',{FILTER=> "KeyOnlyFilter()"}
◼ KeyOnlyFilter –
❑ This filter doesn‟t take any arguments. It returns
solely the key part of every key-value.
Syntax: FirstKeyOnlyFilter ()
Ex: scan „tablename',{FILTER=> "KeyOnlyFilter()“}
◼ ColumnPrefixFilter-
❑ This filter takes one argument a column prefix. It returns only
those key-values present in a column that starts with the
specified column prefix. The column prefix must be of the form
qualifier
ColumnPrefixFilter („<column_prefix>‟)
Example: ColumnPrefixFilter („Col‟)
◼ MultipleColumnPrefixFilter
❑ This filter takes a list of column prefixes. It returns key-values that
are present in a column that starts with any of the specified
column prefixes. Each of the column prefixes must be of the form
qualifier
MultipleColumnPrefixFilter („<column_prefix>‟,
„<column_prefix>‟, …, „<column_prefix>‟)
Example: MultipleColumnPrefixFilter („Col1‟, „Col2‟)
◼ ValueFilter-
❑ takes a compare operator and a comparator. It compares each
value with the comparator using the compare operator and if the
comparison returns true, it returns that key-value.
◼ PrefixFilter-
❑ This filter takes one argument as a prefix of a row key. It returns
solely those key-values present in the very row that starts with
the specified row prefix
PrefixFilter („<row_prefix>‟)
Example: PrefixFilter („Row‟)
Hbase Architecture
Hbase Architecture
◼ HBase has three major components:
❑ The Client library,
❑ A Master server, and
❑ Region servers (Region servers can be added or
removed as per requirement)
HBase Architecture
◼ The master server -
❑ Assigns Regions to the RegionServers and with the
help of Apache ZooKeeper.
❑ Handles load balancing of the regions across region
servers. It unloads the busy servers and shifts the
regions to less occupied servers.
❑ Maintains the state of the cluster by negotiating the
load balancing.
❑ Is responsible for schema changes and other metadata
operations such as creation of tables and column
families.
HBase - Architecture
◼ Regions
❑ Regions are nothing but tables that are split up and spread across
zookeeper
◼ HMaster and HRegionServers
register themselves with
zookeeper
Big Picture
◼ HBase is has three types of
servers in a master slave type of
architecture.
◼ Hbase Master process (Hmaster)
handles Region assignment, DDL
(create, delete tables) operations
◼ Region servers serve data for
reads and writes. When
accessing data, clients
communicate with HBase
RegionServers directly.
◼ Zookeeper, which is part of
HDFS, maintains a live cluster
state and provides server failure
notification.
◼ Hadoop DataNode stores the
data that the RegionServer is
managing
◼ HMaster and HRegionServers
register themselves with
zookeeper
◼ HBase Tables are divided horizontally by row key range into “Regions.”
A region contains all rows in the table between the region‟s start key and
end key. Regions are assigned to the nodes in the cluster, called
“Region Servers,” and these serve data for reads and writes. A region
server can serve about 1,000 regions.
A master :
Coordinating the region servers
❑ Assigning regions on startup , re-assigning regions for recovery
or load balancing- Monitoring all RegionServer instances in the
cluster (listens for notifications from zookeeper)
◼
Admin functions
❑ Interface for creating, deleting, updating tables
HBase uses ZooKeeper as a distributed coordination service to
maintain server state in the cluster. Zookeeper maintains which
servers are alive and available, and provides server failure
notification.
HBase catalog table called META
table, it holds the location of the
regions in the cluster. Zookeeper
stores META table.
recover not-yet-
persisted data in
case a server
crashes.
Hbase Write Steps(2)
◼ Once the data is written
to the WAL, it is placed in
the MemStore. Then, the
put request
acknowledgement
returns to the client.
files
◼ Regions in practice
❑ Initially, there is one region
❑ System monitors region size: if a threshold is attained, SPLIT
❑ Regions are split in two at the middle key