Unit - IV_Notes
Unit - IV_Notes
As the world becomes more digital and the amount of data to process increases day by day. The
usual database management systems became unable to handle and query the data efficiently, these
limitations led to the development of new solutions such as HBase which is the main topic of this
article.
In this article we are going to cover the basics of HBase and its major components’ functionality.
Overview
History
The first HBase prototype was created as a Hadoop contribution in the year Feb 2007.
The first usable HBase was released in the same year Oct 2007 along with Hadoop 0.15.0.
Definition
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop
Distributed File System (HDFS).
This video showcases the major differences between column-oriented databases and row-Oriented
databases in a practical way.
However, by storing data in columns rather than rows, the database can more precisely access the
data it needs to answer a query rather than scanning and discarding unwanted data in rows. Query
performance is increased for certain workloads.
For example :
A common method of storing a table is to serialize each row of data, like this :
Column-oriented database serializes all of the values of a column together, then the values of the
next column, and so on. For our example table, the data would be stored in this fashion:
Features
Scalability
Vertical scalability, on the other hand, increases capacity by adding more resources, such as
more memory or an additional CPU, to a machine. Scaling vertically, which is also called scaling
up, usually requires downtime while new resources are being added and has limits that are defined
by hardware.
“What you need to consider while choosing Horizontal scalability on Vertical Scalability
“Scaling horizontally has both advantages and disadvantages. For example, adding inexpensive
commodity computers to a cluster might seem to be a cost-effective solution at first glance, but
it’s important for the administrator to know whether the licensing costs for those additional
servers, the additional operations cost of powering and cooling and the large footprint they will
occupy in the data center truly makes scaling horizontally a better choice than scaling vertically.”
MemStore : is a write cache that stores new data that has not yet been written to disk, there is
one memstore per column family
A HBase Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a
column family for a table for a given region.
The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a
RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL
ensures that the changes to the data can be replayed.
Consistency
Consistency in database systems refers to the requirement that any given database transaction
must change affected data only in allowed ways. Any data written to the database must be valid
according to all defined rules, including constraints, cascades, triggers, and any combination
thereof.
Write transactions are always performed in strong consistency model in HBase which guarantees
that transactions are ordered, and replayed in the same order by all copies of the data. In
timeline consistency, the get and scan requests can be answered from data that may be stale.
The java client API for HBase is used to perform CRUD operations on HBase tables. HBase is
written in Java and has a Java Native API. Therefore it provides programmatic access to Data
Manipulation Language (DML).
Block cache
HBase supports block cache to improve read performance. When performing a scan, if block
cache is enabled and there is room remaining, data blocks read from StoreFiles on HDFS are
cached in region server’s Java heap space, so that next time, accessing data in the same block can
be served by the cached block. Block cache helps in reducing disk I/O for retrieving data.
Block cache is configurable at table’s column family level. Different column families can have
different cache priorities or even disable the block cache. Applications leverage this cache
mechanism to fit different data sizes and access patterns.
Bloom filter
A Bloom filter, named for its creator, Burton Howard Bloom, is a data structure which is designed
to predict whether a given element is a member of a set of data. A positive result from a Bloom
filter is not always accurate, but a negative result is guaranteed to be accurate. Bloom filters are
designed to be “accurate enough” for sets of data which are so large that conventional hashing
mechanisms would be impractical.
In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the number
of disk reads for a given Get operation to only the StoreFiles likely to contain the desired Row.
The potential performance gain increases with the number of parallel reads.
Using Bloom filters to help reduce the number of I/O operations. The files are all from one
column family and have a similar spread in row keys, although only a few really hold an update to
a specific row. The block index has a spread across the entire row key range, and therefore always
reports positive to contain a random row. The region server would need to load every block to
check if the block actually contains a cell of the row or not.
HBase vs RDMS
HBase Architecture
Row key : the reference of the row, it’s used to make the search of a record faster.
Column Families : combination of a set of columns. Data belonging to the same column
family can be accessed together in a single seek, allowing a faster process.
Cell: the storage area of data. Each cell is connected to a row key and a column qualifiers.
Region Server serves data for write and read. it refers to different computers in the Hadoop
cluster. Each Region Server have a region, HLog, a store memory store
To manage this system, Zookeeper and HMaster works together. Zookeeper verify the status of
HMaster, if it’s active, it’ll send a heartbeat of the zookeeper (active HMaster), and to guarantee
system tolerance, there is a Inactive HMaster that acts as a backup.
Region servers send a heartbeat signal to Zookeeper to send their status (ready for write and read
operations).
In this section we will discuss what happens when a client reads or write data to Hbase.
There is a special Hbase catalog called the META table which holds the location of the regions
in the cluster.
The client sends a request to get the region server that hosts the META table from ZooKeeper.
The client will query the META server to get the region server corresponding to the row key it
wants to access
The client caches this information along side with the META table location
Finally the region Server answer with the row key, so now it could get row or rows.
META table
These following steps occur in HBase Operations, while the client gives a command to Write:
1. Write the data to the write-ahead-log (WAL), HBase always has WAL to look into, if any error
occurs while writing data.
2. Once the data is written to the WAL, it is then copied to the MemStore
3. Once the data is placed in the MemStore, the client then receives the acknowledgement (ACK)
4. When the MemStore reaches the threshold, it dumps or commit the data into HFile
Applications of HBase
Medical
In the medical field, HBase is used for the purpose of storing genome sequences and running
MapReduce on it, storing the disease history of people or an area.
Sports
E-Commerce
For the purpose of recording and storing logs about customer search history, as well as to perform
analytics and then target advertisement for the better business.
There are many popular companies using HBase, some of them are:
1 .Mozilla
2 .Facebook
3 .Infolinks
To process advertisement selection and user events for the In-Text ad network, Infolinks uses
HBase. It is is an In-Text ad provider company. Moreover, to optimize ad selection, they use the
reports which HBase generates as feedback for their production system.
4 .Twitter
A company like Twitter also runs HBase across its entire Hadoop cluster. For them, HBase offers
a distributed, read/write the backup of all MySQL tables in their production backend. That helps
engineers to run MapReduce jobs over the data while maintaining the ability to apply periodic
row updates.
5 .Yahoo!
One of the most famous companies Yahoo! also uses HBase. There HBase helps to store
document fingerprint in order to detect near-duplicates.
Column-family stores, such as Cassandra [Cassandra], HBase [Hbase], Hypertable [Hypertable], and
Amazon SimpleDB [Amazon SimpleDB], allow you to store data with keys mapped to values and the
values grouped into multiple column families, each column family being a map of data.
The column has a key of firstName and the value of Martin and has a timestamp attached to it. A
row is a collection of columns attached or linked to a key; a collection of similar rows makes a
column family. When the columns in a column family are simple columns, the column family is known
as standard column family.
Click here to view code image
//column family
{
//row
"pramod-sadalage" : {
firstName: "Pramod",
lastName: "Sadalage",
lastVisit: "2012/12/12"
}
//row
"martin-fowler" : {
firstName: "Martin",
lastName: "Fowler",
location: "Boston"
}
}
Each column family can be compared to a container of rows in an RDBMS table where the key
identifies the row and the row consists on multiple columns. The difference is that various rows do
not have to have the same columns, and columns can be added to any row at any time without having
to add it to other rows. We have the pramod-sadalage row and the martin-fowler row with
different columns; both rows are part of the column family.
When a column consists of a map of columns, then we have a super column. A super column
consists of a name and a value which is a map of columns. Think of a super column as a container of
columns.
Click here to view code image
{
name: "book:978-0767905923",
value: {
author: "Mitch Albon",
title: "Tuesdays with Morrie",
isbn: "978-0767905923"
}
}
When we use super columns to create a column family, we get a super column family.
Click here to view code image
Super column families are good to keep related data together, but when some of the columns are
not needed most of the time, the columns are still fetched and deserialized by Cassandra, which may
not be optimal.
Cassandra puts the standard and super column families into keyspaces. A keyspace is similar to a
database in RDBMS where all column families related to the application are stored. Keyspaces have
to be created so that column families can be assigned to them:
create keyspace ecommerce
10.2.1. Consistency
When a write is received by Cassandra, the data is first recorded in a commit log, then written to an
in-memory structure known as memtable. A write operation is considered successful once it’s
written to the commit log and the memtable. Writes are batched in memory and periodically written
out to structures known as SSTable. SSTables are not written to again after they are flushed; if there
are changes to the data, a new SSTable is written. Unused SSTables are reclaimed by compactation.
Let’s look at the read operation to see how consistency settings affect it. If we have a consistency
setting of ONE as the default for all read operations, then when a read request is made, Cassandra
returns the data from the first replica, even if the data is stale. If the data is stale, subsequent reads
will get the latest (newest) data; this process is known as read repair. The low consistency level is
good to use when you do not care if you get stale data and/or if you have high read performance
requirements.
Similarly, if you are doing writes, Cassandra would write to one node’s commit log and return a
response to the client. The consistency of ONE is good if you have very high write performance
requirements and also do not mind if some writes are lost, which may happen if the node goes down
before the write is replicated to other nodes.
Click here to view code image
quorum = new ConfigurableConsistencyLevel();
quorum.setDefaultReadConsistencyLevel(HConsistencyLevel.QUORUM);
quorum.setDefaultWriteConsistencyLevel(HConsistencyLevel.QUORUM);
Using the QUORUM consistency setting for both read and write operations ensures that majority of the
nodes respond to the read and the column with the newest timestamp is returned back to the client,
while the replicas that do not have the newest data are repaired via the read repair operations. During
write operations, the QUORUM consistency setting means that the write has to propagate to the majority
of the nodes before it is considered successful and the client is notified.
Using ALL as consistency level means that all nodes will have to respond to reads or writes, which
will make the cluster not tolerant to faults—even when one node is down, the write or read is
blocked and reported as a failure. It’s therefore upon the system designers to tune the consistency
levels as the application requirements change. Within the same application, there may be different
requirements of consistency; they can also change based on each operation, for example showing
review comments for a product has different consistency requirements compared to reading the status
of the last order placed by the customer.
During keyspace creation, we can configure how many replicas of the data we need to store. This
number determines the replication factor of the data. If you have a replication factor of 3, the data
copied on to three nodes. When writing and reading data with Cassandra, if you specify the
consistency values of 2, you get that R + W is greater than the replication factor (2 + 2 > 3) which
gives you better consistency during writes and reads.
We can run the node repair command for the keyspace and force Cassandra to compare every key
it’s responsible for with the rest of the replicas. As this operation is expensive, we can also just
repair a specific column family or a list of column families:
repair ecommerce
While a node is down, the data that was supposed to be stored by that node is handed off to other
nodes. As the node comes back online, the changes made to the data are handed back to the node. This
technique is known as hinted handoff. Hinted handoff allows for faster restore of failed nodes.
10.2.2. Transactions
Cassandra does not have transactions in the traditional sense—where we could start multiple writes
and then decide if we want to commit the changes or not. In Cassandra, a write is atomic at the row
level, which means inserting or updating columns for a given row key will be treated as a single
write and will either succeed or fail. Writes are first written to commit logs and memtables, and are
only considered good when the write to commit log and memtable was successful. If a node goes
down, the commit log is used to apply changes to the node, just like the redo log in Oracle.
You can use external transaction libraries, such as ZooKeeper [ZooKeeper], to synchronize your
writes and reads. There are also libraries such as Cages [Cages] that allow you to wrap your
transactions over ZooKeeper.
10.2.3. Availability
Cassandra is by design highly available, since there is no master in the cluster and every node is a
peer in the cluster. The availability of a cluster can be increased by reducing the consistency level of
the requests. Availability is governed by the (R + W) > N formula (“Quorums,” p. 57) where W is the
minimum number of nodes where the write must be successfully written, R is the minimum number of
nodes that must respond successfully to a read, and N is the number of nodes participating in the
replication of data. You can tune the availability by changing the R and W values for a fixed value of N.
In a 10-node Cassandra cluster with a replication factor for the keyspace set to 3 (N = 3), if we set
R = 2 and W = 2, then we have (2 + 2) > 3. In this scenario, when one node goes down,
availability is not affected much, as the data can be retrieved from the other two nodes. If W = 2 and
R = 1, when two nodes are down the cluster is not available for write but we can still read.
Similarly, if R = 2 and W = 1, we can write but the cluster is not available for read. With the R + W
> N equation, you are making conscious decisions about consistency tradeoffs.
You should set up your keyspaces and read/write operations based on your needs—higher
availability for write or higher availability for read.
10.2.4. Query Features
When designing the data model in Cassandra, it is advised to make the columns and column families
optimized for reading the data, as it does not have a rich query language; as data is inserted in the
column families, data in each row is sorted by column names. If we have a column that is retrieved
much more often than other columns, it’s better performance-wise to use that value for the row key
instead.
10.2.4.1. Basic Queries
Basic queries that can be run using a Cassandra client include the GET, SET, and DEL. Before starting
to query for data, we have to issue the keyspace command use ecommerce;. This ensures that all of
our queries are run against the keyspace that we put our data into. Before starting to use the column
family in the keyspace, we have to define the column family.
Click here to view code image
CREATE COLUMN FAMILY Customer
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: city, validation_class: UTF8Type}
{column_name: name, validation_class: UTF8Type}
{column_name: web, validation_class: UTF8Type}
];
We have a column family named Customer with name, city, and web columns, and we are
inserting data in the column family with a Cassandra client.
Click here to view code image
SET Customer['mfowler']['city']='Boston';
SET Customer['mfowler']['name']='Martin Fowler';
SET Customer['mfowler']['web']='www.martinfowler.com';
Using the Hector [Hector] Java client, we can insert the same data in the column family.
Click here to view code image
ColumnFamilyTemplate<String, String> template =
cassandra.getColumnFamilyTemplate();
ColumnFamilyUpdater<String, String> updater =
template.createUpdater(key);
for (String name : values.keySet()) {
updater.setString(name, values.get(name));
}
try {
template.update(updater);
} catch (HectorException e) {
handleException(e);
}
We can read the data back using the GET command. There are multiple ways to get the data; we can
get the whole column family.
GET Customer['mfowler'];
We can even get just the column we are interested in from the column family.
GET Customer['mfowler']['web'];
Getting the specific column we need is more efficient, as only the data we care about is returned—
which saves lots of data movement, especially when the column family has a large number of
columns. Updating the data is the same as using the SET command for the column that needs to be set
to the new value. Using DEL command, we can delete either a column or the entire column family.
Click here to view code image
DEL Customer['mfowler']['city'];
DEL Customer['mfowler'];
These indexes are implemented as bit-mapped indexes and perform well for low-cardinality
column values.
10.2.4.3. Cassandra Query Language (CQL)
Cassandra has a query language that supports SQL-like commands, known as Cassandra Query
Language (CQL). We can use the CQL commands to create a column family.
Click here to view code image
CREATE COLUMNFAMILY Customer (
KEY varchar PRIMARY KEY,
name varchar,
city varchar,
web varchar);
We can read data using the SELECT command. Here we read all the columns:
SELECT * FROM Customer
Indexing columns are created using the CREATE INDEX command, and then can be used to query the
data.
Click here to view code image
SELECT name,web FROM Customer WHERE city='Boston'
CQL has many more features for querying data, but it does not have all the features that SQL has.
CQL does not allow joins or subqueries, and its where clauses are typically simple.
10.2.5. Scaling
Scaling an existing Cassandra cluster is a matter of adding more nodes. As no single node is a master,
when we add nodes to the cluster we are improving the capacity of the cluster to support more writes
and reads. This type of horizontal scaling allows you to have maximum uptime, as the cluster keeps
serving requests from the clients while new nodes are being added to the cluster.
10.3. Suitable Use Cases
Let’s discuss some of the problems where column-family databases are a good fit.
10.3.1. Event Logging
Column-family databases with their ability to store any data structures are a great choice to store
event information, such as application state or errors encountered by the application. Within the
enterprise, all applications can write their events to Cassandra with their own columns and the
rowkey of the form appname:timestamp. Since we can scale writes, Cassandra would work ideally
for an event logging system (Figure 10.2).
Once a column family is created, you can have arbitrary columns for each page visited within the
web application for every user.
Click here to view code image
INCR visit_counter['mfowler'][home] BY 1;
INCR visit_counter['mfowler'][products] BY 1;
INCR visit_counter['mfowler'][contactus] BY 1;