0% found this document useful (0 votes)
4 views

Chapter 12 HBase[1]

HBase is a distributed, column-oriented database built on the Hadoop file system, designed for quick random access to large amounts of structured data. It supports a multidimensional data model with features like persistence, versioning, and sorted key-value pairs, making it suitable for big data applications. HBase operates with a master-server architecture and utilizes region servers for data management, providing efficient read/write access and data storage capabilities.

Uploaded by

Suhas K P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 12 HBase[1]

HBase is a distributed, column-oriented database built on the Hadoop file system, designed for quick random access to large amounts of structured data. It supports a multidimensional data model with features like persistence, versioning, and sorted key-value pairs, making it suitable for big data applications. HBase operates with a master-server architecture and utilizes region servers for data management, providing efficient read/write access and data storage capabilities.

Uploaded by

Suhas K P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

HBase

1
HBase - Overview
• Since 1970, RDBMS is the solution for data storage and
maintenance related problems.
• After the advent of big data, companies realized the benefit of
processing big data and started opting for solutions like Hadoop.
• Hadoop uses distributed file system for storing big data, and
MapReduce to process it. Hadoop excels in storing and
processing of huge data of various formats such as arbitrary,
semi-, or even unstructured.

2
Limitations of Hadoop

• Hadoop can perform only batch processing, and data will


be accessed only in a sequential manner.

3
Hadoop Random Access Databases

• Applications such as HBase, Cassandra, couchDB,


Dynamo, and MongoDB are some of the databases that
store huge amounts of data and access the data in a
random manner.

4
What is HBase?
• HBase is a distributed column-oriented database built on top of
the Hadoop file system. It is an open-source project and is
horizontally scalable.
• HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts of
structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
• It is a part of the Hadoop ecosystem that provides random real-
time read/write access to data in the Hadoop File System.

5
HBase
• HBase (Hadoop Database) is a Java implementation of Google’s
BigTable.
• HBase is the “Hadoop Database” — hey, it’s built into the
name, for goodness sake.
• Google defines BigTable as a “sparse, distributed, persistent,
multidimensional sorted map”

6
Hbase is Persistent

• Persistent simply means that the data you store in BigTable


(and HBase, for that matter) will persist or remain after your
program or session ends.
• This is no different in concept than any other kind of
persistent storage such as a file on a filesystem.
• HBase leverages HDFS to persist its data to disk storage.

7
Hbase is Distributed

• Hbase and BigTable are built upon distributed filesystems


so that the underlying file storage can be spread out among
an array of independent machines.
• Hbase sits atop either Hadoop's Distributed File
System (HDFS) or Amazon's Simple Storage Service (S3),
while a BigTable makes use of the Google File System (GFS).
• Data is replicated across a number of participating nodes

8
Map
• At its core, Hbase/BigTable is a map. Ex- dictionary (Python),
• a map is "an abstract data type composed of a collection of keys
and a collection of values, where each key is associated with one
value.“
Example -
{
"zzzzz" : "woot",
"xyz" : "hello",
"aaaab" : "world",
"aaaaa" : "y"
} 9
Sorted
• HBase/BigTable the key/value pairs are kept as strictly sorted. In
other words, the row for the key "aaa" should be right next to the
row
• This sorting feature is actually very important since these systems
tend to be so huge and distributed. The sorting ensures that when
you must scan the table, the items that closely related are near each
other.
• Note that the term "sorted" when applied to HBase/BigTable does
not mean that "values" are sorted. The sorted ones are the keys.with
key "aab" and very far from the row with key "zzz".

10
Example for Sorted
• Consider a table whose keys are domain names. It makes
the most sense to list them in reverse notation (so
"com.jimbojw.www" rather than "www.jimbojw.com") so
that rows about a subdomain will be near the parent
domain row.

11
Multidimensional
• Multidimensional map - a map of maps
Example -
{
"1" : {
"A" : "x",
"B" : "z"
},
"aaaab" : {
"A" : "world",
"B" : "ocean"
},

12
Multidimensional
• Time is another dimension.
• All data is versioned either using an integer timestamp (seconds
since the epoch), or another integer of our choice. The client may
specify the timestamp when inserting data.
• Each column family may have its own rules regarding how many
versions of a given cell to keep
• In most cases, applications simply asks for a given cell's data while not
specifying a timestamp. In that common case, HBase returns the most
recent version (the one with the highest timestamp) since it stores
these in reverse chronological order.

13
HBase is a Column-oriented database
• Tables are sorted by row.
• The table schema defines only column families, which are the key
value pairs.
• A table have multiple column families and each column family can
have any number of columns.
• Subsequent column values are stored contiguously on the disk.
• In short, in an HBase:
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.

14
Hbase has A Multidimensional Sorted Map

• A map (also known as an associative array) is an abstract


collection of key-value pairs, where the key is unique
• Each value can have multiple versions, which makes the data
model multidimensional.
• By default, data versions are implemented with a timestamp.
• The keys are stored in sorted in byte-lexicographical order.

15
Lexicographical order
Lexicographical order is alphabetical order preceded by a length
comparison. That is to say, a string a is lexicographically smaller than a
string b

• if the length of a is smaller than the length of b, or


• else they are of the same length and a is alphabetically smaller than b.

16
Storage Mechanism :
A Table in HBase is a collection of rows. (table is sorted by rowID)
-> Row is a collection of column families.
-> Column family is a collection of columns.
-> Column is a collection of key-value pairs.
-> Cell : {row, col, version}

17
How to access Values

Row key - gives complete row


Row key . Column family
Row key . Column family . Column qualifier

18
Hbase Data Model
• Data Store consists of one or more tables indexed by row keys
• Data is stored in rows with columns, and rows can have multiple versions.
• By default, data versioning for rows is implemented with time stamps.
• Columns are grouped into column families, which must be defined up front
during table creation.
• Column families are stored together on disk, which is why HBase is
referred to as a column-oriented data store

19
Logical View of Customer Contact Information in HBase

Row Key Column Family : {Column Qualifier : Version: Value}


00001 CustomerName: {‘FN’: 1383859182496:‘John’,
‘LN’: 1383859182858:‘Smith’,
‘MN’: 1383859183001:’Timothy’,
‘MN’: 1383859182915:’T’}
ContactInfo: {‘EA’: 1383859183030:‘John.Smith@xyz.com’,
’SA’: 1383859183073:’1 Hadoop Lane, NY 11111’}

00002 CustomerName: {‘FN’: 1383859183103:‘Jane’,


‘LN’: 1383859183163:‘Doe’,
ContactInfo: {’SA’: 1383859185577:’7 HBase Ave, CA 22222’}

20
Row keys
• Implemented as byte arrays, and are sorted in byte-lexicographical
order
• The amount of data you can store in HBase can be huge and
the data you are retrieving via your queries should be near
each other.
• row key design is one of the most important aspects of the
HBase data modelling
• The row key should be defined in a way that allows related
rows to be stored near each other. These related rows will be
retrieved by queries and as long as they are stored near each
other you should experience good performance. Otherwise
the performance of your system will be impacted.
21
ASCII
• 1 byte for each charactre
A – 41h; 0- 60h

Find ASCII code for – “MCE1960”

Fixed length and Variable length codings

22
Unicode
• Unicode is a universal character encoding standard. It defines the way
individual characters are represented in text files, web pages, and other types
of documents.

• Unlike ASCII, which was designed to represent only basic English characters, Unicode
was designed to support characters from all languages around the world.

• The standard ASCII character set only supports 128 characters, while Unicode can
support roughly 1,000,000 characters.

• ASCII uses one byte to represent each character, while Unicode supports up to 4 bytes
for each character.
23
Unicode
• There are several different types of Unicode encodings –
UTF -8 (1-4 bytes per character)
UTF-16 (2 or 4 bytes per character)
UTF32(4 bytes per character)
• While UTF-8 supports up to 4 bytes per character, it would be inefficient to use four
bytes to represent frequently used characters.
• Therefore, UTF-8 uses
1 byte - to represent common English characters.
2 bytes - to represent European (Latin), Hebrew, and Arabic characters
3 bytes - to Chinese, Japanese, Korean, and other Asian characters.
4 bytes – additional characters

24
Column Families
• Generally, column families remain fixed throughout the lifetime of
an HBase table
• But new column families can be added by using administrative
commands.
• Official recommendation for the number of column families per
table was three or less.
• One should store data with similar access patterns in the same
column family
Ex - One wouldn’t want a customer’s middle name stored in a separate
column family from the first or last name because you generally
access all name data at the same time
25
Column Families
Note :
Column families are grouped together on disk, so grouping
reduces overall disk access and increases performance of data
with similar access patterns.

26
Column Qualifiers
• Specific names assigned to your data values in order to make sure
you’re able to accurately identify them.
• Unlike column families, column qualifiers can be virtually unlimited in
content, length and number, hence new data can be added to
column families on-the-fly
• HBase stores the column qualifier with your value, hence HBase
doesn’t limit the number of column qualifiers
• But creating long column qualifiers can be quite costly in terms of
storage.
Example, “LN:” used instead of “LastName”
27
Versions
• Stores version number for each value in the table.
• Version number – time stamp by default
• The versioned data is stored in decreasing order, so that the most
recent value is returned by default
Note –
1) Unix time or Unix epoch is used, which represents the number of
milliseconds since midnight January 1, 1970 UTC.
2) Can set TTL – time to live
3) Can set number of versions per value to be stored

28
Key Value Pairs
• Hbase stores data as key-value pair
• Row Key is the primary key
• Specifying only the row key can return every column qualifier, version, and value
related to the row key.
• One can make more specific query:
RowKey:(Column Family:Column Qualifier:Version) => Value
But performance will worsen as system has to spend more time to fetch required data

Note - there are no data types in Hbase; values are just one or more bytes. Again, simple
but powerful because you can store anything!

29
How to store Books of 4th year BE student
1. BE 4th year books -
Cache
2. BE first 3 years books –
Main memory
3. PUC and school books –
Hard disk

30
Why HBase
Provides Random access to huge data

31
How to Achieve ??
• Random Access
– Use cache concept
• Fast Random Access
- store in Sorted order
• Fast write operation
– Write into Buffer

32
Region Server Components

33
HBase Architecture
Main components :
• Region server -serve data for reads and writes
• Master server - Region assignment, DDL (create, delete
tables) operations are handled
• Zoo keeper -maintains a live cluster state and perform
distributed synchronization

34
Region Server

35
Regions
Row key Value Row key Value Row key Value
IS25 34 CS05 44 ME05 44
IS36 45 CS36 45 ME36 05
… … …
IS78 67 CS108 77 ME203 87

Amar Anthony
Akbar

36
HBase Architectural Components

37
HMaster Functions
• Co-ordinating the region servers
- Assigning regions on startup , re-assigning regions for recovery or load balancing-
Monitoring all RegionServer instances in the cluster (listens for notifications from
zookeeper)
• Admin functions
- Interface for creating, deleting, updating tables

38
Regions

39
Region Server Components
• WAL: Write Ahead Log is a file on the distributed file system. The WAL is
used to store new data that hasn't yet been persisted to permanent storage;
it is used for recovery in the case of failure.
• BlockCache: is the read cache. It stores frequently read data in memory.
Least Recently Used data is evicted when full.
• MemStore: is the write cache. It stores new data which has not yet been
written to disk. It is sorted before writing to disk. There is one MemStore
per column family per region.
• Hfiles store the rows as sorted KeyValues on disk.

40
Exercise

• Identify column families, column qualifier, column values


• Write statement to access details of ‘customers’ whose id is ‘1’
• Access price of the product whose id is ‘3’
• Product details of the product whose id is 4 41
Region Server Components

42
Region Split
• Initially there is one region per table. When a region grows too large, it splits into two
child regions. Both child regions, representing one-half of the original region, are
opened in parallel on the same Region server, and then the split is reported to the
HMaster.
• For load balancing reasons, the HMaster may schedule for new regions to be moved off
to other servers.

43
HBase MemStore
• The MemStore stores updates in memory as sorted KeyValues, the same
as it would be stored in an HFile.
• There is one MemStore per column family.
• The updates are sorted per column family.

44
HBase Region Flush
• When the MemStore accumulates enough data, the entire sorted set is written to a new
HFile in HDFS.
• HBase uses multiple HFiles per column family, which contain the actual cells, or
KeyValue instances.
• There is one MemStore per CF; when one is full, they all flush.
• This is one reason why there is a limit to the number of column families in Hbase

45
HBase Minor Compaction
• HBase will automatically pick some smaller HFiles and rewrite them into fewer bigger
Hfiles. This process is called minor compaction.
• Minor compaction reduces the number of storage files by rewriting smaller files into
fewer but larger ones, performing a merge sort.

46
Compaction
Row key Value Row key Value Row key Value
IS25 34 IS15 34 IS35 34
IS36 45 IS36 55 IS46 45
… … …
IS78 67 IS98 67 IS78 69

Amar Akbar Anthony

47
HBase Major Compaction
• Major compaction merges and rewrites all the HFiles in a region to one HFile per
column family, and in the process, drops deleted or expired cells.
• This improves read performance;
• Since major compaction rewrites all of the files, lots of disk I/O and network traffic
might occur during the process. This is called write amplification.

48
Data Recovery
• What happens if there is a failure when the data is still in memory and
not persisted to an HFile? The WAL is replayed.

49
Team Assignment – Making a Video

Farzan
Video

Rakesh Puneet

Shared Lock 50
Vote Counting
Texas

Florida

Alaska

New york

Synchronization 51
Zoo Keeper ?
• Why ?? - Managing a distributed application on a large cluster is a
daunting task
• What ?? - In short, it provides primitives that enable applications to run as
distributed systems. As its best, ZooKeeper allows developers to focus on
core application without worrying about the distributed nature of the
application,

Zoo Keeper is a distributed, open-source coordination service for distributed


applications. It exposes a simple set of primitives that distributed applications
can build upon to implement higher level services for synchronization,
configuration maintenance, and groups and naming.
52
How Zoo Keeper Works Internally
• ZooKeeper follows a simple client-server model where clients are nodes that make use
of the service, and servers are nodes that provide the service.
• A collection of ZooKeeper servers forms a ZooKeeper ensemble.
• Once a ZooKeeper ensemble starts after the leader election process, it will wait for the
clients to connect.
• At any given time, one ZooKeeper client is connected to one ZooKeeper server.
• Each ZooKeeper server can handle a large number of client connections at the same
time.
• Each client periodically sends pings to the ZooKeeper server it is connected to let it
know that it is alive and connected.
• The ZooKeeper server in question responds with an acknowledgment of the ping,
indicating the server is alive as well.
• When the client doesn't receive an acknowledgment from the server within the
specified time, the client connects to another server in the ensemble
53
Zoo Keeper

54
Znode
• ZooKeeper has a file system-like data model composed of
znodes.
• Znodes are ZooKeeper data nodes
• Think of znodes () as files in a traditional UNIX-like
system, except that they can have child nodes.

55
ZooKeeper client Read

• When a client requests to read the contents of a particular


znode, the read takes place at the server that the client is
connected to.
• Since only one server from the ensemble is involved, reads
are quick and scalable.

56
Zoo Keeper client Write
• All the writes in Zookeeper go through the Master node, thus it is guaranteed
that all writes will be sequential.
• If a client wants to store data in the ZooKeeper ensemble, it sends the znode
path and the data to the server.
• However, for writes to be completed successfully, a strict majority of the nodes
of the Zoo Keeper ensemble are required to be available.
• When a client issues a write request, the connected server passes on the
request to the leader. This leader then issues the same write request to all the
nodes of the ensemble.
• If a strict majority of the nodes (also known as a quorum) respond successfully
to this write request, the write request is considered to have succeeded.
• A successful return code is then returned to the client who initiated the write
request.
57
Znode Write – sync command
• This znode write approach can cause followers to fall behind the leader for
short periods.
• Zookeeper solves this potential problem by providing a synchronization
command. Clients that cannot tolerate this temporary lack of synchronization
within the Zookeeper cluster may decide to issue a sync command before
reading znodes.

Note - It is especially fast in "read-dominant" workload

58
Zookeeper and Hbase reliability
• How many Zookeeper servers will you need? Five is the minimum
recommended for production use

• When you decide to plan your Zookeeper ensemble, follow this


simple formula: 2F + 1 = N where F is the number of failures you can
accept in your Zookeeper cluster and N is the total number of
Zookeeper servers you must deploy.

• ZooKeeper provides superior reliability through redundant


services. A service is replicated over a set of machines and
each maintains an in-memory image of the the data tree and
transaction logs. 59
HBase vs RDBMS

60
Knowing when HBase makes
sense for you?
• A big data requirement: having terabytes to petabytes, otherwise you’ll have a lot of
idle servers in your racks.
• Sufficient hardware resources: Five servers is a good starting point
• Other requirements such as transaction support, rich data types, indexes and query
language support — though these factors are not as black and white as the preceding
two
• Rich data types, indexes and query language support can be added via other
technologies, such as Hive or commercial
products
• Other consideration is – consistency

61
ACID Properties in HBase
ACID acronym — Atomicity, Consistency, Isolation, and Durability

HBase supports ACID in limited ways, namely writes to the same row
provide all ACID guarantees, but not on multiple rows or across tables

62
Transitioning from an RDBMS
model to HBase
Three key principles to follow:
• De-normalization,
• Duplication
• Intelligent keys (DDI).

63
When to denormalize a database

• The essence of normalization is to put each piece of data in its


appropriate place; this ensures data integrity and facilitates
updating.
• However, retrieving data from a normalized database can be
slower, as queries need to address many different tables
where different pieces of data are stored.
• Updating, to the contrary, gets faster as all pieces of data are
stored in a single place.

64
When to denormalize a database

• The majority of modern applications need to be able to


retrieve data in the shortest time possible. And that’s when
you can consider denormalizing a relational database.
• Database denormalization means you deliberately put the
same data in several places, thus increasing redundancy
• The main purpose of denormalization is to significantly
speed up data retrieval.

65
SQL JOIN - combine rows from two or more tables, based on
a related column between them.
CustomeID CustomerName ContactName Country
rderID CustomerID OrderDate 1 Alfreds Futterkiste Maria Anders Germany
O
10308 2 1996-09-18 2 Ana Trujillo Ana Trujillo Mexico
Emparedados y
10309 37 1996-09-19
helados
10310 77 1996-09-20
3 Antonio Moreno Antonio Mexico
Taquería Moreno

OrderID CustomerName OrderDate


10308 Ana Trujillo Emparedados y helados 9/18/1996
10365 Antonio Moreno Taquería 11/27/1996
10383 Around the Horn 12/16/1996
10355 Around the Horn 11/15/1996
10278 Berglunds snabbköp 8/12/1996 66
Query -To access messages by category.

Keeping the name of a category right in


Need Join operation the User_messages table can save
between 2 tables time and reduce the number of
necessary joins. 67
Example

68
Example

69
Duplication
• As you de-normalize your database schema, you will likely end up
duplicating the data because it can help you avoid costly read operations
aross multiple tables.

• Don’t be concerned about the extra storage; you can use the automatic
scalability of HBase to your advantage.

• Be aware, though, that extra work will be required by your client


application to duplicate the data and remember that natively HBase only
provides row level atomic operations not cross row or cross table.

70
Intelligent Keys
• Because the data stored in HBase is ordered by row key, and the row key is
the only native index provided by the system, careful intelligent design of
the row key can make a huge difference.

• For example, your row key could be a combination of a service order


number and the customer’s ID number that placed the service order.

• This row key design would allow you to look up data related to the service
order or look up data related to the customer using the same row key in the
same table.

• This technique will be faster for some queries and avoid costly table joins.
71
De-normalization
The relational database model depends on
a) A normalized database schema
b) Joins between tables to respond to SQL operations.

Database normalization is a technique which guards against data


loss, redundancy, and other anomalies as data is updated and
retrieved.
Essentially normalization involves dividing larger tables into
smaller tables and defining relationships between them.
72
De-normalization
De-normalization is the opposite of normalization, where smaller, more
specific tables are joined into larger, more general tables.

This is a common pattern when transitioning to Hbase. Because joins are not
provided across tables, and joins can be slow since they involve costly disk
operations.

Guarding against the update and retrieval anomalies is now the job of your
HBase client application, since the protections afforded you by
normalization are null and void.

73
END

74
How the Components Work Together

• Zookeeper is used to co-ordinate shared state information for members of


distributed systems.
• Region servers and the active HMaster connect with a session to
ZooKeeper.
• The ZooKeeper maintains ephemeral nodes for active sessions via
heartbeats.
• Each Region Server creates an ephemeral node. The HMaster monitors
these nodes to discover available region servers,

75
How the Components Work Together

• If a region server or the active HMaster fails to send a heartbeat, the


session is expired and the corresponding ephemeral node is deleted.
• Listeners for updates will be notified of the deleted nodes.
• The Inactive HMaster listens for active HMaster failure, and if an active
HMaster fails, the inactive HMaster becomes active.

76
How the Components Work Together

77
How does Zookeeper work?
• The data within Zookeeper is divided across multiple
collection of nodes and this is how it achieves its high
availability and consistency.
• In case a node fails, Zookeeper can perform instant
failover migration; e.g. if a master node fails, a new one is
selected in real-time by polling within an ensemble.
• A client connecting to the server can query a different
node if the first one fails to respond.

78
Zoo Keeper: The Coordinator
• HBase uses ZooKeeper as a distributed coordination service to maintain server state in
the cluster for a distributed application
• Zookeeper maintains which servers are alive and available, and provides server failure
notification.
• Zookeeper uses consensus to guarantee common shared state. Note that there should
be three or five machines for consensus.
• As its best, without worrying about the distributed nature of the application, ZooKeeper
allows developers to focus on core application logic.

79
Zookeeper and Hbase reliability
The idea here is that Zookeeper stores znodes in memory and that
these memory-based znodes provide fast client access for
coordination, status, and other vital functions required by
distributed applications like HBase.

80
Zoo Keeper – Read & Write
Read
• Any Zookeeper server can handle reads from a client, including the
leader
Write
• Only the leader issues atomic znode writes — writes that either
completely succeed or completely fail.
• When a znode write request arrives at the leader node, the leader
broadcasts the write request to the follower nodes and then waits for
a majority of followers to acknowledge znode write complete.
• After the acknowledgement, the leader issues the znode write itself
and then reports the successful completion status to the client.
81
Zoo Keeper – Write

82
Extra Slides

83
Facebook Messenger Case Study
• Facebook Messenger combines Messages, email, chat and SMS
into a real-time conversation.
• The chat service supports over 300 million users who send over
120 billion messages per month.
• Facebook was trying to build a scalable and robust infrastructure
to handle set of these services.

84
Challenges faced by Facebook
messenger

85
The Major problems faced by Facebook

• Storing the large sets of continuously growing data from


various Facebook services.
• Requires Database which can leverage high processing on it.
• High performance needed to serve millions of requests.
• Maintaining consistency in storage and performance.

86
The Solution
• Facebook spent a few weeks testing different frameworks, to evaluate the clusters
of MySQL, Apache Cassandra, Apache HBase and other systems. They ultimately
selected HBase.

• HBase comes with very good scalability and performance for this workload with a
simpler consistency model than Cassandra.

• Facebook Messaging Platform shifted from Apache Cassandra to HBase in


November 2010.

• HDFS is the underlying file system used by HBase


87
Hbase History

88
Hbase History
• Apache HBase is modelled after Google’s BigTable, which is used to collect
data and serve request for various Google services like Maps, Finance, Earth
etc.

• Apache HBase began as a project by the company Powerset for Natural


Language Search, which was handling massive and sparse data sets.

• Apache HBase was first released in February 2007. Later in January 2008,
HBase became a sub project of Apache Hadoop.

• In 2010, HBase became Apache’s top level project.


89
Adding LinkedIn Link to sign Up page

90
With RDBMS

• Any changes to schema must be sent to DBA team


• They need time to update
• The database content must be moved to the new one
• This takes time, also asks for server downtime

91
With HBase
• Specify column families when table is constructed

• Fields within a column family can be altered on-fly


• Thus providing complete agility
• Hive cannot support delete & update operations, but DB team needed them
• Hbase can support all
• HBase is the choice for the applications which require fast & random access to large
amount of data.

92
Scenario – MCE Students Database
• Data Base engineer is invited
• Data – USN, Name, Address, Mobile, CGPA, Placed-Company,
Date-of-placement
• 30,000 students data
• Each book – 10,000 students data
• Need 3 books

93
Exercise
• Data analyst – finding a strategy to improve placement
• Group 3 - Placed-Company, Date-of-placement

94
95
Region Servers
• Region Servers are the software processes (often called daemons) you activate to
store and retrieve data in HBase.
• Each Region Server is deployed on its own dedicated compute node.
• You create a table and then begin storing and retrieving data. However, at some
point — and perhaps quite quickly in big data use cases — the table grows beyond a
configurable limit.
• At this point HBase system automatically splits the table and distributes the load to
another Region Server – known as auto-sharding
• This is a huge benefit compared to most database management systems, which
require manual intervention to scale the overall system beyond a single server.
• A region server can serve about 1,000 regions.

96
Regions
• In HBase, a table is both spread across a number of Region Servers
as well as being made up of individual regions.
• As tables are split, the splits become regions. Regions store a range
of key-value pairs
• Each Region Server manages a configurable number of regions.

97
Hbase Regions

98
Hbase Regions
• Regions separate data into column families and store the data in the HDFS using
Hfile objects.
• When clients put key-value pairs into the system, the keys are processed so that
data is stored based on the column family the pair belongs to
• Each column family store object has a read cache called the BlockCache and a write
cache called the MemStore.
• The BlockCache helps with random read performance. Data is read in blocks from
the HDFS and stored in the BlockCache. Subsequent reads for the data — or data
stored in close proximity — will be read from RAM instead of disk, improving overall
performance.
• The Write Ahead Log (WAL, for short) ensures that your Hbase writes are reliable.
There is one WAL per RegionServer.

99
Hbase Regions
• When you write or modify data in HBase, the data is first persisted to
the WAL, which is stored in the HDFS, and then the data is written to
the MemStore cache.
• At configurable intervals, key-value pairs stored in the MemStore are
written to HFiles in the HDFS and afterwards WAL entries are erased.
• If a failure occurs after the initial WAL write but before the final
MemStore write to disk, the WAL can be replayed to avoid any data
loss.
• The design of HBase is to flush column family data stored in the
MemStore to one HFile per flush.
• Then at configurable intervals HFiles are combined into larger HFiles
100
Compactions Major and Minor
• Compaction, the process by which HBase cleans up after itself, comes in
two flavors: major and minor.
• Minor compactions combine a configurable number of smaller HFiles into
one larger HFile. You can tune the number of HFiles to compact and the
frequency of a minor compaction.
• A major compaction seeks to combine all HFiles into one large HFile.
• In addition, a major compaction does the cleanup work after a user
deletes a record.

101
ZooKeeper
• ZooKeeper is a coordination service for distributed applications with the
motto "ZooKeeper: Because Coordinating Distributed Systems is a Zoo."

• The ZooKeeper framework was originally built at Yahoo. It runs on JVM


(Java virtual machine).

• A few of the distributed applications that use Zookeeper are Apache


Hadoop, Apache Kafka, and Apache Storm.

102
Master Server
• Monitor the Region Servers in the HBase cluster
• Handle metadata operations
• Assign regions
• Manage Region Server failover
• Oversee load balancing of regions across all available Region
Servers
• Manage (and clean) catalog tables
• Clear the WAL
• Provide a coprocessor framework for observing master operations
103
Zookeeper and Hbase reliability

• Zookeeper is a distributed cluster of servers that collectively provides


reliable coordination and synchronization services for clustered
applications.
• When you’re building and debugging distributed applications “it’s a
zoo out there,” so you should put Zookeeper on your team.
• Zookeeper clusters typically run on low-cost commodity x86 servers
with one Zookeeper server elected by the ensemble as the leader and
the rest of the servers are followers.
• Zookeeper ensembles are governed by the principle of a majority
quorum.

104
HBase Architecture

105
HBase First Read
• The client gets the Region server that hosts the META table from ZooKeeper.
• The client will query the .META server to get the region server corresponding to the row
key it wants to access. The client caches this information along with the META table
location.
• It will get the Row from the corresponding Region Server.

106
HBase Write Steps - 1
• Edits are appended to the end of WAL file that is stored on disk.
• WAL is used to recover not-yet-persisted data in case a server crashes.

107
HBase Write Steps - 2
• Once the data is written to the WAL, it is placed in the MemStore.
• Then, the put request acknowledgement returns to the client.

108

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy