0% found this document useful (0 votes)
1 views

w_java132

Uploaded by

ridwangsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

w_java132

Uploaded by

ridwangsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

GETTING STARTED WITH

APACHE HADOOP

Getting Started With


Apache Hadoop
TABLE OF CONTENTS

Preface 1
Introduction 1
Installing Apache Hadoop 1
Single-Node Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Multi-Node Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Hadoop Distributed File System (HDFS) 1
HDFS Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Interacting with HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
MapReduce 4
MapReduce Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Writing a MapReduce Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Apache Hadoop Ecosystem 5
Apache Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Apache Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Apache HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Apache Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Additional Resources 12

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
1 APACHE HADOOP

PREFACE
PREFACE
# Set HADOOP_HOME
Getting Started with Apache Hadoop cheatsheet export HADOOP_HOME=/path/to/hadoop
serves as your quick reference guide to
understanding the fundamental concepts, # Add Hadoop binary path to PATH
components, and essential commands of Hadoop.
export PATH=$PATH:$HADOOP_HOME/bin
Whether you are a data engineer, data scientist, or
simply curious about big data technologies, this
cheatsheet will provide you with a solid foundation
MULTI-NODE INSTALLATION
to embark on your Hadoop journey.
For production or more realistic testing scenarios,
INTRODUCTION
INTRODUCTION you’ll need to set up a multi-node Hadoop cluster.
Here’s a high-level overview of the steps involved:
Getting started with Apache Hadoop is a powerful
ecosystem for handling big data. It allows you to Prepare the machines: Set up multiple machines
store, process, and analyze vast amounts of data (physical or virtual) with the same version of
across distributed clusters of computers. Hadoop is Hadoop installed on each of them.
based on the MapReduce programming model,
which enables parallel processing of data. This Configure SSH: Ensure passwordless SSH login
section will cover the key components of Hadoop, between all the machines in the cluster.
its architecture, and how it works.
Adjust the configuration: Modify the Hadoop
configuration files to reflect the cluster setup,
INSTALLINGAPACHE
INSTALLING APACHEHADOOP
HADOOP
including specifying the NameNode and DataNode
details.
In this section, we’ll guide you through the
installation process for Apache Hadoop. We’ll cover
HADOOPDISTRIBUTED
HADOOP DISTRIBUTEDFILE
FILESYSTEM
SYSTEM
both single-node and multi-node cluster setups to
suit your development and testing needs. (HDFS)
(HDFS)

HDFS is the distributed file system used by Hadoop


SINGLE-NODE INSTALLATION
to store large datasets across multiple nodes. It
To get started quickly, you can set up Hadoop in a provides fault tolerance and high availability by
single-node configuration on your local machine. replicating data blocks across different nodes in the
Follow these steps: cluster. This section will cover the basics of HDFS
and how to interact with it.
Download Hadoop: Visit the Apache Hadoop
website (https://hadoop.apache.org/) and download HDFS ARCHITECTURE
the latest stable release.
HDFS follows a master-slave architecture with two
Extract the tarball: After downloading, extract the main components: the NameNode and the
tarball to your preferred installation directory. DataNodes.

Set up environmental variables: Configure the


NameNode
HADOOP_HOME and add the Hadoop binary path to the
PATH variable. The NameNode is a critical component in the
Hadoop Distributed File System (HDFS)
Configure Hadoop: Adjust the configuration files architecture. It serves as the master node and plays
(core-site.xml, hdfs-site.xml, mapred-site.xml, yarn- a crucial role in managing the file system
site.xml) to suit your setup. namespace and metadata. Let’s explore the
significance of the NameNode and its
Example code for setting environmental variables
responsibilities in more detail.
(Linux):
NameNode Responsibilities

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
2 APACHE HADOOP

Metadata Management: The NameNode maintains NameNode in periodic checkpoints to optimize its
crucial metadata about the HDFS, including performance. The Secondary NameNode
information about files and directories. It keeps periodically merges the edit logs with the fsimage
track of the data block locations, replication factor, (file system image) and creates a new, updated
and other essential details required for efficient fsimage, reducing the startup time of the primary
data storage and retrieval. NameNode.

Namespace Management: HDFS follows a NameNode Federation


hierarchical directory structure, similar to a
traditional file system. The NameNode manages this Starting from Hadoop 2.x, NameNode Federation
namespace, ensuring that each file and directory is allows multiple independent HDFS namespaces to
correctly represented and organized. be hosted on a single Hadoop cluster. Each
namespace is served by a separate Active
Data Block Mapping: When a file is stored in HDFS, NameNode, providing better isolation and resource
it is divided into fixed-size data blocks. The utilization in a multi-tenant environment.
NameNode maintains the mapping of these data
blocks to the corresponding DataNodes where the NameNode Hardware Considerations
actual data is stored.
The NameNode’s role in HDFS is resource-intensive,
Heartbeat and Health Monitoring: The as it manages metadata and handles a large
NameNode receives periodic heartbeat signals from number of small files. When setting up a Hadoop
DataNodes, which indicates their and cluster, it’s essential to consider the following
health
availability. If a DataNode fails to send a heartbeat, factors for the NameNode hardware:
the NameNode marks it as unavailable and
replicates its data to other healthy nodes to Hardware
Description
maintain data redundancy and fault tolerance. Consideration

Memory Sufficient RAM to hold


Replication Management: The NameNode ensures
metadata and file
that the configured replication factor for each file is
system namespace.
maintained across the cluster. It monitors the
More memory enables
number of replicas for each data block and triggers
faster metadata
replication of blocks if necessary.
operations.

High Availability and Secondary NameNode Storage Fast and reliable storage
for maintaining file
As the NameNode is a critical component, its failure system metadata.
could result in the unavailability of the entire
CPU Capable CPU to handle
HDFS. To address this concern, Hadoop introduced
the processing load of
the concept of High Availability (HA) with Hadoop
metadata management
2.x versions.
and client request
In an HA setup, there are two NameNodes: the handling.
Active NameNode and the Standby NameNode. The Networking Good network
Active NameNode handles all client requests and connection for
metadata operations, while the Standby NameNode communication with
remains in sync with the Active NameNode. If the DataNodes and prompt
Active NameNode fails, the Standby NameNode response to client
takes over as the new Active NameNode, ensuring requests.
seamless HDFS availability.
By optimizing the NameNode hardware, you can
Additionally, the Secondary NameNode is a
ensure smooth HDFS operations and reliable data
misnomer and should not be confused with the
management in your Hadoop cluster.
Standby NameNode. The Secondary NameNode is
not a failover mechanism but assists the primary

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
3 APACHE HADOOP

DataNodes DataNodes. If a DataNode fails to send a heartbeat


within a specific time frame, the NameNode marks
DataNodes are integral components in the Hadoop it as unavailable and starts the process of
Distributed File System (HDFS) architecture. They replicating its data blocks to other healthy nodes.
serve as the worker nodes responsible for storing This mechanism helps in quickly detecting and
and managing the actual data blocks that make up recovering from DataNode failures.
the files in HDFS. Let’s explore the role of
DataNodes and their responsibilities in more detail. Decommissioning: When a DataNode needs to be
taken out of service for maintenance or other
DataNode Responsibilities reasons, it goes through a decommissioning
process. During decommissioning, the DataNode
Data Storage: DataNodes are responsible for informs the NameNode about its intent to leave the
storing the actual data blocks of files. When a file is cluster gracefully. The NameNode then starts
uploaded to HDFS, it is split into fixed-size blocks, replicating its data blocks to other nodes to
and each block is stored on one or more DataNodes. maintain the desired replication factor. Once the
The DataNodes efficiently manage the data blocks replication is complete, the DataNode can be safely
and ensure their availability. removed from the cluster.

Data Block Replication: HDFS replicates data DataNode Hardware Considerations


blocks to provide fault tolerance and data
redundancy. The DataNodes are responsible for DataNodes are responsible for handling a large
creating and maintaining replicas of data blocks as amount of data and performing read and write
directed by the NameNode. By default, each data operations on data blocks. When setting up a
block is replicated three times across different Hadoop cluster, consider the following factors for
DataNodes in the cluster. DataNode hardware:

Heartbeat and Block Reports: DataNodes


Hardware
regularly send heartbeat signals to the NameNode Description
Consideration
to indicate their health and availability.
Additionally, they provide block reports, informing Storage Significant storage
the NameNode about the list of blocks they are capacity for storing data
storing. The NameNode uses this information to blocks. Use reliable and
track the availability of data blocks and manage high-capacity storage
their replication. drives to accommodate
large datasets.
Data Block Operations: DataNodes perform read
CPU Sufficient processing
and write operations on the data blocks they store.
power to handle data
When a client wants to read data from a file, the
read and write
NameNode provides the locations of the relevant
operations efficiently.
data blocks, and the client can directly retrieve the
data from the corresponding DataNodes. Similarly, Memory Adequate RAM for
when a client wants to write data to a file, the data smooth data block
is written to multiple DataNodes based on the operations and better
replication factor. caching of frequently
accessed data.
DataNode Health and Decommissioning
Networking Good network
connectivity for efficient
DataNodes are crucial for the availability and
data transfer between
reliability of HDFS. To ensure the overall health of
DataNodes and
the Hadoop cluster, the following factors related to
communication with the
DataNodes are critical.
NameNode.

Heartbeat and Health Monitoring: The


NameNode expects periodic heartbeat signals from By optimizing the hardware for DataNodes, you can

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
4 APACHE HADOOP

ensure smooth data operations, fault tolerance, and produces key-value pairs as intermediate outputs.
high availability within your Hadoop cluster.
Shuffle and Sort: The intermediate key-value pairs
are shuffled and sorted based on their keys,
INTERACTING WITH HDFS
grouping them for the reduce phase.
You can interact with HDFS using either the
Reduce: The reducer processes the sorted
command-line interface (CLI) or the Hadoop Java
intermediate data and produces the final output.
API. Here are some common HDFS operations:

Uploading files to HDFS: WRITING A MAPREDUCE JOB

To write a MapReduce job, you’ll need to create two


hadoop fs -put /local/path/to/file main classes: Mapper and Reducer. The following is
/hdfs/destination/path a simple example of counting word occurrences in
a text file.

Downloading files from HDFS: Mapper class:

hadoop fs -get /hdfs/path/to/file import java.io.IOException;


/local/destination/path import org.apache.hadoop.io.*;
import
Listing files in a directory:
org.apache.hadoop.mapreduce.*;

public class WordCountMapper extends


hadoop fs -ls Mapper<LongWritable, Text, Text,
/hdfs/path/to/directory IntWritable> {
private final static IntWritable
Creating a new directory in HDFS:
one = new IntWritable(1);
private Text word = new Text();

hadoop fs -mkdir /hdfs/new/directory @Override


public void map(LongWritable key,
Text value, Context context) throws
MAPREDUCE
MAPREDUCE IOException, InterruptedException {
String[] words =
MapReduce is the core programming model of
value.toString().split("\s+");
Hadoop, designed to process and analyze vast
datasets in parallel across the Hadoop cluster. It
for (String w : words) {
breaks down the processing into two phases: the word.set(w);
Map phase and the Reduce phase. Let’s dive into the context.write(word, one);
details of MapReduce. }
}
MAPREDUCE WORKFLOW }

The MapReduce workflow consists of three steps:


Reducer class:
Input, Map, and Reduce.

Input: The input data is divided into fixed-size


import java.io.IOException;
splits, and each split is assigned to a mapper for
processing.
import org.apache.hadoop.io.*;
import
Map: The mapper processes the input splits and org.apache.hadoop.mapreduce.*;

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
5 APACHE HADOOP

public class WordCountReducer job.setOutputKeyClass(Text.class);


extends Reducer<Text, IntWritable,
Text, IntWritable> { job.setOutputValueClass(IntWritable.
@Override class);
public void reduce(Text key,
Iterable<IntWritable> values, FileInputFormat.addInputPath(job,
Context context) throws IOException, new Path(args[0]));
InterruptedException {
int sum = 0; FileOutputFormat.setOutputPath(job,
for (IntWritable val : values) { new Path(args[1]));
sum += val.get();
}
context.write(key, new
IntWritable(sum)); System.exit(job.waitForCompletion(tr
} ue) ? 0 : 1);
} }
}

Main class:
APACHEHADOOP
APACHE HADOOPECOSYSTEM
ECOSYSTEM

import
Apache Hadoop has a rich ecosystem of related
org.apache.hadoop.conf.Configuration projects that extend its capabilities. In this section,
; we’ll explore some of the most popular components
import org.apache.hadoop.fs.Path; of the Hadoop ecosystem.
import
org.apache.hadoop.mapreduce.*; APACHE HIVE
import
org.apache.hadoop.mapreduce.lib.inpu Using Apache Hive involves several steps, from
t.FileInputFormat; creating tables to querying and analyzing data.
import Let’s walk through a basic workflow for using Hive.

org.apache.hadoop.mapreduce.lib.outp
ut.FileOutputFormat; Launching Hive and Creating Tables

Start Hive CLI (Command Line Interface) or use


public class WordCount { HiveServer2 for a JDBC/ODBC connection.
public static void main(String[]
args) throws Exception { Create a database (if it doesn’t exist) to organize
Configuration conf = new your tables:
Configuration();
Job job = Job.getInstance(conf,
CREATE DATABASE mydatabase;
"word count");

job.setJarByClass(WordCount.class); Switch to the newly created database:

job.setMapperClass(WordCountMapper.c
USE mydatabase;
lass);

job.setReducerClass(WordCountReducer Define and create a table in Hive, specifying the


.class); schema and the storage format. For example, let’s

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
6 APACHE HADOOP

create a table to store employee information: Creating Views in Hive

Hive allows you to create views, which are virtual


CREATE TABLE employees ( tables representing the results of queries. Views can
emp_id INT, simplify complex queries and provide a more user-
emp_name STRING, friendly interface. Here’s how you can create a

emp_salary DOUBLE view:

)
ROW FORMAT DELIMITED CREATE VIEW high_salary_employees AS
FIELDS TERMINATED BY ','; SELECT * FROM employees WHERE
emp_salary > 75000;
Loading Data into Hive Tables
Using User-Defined Functions (UDFs)
Upload data files to HDFS or make sure the data is
available in a compatible storage format (e.g., CSV,
Hive allows you to create custom User-Defined
JSON) accessible by Hive.
Functions (UDFs) in Java, Python, or other
supported languages to perform complex
Load the data into the Hive table using the LOAD
computations or data transformations. After
DATA command. For example, if the data is in a CSV
creating a UDF, you can use it in your HQL queries.
file located in HDFS:
For example, let’s create a simple UDF to convert
employee salaries from USD to EUR:
LOAD DATA INPATH
'/path/to/employees.csv' INTO TABLE
package com.example.hive.udf;
employees;
import
org.apache.hadoop.hive.ql.exec.UDF;
Querying Data with Hive
import org.apache.hadoop.io.Text;
Now that the data is loaded into the Hive table, you
can perform SQL-like queries on it using Hive public class USDtoEUR extends UDF {
Query Language (HQL). Here are some example public Text evaluate(double usd) {
queries: double eur = usd * 0.85; //
Conversion rate (as an example)
Retrieve all employee records:
return new
Text(String.valueOf(eur));
SELECT * FROM employees; }
}

Calculate the average salary of employees:


Compile the UDF and add the JAR to the Hive
session:
SELECT AVG(emp_salary) AS avg_salary
FROM employees;
ADD JAR /path/to/usd_to_eur_udf.jar;

Filter employees earning more than $50,000:


Then, use the UDF in a query:

SELECT * FROM employees WHERE


emp_salary > 50000; SELECT emp_id, emp_name, emp_salary,
USDtoEUR(emp_salary) AS

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
7 APACHE HADOOP

emp_salary_eur FROM employees; structure:

Storing Query Results


-- Load data from a data source
(e.g., HDFS)
You can store the results of Hive queries into new data = LOAD '/path/to/data' USING
tables or external files. For example, let’s create a PigStorage(',') AS (col1:datatype,
new table to store high-earning employees: col2:datatype, ...);

-- Data transformation and


CREATE TABLE high_earning_employees
processing
AS
transformed_data = FOREACH data
SELECT * FROM employees WHERE
GENERATE col1, col2, ...;
emp_salary > 75000;

-- Filtering and grouping


Exiting Hive filtered_data = FILTER
transformed_data BY condition;
Once you have completed your Hive operations, grouped_data = GROUP filtered_data
you can exit the Hive CLI or close your JDBC/ODBC
BY group_column;
connection.

This is just a basic overview of using Hive. Hive is a


-- Aggregation and calculations
powerful tool with many advanced features, aggregated_data = FOREACH
optimization techniques, and integration options grouped_data GENERATE group_column,
with other components of the Hadoop ecosystem. SUM(filtered_data.col1) AS total;
As you explore and gain more experience with
Hive, you’ll discover its full potential for big data -- Storing the results
analysis and processing tasks. STORE aggregated_data INTO
'/path/to/output' USING
APACHE PIG PigStorage(',');

Apache Pig is a high-level data flow language and


execution framework built on top of Apache Pig Execution Modes
Hadoop. It provides a simple and expressive
scripting language called Pig Latin for data Pig supports two execution modes.
manipulation and analysis. Pig abstracts the
complexity of writing low-level Java MapReduce Local Mode: In Local Mode, Pig runs on a single
code and enables users to process large datasets machine and uses the local file system for input and
with ease. Pig is particularly useful for users who output. It is suitable for testing and debugging small
are not familiar with Java or MapReduce but still datasets without the need for a Hadoop cluster.
need to perform data processing tasks on Hadoop.
MapReduce Mode: In MapReduce Mode, Pig runs
on a Hadoop cluster and generates MapReduce jobs
Pig Latin
for data processing. It leverages the full power of
Pig Latin is the scripting language used in Apache Hadoop’s distributed computing capabilities to
Pig. It consists of a series of data flow operations, process large datasets.
where each operation takes input data, performs a
transformation, and generates output data. Pig Pig Features
Latin scripts are translated into a series of
MapReduce jobs by the Pig execution engine.

Pig Latin scripts typically follow the following

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
8 APACHE HADOOP

Feature Description
-- Load data from HDFS
Abstraction Pig abstracts the
data = LOAD '/path/to/input' USING
complexities of
PigStorage(',') AS (name:chararray,
MapReduce code,
age:int, city:chararray);
allowing users to focus
on data manipulation
and analysis. -- Filter records where age is
greater than 25
Extensibility Pig supports user-
filtered_data = FILTER data BY age >
defined functions (UDFs)
25;
in Java, Python, or other
languages, enabling
custom data -- Store the filtered results to
transformations and HDFS
calculations. STORE filtered_data INTO
'/path/to/output' USING
Optimization Pig optimizes data
processing through
PigStorage(',');
logical and physical
optimizations, reducing As you become more familiar with Pig, you can
data movement and explore its advanced features, including UDFs,
improving performance. joins, groupings, and more complex data processing
Schema Flexibility Pig follows a schema-on- operations. Apache Pig is a valuable tool in the
read approach, allowing Hadoop ecosystem, enabling users to perform data
data to be stored in a processing tasks efficiently without the need for
flexible and schema-less extensive programming knowledge.
manner,
accommodating APACHE HBASE
evolving data structures.
Apache HBase is a distributed, scalable, and NoSQL
Integration with Hadoop Pig integrates seamlessly
database built on top of Apache Hadoop. It provides
Ecosystem with various Hadoop
real-time read and write access to large amounts of
ecosystem components,
structured data. HBase is designed to handle
including HDFS, Hive,
massive amounts of data and is well-suited for use
HBase, etc., enhancing
cases that require random access to data, such as
data processing
real-time analytics, online transaction processing
capabilities.
(OLTP), and serving as a data store for web
applications.
Using Pig
HBase Features
To use Apache Pig, follow these general steps:

Feature Description
Install Apache Pig on your Hadoop cluster or a
standalone machine. Column-Family Data Data is organized into
Model column families within
Write Pig Latin scripts to load, transform, and a table. Each column
process your data. Save the scripts in .pig files. family can have
multiple columns. New
Run Pig in either Local Mode or MapReduce Mode, columns can be added
depending on your data size and requirements. dynamically without
affecting existing rows.
Here’s an example of a simple Pig Latin script that
loads data, filters records, and stores the results:

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
9 APACHE HADOOP

Feature Description Component Description

Schema Flexibility HBase is schema-less, HBase Master Responsible for


allowing each row in a administrative tasks,
table to have different including region
columns. This flexibility assignment, load
accommodates data balancing, and failover
with varying attributes management. It doesn’t
without predefined directly serve data to
schemas. clients.

Horizontal Scalability HBase can scale HBase RegionServer Stores and manages
horizontally by adding data. Each RegionServer
more nodes to the manages multiple
cluster. It automatically regions, and each region
distributes data across corresponds to a portion
regions and nodes, of an HBase table.
ensuring even data
ZooKeeper HBase relies on Apache
distribution and load
ZooKeeper for
balancing.
coordination and
High Availability HBase supports distributed
automatic failover and synchronization among
recovery, ensuring data the HBase Master and
availability even if some RegionServers.
nodes experience
HBase Client Interacts with the HBase
failures.
cluster to read and write
Real-Time Read/Write HBase provides fast and data. Clients use the
low-latency read and HBase API or HBase
write access to data, shell to perform
making it suitable for operations on HBase
real-time applications. tables.

Data Compression HBase supports data


compression techniques Using HBase
like Snappy and LZO,
reducing storage To use Apache HBase, follow these general steps:
requirements and
Install Apache HBase on your Hadoop cluster or a
improving query
standalone machine.
performance.

Integration with Hadoop HBase seamlessly Start the HBase Master and RegionServers.
Ecosystem integrates with various
Hadoop ecosystem Create HBase tables and specify the column
components, such as families.
HDFS, MapReduce, and
Apache Hive, enhancing Use the HBase API or HBase shell to perform read
data processing and write operations on HBase tables.
capabilities.
Here’s an example of using the HBase shell to
create a table and insert data:
HBase Architecture

HBase follows a master-slave architecture with the


$ hbase shell
following key components:

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
10 APACHE HADOOP

hbase(main):001:0> create MLlib: Spark’s machine learning library, MLlib,


'my_table', 'cf1', 'cf2' offers a rich set of algorithms and utilities for
hbase(main):002:0> put 'my_table', building and evaluating machine learning models.

'row1', 'cf1:col1', 'value1'


GraphX: GraphX is Spark’s library for graph
hbase(main):003:0> put 'my_table',
processing, enabling graph analytics and
'row1', 'cf2:col2', 'value2' computations on large-scale graphs.
hbase(main):004:0> scan 'my_table'
Spark Streaming: Spark Streaming allows real-
time processing of data streams, making Spark
This example creates a table named my_table with
suitable for real-time analytics.
two column families (cf1 and cf2), inserts data into
rows row1, and scans the table to retrieve the
inserted data. Spark Architecture

Spark follows a master-slave architecture with the


Apache HBase is an excellent choice for storing and
following key components:
accessing massive amounts of structured data with
low-latency requirements. Its integration with the
Component Description
Hadoop ecosystem makes it a powerful tool for real-
time data processing and analytics. Driver The Spark Driver
program runs on the
APACHE SPARK master node and is
responsible for
Apache Spark is an open-source distributed data coordinating the Spark
processing framework designed for speed, ease of application. It splits the
use, and sophisticated analytics. It provides an in- tasks into smaller tasks
memory computing engine that enables fast data called stages and
processing and iterative algorithms, making it well- schedules their
suited for big data analytics and machine learning execution.
applications. Spark supports various data sources,
Executor Executors run on the
including Hadoop Distributed File System (HDFS),
worker nodes and
Apache HBase, Apache Hive, and more.
perform the actual data
processing tasks. They
Spark Features store the RDD partitions
in memory and cache
In-Memory Computing: Spark keeps intermediate
intermediate data for
data in memory, reducing the need to read and
faster processing.
write to disk and significantly speeding up data
processing. Cluster Manager The cluster manager
allocates resources to
Resilient Distributed Dataset (RDD): Spark’s the Spark application
fundamental data structure, RDD, allows for and manages the
distributed data processing and fault tolerance. allocation of executors
RDDs are immutable and can be regenerated in across the cluster.
case of failures. Popular cluster
managers include
Data Transformation and Actions: Spark provides Apache Mesos, Hadoop
a wide range of transformations (e.g., map, filter, YARN, and Spark’s
reduce) and actions (e.g., count, collect, save) for standalone manager.
processing and analyzing data.

Spark SQL: Spark SQL enables SQL-like querying Using Apache Spark
on structured data and seamless integration with
To use Apache Spark, follow these general steps:
data sources like Hive and JDBC.

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
11 APACHE HADOOP

Install Apache Spark on your Hadoop cluster or a popular choice for big data processing, analytics,
standalone machine. and machine learning applications. Its ability to
leverage in-memory computing and seamless
Create a SparkContext, which is the entry point to integration with various data sources and machine
Spark functionalities. learning libraries make it a versatile tool in the big
data ecosystem.
Load data from various data sources into RDDs or
DataFrames (Spark SQL).
APACHE SQOOP
Perform transformations and actions on the RDDs
Apache Sqoop is an open-source tool designed for
or DataFrames to process and analyze the data.
efficiently transferring data between Apache
Hadoop and structured data stores, such as
Use Spark MLlib for machine learning tasks if
relational databases. Sqoop simplifies the process of
needed.
importing data from relational databases into
Hadoop’s distributed file system (HDFS) and
Save the results or write the data back to external
exporting data from HDFS to relational databases. It
data sources if required.
supports various databases, including MySQL,
Here’s an example of using Spark in Python to Oracle, PostgreSQL, and more.
count the occurrences of each word in a text file:
Sqoop Features

from pyspark import SparkContext Data Import and Export: Sqoop allows users to
import data from relational databases into HDFS
# Create a SparkContext and export data from HDFS back to relational
databases.
sc = SparkContext("local", "Word
Count")
Parallel Data Transfer: Sqoop uses multiple
mappers in Hadoop to import and export data in
# Load data from a text file into an parallel, achieving faster data transfer.
RDD
text_file = Full and Incremental Data Imports: Sqoop
sc.textFile("path/to/text_file.txt") supports both full and incremental data imports.
Incremental imports enable transferring only new
# Split the lines into words and or updated data since the last import.

count the occurrences of each word


Data Compression: Sqoop can compress data
word_counts = during import and decompress it during export,
text_file.flatMap(lambda line: reducing storage requirements and speeding up
line.split(" ")).map(lambda word: data transfer.
(word, 1)).reduceByKey(lambda a, b:
a + b) Schema Inference: Sqoop can automatically infer
the database schema during import, reducing the
# Print the word counts need for manual schema specification.

for word, count in


Integration with Hadoop Ecosystem: Sqoop
word_counts.collect():
integrates seamlessly with other Hadoop ecosystem
print(f"{word}: {count}") components, such as Hive and HBase, enabling data
integration and analysis.
# Stop the SparkContext
sc.stop() Sqoop Architecture

Sqoop consists of the following key components:


Apache Spark’s performance, ease of use, and
broad range of functionalities have made it a

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
12 APACHE HADOOP

Component Description Apache Sqoop simplifies the process of transferring


data between Hadoop and relational databases,
Sqoop Client The Sqoop Client is the
making it a valuable tool for integrating big data
command-line tool used
with existing data stores and enabling seamless
to interact with Sqoop.
data analysis in Hadoop.
Users execute Sqoop
commands from the
ADDITIONALRESOURCES
ADDITIONAL RESOURCES
command line to import
or export data.
Here are some additional resources to learn more
Sqoop Server The Sqoop Server about the topics mentioned:
provides REST APIs for
the Sqoop Client to Resource Description
communicate with the
Apache Hadoop Official The official website of
underlying Hadoop
Website Apache Hadoop,
ecosystem. It manages
providing extensive
the data transfer tasks
documentation,
and interacts with HDFS
tutorials, and downloads
and relational
for getting started with
databases.
Hadoop.

Apache Hive Official The official website of


Using Apache Sqoop
Website Apache Hive, offering
To use Apache Sqoop, follow these general steps: documentation,
examples, and
Install Apache Sqoop on your Hadoop cluster or a downloads, providing all
standalone machine. the essential
information to get
Configure the Sqoop Client by specifying the started with Apache
database connection details and other required Hive.
parameters.
Apache Pig Official The official website of
Website Apache Pig, offering
Use the Sqoop Client to import data from the
documentation,
relational database into HDFS or export data from
examples, and
HDFS to the relational database.
downloads, providing all
Here’s an example of using Sqoop to import data the essential
from a MySQL database into HDFS: information to get
started with Apache Pig.

Apache HBase Official The official website of


# Import data from MySQL to HDFS Website Apache HBase, offering
sqoop import documentation,
--connect tutorials, and
jdbc:mysql://mysql_server:3306/mydat downloads, providing all
abase the essential
--username myuser information to get
--password mypassword started with Apache
--table mytable HBase.
--target-dir /user/hadoop/mydata

This example imports data from the mytable in the


MySQL database into the HDFS directory
/user/hadoop/mydata.

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


GETTING STARTED WITH
13 APACHE HADOOP

Resource Description

Apache Spark Official The official website of


Website Apache Spark, offering
documentation,
examples, and
downloads, providing all
the essential
information to get
started with Apache
Spark.

Apache Sqoop Official The official website of


Website Apache Sqoop, offering
documentation,
examples, and
downloads, providing all
the essential
information to get
started with Apache
Sqoop.

Additionally, you can find many tutorials, blog


posts, and online courses on platforms like Udemy,
Coursera, and LinkedIn Learning that offer in-
depth knowledge on these Apache projects. Happy
learning!

JCG delivers over 1 million pages each month to more than 700K software
developers, architects and decision makers. JCG offers something for everyone,
including news, tutorials, cheat sheets, research guides, feature articles, source code
and more.
CHEATSHEET FEEDBACK
WELCOME
support@javacodegeeks.com

Copyright © 2014 Exelixis Media P.C. All rights reserved. No part of this publication may be SPONSORSHIP
reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, OPPORTUNITIES
mechanical, photocopying, or otherwise, without prior written permission of the publisher. sales@javacodegeeks.com

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy