0% found this document useful (0 votes)

1 views

w_java132

Uploaded by

ridwangsn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

w_java132

Uploaded by

ridwangsn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

GETTING STARTED WITH

APACHE HADOOP

Getting Started With

Apache Hadoop
TABLE OF CONTENTS

Preface 1
Introduction 1
Installing Apache Hadoop 1
Single-Node Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Multi-Node Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Hadoop Distributed File System (HDFS) 1
HDFS Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Interacting with HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
MapReduce 4
MapReduce Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Writing a MapReduce Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Apache Hadoop Ecosystem 5
Apache Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Apache Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Apache HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Apache Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Additional Resources 12

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
1 APACHE HADOOP

PREFACE
PREFACE
# Set HADOOP_HOME
Getting Started with Apache Hadoop cheatsheet export HADOOP_HOME=/path/to/hadoop
serves as your quick reference guide to
understanding the fundamental concepts, # Add Hadoop binary path to PATH
components, and essential commands of Hadoop.
export PATH=$PATH:$HADOOP_HOME/bin
Whether you are a data engineer, data scientist, or
simply curious about big data technologies, this
cheatsheet will provide you with a solid foundation
MULTI-NODE INSTALLATION
to embark on your Hadoop journey.
For production or more realistic testing scenarios,
INTRODUCTION
INTRODUCTION you’ll need to set up a multi-node Hadoop cluster.
Here’s a high-level overview of the steps involved:
Getting started with Apache Hadoop is a powerful
ecosystem for handling big data. It allows you to Prepare the machines: Set up multiple machines
store, process, and analyze vast amounts of data (physical or virtual) with the same version of
across distributed clusters of computers. Hadoop is Hadoop installed on each of them.
based on the MapReduce programming model,
which enables parallel processing of data. This Configure SSH: Ensure passwordless SSH login
section will cover the key components of Hadoop, between all the machines in the cluster.
its architecture, and how it works.
Adjust the configuration: Modify the Hadoop
configuration files to reflect the cluster setup,
INSTALLINGAPACHE
INSTALLING APACHEHADOOP
HADOOP
including specifying the NameNode and DataNode
details.
In this section, we’ll guide you through the
installation process for Apache Hadoop. We’ll cover
HADOOPDISTRIBUTED
HADOOP DISTRIBUTEDFILE
FILESYSTEM
SYSTEM
both single-node and multi-node cluster setups to
suit your development and testing needs. (HDFS)
(HDFS)

HDFS is the distributed file system used by Hadoop

SINGLE-NODE INSTALLATION
to store large datasets across multiple nodes. It
To get started quickly, you can set up Hadoop in a provides fault tolerance and high availability by
single-node configuration on your local machine. replicating data blocks across different nodes in the
Follow these steps: cluster. This section will cover the basics of HDFS
and how to interact with it.
Download Hadoop: Visit the Apache Hadoop
website (https://hadoop.apache.org/) and download HDFS ARCHITECTURE
the latest stable release.
HDFS follows a master-slave architecture with two
Extract the tarball: After downloading, extract the main components: the NameNode and the
tarball to your preferred installation directory. DataNodes.

Set up environmental variables: Configure the

NameNode
HADOOP_HOME and add the Hadoop binary path to the
PATH variable. The NameNode is a critical component in the
Hadoop Distributed File System (HDFS)
Configure Hadoop: Adjust the configuration files architecture. It serves as the master node and plays
(core-site.xml, hdfs-site.xml, mapred-site.xml, yarn- a crucial role in managing the file system
site.xml) to suit your setup. namespace and metadata. Let’s explore the
significance of the NameNode and its
Example code for setting environmental variables
responsibilities in more detail.
(Linux):
NameNode Responsibilities

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
2 APACHE HADOOP

Metadata Management: The NameNode maintains NameNode in periodic checkpoints to optimize its
crucial metadata about the HDFS, including performance. The Secondary NameNode
information about files and directories. It keeps periodically merges the edit logs with the fsimage
track of the data block locations, replication factor, (file system image) and creates a new, updated
and other essential details required for efficient fsimage, reducing the startup time of the primary
data storage and retrieval. NameNode.

Namespace Management: HDFS follows a NameNode Federation

hierarchical directory structure, similar to a
traditional file system. The NameNode manages this Starting from Hadoop 2.x, NameNode Federation
namespace, ensuring that each file and directory is allows multiple independent HDFS namespaces to
correctly represented and organized. be hosted on a single Hadoop cluster. Each
namespace is served by a separate Active
Data Block Mapping: When a file is stored in HDFS, NameNode, providing better isolation and resource
it is divided into fixed-size data blocks. The utilization in a multi-tenant environment.
NameNode maintains the mapping of these data
blocks to the corresponding DataNodes where the NameNode Hardware Considerations
actual data is stored.
The NameNode’s role in HDFS is resource-intensive,
Heartbeat and Health Monitoring: The as it manages metadata and handles a large
NameNode receives periodic heartbeat signals from number of small files. When setting up a Hadoop
DataNodes, which indicates their and cluster, it’s essential to consider the following
health
availability. If a DataNode fails to send a heartbeat, factors for the NameNode hardware:
the NameNode marks it as unavailable and
replicates its data to other healthy nodes to Hardware
Description
maintain data redundancy and fault tolerance. Consideration

Memory Sufficient RAM to hold

Replication Management: The NameNode ensures
metadata and file
that the configured replication factor for each file is
system namespace.
maintained across the cluster. It monitors the
More memory enables
number of replicas for each data block and triggers
faster metadata
replication of blocks if necessary.
operations.

High Availability and Secondary NameNode Storage Fast and reliable storage
for maintaining file
As the NameNode is a critical component, its failure system metadata.
could result in the unavailability of the entire
CPU Capable CPU to handle
HDFS. To address this concern, Hadoop introduced
the processing load of
the concept of High Availability (HA) with Hadoop
metadata management
2.x versions.
and client request
In an HA setup, there are two NameNodes: the handling.
Active NameNode and the Standby NameNode. The Networking Good network
Active NameNode handles all client requests and connection for
metadata operations, while the Standby NameNode communication with
remains in sync with the Active NameNode. If the DataNodes and prompt
Active NameNode fails, the Standby NameNode response to client
takes over as the new Active NameNode, ensuring requests.
seamless HDFS availability.
By optimizing the NameNode hardware, you can
Additionally, the Secondary NameNode is a
ensure smooth HDFS operations and reliable data
misnomer and should not be confused with the
management in your Hadoop cluster.
Standby NameNode. The Secondary NameNode is
not a failover mechanism but assists the primary

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
3 APACHE HADOOP

DataNodes DataNodes. If a DataNode fails to send a heartbeat

within a specific time frame, the NameNode marks
DataNodes are integral components in the Hadoop it as unavailable and starts the process of
Distributed File System (HDFS) architecture. They replicating its data blocks to other healthy nodes.
serve as the worker nodes responsible for storing This mechanism helps in quickly detecting and
and managing the actual data blocks that make up recovering from DataNode failures.
the files in HDFS. Let’s explore the role of
DataNodes and their responsibilities in more detail. Decommissioning: When a DataNode needs to be
taken out of service for maintenance or other
DataNode Responsibilities reasons, it goes through a decommissioning
process. During decommissioning, the DataNode
Data Storage: DataNodes are responsible for informs the NameNode about its intent to leave the
storing the actual data blocks of files. When a file is cluster gracefully. The NameNode then starts
uploaded to HDFS, it is split into fixed-size blocks, replicating its data blocks to other nodes to
and each block is stored on one or more DataNodes. maintain the desired replication factor. Once the
The DataNodes efficiently manage the data blocks replication is complete, the DataNode can be safely
and ensure their availability. removed from the cluster.

Data Block Replication: HDFS replicates data DataNode Hardware Considerations

blocks to provide fault tolerance and data
redundancy. The DataNodes are responsible for DataNodes are responsible for handling a large
creating and maintaining replicas of data blocks as amount of data and performing read and write
directed by the NameNode. By default, each data operations on data blocks. When setting up a
block is replicated three times across different Hadoop cluster, consider the following factors for
DataNodes in the cluster. DataNode hardware:

Heartbeat and Block Reports: DataNodes

Hardware
regularly send heartbeat signals to the NameNode Description
Consideration
to indicate their health and availability.
Additionally, they provide block reports, informing Storage Significant storage
the NameNode about the list of blocks they are capacity for storing data
storing. The NameNode uses this information to blocks. Use reliable and
track the availability of data blocks and manage high-capacity storage
their replication. drives to accommodate
large datasets.
Data Block Operations: DataNodes perform read
CPU Sufficient processing
and write operations on the data blocks they store.
power to handle data
When a client wants to read data from a file, the
read and write
NameNode provides the locations of the relevant
operations efficiently.
data blocks, and the client can directly retrieve the
data from the corresponding DataNodes. Similarly, Memory Adequate RAM for
when a client wants to write data to a file, the data smooth data block
is written to multiple DataNodes based on the operations and better
replication factor. caching of frequently
accessed data.
DataNode Health and Decommissioning
Networking Good network
connectivity for efficient
DataNodes are crucial for the availability and
data transfer between
reliability of HDFS. To ensure the overall health of
DataNodes and
the Hadoop cluster, the following factors related to
communication with the
DataNodes are critical.
NameNode.

Heartbeat and Health Monitoring: The

NameNode expects periodic heartbeat signals from By optimizing the hardware for DataNodes, you can

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
4 APACHE HADOOP

ensure smooth data operations, fault tolerance, and produces key-value pairs as intermediate outputs.
high availability within your Hadoop cluster.
Shuffle and Sort: The intermediate key-value pairs
are shuffled and sorted based on their keys,
INTERACTING WITH HDFS
grouping them for the reduce phase.
You can interact with HDFS using either the
Reduce: The reducer processes the sorted
command-line interface (CLI) or the Hadoop Java
intermediate data and produces the final output.
API. Here are some common HDFS operations:

Uploading files to HDFS: WRITING A MAPREDUCE JOB

To write a MapReduce job, you’ll need to create two

hadoop fs -put /local/path/to/file main classes: Mapper and Reducer. The following is
/hdfs/destination/path a simple example of counting word occurrences in
a text file.

Downloading files from HDFS: Mapper class:

hadoop fs -get /hdfs/path/to/file import java.io.IOException;

/local/destination/path import org.apache.hadoop.io.*;
import
Listing files in a directory:
org.apache.hadoop.mapreduce.*;

public class WordCountMapper extends

hadoop fs -ls Mapper<LongWritable, Text, Text,
/hdfs/path/to/directory IntWritable> {
private final static IntWritable
Creating a new directory in HDFS:
one = new IntWritable(1);
private Text word = new Text();

hadoop fs -mkdir /hdfs/new/directory @Override

public void map(LongWritable key,
Text value, Context context) throws
MAPREDUCE
MAPREDUCE IOException, InterruptedException {
String[] words =
MapReduce is the core programming model of
value.toString().split("\s+");
Hadoop, designed to process and analyze vast
datasets in parallel across the Hadoop cluster. It
for (String w : words) {
breaks down the processing into two phases: the word.set(w);
Map phase and the Reduce phase. Let’s dive into the context.write(word, one);
details of MapReduce. }
}
MAPREDUCE WORKFLOW }

The MapReduce workflow consists of three steps:

Reducer class:
Input, Map, and Reduce.

Input: The input data is divided into fixed-size

import java.io.IOException;
splits, and each split is assigned to a mapper for
processing.
import org.apache.hadoop.io.*;
import
Map: The mapper processes the input splits and org.apache.hadoop.mapreduce.*;

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
5 APACHE HADOOP

public class WordCountReducer job.setOutputKeyClass(Text.class);

extends Reducer<Text, IntWritable,
Text, IntWritable> { job.setOutputValueClass(IntWritable.
@Override class);
public void reduce(Text key,
Iterable<IntWritable> values, FileInputFormat.addInputPath(job,
Context context) throws IOException, new Path(args[0]));
InterruptedException {
int sum = 0; FileOutputFormat.setOutputPath(job,
for (IntWritable val : values) { new Path(args[1]));
sum += val.get();
}
context.write(key, new
IntWritable(sum)); System.exit(job.waitForCompletion(tr
} ue) ? 0 : 1);
} }
}

Main class:
APACHEHADOOP
APACHE HADOOPECOSYSTEM
ECOSYSTEM

import
Apache Hadoop has a rich ecosystem of related
org.apache.hadoop.conf.Configuration projects that extend its capabilities. In this section,
; we’ll explore some of the most popular components
import org.apache.hadoop.fs.Path; of the Hadoop ecosystem.
import
org.apache.hadoop.mapreduce.*; APACHE HIVE
import
org.apache.hadoop.mapreduce.lib.inpu Using Apache Hive involves several steps, from
t.FileInputFormat; creating tables to querying and analyzing data.
import Let’s walk through a basic workflow for using Hive.

org.apache.hadoop.mapreduce.lib.outp
ut.FileOutputFormat; Launching Hive and Creating Tables

Start Hive CLI (Command Line Interface) or use

public class WordCount { HiveServer2 for a JDBC/ODBC connection.
public static void main(String[]
args) throws Exception { Create a database (if it doesn’t exist) to organize
Configuration conf = new your tables:
Configuration();
Job job = Job.getInstance(conf,
CREATE DATABASE mydatabase;
"word count");

job.setJarByClass(WordCount.class); Switch to the newly created database:

job.setMapperClass(WordCountMapper.c
USE mydatabase;
lass);

job.setReducerClass(WordCountReducer Define and create a table in Hive, specifying the

.class); schema and the storage format. For example, let’s

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
6 APACHE HADOOP

create a table to store employee information: Creating Views in Hive

Hive allows you to create views, which are virtual

CREATE TABLE employees ( tables representing the results of queries. Views can
emp_id INT, simplify complex queries and provide a more user-
emp_name STRING, friendly interface. Here’s how you can create a

emp_salary DOUBLE view:

)
ROW FORMAT DELIMITED CREATE VIEW high_salary_employees AS
FIELDS TERMINATED BY ','; SELECT * FROM employees WHERE
emp_salary > 75000;
Loading Data into Hive Tables
Using User-Defined Functions (UDFs)
Upload data files to HDFS or make sure the data is
available in a compatible storage format (e.g., CSV,
Hive allows you to create custom User-Defined
JSON) accessible by Hive.
Functions (UDFs) in Java, Python, or other
supported languages to perform complex
Load the data into the Hive table using the LOAD
computations or data transformations. After
DATA command. For example, if the data is in a CSV
creating a UDF, you can use it in your HQL queries.
file located in HDFS:
For example, let’s create a simple UDF to convert
employee salaries from USD to EUR:
LOAD DATA INPATH
'/path/to/employees.csv' INTO TABLE
package com.example.hive.udf;
employees;
import
org.apache.hadoop.hive.ql.exec.UDF;
Querying Data with Hive
import org.apache.hadoop.io.Text;
Now that the data is loaded into the Hive table, you
can perform SQL-like queries on it using Hive public class USDtoEUR extends UDF {
Query Language (HQL). Here are some example public Text evaluate(double usd) {
queries: double eur = usd * 0.85; //
Conversion rate (as an example)
Retrieve all employee records:
return new
Text(String.valueOf(eur));
SELECT * FROM employees; }
}

Calculate the average salary of employees:

Compile the UDF and add the JAR to the Hive
session:
SELECT AVG(emp_salary) AS avg_salary
FROM employees;
ADD JAR /path/to/usd_to_eur_udf.jar;

Filter employees earning more than $50,000:

Then, use the UDF in a query:

SELECT * FROM employees WHERE

emp_salary > 50000; SELECT emp_id, emp_name, emp_salary,
USDtoEUR(emp_salary) AS

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
7 APACHE HADOOP

emp_salary_eur FROM employees; structure:

Storing Query Results

-- Load data from a data source
(e.g., HDFS)
You can store the results of Hive queries into new data = LOAD '/path/to/data' USING
tables or external files. For example, let’s create a PigStorage(',') AS (col1:datatype,
new table to store high-earning employees: col2:datatype, ...);

-- Data transformation and

CREATE TABLE high_earning_employees
processing
AS
transformed_data = FOREACH data
SELECT * FROM employees WHERE
GENERATE col1, col2, ...;
emp_salary > 75000;

-- Filtering and grouping

Exiting Hive filtered_data = FILTER
transformed_data BY condition;
Once you have completed your Hive operations, grouped_data = GROUP filtered_data
you can exit the Hive CLI or close your JDBC/ODBC
BY group_column;
connection.

This is just a basic overview of using Hive. Hive is a

-- Aggregation and calculations
powerful tool with many advanced features, aggregated_data = FOREACH
optimization techniques, and integration options grouped_data GENERATE group_column,
with other components of the Hadoop ecosystem. SUM(filtered_data.col1) AS total;
As you explore and gain more experience with
Hive, you’ll discover its full potential for big data -- Storing the results
analysis and processing tasks. STORE aggregated_data INTO
'/path/to/output' USING
APACHE PIG PigStorage(',');

Apache Pig is a high-level data flow language and

execution framework built on top of Apache Pig Execution Modes
Hadoop. It provides a simple and expressive
scripting language called Pig Latin for data Pig supports two execution modes.
manipulation and analysis. Pig abstracts the
complexity of writing low-level Java MapReduce Local Mode: In Local Mode, Pig runs on a single
code and enables users to process large datasets machine and uses the local file system for input and
with ease. Pig is particularly useful for users who output. It is suitable for testing and debugging small
are not familiar with Java or MapReduce but still datasets without the need for a Hadoop cluster.
need to perform data processing tasks on Hadoop.
MapReduce Mode: In MapReduce Mode, Pig runs
on a Hadoop cluster and generates MapReduce jobs
Pig Latin
for data processing. It leverages the full power of
Pig Latin is the scripting language used in Apache Hadoop’s distributed computing capabilities to
Pig. It consists of a series of data flow operations, process large datasets.
where each operation takes input data, performs a
transformation, and generates output data. Pig Pig Features
Latin scripts are translated into a series of
MapReduce jobs by the Pig execution engine.

Pig Latin scripts typically follow the following

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
8 APACHE HADOOP

Feature Description
-- Load data from HDFS
Abstraction Pig abstracts the
data = LOAD '/path/to/input' USING
complexities of
PigStorage(',') AS (name:chararray,
MapReduce code,
age:int, city:chararray);
allowing users to focus
on data manipulation
and analysis. -- Filter records where age is
greater than 25
Extensibility Pig supports user-
filtered_data = FILTER data BY age >
defined functions (UDFs)
25;
in Java, Python, or other
languages, enabling
custom data -- Store the filtered results to
transformations and HDFS
calculations. STORE filtered_data INTO
'/path/to/output' USING
Optimization Pig optimizes data
processing through
PigStorage(',');
logical and physical
optimizations, reducing As you become more familiar with Pig, you can
data movement and explore its advanced features, including UDFs,
improving performance. joins, groupings, and more complex data processing
Schema Flexibility Pig follows a schema-on- operations. Apache Pig is a valuable tool in the
read approach, allowing Hadoop ecosystem, enabling users to perform data
data to be stored in a processing tasks efficiently without the need for
flexible and schema-less extensive programming knowledge.
manner,
accommodating APACHE HBASE
evolving data structures.
Apache HBase is a distributed, scalable, and NoSQL
Integration with Hadoop Pig integrates seamlessly
database built on top of Apache Hadoop. It provides
Ecosystem with various Hadoop
real-time read and write access to large amounts of
ecosystem components,
structured data. HBase is designed to handle
including HDFS, Hive,
massive amounts of data and is well-suited for use
HBase, etc., enhancing
cases that require random access to data, such as
data processing
real-time analytics, online transaction processing
capabilities.
(OLTP), and serving as a data store for web
applications.
Using Pig
HBase Features
To use Apache Pig, follow these general steps:

Feature Description
Install Apache Pig on your Hadoop cluster or a
standalone machine. Column-Family Data Data is organized into
Model column families within
Write Pig Latin scripts to load, transform, and a table. Each column
process your data. Save the scripts in .pig files. family can have
multiple columns. New
Run Pig in either Local Mode or MapReduce Mode, columns can be added
depending on your data size and requirements. dynamically without
affecting existing rows.
Here’s an example of a simple Pig Latin script that
loads data, filters records, and stores the results:

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

GETTING STARTED WITH
9 APACHE HADOOP

Feature Description Component Description

Schema Flexibility HBase is schema-less, HBase Master Responsible for

allowing each row in a administrative tasks,
table to have different including region
columns. This flexibility assignment, load
accommodates data balancing, and failover
with varying attributes management. It doesn’t
without predefined directly serve data to
schemas. clients.

Horizontal Scalability HBase can scale HBase RegionServer Stores and manages
horizontally by adding data. Each RegionServer
more nodes to the manages multiple
cluster. It automatically regions, and each region
distributes data across corresponds to a portion
regions and nodes, of an HBase table.
ensuring even data
ZooKeeper HBase relies on Apache
distribution and load
ZooKeeper for
balancing.
coordination and
High Availability HBase supports distributed
automatic failover and synchronization among
recovery, ensuring data the HBase Master and
availability even if some RegionServers.
nodes experience
HBase Client Interacts with the HBase
failures.
cluster to read and write
Real-Time Read/Write HBase provides fast and data. Clients use the
low-latency read and HBase API or HBase
write access to data, shell to perform
making it suitable for operations on HBase
real-time applications. tables.

Data Compression HBase supports data

compression techniques Using HBase
like Snappy and LZO,
reducing storage To use Apache HBase, follow these general steps:
requirements and
Install Apache HBase on your Hadoop cluster or a
improving query
standalone machine.
performance.

Integration with Hadoop HBase seamlessly Start the HBase Master and RegionServers.
Ecosystem integrates with various
Hadoop ecosystem Create HBase tables and specify the column
components, such as families.
HDFS, MapReduce, and
Apache Hive, enhancing Use the HBase API or HBase shell to perform read
data processing and write operations on HBase tables.
capabilities.
Here’s an example of using the HBase shell to
create a table and insert data:
HBase Architecture

HBase follows a master-slave architecture with the

$ hbase shell
following key components:

GETTING STARTED WITH
10 APACHE HADOOP

hbase(main):001:0> create MLlib: Spark’s machine learning library, MLlib,

'my_table', 'cf1', 'cf2' offers a rich set of algorithms and utilities for
hbase(main):002:0> put 'my_table', building and evaluating machine learning models.

'row1', 'cf1:col1', 'value1'

GraphX: GraphX is Spark’s library for graph
hbase(main):003:0> put 'my_table',
processing, enabling graph analytics and
'row1', 'cf2:col2', 'value2' computations on large-scale graphs.
hbase(main):004:0> scan 'my_table'
Spark Streaming: Spark Streaming allows real-
time processing of data streams, making Spark
This example creates a table named my_table with
suitable for real-time analytics.
two column families (cf1 and cf2), inserts data into
rows row1, and scans the table to retrieve the
inserted data. Spark Architecture

Spark follows a master-slave architecture with the

Apache HBase is an excellent choice for storing and
following key components:
accessing massive amounts of structured data with
low-latency requirements. Its integration with the
Component Description
Hadoop ecosystem makes it a powerful tool for real-
time data processing and analytics. Driver The Spark Driver
program runs on the
APACHE SPARK master node and is
responsible for
Apache Spark is an open-source distributed data coordinating the Spark
processing framework designed for speed, ease of application. It splits the
use, and sophisticated analytics. It provides an in- tasks into smaller tasks
memory computing engine that enables fast data called stages and
processing and iterative algorithms, making it well- schedules their
suited for big data analytics and machine learning execution.
applications. Spark supports various data sources,
Executor Executors run on the
including Hadoop Distributed File System (HDFS),
worker nodes and
Apache HBase, Apache Hive, and more.
perform the actual data
processing tasks. They
Spark Features store the RDD partitions
in memory and cache
In-Memory Computing: Spark keeps intermediate
intermediate data for
data in memory, reducing the need to read and
faster processing.
write to disk and significantly speeding up data
processing. Cluster Manager The cluster manager
allocates resources to
Resilient Distributed Dataset (RDD): Spark’s the Spark application
fundamental data structure, RDD, allows for and manages the
distributed data processing and fault tolerance. allocation of executors
RDDs are immutable and can be regenerated in across the cluster.
case of failures. Popular cluster
managers include
Data Transformation and Actions: Spark provides Apache Mesos, Hadoop
a wide range of transformations (e.g., map, filter, YARN, and Spark’s
reduce) and actions (e.g., count, collect, save) for standalone manager.
processing and analyzing data.

Spark SQL: Spark SQL enables SQL-like querying Using Apache Spark
on structured data and seamless integration with
To use Apache Spark, follow these general steps:
data sources like Hive and JDBC.

GETTING STARTED WITH
11 APACHE HADOOP

Install Apache Spark on your Hadoop cluster or a popular choice for big data processing, analytics,
standalone machine. and machine learning applications. Its ability to
leverage in-memory computing and seamless
Create a SparkContext, which is the entry point to integration with various data sources and machine
Spark functionalities. learning libraries make it a versatile tool in the big
data ecosystem.
Load data from various data sources into RDDs or
DataFrames (Spark SQL).
APACHE SQOOP
Perform transformations and actions on the RDDs
Apache Sqoop is an open-source tool designed for
or DataFrames to process and analyze the data.
efficiently transferring data between Apache
Hadoop and structured data stores, such as
Use Spark MLlib for machine learning tasks if
relational databases. Sqoop simplifies the process of
needed.
importing data from relational databases into
Hadoop’s distributed file system (HDFS) and
Save the results or write the data back to external
exporting data from HDFS to relational databases. It
data sources if required.
supports various databases, including MySQL,
Here’s an example of using Spark in Python to Oracle, PostgreSQL, and more.
count the occurrences of each word in a text file:
Sqoop Features

from pyspark import SparkContext Data Import and Export: Sqoop allows users to
import data from relational databases into HDFS
# Create a SparkContext and export data from HDFS back to relational
databases.
sc = SparkContext("local", "Word
Count")
Parallel Data Transfer: Sqoop uses multiple
mappers in Hadoop to import and export data in
# Load data from a text file into an parallel, achieving faster data transfer.
RDD
text_file = Full and Incremental Data Imports: Sqoop
sc.textFile("path/to/text_file.txt") supports both full and incremental data imports.
Incremental imports enable transferring only new
# Split the lines into words and or updated data since the last import.

count the occurrences of each word

Data Compression: Sqoop can compress data
word_counts = during import and decompress it during export,
text_file.flatMap(lambda line: reducing storage requirements and speeding up
line.split(" ")).map(lambda word: data transfer.
(word, 1)).reduceByKey(lambda a, b:
a + b) Schema Inference: Sqoop can automatically infer
the database schema during import, reducing the
# Print the word counts need for manual schema specification.

for word, count in

Integration with Hadoop Ecosystem: Sqoop
word_counts.collect():
integrates seamlessly with other Hadoop ecosystem
print(f"{word}: {count}") components, such as Hive and HBase, enabling data
integration and analysis.
# Stop the SparkContext
sc.stop() Sqoop Architecture

Sqoop consists of the following key components:

Apache Spark’s performance, ease of use, and
broad range of functionalities have made it a

GETTING STARTED WITH
12 APACHE HADOOP

Component Description Apache Sqoop simplifies the process of transferring

data between Hadoop and relational databases,
Sqoop Client The Sqoop Client is the
making it a valuable tool for integrating big data
command-line tool used
with existing data stores and enabling seamless
to interact with Sqoop.
data analysis in Hadoop.
Users execute Sqoop
commands from the
ADDITIONALRESOURCES
ADDITIONAL RESOURCES
command line to import
or export data.
Here are some additional resources to learn more
Sqoop Server The Sqoop Server about the topics mentioned:
provides REST APIs for
the Sqoop Client to Resource Description
communicate with the
Apache Hadoop Official The official website of
underlying Hadoop
Website Apache Hadoop,
ecosystem. It manages
providing extensive
the data transfer tasks
documentation,
and interacts with HDFS
tutorials, and downloads
and relational
for getting started with
databases.
Hadoop.

Apache Hive Official The official website of

Using Apache Sqoop
Website Apache Hive, offering
To use Apache Sqoop, follow these general steps: documentation,
examples, and
Install Apache Sqoop on your Hadoop cluster or a downloads, providing all
standalone machine. the essential
information to get
Configure the Sqoop Client by specifying the started with Apache
database connection details and other required Hive.
parameters.
Apache Pig Official The official website of
Website Apache Pig, offering
Use the Sqoop Client to import data from the
documentation,
relational database into HDFS or export data from
examples, and
HDFS to the relational database.
downloads, providing all
Here’s an example of using Sqoop to import data the essential
from a MySQL database into HDFS: information to get
started with Apache Pig.

Apache HBase Official The official website of

# Import data from MySQL to HDFS Website Apache HBase, offering
sqoop import documentation,
--connect tutorials, and
jdbc:mysql://mysql_server:3306/mydat downloads, providing all
abase the essential
--username myuser information to get
--password mypassword started with Apache
--table mytable HBase.
--target-dir /user/hadoop/mydata

This example imports data from the mytable in the

MySQL database into the HDFS directory
/user/hadoop/mydata.

GETTING STARTED WITH
13 APACHE HADOOP

Resource Description

Apache Spark Official The official website of

Website Apache Spark, offering
documentation,
examples, and
downloads, providing all
the essential
information to get
started with Apache
Spark.

Apache Sqoop Official The official website of

Website Apache Sqoop, offering
documentation,
examples, and
downloads, providing all
the essential
information to get
started with Apache
Sqoop.

Additionally, you can find many tutorials, blog

posts, and online courses on platforms like Udemy,
Coursera, and LinkedIn Learning that offer in-
depth knowledge on these Apache projects. Happy
learning!

JCG delivers over 1 million pages each month to more than 700K software
developers, architects and decision makers. JCG offers something for everyone,
including news, tutorials, cheat sheets, research guides, feature articles, source code
and more.
CHEATSHEET FEEDBACK
WELCOME
support@javacodegeeks.com

Copyright © 2014 Exelixis Media P.C. All rights reserved. No part of this publication may be SPONSORSHIP
reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, OPPORTUNITIES
mechanical, photocopying, or otherwise, without prior written permission of the publisher. sales@javacodegeeks.com

UNITEXT_158_Lucian_Bădescu_,_Ettore_Carletti_Lectures_on_Geometry
100% (1)
UNITEXT_158_Lucian_Bădescu_,_Ettore_Carletti_Lectures_on_Geometry
493 pages
UiPath Certified Professional - Automation Developer Associate Exam Description
No ratings yet
UiPath Certified Professional - Automation Developer Associate Exam Description
9 pages
Microsoft - Vdumps.AZ-104.vDec-2023.by .Jack .206q
No ratings yet
Microsoft - Vdumps.AZ-104.vDec-2023.by .Jack .206q
212 pages
Ddos Protection With Iptables: The Ultimate Guide: This Guide Will Teach You How To
No ratings yet
Ddos Protection With Iptables: The Ultimate Guide: This Guide Will Teach You How To
17 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop 2 Quick Start Guide PDF
100% (1)
Hadoop 2 Quick Start Guide PDF
736 pages
Hadoop
No ratings yet
Hadoop
71 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Module 4_hadoop
No ratings yet
Module 4_hadoop
5 pages
HADOOP NOTES
No ratings yet
HADOOP NOTES
8 pages
bda2
No ratings yet
bda2
25 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Install and Run Hadoop On Windows
No ratings yet
Install and Run Hadoop On Windows
29 pages
Bda Aat
No ratings yet
Bda Aat
18 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
HADOOP
No ratings yet
HADOOP
19 pages
(Ebook) Hadoop: The Definitive Guide by Tom White ISBN 9781449311520, 1449311520pdf download
100% (5)
(Ebook) Hadoop: The Definitive Guide by Tom White ISBN 9781449311520, 1449311520pdf download
50 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
[Ebooks PDF] download Hadoop MapReduce Cookbook 1st Edition Srinath Perera full chapters
100% (9)
[Ebooks PDF] download Hadoop MapReduce Cookbook 1st Edition Srinath Perera full chapters
50 pages
Big data aktu unit 2
No ratings yet
Big data aktu unit 2
127 pages
bdcc-2.2
No ratings yet
bdcc-2.2
12 pages
Hadoop MapReduce Cookbook 1st Edition Srinath Perera All Chapters Instant Download
100% (4)
Hadoop MapReduce Cookbook 1st Edition Srinath Perera All Chapters Instant Download
51 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
15 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
Download Hadoop MapReduce Cookbook 1st Edition Srinath Perera ebook All Chapters PDF
No ratings yet
Download Hadoop MapReduce Cookbook 1st Edition Srinath Perera ebook All Chapters PDF
41 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
UNIT 5
No ratings yet
UNIT 5
101 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
toc_9780134049984
No ratings yet
toc_9780134049984
10 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
BDA lab Manual
No ratings yet
BDA lab Manual
62 pages
Unit 4 Bda
No ratings yet
Unit 4 Bda
19 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
(Ebook) Hadoop: the definitive guide by Tom White ISBN 9780596521974, 0596521979 pdf download
100% (1)
(Ebook) Hadoop: the definitive guide by Tom White ISBN 9780596521974, 0596521979 pdf download
56 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Getting Started With Hadoop
No ratings yet
Getting Started With Hadoop
47 pages
(Ebook) Hadoop MapReduce Cookbook by Srinath Perera; Thilina Gunarathne ISBN 9781849517287, 1849517282 - Download the ebook now and own the full detailed content
100% (2)
(Ebook) Hadoop MapReduce Cookbook by Srinath Perera; Thilina Gunarathne ISBN 9781849517287, 1849517282 - Download the ebook now and own the full detailed content
51 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Ceph Cookbook: Over 100 effective recipes to help you design, implement, and manage the software-defined and massively scalable Ceph storage system
From Everand
Ceph Cookbook: Over 100 effective recipes to help you design, implement, and manage the software-defined and massively scalable Ceph storage system
Karan Singh
4/5 (1)
Hadoop MapReduce v2 Cookbook - Second Edition
From Everand
Hadoop MapReduce v2 Cookbook - Second Edition
Thilina Gunarathne
No ratings yet
Apache Karaf Cookbook
From Everand
Apache Karaf Cookbook
Jamie Goodyear
No ratings yet
Mastering_Crypto_Assets_Investing_in_Bitcoin,_Ethereum,_and_Beyond
No ratings yet
Mastering_Crypto_Assets_Investing_in_Bitcoin,_Ethereum,_and_Beyond
353 pages
w_java135
No ratings yet
w_java135
10 pages
Cloud Native Database Principle and Practice
No ratings yet
Cloud Native Database Principle and Practice
263 pages
Building_Embedded_Systems_with_Embedded_Linux_Roronoa_Hatake_Z_Library
No ratings yet
Building_Embedded_Systems_with_Embedded_Linux_Roronoa_Hatake_Z_Library
193 pages
2014-10_Breakpoint_Intel_ME_-_Two_Years_Later
No ratings yet
2014-10_Breakpoint_Intel_ME_-_Two_Years_Later
66 pages
w_nopa02
No ratings yet
w_nopa02
8 pages
140sp4626
No ratings yet
140sp4626
29 pages
Loongson 7A1000 Usermanual v2.00 En
No ratings yet
Loongson 7A1000 Usermanual v2.00 En
214 pages
Multi-task Ada Code Generation_preprint
No ratings yet
Multi-task Ada Code Generation_preprint
60 pages
active-management-technology_developer-guide_2021-772055-772056
No ratings yet
active-management-technology_developer-guide_2021-772055-772056
70 pages
2017_Deep_Table_ICDAR
No ratings yet
2017_Deep_Table_ICDAR
7 pages
Hobbyist reworks Xilinx FPGA to build a 1990s graphics card • The Register
No ratings yet
Hobbyist reworks Xilinx FPGA to build a 1990s graphics card • The Register
3 pages
CPU vs GPU vs NPU_ What's the difference_ _ CORSAIR
No ratings yet
CPU vs GPU vs NPU_ What's the difference_ _ CORSAIR
11 pages
Christopher_Noel_Hesse
No ratings yet
Christopher_Noel_Hesse
103 pages
electronics-13-01646
No ratings yet
electronics-13-01646
14 pages
AI Hardware Showdown_ CPU vs GPU vs NPU
No ratings yet
AI Hardware Showdown_ CPU vs GPU vs NPU
13 pages
A_Survey_Comparing_Specialized_Hardware_And_Evolution_In_TPUs_For_Neural_Networks
No ratings yet
A_Survey_Comparing_Specialized_Hardware_And_Evolution_In_TPUs_For_Neural_Networks
7 pages
A Guide to CPU, GPU, NPU, and Windows _ Microsoft Windows
No ratings yet
A Guide to CPU, GPU, NPU, and Windows _ Microsoft Windows
3 pages
SpaceflightSwFw_Publication
No ratings yet
SpaceflightSwFw_Publication
161 pages
AI Chips Overview_ TPU, NPU, GPU, and FPGA - Pynomial
No ratings yet
AI Chips Overview_ TPU, NPU, GPU, and FPGA - Pynomial
9 pages
008 - Mathematicians Playing Games (Jon-Lark Kim) (Z-Library)
100% (1)
008 - Mathematicians Playing Games (Jon-Lark Kim) (Z-Library)
135 pages
System Level Simulations For Cellular Networks Using Matlab
No ratings yet
System Level Simulations For Cellular Networks Using Matlab
16 pages
005 - mathematical_application_in_political_science,_ii_joseph_l_bernd
No ratings yet
005 - mathematical_application_in_political_science,_ii_joseph_l_bernd
224 pages
CAP CPD Formal Informal Activity Category Explanation
No ratings yet
CAP CPD Formal Informal Activity Category Explanation
3 pages
Oracle Architecture PDF
No ratings yet
Oracle Architecture PDF
68 pages
Marketing Information System
No ratings yet
Marketing Information System
9 pages
Air University Department of Computer Sciences Operating Systems
No ratings yet
Air University Department of Computer Sciences Operating Systems
9 pages
UI by Anik Dutta
No ratings yet
UI by Anik Dutta
7 pages
Chapter 2 Notes NBCAS511
No ratings yet
Chapter 2 Notes NBCAS511
10 pages
Functions of Information Systems
No ratings yet
Functions of Information Systems
18 pages
Storage Media and Devices: IGCSE - 0417
0% (1)
Storage Media and Devices: IGCSE - 0417
35 pages
Enabling Vulkan Validation Layers
No ratings yet
Enabling Vulkan Validation Layers
2 pages
Internet Programming - 03
No ratings yet
Internet Programming - 03
90 pages
Professional Cloud Architect - D2cce566118d
No ratings yet
Professional Cloud Architect - D2cce566118d
170 pages
Corrective Action Plan (CAP) Template
100% (1)
Corrective Action Plan (CAP) Template
3 pages
AWS Certified Solutions Architect Associate Slides
No ratings yet
AWS Certified Solutions Architect Associate Slides
900 pages
JonDo Help
No ratings yet
JonDo Help
115 pages
ODI 12c - Concept - Knowledge Modules
No ratings yet
ODI 12c - Concept - Knowledge Modules
8 pages
Microsoft: Question & Answers
No ratings yet
Microsoft: Question & Answers
32 pages
Chapter 2 Quiz - 2019-20 1q Ece154p Cola-Valiente
No ratings yet
Chapter 2 Quiz - 2019-20 1q Ece154p Cola-Valiente
17 pages
Kenya Stanley CH 15 HMWK
No ratings yet
Kenya Stanley CH 15 HMWK
3 pages
Apps DBA
No ratings yet
Apps DBA
24 pages
Week 8
No ratings yet
Week 8
21 pages
Kshitiz Tayal Resume
No ratings yet
Kshitiz Tayal Resume
1 page
It Software Quality Assurance Plan Template
100% (1)
It Software Quality Assurance Plan Template
15 pages
PowerEdge Sizer UserGuide v2
No ratings yet
PowerEdge Sizer UserGuide v2
27 pages
Sample Employee Job
No ratings yet
Sample Employee Job
24 pages
Practo Presentation
No ratings yet
Practo Presentation
5 pages
Shubham Yadav: Mobile: +91 - (7004630688)
No ratings yet
Shubham Yadav: Mobile: +91 - (7004630688)
3 pages
Exascale Open Data Analytics Lab: NED University of Engineering and Technology
No ratings yet
Exascale Open Data Analytics Lab: NED University of Engineering and Technology
18 pages
CH 13
No ratings yet
CH 13
13 pages
transport_challan_voucher
No ratings yet
transport_challan_voucher
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

w_java132

Uploaded by

w_java132

Uploaded by

GETTING STARTED WITH

Getting Started With

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

HDFS is the distributed file system used by Hadoop

Set up environmental variables: Configure the

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Namespace Management: HDFS follows a NameNode Federation

Memory Sufficient RAM to hold

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

DataNodes DataNodes. If a DataNode fails to send a heartbeat

Data Block Replication: HDFS replicates data DataNode Hardware Considerations

Heartbeat and Block Reports: DataNodes

Heartbeat and Health Monitoring: The

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Uploading files to HDFS: WRITING A MAPREDUCE JOB

To write a MapReduce job, you’ll need to create two

Downloading files from HDFS: Mapper class:

hadoop fs -get /hdfs/path/to/file import java.io.IOException;

public class WordCountMapper extends

hadoop fs -mkdir /hdfs/new/directory @Override

The MapReduce workflow consists of three steps:

Input: The input data is divided into fixed-size

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

public class WordCountReducer job.setOutputKeyClass(Text.class);

Start Hive CLI (Command Line Interface) or use

job.setJarByClass(WordCount.class); Switch to the newly created database:

job.setReducerClass(WordCountReducer Define and create a table in Hive, specifying the

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

create a table to store employee information: Creating Views in Hive

Hive allows you to create views, which are virtual

emp_salary DOUBLE view:

Calculate the average salary of employees:

Filter employees earning more than $50,000:

SELECT * FROM employees WHERE

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

emp_salary_eur FROM employees; structure:

Storing Query Results

-- Data transformation and

-- Filtering and grouping

This is just a basic overview of using Hive. Hive is a

Apache Pig is a high-level data flow language and

Pig Latin scripts typically follow the following

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Feature Description Component Description

Schema Flexibility HBase is schema-less, HBase Master Responsible for

Data Compression HBase supports data

HBase follows a master-slave architecture with the

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

hbase(main):001:0> create MLlib: Spark’s machine learning library, MLlib,

'row1', 'cf1:col1', 'value1'

Spark follows a master-slave architecture with the

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

count the occurrences of each word

for word, count in

Sqoop consists of the following key components:

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Component Description Apache Sqoop simplifies the process of transferring

Apache Hive Official The official website of

Apache HBase Official The official website of

This example imports data from the mytable in the

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Apache Spark Official The official website of

Apache Sqoop Official The official website of

Additionally, you can find many tutorials, blog

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.