w_java132
w_java132
APACHE HADOOP
Preface 1
Introduction 1
Installing Apache Hadoop 1
Single-Node Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Multi-Node Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Hadoop Distributed File System (HDFS) 1
HDFS Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Interacting with HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
MapReduce 4
MapReduce Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Writing a MapReduce Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Apache Hadoop Ecosystem 5
Apache Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Apache Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Apache HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Apache Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Additional Resources 12
PREFACE
PREFACE
# Set HADOOP_HOME
Getting Started with Apache Hadoop cheatsheet export HADOOP_HOME=/path/to/hadoop
serves as your quick reference guide to
understanding the fundamental concepts, # Add Hadoop binary path to PATH
components, and essential commands of Hadoop.
export PATH=$PATH:$HADOOP_HOME/bin
Whether you are a data engineer, data scientist, or
simply curious about big data technologies, this
cheatsheet will provide you with a solid foundation
MULTI-NODE INSTALLATION
to embark on your Hadoop journey.
For production or more realistic testing scenarios,
INTRODUCTION
INTRODUCTION you’ll need to set up a multi-node Hadoop cluster.
Here’s a high-level overview of the steps involved:
Getting started with Apache Hadoop is a powerful
ecosystem for handling big data. It allows you to Prepare the machines: Set up multiple machines
store, process, and analyze vast amounts of data (physical or virtual) with the same version of
across distributed clusters of computers. Hadoop is Hadoop installed on each of them.
based on the MapReduce programming model,
which enables parallel processing of data. This Configure SSH: Ensure passwordless SSH login
section will cover the key components of Hadoop, between all the machines in the cluster.
its architecture, and how it works.
Adjust the configuration: Modify the Hadoop
configuration files to reflect the cluster setup,
INSTALLINGAPACHE
INSTALLING APACHEHADOOP
HADOOP
including specifying the NameNode and DataNode
details.
In this section, we’ll guide you through the
installation process for Apache Hadoop. We’ll cover
HADOOPDISTRIBUTED
HADOOP DISTRIBUTEDFILE
FILESYSTEM
SYSTEM
both single-node and multi-node cluster setups to
suit your development and testing needs. (HDFS)
(HDFS)
Metadata Management: The NameNode maintains NameNode in periodic checkpoints to optimize its
crucial metadata about the HDFS, including performance. The Secondary NameNode
information about files and directories. It keeps periodically merges the edit logs with the fsimage
track of the data block locations, replication factor, (file system image) and creates a new, updated
and other essential details required for efficient fsimage, reducing the startup time of the primary
data storage and retrieval. NameNode.
High Availability and Secondary NameNode Storage Fast and reliable storage
for maintaining file
As the NameNode is a critical component, its failure system metadata.
could result in the unavailability of the entire
CPU Capable CPU to handle
HDFS. To address this concern, Hadoop introduced
the processing load of
the concept of High Availability (HA) with Hadoop
metadata management
2.x versions.
and client request
In an HA setup, there are two NameNodes: the handling.
Active NameNode and the Standby NameNode. The Networking Good network
Active NameNode handles all client requests and connection for
metadata operations, while the Standby NameNode communication with
remains in sync with the Active NameNode. If the DataNodes and prompt
Active NameNode fails, the Standby NameNode response to client
takes over as the new Active NameNode, ensuring requests.
seamless HDFS availability.
By optimizing the NameNode hardware, you can
Additionally, the Secondary NameNode is a
ensure smooth HDFS operations and reliable data
misnomer and should not be confused with the
management in your Hadoop cluster.
Standby NameNode. The Secondary NameNode is
not a failover mechanism but assists the primary
ensure smooth data operations, fault tolerance, and produces key-value pairs as intermediate outputs.
high availability within your Hadoop cluster.
Shuffle and Sort: The intermediate key-value pairs
are shuffled and sorted based on their keys,
INTERACTING WITH HDFS
grouping them for the reduce phase.
You can interact with HDFS using either the
Reduce: The reducer processes the sorted
command-line interface (CLI) or the Hadoop Java
intermediate data and produces the final output.
API. Here are some common HDFS operations:
Main class:
APACHEHADOOP
APACHE HADOOPECOSYSTEM
ECOSYSTEM
import
Apache Hadoop has a rich ecosystem of related
org.apache.hadoop.conf.Configuration projects that extend its capabilities. In this section,
; we’ll explore some of the most popular components
import org.apache.hadoop.fs.Path; of the Hadoop ecosystem.
import
org.apache.hadoop.mapreduce.*; APACHE HIVE
import
org.apache.hadoop.mapreduce.lib.inpu Using Apache Hive involves several steps, from
t.FileInputFormat; creating tables to querying and analyzing data.
import Let’s walk through a basic workflow for using Hive.
org.apache.hadoop.mapreduce.lib.outp
ut.FileOutputFormat; Launching Hive and Creating Tables
job.setMapperClass(WordCountMapper.c
USE mydatabase;
lass);
)
ROW FORMAT DELIMITED CREATE VIEW high_salary_employees AS
FIELDS TERMINATED BY ','; SELECT * FROM employees WHERE
emp_salary > 75000;
Loading Data into Hive Tables
Using User-Defined Functions (UDFs)
Upload data files to HDFS or make sure the data is
available in a compatible storage format (e.g., CSV,
Hive allows you to create custom User-Defined
JSON) accessible by Hive.
Functions (UDFs) in Java, Python, or other
supported languages to perform complex
Load the data into the Hive table using the LOAD
computations or data transformations. After
DATA command. For example, if the data is in a CSV
creating a UDF, you can use it in your HQL queries.
file located in HDFS:
For example, let’s create a simple UDF to convert
employee salaries from USD to EUR:
LOAD DATA INPATH
'/path/to/employees.csv' INTO TABLE
package com.example.hive.udf;
employees;
import
org.apache.hadoop.hive.ql.exec.UDF;
Querying Data with Hive
import org.apache.hadoop.io.Text;
Now that the data is loaded into the Hive table, you
can perform SQL-like queries on it using Hive public class USDtoEUR extends UDF {
Query Language (HQL). Here are some example public Text evaluate(double usd) {
queries: double eur = usd * 0.85; //
Conversion rate (as an example)
Retrieve all employee records:
return new
Text(String.valueOf(eur));
SELECT * FROM employees; }
}
Feature Description
-- Load data from HDFS
Abstraction Pig abstracts the
data = LOAD '/path/to/input' USING
complexities of
PigStorage(',') AS (name:chararray,
MapReduce code,
age:int, city:chararray);
allowing users to focus
on data manipulation
and analysis. -- Filter records where age is
greater than 25
Extensibility Pig supports user-
filtered_data = FILTER data BY age >
defined functions (UDFs)
25;
in Java, Python, or other
languages, enabling
custom data -- Store the filtered results to
transformations and HDFS
calculations. STORE filtered_data INTO
'/path/to/output' USING
Optimization Pig optimizes data
processing through
PigStorage(',');
logical and physical
optimizations, reducing As you become more familiar with Pig, you can
data movement and explore its advanced features, including UDFs,
improving performance. joins, groupings, and more complex data processing
Schema Flexibility Pig follows a schema-on- operations. Apache Pig is a valuable tool in the
read approach, allowing Hadoop ecosystem, enabling users to perform data
data to be stored in a processing tasks efficiently without the need for
flexible and schema-less extensive programming knowledge.
manner,
accommodating APACHE HBASE
evolving data structures.
Apache HBase is a distributed, scalable, and NoSQL
Integration with Hadoop Pig integrates seamlessly
database built on top of Apache Hadoop. It provides
Ecosystem with various Hadoop
real-time read and write access to large amounts of
ecosystem components,
structured data. HBase is designed to handle
including HDFS, Hive,
massive amounts of data and is well-suited for use
HBase, etc., enhancing
cases that require random access to data, such as
data processing
real-time analytics, online transaction processing
capabilities.
(OLTP), and serving as a data store for web
applications.
Using Pig
HBase Features
To use Apache Pig, follow these general steps:
Feature Description
Install Apache Pig on your Hadoop cluster or a
standalone machine. Column-Family Data Data is organized into
Model column families within
Write Pig Latin scripts to load, transform, and a table. Each column
process your data. Save the scripts in .pig files. family can have
multiple columns. New
Run Pig in either Local Mode or MapReduce Mode, columns can be added
depending on your data size and requirements. dynamically without
affecting existing rows.
Here’s an example of a simple Pig Latin script that
loads data, filters records, and stores the results:
Horizontal Scalability HBase can scale HBase RegionServer Stores and manages
horizontally by adding data. Each RegionServer
more nodes to the manages multiple
cluster. It automatically regions, and each region
distributes data across corresponds to a portion
regions and nodes, of an HBase table.
ensuring even data
ZooKeeper HBase relies on Apache
distribution and load
ZooKeeper for
balancing.
coordination and
High Availability HBase supports distributed
automatic failover and synchronization among
recovery, ensuring data the HBase Master and
availability even if some RegionServers.
nodes experience
HBase Client Interacts with the HBase
failures.
cluster to read and write
Real-Time Read/Write HBase provides fast and data. Clients use the
low-latency read and HBase API or HBase
write access to data, shell to perform
making it suitable for operations on HBase
real-time applications. tables.
Integration with Hadoop HBase seamlessly Start the HBase Master and RegionServers.
Ecosystem integrates with various
Hadoop ecosystem Create HBase tables and specify the column
components, such as families.
HDFS, MapReduce, and
Apache Hive, enhancing Use the HBase API or HBase shell to perform read
data processing and write operations on HBase tables.
capabilities.
Here’s an example of using the HBase shell to
create a table and insert data:
HBase Architecture
Spark SQL: Spark SQL enables SQL-like querying Using Apache Spark
on structured data and seamless integration with
To use Apache Spark, follow these general steps:
data sources like Hive and JDBC.
Install Apache Spark on your Hadoop cluster or a popular choice for big data processing, analytics,
standalone machine. and machine learning applications. Its ability to
leverage in-memory computing and seamless
Create a SparkContext, which is the entry point to integration with various data sources and machine
Spark functionalities. learning libraries make it a versatile tool in the big
data ecosystem.
Load data from various data sources into RDDs or
DataFrames (Spark SQL).
APACHE SQOOP
Perform transformations and actions on the RDDs
Apache Sqoop is an open-source tool designed for
or DataFrames to process and analyze the data.
efficiently transferring data between Apache
Hadoop and structured data stores, such as
Use Spark MLlib for machine learning tasks if
relational databases. Sqoop simplifies the process of
needed.
importing data from relational databases into
Hadoop’s distributed file system (HDFS) and
Save the results or write the data back to external
exporting data from HDFS to relational databases. It
data sources if required.
supports various databases, including MySQL,
Here’s an example of using Spark in Python to Oracle, PostgreSQL, and more.
count the occurrences of each word in a text file:
Sqoop Features
from pyspark import SparkContext Data Import and Export: Sqoop allows users to
import data from relational databases into HDFS
# Create a SparkContext and export data from HDFS back to relational
databases.
sc = SparkContext("local", "Word
Count")
Parallel Data Transfer: Sqoop uses multiple
mappers in Hadoop to import and export data in
# Load data from a text file into an parallel, achieving faster data transfer.
RDD
text_file = Full and Incremental Data Imports: Sqoop
sc.textFile("path/to/text_file.txt") supports both full and incremental data imports.
Incremental imports enable transferring only new
# Split the lines into words and or updated data since the last import.
Resource Description
JCG delivers over 1 million pages each month to more than 700K software
developers, architects and decision makers. JCG offers something for everyone,
including news, tutorials, cheat sheets, research guides, feature articles, source code
and more.
CHEATSHEET FEEDBACK
WELCOME
support@javacodegeeks.com
Copyright © 2014 Exelixis Media P.C. All rights reserved. No part of this publication may be SPONSORSHIP
reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, OPPORTUNITIES
mechanical, photocopying, or otherwise, without prior written permission of the publisher. sales@javacodegeeks.com