0% found this document useful (0 votes)
1 views21 pages

Three Node Cluster in Hadoop

The document outlines the process of building a 3-node cluster for big data analytics using Cloudera, highlighting the importance of big data analytics and the challenges faced with Cloudera Quickstart VM. It details the objectives, methods, prerequisites, and procedures for setting up the cluster, including the configuration of Hadoop services and validation of the cluster. The conclusion emphasizes the benefits of using a Cloudera cluster for learning and small-scale analytics, along with suggestions for future scalability.

Uploaded by

jick alvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views21 pages

Three Node Cluster in Hadoop

The document outlines the process of building a 3-node cluster for big data analytics using Cloudera, highlighting the importance of big data analytics and the challenges faced with Cloudera Quickstart VM. It details the objectives, methods, prerequisites, and procedures for setting up the cluster, including the configuration of Hadoop services and validation of the cluster. The conclusion emphasizes the benefits of using a Cloudera cluster for learning and small-scale analytics, along with suggestions for future scalability.

Uploaded by

jick alvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Building a 3-Node Cluster for Big Data

Analytics in Cloudera

Molinge Lyonga Jr SC24P107


Ebai Jenniline Agbor SC24P118
Jik Alvin Comforter SC24P111

1
Presentation Outline

Introduction
objectives
methods
Results
conclusion

2
INTRODUCTION
What is big data analytics?
Big Data Analytics is all about crunching massive amounts of information to uncover hidden trends, patterns, and
relationships. It's like sifting through a giant mountain of data to find the gold nuggets of insight.

Importance of big data analytics?


● Informed Decisions: In a store like Walmart. Big Data Analytics helps them make smart choices about
what products to stock. This not only reduces waste but also keeps customers happy and profits high.
● Enhanced Customer Experiences: Think about Amazon. Big Data Analytics is what makes those
product suggestions so accurate. It's like having a personal shopper who knows your taste and helps
you find what you want.
● Fraud Detection: Credit card companies, like MasterCard, use Big Data Analytics to catch and stop
fraudulent transactions. It's like having a guardian that watches over your money and keeps it safe.
● Optimized Logistics: FedEx, for example, uses Big Data Analytics to deliver your packages faster and
with less impact on the environment. It's like taking the fastest route to your destination while also
being kind to the planet.
3
Challenges of Cloudera Quickstart VM
As we know, cloudera quickstart provides us with all needed tools preconfigured. Although like this, it is not
production level, just for local development purposes. Configuring a cluster and adding nodes to it has shown to
be challenging. Some key points noted:
● Cloudera quickstart VM is old supporting CDH5.* and below and is no longer officially supported by
Cloudera
● It lacks support and compatibility issue with other OS. We attempted adding a node running on Ubuntu 24,
Ubuntu 14, and another Cloudera Quickstart VM. But all fails during cloudera agent installing part of
adding a node to the cluster.

4
OBJECTIVES

● Describe procedure for building a 3-node cluster for big data


analytics in the Cloudera platform.
● Build a 3-node cluster using cloudera-quickstart VM as master
and 2 Ubuntu machines as slaves.
● Case study

5
Why use Cloudera for big data processing?

● Cloudera is often used for big data processing because it offers a comprehensive platform
for managing and analyzing large datasets, including features like scalable storage, data
processing engines, and data security.

● It facilitates faster analysis and offers a flexible environment for building complex
applications.

6
METHODS

Using Hadoop Cluster

● A Hadoop cluster is nothing but a group of computers connected together via LAN.
● We use it for storing and processing large data sets. Hadoop clusters have a number of
commodity hardware connected together.
● They communicate with a high-end machine which acts as a master.
● These master and slaves implement distributed computing over distributed data storage.
● It runs open-source software for providing distributed functionality.
7
Architecture of a Hadoop Cluster

8
Prerequisites for Cluster Setup

● Hardware Requirements
○ 3 physical/virtual machines (minimum specs: 8GB RAM, 4 CPU cores, 100GB storage per
node)
● Software Requirements
○ Linux OS (CentOS/RHEL 7/8, Ubuntu).
○ Cloudera Manager (for cluster management).
○ JDK (Java Development Kit).
● Network Requirements
○ Static IP addresses for all nodes.
○ SSH key-based authentication.
9
System Architecture Overview
● Cluster Roles
○ Master Node (1): Runs NameNode,
ResourceManager, Cloudera
Manager.
○ Worker Nodes (2): Run DataNode,
NodeManager.

10
Procedure for creating a cluster

● Setup cloudera-quickstart VM as master machine with Namenode.


● Setup 2 ubuntu VM as slave machines with Datanode.
● Setup ssh between slaves and master nodes and enable static IPs.
● Update /etc/hosts for each master/slaves so each one can see the other.
● Install Java on slave nodes.
● Download hadoop and setup on ubuntu1, then ssh-copy tar to ubuntu2 and setup.
● Add slave nodes to cluster and allow master node install all necessary components
like cloudera agent on slave nodes.

11
Adding Worker/slave Nodes via Cloudera Manager
1. Access Cloudera Manager Web UI (http://<master-node>:7180)
2. Navigate to Hosts > Add New Hosts
3. Enter worker node IPs/hostnames.
4. Install CDH (Cloudera Distribution for Hadoop).

12
Configuring Hadoop Services

● Role Assignment
○ Master Node: NameNode, ResourceManager, Cloudera Manager.
○ Worker Nodes: DataNode, NodeManager
● Key Services to Install:
○ HDFS (Storage).
○ YARN (Resource Management).
○ ZooKeeper (Coordination).

13
Configuring Hadoop Services

● Configure Cloudera manager using Proof-of-Concept Installation Guide


(https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/poc_ins
tallation.html)
● Use Vagrant to provision CentOS VMs automatically with Cloudera CDH

14
Screenshot for adding new hosts a cluster

Fig: Adding new slave nodes to cluster 15


Validating the Cluster

● Commands to Verify:
○ hdfs dfsadmin -report (Check HDFS health)
○ yarn node -list (Check YARN nodes)
○ Run a sample MapReduce job (e.g., WordCount).

● Expected Output
○ All nodes should show as "Live" in HDFS/YARN.

16
Best Practices
● Cluster Maintenance:
○ Regularly back up NameNode metadata.
○ Monitor via Cloudera Manager alerts.

● Security:
○ Enable Kerberos for production clusters.
○ Use firewall rules for network security.

17
Troubleshooting Common Issues
● SSH Failures: Fix:
○ Verify ~/.ssh/authorized_keys permissions
● Service Startup Errors:
○ Fix: Check /var/log/cloudera-scm-server/ logs.
● Resource Constraints:
○ Fix: Increase RAM/CPU allocation.

18
Case Study:Real World Implementation

Example: A retail Nodes Results


company using a ● Customer behavior ● 30% faster data
3-node Cloudera analytics. processing.
cluster for: ● Inventory optimization.
● Cost savings
compared to
● Sales forecasting cloud-based
solutions.

19
Conclusion
➔ Summary
◆ A 3-node Cloudera cluster is ideal for learning and
small-scale analytics.
◆ Cloudera Manager simplifies deployment and
management.
➔ Next Steps

◆ Scale to more nodes for production workloads.

◆ Integrate Spark, Kafka, or Hive.

➔ Simple
Provide a simple unifying message for what is to come

20
The End
Thank you

21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy