Three Node Cluster in Hadoop
Three Node Cluster in Hadoop
Analytics in Cloudera
1
Presentation Outline
Introduction
objectives
methods
Results
conclusion
2
INTRODUCTION
What is big data analytics?
Big Data Analytics is all about crunching massive amounts of information to uncover hidden trends, patterns, and
relationships. It's like sifting through a giant mountain of data to find the gold nuggets of insight.
4
OBJECTIVES
5
Why use Cloudera for big data processing?
● Cloudera is often used for big data processing because it offers a comprehensive platform
for managing and analyzing large datasets, including features like scalable storage, data
processing engines, and data security.
● It facilitates faster analysis and offers a flexible environment for building complex
applications.
6
METHODS
● A Hadoop cluster is nothing but a group of computers connected together via LAN.
● We use it for storing and processing large data sets. Hadoop clusters have a number of
commodity hardware connected together.
● They communicate with a high-end machine which acts as a master.
● These master and slaves implement distributed computing over distributed data storage.
● It runs open-source software for providing distributed functionality.
7
Architecture of a Hadoop Cluster
8
Prerequisites for Cluster Setup
● Hardware Requirements
○ 3 physical/virtual machines (minimum specs: 8GB RAM, 4 CPU cores, 100GB storage per
node)
● Software Requirements
○ Linux OS (CentOS/RHEL 7/8, Ubuntu).
○ Cloudera Manager (for cluster management).
○ JDK (Java Development Kit).
● Network Requirements
○ Static IP addresses for all nodes.
○ SSH key-based authentication.
9
System Architecture Overview
● Cluster Roles
○ Master Node (1): Runs NameNode,
ResourceManager, Cloudera
Manager.
○ Worker Nodes (2): Run DataNode,
NodeManager.
10
Procedure for creating a cluster
11
Adding Worker/slave Nodes via Cloudera Manager
1. Access Cloudera Manager Web UI (http://<master-node>:7180)
2. Navigate to Hosts > Add New Hosts
3. Enter worker node IPs/hostnames.
4. Install CDH (Cloudera Distribution for Hadoop).
12
Configuring Hadoop Services
● Role Assignment
○ Master Node: NameNode, ResourceManager, Cloudera Manager.
○ Worker Nodes: DataNode, NodeManager
● Key Services to Install:
○ HDFS (Storage).
○ YARN (Resource Management).
○ ZooKeeper (Coordination).
13
Configuring Hadoop Services
14
Screenshot for adding new hosts a cluster
● Commands to Verify:
○ hdfs dfsadmin -report (Check HDFS health)
○ yarn node -list (Check YARN nodes)
○ Run a sample MapReduce job (e.g., WordCount).
● Expected Output
○ All nodes should show as "Live" in HDFS/YARN.
16
Best Practices
● Cluster Maintenance:
○ Regularly back up NameNode metadata.
○ Monitor via Cloudera Manager alerts.
● Security:
○ Enable Kerberos for production clusters.
○ Use firewall rules for network security.
17
Troubleshooting Common Issues
● SSH Failures: Fix:
○ Verify ~/.ssh/authorized_keys permissions
● Service Startup Errors:
○ Fix: Check /var/log/cloudera-scm-server/ logs.
● Resource Constraints:
○ Fix: Increase RAM/CPU allocation.
18
Case Study:Real World Implementation
19
Conclusion
➔ Summary
◆ A 3-node Cloudera cluster is ideal for learning and
small-scale analytics.
◆ Cloudera Manager simplifies deployment and
management.
➔ Next Steps
➔ Simple
Provide a simple unifying message for what is to come
20
The End
Thank you
21