Big Data complete Notes
Big Data complete Notes
UNIT Ⅰ
Nature of Data
• Static Data: Historical, does not change frequently (e.g., archived records).
• Dynamic Data: Continuously updated (e.g., stock market data, social media).
Analysis vs Reporting
• Analysis: Examines data to find trends, relationships, and predictions.
• Reporting: Presents past data using dashboards, summaries, and visualizations.
Apache Hadoop
Apache Hadoop is an open-source framework designed to store and process massive amounts
of data using a distributed computing model. It allows multiple machines (nodes) to work
together, making data processing efficient and scalable.
Hadoop is built on three main principles:
• Scalability – Can handle petabytes of data.
• Fault Tolerance – If one node fails, the system continues working.
• Cost Efficiency – Uses commodity hardware (low-cost servers).
Components of Hadoop
Hadoop has four main components:
1. HDFS (Storage Layer) – Stores large data in a distributed way.
2. MapReduce (Processing Layer) – Processes data in parallel.
3. YARN (Resource Management Layer) – Allocates system resources for tasks.
4. Hadoop Common – Provides libraries and utilities for other Hadoop components.
Scaling Out
Hadoop scales out by adding more nodes (machines) instead of increasing the power of a
single machine (scaling up). This ensures better performance and fault tolerance.
Hadoop Ecosystem
Hadoop has an ecosystem of tools that make it more powerful:
• HDFS – Storage
• MapReduce – Processing
• YARN – Resource Management
• Hive – SQL for Big Data
• Pig – Data transformation scripting
• HBase – NoSQL database for Hadoop
• Sqoop – Transfers data between Hadoop & relational databases
• Flume – Collects & stores log data
• Spark – Fast data processing & analytics
• Oozie – Workflow scheduler
MapReduce
What is MapReduce?
MapReduce is a data processing model that breaks down large datasets into smaller parts,
processes them in parallel, and then combines the results. It follows a divide-and-conquer
approach.
MapReduce consists of two main phases:
1. Map Phase: Processes input data and converts it into key-value pairs.
2. Reduce Phase: Aggregates and summarizes the output from the Map phase.
Example: Word Count in a large text file
• Map: Splits the text and counts words.
• Reduce: Aggregates the word counts.
Failures in MapReduce
Failures can occur in:
1. Map Task Failure – Hadoop retries the task on another node.
2. Reduce Task Failure – The task is restarted on another node.
3. Namenode Failure – Secondary Namenode helps restore data.
MapReduce Types
1. Simple MapReduce: Basic key-value processing.
2. Chain MapReduce: Output of one job is input to another.
3. Iterative MapReduce: Used in machine learning (e.g., k-means clustering).
MapReduce Features
• Parallel Processing: Handles large data efficiently.
• Fault Tolerance: Automatically recovers from failures.
• Scalability: Easily scales across multiple machines.
2. HDFS Concepts
✅ Benefits:
Example:
A 500 MB file is stored in 4 blocks (128 MB × 3 + 116 MB).
Storing Data:
Reading Data:
Writing Data:
1. Client writes data to HDFS.
2. NameNode assigns DataNodes for storing blocks.
3. Blocks are replicated and stored on different nodes.
The Hadoop Distributed File System (HDFS) provides a Java API that allows developers to
interact with the file system programmatically. This is useful for developers who want to create,
read, update, or delete files in HDFS using Java code.
• The FileSystem class is part of the Hadoop API and provides methods to perform file
operations on HDFS.
• The Configuration class holds the Hadoop configuration details such as file paths,
block sizes, and replication factors.
• The Path class is used to define the HDFS file path.
explanation of Code
• This creates a Hadoop configuration object that loads the cluster’s configuration files
(like core-site.xml and hdfs-site.xml).
✅ FileSystem fs = FileSystem.get(conf);
✅ System.out.println(line);
✅ Closing Resources:
• Always close the file streams and the FileSystem object to free resources.
8. Command Line Interface (CLI)
HDFS provides a powerful command-line interface (CLI) to manage files and directories in the Hadoop
cluster. These commands simplify file handling without needing Java code.
✅ List Files
bash
hdfs dfs -ls /
• Lists all files and directories in the root directory (/) of HDFS.
• Example output:
sql
drwxr-xr-x - user supergroup 4096 2025-02-21 /user/data
-rw-r--r-- 3 user supergroup 10240 2025-02-21 /user/data.txt
bash
hdfs dfs -put localfile.txt /hdfs_path/
bash
hdfs dfs -put /home/user/data.txt /user/
bash
hdfs dfs -get /hdfs_path/file.txt localfile.txt
bash
hdfs dfs -get /user/data.txt /home/user/localdata.txt
bash
hdfs dfs -rm /hdfs_path/file.txt
bash
hdfs dfs -rm /user/data.txt
This deletes the data.txt file from the /user/ directory in HDFS.
Command Description
hdfs dfs -ls / List files in HDFS
hdfs dfs -put localfile / Upload file to HDFS
hdfs dfs -get /hdfsfile . Download file from HDFS
hdfs dfs -rm /file.txt Delete file from HDFS
• Flume: Collects log data from sources (e.g., web servers) and stores it in HDFS.
• Sqoop: Transfers data between HDFS and relational databases (MySQL, PostgreSQL).
Hadoop archives (HAR) help manage a large number of small files efficiently by combining
them into one large file.
Hadoop Environment
1. Setting Up a Hadoop Cluster
2. Cluster Specification
sh
start-dfs.sh
start-yarn.sh
4. Hadoop Configuration
5. Security in Hadoop
6. Administering Hadoop
• Monitoring performance.
• Managing storage and replication.
• Handling failures and backups.
Key Components:
1. HDFS (Hadoop Distributed File System): Stores large amounts of data across multiple
machines.
2. YARN (Yet Another Resource Negotiator): Manages and schedules resources for
processing.
3. MapReduce: A programming model for processing large datasets in parallel.
4. HBase: A NoSQL database for storing structured data.
5. Hive: A data warehouse tool that uses SQL-like queries to process big data.
6. Pig: A scripting language for analyzing large datasets.
7. Sqoop: Transfers data between Hadoop and relational databases (MySQL, PostgreSQL).
8. Flume: Collects and transfers log data to Hadoop.
9. Oozie: A workflow scheduler for managing Hadoop jobs.
10. Zookeeper: A coordination service that helps manage distributed applications.
Types of Schedulers:
2. HDFS Federation:
• The new version of MapReduce works within YARN, improving resource management
and job scheduling.
• Older MapReduce applications (MRv1) can still run in Hadoop 2.0 using YARN
compatibility mode.
2. NoSQL Databases
2.1 Introduction to NoSQL
NoSQL (Not Only SQL) databases store and process unstructured or semi-structured data,
unlike traditional relational databases (MySQL, PostgreSQL).
Advantages of NoSQL:
Updating a Document:
json
db.users.updateOne({ "name": "Alice" }, { $set: { "age": 26 } })
Deleting a Document:
json
json
db.users.createIndex({ "name": 1 })
json
db.createCollection("logs", { capped: true, size: 100000 })
4. Apache Spark
4.1 Installing Spark
• Install Java & Scala.
• Download Spark from the official site.
• Start Spark using spark-shell.
scala
✅ The Animal class has a method sound() that returns "Some sound".
✅ The Dog class inherits from Animal and overrides the sound() method to return "Bark".
UNIT Ⅴ
Hadoop Ecosystem Frameworks:
The Hadoop ecosystem consists of several tools that help manage, process, and analyze big data
effectively. Key frameworks include Pig, Hive, HBase, Zookeeper, and IBM Big Data
solutions. Let’s explore each in detail.
1. Pig
1.1 Introduction to Pig:
Apache Pig is a platform for processing large datasets. It simplifies complex MapReduce jobs
using its scripting language called Pig Latin. Pig is ideal for tasks like ETL (Extract, Transform,
Load), data cleansing, and analytics.
• Pig is a high-level platform built on top of Hadoop that simplifies the process of working with
large datasets.
• It uses Pig Latin, a data flow language, to execute complex data transformations.
• Pig abstracts the complexity of writing low-level MapReduce code and makes it easier to
process data.
• Pig is better suited for processing large, unstructured data, while traditional databases are
more for structured data.
• Pig is not a database itself but an abstraction for Hadoop to process big data using a
scripting language, while databases store data in tables and use SQL.
1.4 Grunt:
Grunt is the interactive shell in Pig, where you can run commands and execute Pig Latin scripts directly.
1.5 Pig Latin:
• Pig Latin is a simple scripting language used to express data transformations in Pig.
• It supports data transformations, such as LOAD, FILTER, GROUP, JOIN, STORE, etc.
• Example:
pig
• Pig allows you to create custom functions to extend its capabilities. These are written in
Java, Python, or JavaScript.
2. Hive
2.1 Apache Hive Architecture
Hive is a data warehouse system that facilitates querying and managing large datasets using
HiveQL (a SQL-like language). Hive's architecture includes:
• Example:
bash
hive> SELECT * FROM employees;
• Hive Metastore stores metadata about the structure of data, like tables, columns, partitions, etc. It
is necessary for Hive to perform queries efficiently.
• Hive uses Hadoop for distributed processing, which makes it better for handling large amounts
of data compared to traditional databases.
• Hive is not suitable for transactional processing (like RDBMS), but it’s great for batch
processing and analytics.
Example:
sql
SELECT * FROM students WHERE age > 20;
2.7 Tables, Querying, and UDFs:
• In Hive, you create tables using CREATE TABLE, LOAD DATA to load data, and use
HiveQL to query data.
o Example:
Sql
Hive supports sorting, aggregating, joins, and subqueries to analyze large datasets.
• Sorting: ORDER BY
• Aggregation: GROUP BY
• Joins: Inner, Outer, and Cross joins
• Subqueries: Nested queries for advanced data retrieval.
Example:
3. HBase:
3.1 HBase Concepts
Apache HBase is a NoSQL database that stores structured data in column-oriented format. It
excels in fast read/write operations and handles massive amounts of data efficiently.
• HBase is a NoSQL database that runs on top of Hadoop and stores large amounts of
structured data in the form of key-value pairs.
• HBase is designed for random access and allows fast read/write operations on very large
datasets.
3.2 HBase Clients:
HBase offers Java-based APIs to interact with tables and perform CRUD (Create, Read, Update,
Delete) operations. Example in Java:
HBase provides Java clients to interact with the database and perform operations like put, get,
delete, and scan.
Example:
java
Table table = connection.getTable(TableName.valueOf("employees"));
Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"),
Bytes.toBytes("Alice"));
table.put(put);
• HBase is suitable for large, unstructured data and offers real-time random access.
• RDBMS are better suited for structured data with complex queries and relationships.
4. Zookeeper:
4.1 How Zookeeper Helps in Cluster Monitoring
Zookeeper is used to build distributed applications by providing tools like leader election, distributed
locks, and barriers to coordinate distributed processes.
• Distributed locks
• Queues for task distribution
• Cluster coordination for failover and recovery
• Infosphere is IBM’s suite of data integration tools for working with large-scale data.
• Big Insights is IBM’s Hadoop-based platform for storing and analyzing big data. It combines
Hadoop, MapReduce, and analytics in a unified environment.
• Big Sheets is a tool for analyzing large spreadsheets in the cloud, and it is part of the
BigInsights suite.
6. Big SQL:
6.1 Introduction to Big SQL
• Big SQL is an SQL engine built on top of Hadoop and IBM BigInsights that provides
high-performance queries for big data.
• It allows users to write SQL queries against data stored in Hadoop and NoSQL
databases.
IBM's Big SQL is a powerful SQL engine designed for querying large-scale data stored in
Hadoop and other big data platforms. It supports: