0% found this document useful (0 votes)
21 views

Big Data complete Notes

The document provides an overview of Big Data, including its definition, types, historical innovations, and the architecture of Big Data platforms. It discusses the importance of Big Data across various industries, its technology components, and the analytics processes involved. Additionally, it covers Hadoop, its components, and the MapReduce programming model for processing large datasets efficiently.

Uploaded by

Gaurav Trivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Big Data complete Notes

The document provides an overview of Big Data, including its definition, types, historical innovations, and the architecture of Big Data platforms. It discusses the importance of Big Data across various industries, its technology components, and the analytics processes involved. Additionally, it covers Hadoop, its components, and the MapReduce programming model for processing large datasets efficiently.

Uploaded by

Gaurav Trivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

BIG DATA ANALYTICS

UNIT Ⅰ

Introduction to Big Data


Big Data refers to massive amounts of data that cannot be handled efficiently using traditional
data processing methods. It includes structured, unstructured, and semi-structured data, which
are generated from various sources like social media, business transactions, IoT devices, and
sensors.

Types of Digital Data


1. Structured Data: Organized in a predefined format, typically stored in databases (e.g.,
relational databases like MySQL, Oracle).
2. Unstructured Data: Does not have a fixed format, making it more complex to analyze
(e.g., emails, videos, social media posts).
3. Semi-structured Data: A mix of both, having some structure but not completely
organized (e.g., JSON, XML, CSV files).

History of Big Data Innovation


• 1960s-70s: Early database systems like IBM’s hierarchical database were developed.
• 1980s: The concept of relational databases (RDBMS) emerged, making data storage and
retrieval more efficient.
• 1990s: Data Warehousing and Business Intelligence (BI) became popular, helping
organizations analyze past data.
• 2000s: Google introduced MapReduce, which allowed processing of large-scale data in
distributed environments, leading to the rise of Hadoop and Big Data technologies.
• 2010s-Present: Cloud computing, artificial intelligence, and machine learning
revolutionized Big Data analytics. Real-time analytics tools like Apache Spark became
popular.

Introduction to Big Data Platform


A Big Data Platform is a collection of tools, frameworks, and technologies designed to manage
large datasets efficiently. It includes:
• Data Storage: HDFS, NoSQL databases (MongoDB, Cassandra).
• Data Processing: Hadoop, Spark.
• Data Analysis & Visualization: Python, R, Tableau, Power BI.

Drivers for Big Data Growth


The main reasons why Big Data has become important:
• Explosion of Data Sources: social media, IoT devices, and online activities generate
vast amounts of data.
• Advanced Computing Power: Cloud computing and distributed systems allow
processing of large-scale data.
• AI & Machine Learning Integration: Big Data fuels AI-driven decision-making.
• Business Demand: Companies want to extract valuable insights for better decision-
making.

Big Data Architecture and Characteristics


Big Data Architecture Components:
1. Data Sources: social media, IoT devices, logs, sensors.
2. Data Storage: Hadoop Distributed File System (HDFS), NoSQL databases.
3. Data Processing: Batch processing (Hadoop), real-time processing (Spark, Flink).
4. Data Analysis & Visualization: Tools like Python, R, Power BI, Tableau.
Big Data Characteristics (5 Vs):
1. Volume: Extremely large amounts of data (terabytes, petabytes).
2. Velocity: Data is generated at high speed (real-time data streams).
3. Variety: Different types of data (structured, unstructured, semi-structured).
4. Veracity: Ensuring data accuracy and reliability.
5. Value: Extracting useful insights from data.

Big Data Technology Components


• Storage: HDFS, Amazon S3, Google Cloud Storage.
• Processing: Hadoop, Spark, Flink.
• Databases: NoSQL databases (MongoDB, Cassandra, HBase).
• Analytics & Visualization: Python, R, Power BI, Tableau.

Big Data Importance and Applications


Big Data is crucial in various industries:
• Healthcare: Disease prediction, personalized treatments.
• Finance: Fraud detection, risk management.
• Retail: Customer behavior analysis, product recommendations.
• Social Media: Sentiment analysis, targeted ads.

Big Data Features – Security, Compliance, Auditing, Protection


• Security: Protecting data from breaches and cyberattacks.
• Compliance: Adhering to laws like GDPR (Europe) and HIPAA (Healthcare).
• Auditing: Monitoring data access and changes.
• Protection: Preventing unauthorized access using encryption, authentication.

Big Data Privacy and Ethics


• Privacy Concerns: Companies must ensure users’ personal data is not misused.
• Ethical Use of AI & Data: Avoiding biased decision-making in AI algorithms.
• Transparency: Organizations must be clear about how they collect and use data.

Big Data Analytics


Big Data Analytics is the process of examining large datasets to uncover patterns, trends, and
insights.
• Descriptive Analytics: Analyzes past data for reporting.
• Predictive Analytics: Uses machine learning to predict future trends.
• Prescriptive Analytics: Suggests actions based on data insights.
Challenges of Conventional Systems
Traditional databases struggle with:
• Handling extremely large data volumes.
• Processing real-time data efficiently.
• Managing diverse data types (structured, unstructured).

Intelligent Data Analysis


Uses AI and machine learning to automate data analysis. Examples:
• Chatbots analyzing customer queries.
• Recommendation engines suggesting products.

Nature of Data
• Static Data: Historical, does not change frequently (e.g., archived records).
• Dynamic Data: Continuously updated (e.g., stock market data, social media).

Analytic Processes & Tools


• Data Collection → Cleaning → Processing → Visualization → Decision-Making.
• Popular Tools: Python, R, SQL, Tableau, Power BI, Hadoop, Spark.

Analysis vs Reporting
• Analysis: Examines data to find trends, relationships, and predictions.
• Reporting: Presents past data using dashboards, summaries, and visualizations.

Modern Data Analytics Tools


1. Hadoop & Spark – For distributed Big Data processing.
2. Tableau & Power BI – For interactive data visualization.
3. Python & R – For machine learning and AI-driven analytics.
UNIT Ⅱ
Hadoop
History of Hadoop
Hadoop was created by Doug Cutting and Mike Cafarella in 2006. It was inspired by
Google’s MapReduce and Google File System (GFS), which were developed to handle large-
scale data processing.
Initially, Hadoop was part of the Nutch search engine project, but later, Yahoo! developed it
as an open-source framework under the Apache Software Foundation. Today, Hadoop is
widely used for processing big data.

Apache Hadoop
Apache Hadoop is an open-source framework designed to store and process massive amounts
of data using a distributed computing model. It allows multiple machines (nodes) to work
together, making data processing efficient and scalable.
Hadoop is built on three main principles:
• Scalability – Can handle petabytes of data.
• Fault Tolerance – If one node fails, the system continues working.
• Cost Efficiency – Uses commodity hardware (low-cost servers).

Hadoop Distributed File System (HDFS)


HDFS is a storage system in Hadoop that splits large files into smaller chunks and distributes
them across multiple machines (nodes).
Key Features of HDFS:
1. Distributed Storage – Data is stored across multiple machines.
2. Fault Tolerance – Data is replicated (default: 3 copies) to prevent data loss.
3. High Throughput – Allows fast read/write operations.
4. Write Once, Read Many – Data is usually written once but read multiple times.
HDFS Architecture:
• Namenode: Manages metadata (file locations, replicas).
• Datanodes: Store actual data blocks.
• Secondary Namenode: Helps in maintaining Namenode checkpoints.

Components of Hadoop
Hadoop has four main components:
1. HDFS (Storage Layer) – Stores large data in a distributed way.
2. MapReduce (Processing Layer) – Processes data in parallel.
3. YARN (Resource Management Layer) – Allocates system resources for tasks.
4. Hadoop Common – Provides libraries and utilities for other Hadoop components.

Data Format in Hadoop


Data in Hadoop can be stored in various formats:
• Text Format: Simple and easy (CSV, JSON).
• Sequence File: Binary file format used for fast processing.
• Avro & Parquet: Used for efficient data storage and schema evolution.

Analyzing Data with Hadoop


To analyze data with Hadoop, we use MapReduce, Hive, Pig, and Spark.
• MapReduce – Batch processing of large datasets.
• Hive – SQL-like queries for Big Data.
• Pig – A scripting language for data transformation.
• Spark – Faster alternative to MapReduce for real-time analytics.

Scaling Out
Hadoop scales out by adding more nodes (machines) instead of increasing the power of a
single machine (scaling up). This ensures better performance and fault tolerance.

Hadoop Streaming & Pipes


• Hadoop Streaming: Allows developers to write MapReduce jobs in any language
(Python, Perl) instead of Java.
• Hadoop Pipes: Enables writing MapReduce programs in C++ using APIs.

Hadoop Ecosystem
Hadoop has an ecosystem of tools that make it more powerful:
• HDFS – Storage
• MapReduce – Processing
• YARN – Resource Management
• Hive – SQL for Big Data
• Pig – Data transformation scripting
• HBase – NoSQL database for Hadoop
• Sqoop – Transfers data between Hadoop & relational databases
• Flume – Collects & stores log data
• Spark – Fast data processing & analytics
• Oozie – Workflow scheduler

MapReduce
What is MapReduce?
MapReduce is a data processing model that breaks down large datasets into smaller parts,
processes them in parallel, and then combines the results. It follows a divide-and-conquer
approach.
MapReduce consists of two main phases:
1. Map Phase: Processes input data and converts it into key-value pairs.
2. Reduce Phase: Aggregates and summarizes the output from the Map phase.
Example: Word Count in a large text file
• Map: Splits the text and counts words.
• Reduce: Aggregates the word counts.

How MapReduce Works?


1. Input Split: The data is divided into chunks (blocks).
2. Map Function: Processes each block and generates key-value pairs.
3. Shuffle & Sort: Groups similar keys together.
4. Reduce Function: Aggregates and produces the final result.

Developing a MapReduce Application


A typical MapReduce application involves:
• Writing Mapper and Reducer classes in Java.
• Specifying input format (e.g., Text, SequenceFile).
• Running the job using Hadoop.

Unit Tests with MRUnit


• MRUnit is a testing framework for MapReduce applications.
• It allows developers to test Map and Reduce functions independently before deploying to
Hadoop.

Test Data & Local Tests


• Developers use small datasets to test their MapReduce jobs locally before running on a
Hadoop cluster.

Anatomy of a MapReduce Job Run


A MapReduce job consists of:
• Job Submission – Client submits the job.
• Job Initialization – YARN schedules resources.
• Map Execution – Data is split and processed.
• Shuffle & Sort – Intermediate data is sorted.
• Reduce Execution – Final results are produced.
• Job Completion – The job finishes and results are stored.

Failures in MapReduce
Failures can occur in:
1. Map Task Failure – Hadoop retries the task on another node.
2. Reduce Task Failure – The task is restarted on another node.
3. Namenode Failure – Secondary Namenode helps restore data.

Job Scheduling in MapReduce


Hadoop has multiple job schedulers:
• FIFO Scheduler: Jobs are processed in order.
• Fair Scheduler: Resources are shared fairly among users.
• Capacity Scheduler: Assigns priority to jobs based on demand.

Shuffle and Sort


After the Map phase, data is shuffled and sorted to group similar keys before being processed
by the Reduce phase.

Task Execution in MapReduce


Each job consists of multiple tasks:
• Map Tasks: Process data and produce key-value pairs.
• Reduce Tasks: Aggregate and produce final results.

MapReduce Types
1. Simple MapReduce: Basic key-value processing.
2. Chain MapReduce: Output of one job is input to another.
3. Iterative MapReduce: Used in machine learning (e.g., k-means clustering).

Input & Output Formats in MapReduce


• Input Formats:
o TextInputFormat (default, line-based input)
o KeyValueTextInputFormat (key-value input)
o SequenceFileInputFormat (binary file input)
• Output Formats:
o TextOutputFormat (default, text output)
o SequenceFileOutputFormat (binary output)

MapReduce Features
• Parallel Processing: Handles large data efficiently.
• Fault Tolerance: Automatically recovers from failures.
• Scalability: Easily scales across multiple machines.

Real-World Applications of MapReduce


• Log Analysis: Analyzing server logs.
• Recommendation Systems: Used in Netflix, Amazon, and YouTube.
• Social Media Analysis: Facebook, Twitter use MapReduce for user insights.
• Fraud Detection: Banks analyze transactions to detect fraud.
UNIT Ⅲ

HDFS (Hadoop Distributed File System)


1. Design of HDFS
HDFS is a distributed file system specifically designed to store large-scale datasets and run on
commodity hardware. It follows the Write Once, Read Many model, meaning that once data
is written, it cannot be modified but can be read multiple times.

The main goals of HDFS are:

• Scalability – Can handle petabytes of data.


• Fault Tolerance – Data is replicated to prevent loss.
• High Throughput – Supports parallel processing.
• Cost Efficiency – Uses commodity hardware (low-cost servers).

HDFS follows a master-slave architecture, with:

• NameNode (Master): Manages metadata (file locations, structure).


• DataNodes (Slaves): Store actual data blocks.
• Secondary NameNode: Takes periodic snapshots of NameNode metadata.

2. HDFS Concepts

• Blocks: Data is divided into blocks (default: 128 MB).


• Replication: Each block is replicated (default: 3 copies) to prevent data loss.
• Write Once, Read Many: Data in HDFS is not modified after writing.
• Rack Awareness: Ensures copies of data are distributed across different racks (groups of
nodes) to improve fault tolerance.

3. Benefits and Challenges of HDFS

✅ Benefits:

• Stores and processes huge datasets efficiently.


• Supports parallel processing using MapReduce.
• Provides fault tolerance via data replication.
• Works well with commodity hardware (low-cost servers).
❌ Challenges:

• High Latency: Not ideal for real-time data processing.


• Limited File Modification: Files cannot be updated once written.
• Requires Manual Tuning: Optimizing performance needs configuration.

4. File Sizes, Block Sizes, and Block Abstraction in HDFS

• File Size: HDFS is optimized for large files (GB to PB scale).


• Block Size: Default is 128 MB, but can be configured.
• Block Abstraction: Large files are split into blocks, and each block is stored across
multiple nodes.

Example:
A 500 MB file is stored in 4 blocks (128 MB × 3 + 116 MB).

5. Data Replication in HDFS

• Default replication factor is 3 (each block is stored on 3 different nodes).


• Improves fault tolerance (if a node fails, data is available from another copy).
• Replication strategy:
o 1st copy → Same rack
o 2nd copy → Different rack
o 3rd copy → Another node in the same rack

6. How HDFS Stores, Reads, and Writes Files

Storing Data:

1. A client submits a file to HDFS.


2. The file is split into blocks and stored in different DataNodes.
3. The NameNode stores metadata about block locations.

Reading Data:

1. Client requests a file.


2. NameNode provides the block locations.
3. Data is retrieved in parallel from multiple DataNodes.

Writing Data:
1. Client writes data to HDFS.
2. NameNode assigns DataNodes for storing blocks.
3. Blocks are replicated and stored on different nodes.

7. Java Interfaces to HDFS

The Hadoop Distributed File System (HDFS) provides a Java API that allows developers to
interact with the file system programmatically. This is useful for developers who want to create,
read, update, or delete files in HDFS using Java code.

Key Java Class: FileSystem

• The FileSystem class is part of the Hadoop API and provides methods to perform file
operations on HDFS.
• The Configuration class holds the Hadoop configuration details such as file paths,
block sizes, and replication factors.
• The Path class is used to define the HDFS file path.

Java Code Example - Reading a File from HDFS


java
CopyEdit
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;

public class HDFSReadExample {


public static void main(String[] args) {
try {
// Step 1: Create Configuration Object
Configuration conf = new Configuration();

// Step 2: Access the HDFS FileSystem


FileSystem fs = FileSystem.get(conf);

// Step 3: Define the File Path in HDFS


Path filePath = new Path("/user/data.txt");

// Step 4: Open the File


FSDataInputStream inputStream = fs.open(filePath);

// Step 5: Read and Display File Content


BufferedReader reader = new BufferedReader(new
InputStreamReader(inputStream));
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line); // Print each line from the file
}

// Step 6: Close Resources


reader.close();
inputStream.close();
fs.close();
} catch (Exception e) {
e.printStackTrace(); // Handle exceptions
}
}
}

explanation of Code

✅ Configuration conf = new Configuration();

• This creates a Hadoop configuration object that loads the cluster’s configuration files
(like core-site.xml and hdfs-site.xml).

✅ FileSystem fs = FileSystem.get(conf);

• Establishes a connection to the Hadoop filesystem.

✅ Path filePath = new Path("/user/data.txt");

• Defines the path to the file located in HDFS.

✅ FSDataInputStream inputStream = fs.open(filePath);

• Opens the file for reading.

✅ BufferedReader & InputStreamReader:

• Used to read the content of the file line by line.

✅ System.out.println(line);

• Prints the file’s content to the console.

✅ Closing Resources:

• Always close the file streams and the FileSystem object to free resources.
8. Command Line Interface (CLI)

HDFS provides a powerful command-line interface (CLI) to manage files and directories in the Hadoop
cluster. These commands simplify file handling without needing Java code.

HDFS can be accessed via the command line:

• List files: hdfs dfs -ls /


• Upload file: hdfs dfs -put localfile.txt /hdfs_path/
• Download file: hdfs dfs -get /hdfs_path/file.txt localfile.txt
• Delete file: hdfs dfs -rm /hdfs_path/file.txt

Common HDFS CLI Commands

✅ List Files

bash
hdfs dfs -ls /

• Lists all files and directories in the root directory (/) of HDFS.
• Example output:

sql
drwxr-xr-x - user supergroup 4096 2025-02-21 /user/data
-rw-r--r-- 3 user supergroup 10240 2025-02-21 /user/data.txt

✅ Upload File to HDFS

bash
hdfs dfs -put localfile.txt /hdfs_path/

• localfile.txt (file on your local machine) will be uploaded to /hdfs_path/ in HDFS.


• Example:

bash
hdfs dfs -put /home/user/data.txt /user/

This uploads data.txt to /user/ in HDFS.

✅ Download File from HDFS

bash
hdfs dfs -get /hdfs_path/file.txt localfile.txt

• Copies a file from HDFS to your local system.


• Example:

bash
hdfs dfs -get /user/data.txt /home/user/localdata.txt

This downloads data.txt from HDFS to /home/user/localdata.txt.

✅ Delete a File in HDFS

bash
hdfs dfs -rm /hdfs_path/file.txt

• Deletes a file in HDFS.


• Example:

bash
hdfs dfs -rm /user/data.txt

This deletes the data.txt file from the /user/ directory in HDFS.

Summary of Commands Table

Command Description
hdfs dfs -ls / List files in HDFS
hdfs dfs -put localfile / Upload file to HDFS
hdfs dfs -get /hdfsfile . Download file from HDFS
hdfs dfs -rm /file.txt Delete file from HDFS

9. Hadoop File System Interfaces


Hadoop supports multiple storage backends:

• HDFS – Default Hadoop File System.


• Local File System – Used for testing.
• Amazon S3 – Cloud-based storage.
• Azure Blob Storage – Microsoft cloud storage.

10. Data Flow in HDFS

• Input: Data is ingested using Flume/Sqoop.


• Storage: Data is stored in HDFS.
• Processing: Data is processed using MapReduce/Spark.
• Output: Results are saved to HDFS or databases.
11. Data Ingest with Flume and Sqoop

• Flume: Collects log data from sources (e.g., web servers) and stores it in HDFS.
• Sqoop: Transfers data between HDFS and relational databases (MySQL, PostgreSQL).

12. Hadoop Archives

Hadoop archives (HAR) help manage a large number of small files efficiently by combining
them into one large file.

13. Hadoop I/O

Hadoop I/O optimizes data storage and retrieval using:

• Compression: Reduces file size (Gzip, Snappy, LZO).


• Serialization: Converts objects into byte streams.
• Avro: A format for storing structured data with schemas.

Hadoop Environment
1. Setting Up a Hadoop Cluster

A Hadoop cluster consists of:

• Master Node: Runs NameNode and ResourceManager.


• Worker Nodes: Run DataNodes and TaskTrackers.

2. Cluster Specification

• Hardware: Minimum 16 GB RAM, 8-core CPU, SSD for NameNode.


• Network: High-speed network for better performance.
• Storage: At least 500 TB of disk storage.
3. Cluster Setup and Installation

1. Install Java and Hadoop on all nodes.


2. Configure HDFS and YARN.
3. Set up password-less SSH between nodes.
4. Start Hadoop services using:

sh

start-dfs.sh
start-yarn.sh

4. Hadoop Configuration

• core-site.xml: General Hadoop settings.


• hdfs-site.xml: Configures NameNode, DataNode, replication.
• yarn-site.xml: Configures YARN (job scheduling).
• mapred-site.xml: Configures MapReduce settings.

5. Security in Hadoop

• Kerberos Authentication: Ensures Secure access control.


• Data Encryption: Protects sensitive data.
• Access Control Lists (ACLs): Restrict file access permissions.

6. Administering Hadoop

Hadoop administrators manage the cluster by:

• Monitoring performance.
• Managing storage and replication.
• Handling failures and backups.

7. HDFS Monitoring & Maintenance

• Use hdfs fsck / to check file system health.


• Monitor logs using jps and hadoop dfsadmin -report.
• Use Ambari or Cloudera Manager for GUI-based monitoring.
8. Hadoop Benchmarks

Benchmarking tools for testing Hadoop performance:

• TestDFSIO: Measures HDFS read/write speed.


• TeraSort: Tests MapReduce performance.

9. Hadoop in the Cloud

Hadoop can be deployed in cloud platforms like:

• AWS EMR (Elastic MapReduce)


• Google Cloud Dataproc
• Azure HDInsight
UNIT Ⅳ

1. Hadoop Ecosystem and YARN


1.1 Hadoop Ecosystem Components
The Hadoop Ecosystem consists of several tools that work together for storing, processing, and
analyzing big data.

Key Components:

1. HDFS (Hadoop Distributed File System): Stores large amounts of data across multiple
machines.
2. YARN (Yet Another Resource Negotiator): Manages and schedules resources for
processing.
3. MapReduce: A programming model for processing large datasets in parallel.
4. HBase: A NoSQL database for storing structured data.
5. Hive: A data warehouse tool that uses SQL-like queries to process big data.
6. Pig: A scripting language for analyzing large datasets.
7. Sqoop: Transfers data between Hadoop and relational databases (MySQL, PostgreSQL).
8. Flume: Collects and transfers log data to Hadoop.
9. Oozie: A workflow scheduler for managing Hadoop jobs.
10. Zookeeper: A coordination service that helps manage distributed applications.

1.2 Schedulers in Hadoop


Schedulers in Hadoop determine how computing resources are allocated among different jobs.

Types of Schedulers:

1. FIFO (First In First Out) Scheduler:


o Executes jobs in the order they arrive.
o Not suitable for multi-user environments.
2. Fair Scheduler:
o Ensures fair distribution of resources among users.
o If one user has fewer jobs, other users can utilize the extra resources.
3. Capacity Scheduler:
o Divides resources into queues for multiple users.
o Ensures efficient resource utilization in large organizations.
1.3 Hadoop 2.0 New Features
1. NameNode High Availability:

• Hadoop 2.0 introduces a backup NameNode to avoid single-point failure.


• If the primary NameNode fails, the secondary takes over automatically.

2. HDFS Federation:

• Instead of a single NameNode managing all metadata, multiple NameNodes handle


different parts of the file system.
• Improves performance and scalability.

3. MRv2 (MapReduce Version 2):

• The new version of MapReduce works within YARN, improving resource management
and job scheduling.

4. YARN (Yet Another Resource Negotiator):

• Manages cluster resources separately from application processing.


• Allows different frameworks (MapReduce, Spark, Tez) to run on the same Hadoop
cluster.

5. Running MRv1 in YARN:

• Older MapReduce applications (MRv1) can still run in Hadoop 2.0 using YARN
compatibility mode.

2. NoSQL Databases
2.1 Introduction to NoSQL
NoSQL (Not Only SQL) databases store and process unstructured or semi-structured data,
unlike traditional relational databases (MySQL, PostgreSQL).

Advantages of NoSQL:

✅ Scalability: Easily handles massive datasets.


✅ Flexibility: Can store different types of data (JSON, XML, key-value pairs).
✅ Performance: Faster read/write operations than relational databases.
✅ Schema-less: No need for predefined table structures.

Types of NoSQL Databases:

1. Key-Value Stores (e.g., Redis, DynamoDB) – Stores data as key-value pairs.


2. Document Stores (e.g., MongoDB, CouchDB) – Stores data as JSON/BSON
documents.
3. Column-Family Stores (e.g., HBase, Cassandra) – Stores data in column-based format.
4. Graph Databases (e.g., Neo4j, ArangoDB) – Stores relationships between data nodes.

3. MongoDB (NoSQL Database)


3.1 Introduction to MongoDB
MongoDB is a popular document-oriented NoSQL database that stores data in JSON-like
format (BSON).

3.2 Data Types in MongoDB


• String: "Hello World"
• Integer: 12345
• Boolean: true / false
• Array: ["Apple", "Banana", "Cherry"]
• Embedded Documents: { "name": "John", "address": { "city": "NY", "zip":
"10001" } }

3.3 Creating, Updating, and Deleting Documents


Creating a Document:
db.users.insertOne({ "name": "Alice", "age": 25, "city": "New York" })

✅ insertOne() is used to insert a single document into the users collection.


✅ The document {"name": "Alice", "age": 25, "city": "New York"} is stored in the
database.

Updating a Document:
json
db.users.updateOne({ "name": "Alice" }, { $set: { "age": 26 } })

✅ updateOne() updates the first matching document.


✅ The query { "name": "Alice" } finds the document with "name" as "Alice".
✅ The $set operator modifies the "age" field to 26.

Deleting a Document:
json

db.users.deleteOne({ "name": "Alice" })

✅ deleteOne() removes the first document that matches the condition.


✅ Here, the document with "name" as "Alice" is deleted.

3.4 Querying in MongoDB


• Find all users: db.users.find()
• Find a specific user: db.users.find({ "name": "Alice" })

3.5 Indexing in MongoDB


Indexes speed up queries. Example:

json

db.users.createIndex({ "name": 1 })

✅ createIndex() improves query performance by indexing the "name" field in ascending


order (1).
✅ Indexing is crucial for faster data retrieval in large datasets.

3.6 Capped Collections


Capped collections are fixed-size collections that automatically delete old records when new
records are added.
Capped collections are fixed-size collections that automatically overwrite old data when new
data arrives.
For example:

json
db.createCollection("logs", { capped: true, size: 100000 })

✅ The logs collection is created with a maximum size of 100 KB.


✅ Older entries will automatically be deleted as new entries are added.

4. Apache Spark
4.1 Installing Spark
• Install Java & Scala.
• Download Spark from the official site.
• Start Spark using spark-shell.

4.2 Spark Components


• Driver Program: Manages Spark application.
• Cluster Manager: Manages resources.
• Executors: Perform tasks.

4.3 Spark Applications, Jobs, Stages, and Tasks


• Application: User-defined program (e.g., data processing).
• Job: Execution of an action (e.g., count()).
• Stage: A set of tasks executed in parallel.
• Task: A unit of execution in Spark.

4.4 Resilient Distributed Dataset (RDD)


RDD is the core data structure in Spark that supports fault tolerance and parallel processing.
Example:

scala

val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5))


val squaredRDD = rdd.map(x => x * x)

✅ sc.parallelize() creates an RDD from a sequence of numbers [1, 2, 3, 4, 5].


✅ The .map() function applies the provided logic (x * x) to each element in the RDD.
✅ The result will be [1, 4, 9, 16, 25].

4.5 Anatomy of a Spark Job Run


1. The user submits a Spark job.
2. The Driver Program creates an RDD.
3. Tasks are distributed across Executors.
4. Results are collected.

4.6 Spark on YARN


• Spark can run on YARN (Hadoop's resource manager) to leverage cluster resources.
• Allows Spark to co-exist with Hadoop jobs.

5. Scala (Programming Language for Spark)


5.1 Introduction to Scala
Scala is a functional and object-oriented language commonly used with Apache Spark.

5.2 Classes and Objects


Scala

class Person(val name: String, val age: Int)


val p = new Person("Alice", 25)
✅ The class keyword defines a class called Person with two parameters (name and age).
✅ The val keyword ensures these values are immutable (cannot be changed).
✅ The line val p = new Person("Alice", 25) creates an object p of the Person class.

5.3 Basic Types and Operators


val x: Int = 10
val y: Double = 20.5
val z: Boolean = true

5.4 Control Structures


if (x > 5) println("x is greater than 5")

✅ The if statement checks if x is greater than 5.


✅ If true, "x is greater than 5" is printed.

5.5 Functions and Closures


def square(x: Int): Int = x * x
println(square(5))

✅ The def keyword defines a function named square.


✅ It takes one parameter x and returns its square.
✅ println(square(5)) prints 25.

5.6 Inheritance in Scala


class Animal { def sound() = "Some sound" }
class Dog extends Animal { override def sound() = "Bark" }

✅ The Animal class has a method sound() that returns "Some sound".
✅ The Dog class inherits from Animal and overrides the sound() method to return "Bark".
UNIT Ⅴ
Hadoop Ecosystem Frameworks:
The Hadoop ecosystem consists of several tools that help manage, process, and analyze big data
effectively. Key frameworks include Pig, Hive, HBase, Zookeeper, and IBM Big Data
solutions. Let’s explore each in detail.

1. Pig
1.1 Introduction to Pig:

Apache Pig is a platform for processing large datasets. It simplifies complex MapReduce jobs
using its scripting language called Pig Latin. Pig is ideal for tasks like ETL (Extract, Transform,
Load), data cleansing, and analytics.

• Pig is a high-level platform built on top of Hadoop that simplifies the process of working with
large datasets.
• It uses Pig Latin, a data flow language, to execute complex data transformations.

• Pig abstracts the complexity of writing low-level MapReduce code and makes it easier to
process data.

1.2 Execution Modes of Pig:

Pig operates in two modes:

• Local Mode: Runs Pig on a single machine without requiring Hadoop.


• MapReduce Mode: Executes Pig scripts over a Hadoop cluster using MapReduce
framework.

1.3 Comparison of Pig with Databases:

• Pig is better suited for processing large, unstructured data, while traditional databases are
more for structured data.
• Pig is not a database itself but an abstraction for Hadoop to process big data using a
scripting language, while databases store data in tables and use SQL.

1.4 Grunt:

Grunt is the interactive shell in Pig, where you can run commands and execute Pig Latin scripts directly.
1.5 Pig Latin:

• Pig Latin is a simple scripting language used to express data transformations in Pig.
• It supports data transformations, such as LOAD, FILTER, GROUP, JOIN, STORE, etc.

• Example:

pig

A = LOAD 'data.txt' USING PigStorage(',');


B = FILTER A BY $0 > 1000;
DUMP B;

1.6 User Defined Functions (UDFs)

• Pig allows you to create custom functions to extend its capabilities. These are written in
Java, Python, or JavaScript.

Example: A UDF could be written to filter data in a unique way.

1.7 Data Processing Operators

Pig provides various operators for processing data:

1. LOAD: Reads data from a file.


2. FILTER: Filters data based on conditions.
3. GROUP: Groups data by a certain key.
4. JOIN: Combines data from multiple datasets.
5. STORE: Writes data to a file or database.

2. Hive
2.1 Apache Hive Architecture

Hive is a data warehouse system that facilitates querying and managing large datasets using
HiveQL (a SQL-like language). Hive's architecture includes:

• Hive Metastore: Stores table schema and metadata.


• Hive Driver: Manages client connections and query execution.
• Execution Engine: Converts HiveQL queries into MapReduce or Spark jobs.
2.2 Hive Shell:

Hive Shell is an interactive command-line interface for executing HiveQL

• Example:

bash
hive> SELECT * FROM employees;

2.3 Hive Services

Hive services include:

• HiveServer2: Allows external applications to interact with Hive.


• WebUI: Provides a web-based interface for running queries.

2.4 Hive Metastore

• Hive Metastore stores metadata about the structure of data, like tables, columns, partitions, etc. It
is necessary for Hive to perform queries efficiently.

2.5 Hive vs Traditional Databases

• Hive uses Hadoop for distributed processing, which makes it better for handling large amounts
of data compared to traditional databases.

• Hive is not suitable for transactional processing (like RDBMS), but it’s great for batch
processing and analytics.

Feature Hive Traditional Database


Data Processing Batch Processing Transactional Processing
Language HiveQL (SQL-like) SQL
Data Size Petabyte-scale Terabyte-scale

2.6 HiveQL Example

• HiveQL is a query language used to query data in Hive, similar to SQL.

Example:

sql
SELECT * FROM students WHERE age > 20;
2.7 Tables, Querying, and UDFs:

• In Hive, you create tables using CREATE TABLE, LOAD DATA to load data, and use
HiveQL to query data.
o Example:

Sql

CREATE TABLE employee (id INT, name STRING);


LOAD DATA INPATH '/user/data.txt' INTO TABLE employee;
SELECT * FROM employee;

2.7 Sorting, Aggregation, Joins, and Subqueries:

Hive supports sorting, aggregating, joins, and subqueries to analyze large datasets.

Hive supports complex queries with features like:

• Sorting: ORDER BY
• Aggregation: GROUP BY
• Joins: Inner, Outer, and Cross joins
• Subqueries: Nested queries for advanced data retrieval.

Example:

SELECT department, COUNT(*) FROM employees GROUP BY department;

3. HBase:
3.1 HBase Concepts

Apache HBase is a NoSQL database that stores structured data in column-oriented format. It
excels in fast read/write operations and handles massive amounts of data efficiently.

• HBase is a NoSQL database that runs on top of Hadoop and stores large amounts of
structured data in the form of key-value pairs.

• HBase is designed for random access and allows fast read/write operations on very large
datasets.
3.2 HBase Clients:

HBase offers Java-based APIs to interact with tables and perform CRUD (Create, Read, Update,
Delete) operations. Example in Java:

HBase provides Java clients to interact with the database and perform operations like put, get,
delete, and scan.

Example:

java
Table table = connection.getTable(TableName.valueOf("employees"));
Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"),
Bytes.toBytes("Alice"));
table.put(put);

3.3 HBase vs RDBMS

• HBase is suitable for large, unstructured data and offers real-time random access.

• RDBMS are better suited for structured data with complex queries and relationships.

Feature HBase RDBMS


Data Model Key-Value Store Relational Tables
Schema Flexible Schema Fixed Schema
Speed Fast Random Access Optimized for Transactions

3.4 Schema Design & Advanced Indexing:

• HBase follows a column-family structure for efficient storage.


• Advanced indexing helps improve query performance by reducing data scans.

4. Zookeeper:
4.1 How Zookeeper Helps in Cluster Monitoring

• Zookeeper is a centralized service for maintaining configuration information, naming,


synchronization, and providing group services.

• It helps in monitoring and managing Hadoop clusters by ensuring coordination between


distributed systems.
Zookeeper is a coordination service that ensures consistency and synchronization in distributed
systems. It:

• Maintains configuration data


• Provides naming services
• Ensures leader election for failover recovery

4.2 Building Applications with Zookeeper:

Zookeeper is used to build distributed applications by providing tools like leader election, distributed
locks, and barriers to coordinate distributed processes.

Applications use Zookeeper to manage:

• Distributed locks
• Queues for task distribution
• Cluster coordination for failover and recovery

5. IBM Big Data Strategy


5.1 Introduction to Infosphere, BigInsights, and BigSheets

• Infosphere is IBM’s suite of data integration tools for working with large-scale data.

• Big Insights is IBM’s Hadoop-based platform for storing and analyzing big data. It combines
Hadoop, MapReduce, and analytics in a unified environment.

• Big Sheets is a tool for analyzing large spreadsheets in the cloud, and it is part of the
BigInsights suite.

• IBM Infosphere: A suite for data integration, governance, and quality.


• IBM BigInsights: A Hadoop-based platform for big data analytics.
• IBM BigSheets: A spreadsheet-like interface designed for analyzing large datasets using
BigInsights.

6. Big SQL:
6.1 Introduction to Big SQL

• Big SQL is an SQL engine built on top of Hadoop and IBM BigInsights that provides
high-performance queries for big data.
• It allows users to write SQL queries against data stored in Hadoop and NoSQL
databases.

IBM's Big SQL is a powerful SQL engine designed for querying large-scale data stored in
Hadoop and other big data platforms. It supports:

• ANSI SQL Compliance


• Data Federation to access data from multiple sources
• High Performance using query optimization techniques

Example Big SQL Query:

SELECT customer_id, SUM(purchase_amount)


FROM transactions
WHERE purchase_date >= '2024-01-01'
GROUP BY customer_id;

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy