Unit 1 Notes
Unit 1 Notes
Big data is characterized by several key features, often referred to as the "3Vs" -
Volume, Velocity, and Variety. Additionally, two more Vs - Veracity and Value -
are sometimes included to provide a more comprehensive understanding. Here
are the characteristics of big data:
1. Volume: Big data refers to datasets that are extremely large in size, far beyond
the capacity of traditional data processing systems to manage, store, and analyze
efficiently. The volume of data can range from terabytes to petabytes and even
exabytes.
2. Velocity: Big data is generated and collected at an unprecedented speed. Data
streams in continuously from various sources such as social media, sensors, web
logs, and transactions. The velocity of data refers to the rate at which data is
generated, captured, and processed in real-time or near real-time.
3. Variety: Big data comes in various formats and types, including structured, semi-
structured, and unstructured data. Structured data, such as relational databases,
follows a predefined schema. Semi-structured data, like JSON or XML files, has
some organization but lacks a fixed schema. Unstructured data, such as text,
images, audio, and video, lacks any predefined structure.
4. Veracity: Veracity refers to the quality, accuracy, and reliability of the data. Big
data sources may include noisy, incomplete, inconsistent, or erroneous data.
Ensuring data veracity involves assessing data quality, detecting and correcting
errors, and maintaining data integrity throughout the data lifecycle.
5. Value: The ultimate goal of big data analysis is to derive meaningful insights,
actionable intelligence, and business value from the vast amounts of data
collected. Extracting value from big data involves applying advanced analytics
techniques, such as data mining, machine learning, and predictive modeling, to
uncover patterns, trends, correlations, and hidden knowledge that can inform
decision-making, drive innovation, and optimize processes.
1. NameNode:
The NameNode is the master node in the HDFS architecture.
It manages the metadata of the file system, including the namespace
hierarchy, file permissions, and file-to-block mappings.
The NameNode stores metadata in memory for faster access and
periodically persists it to the disk in the form of the fsimage and edits
log files.
The failure of the NameNode can lead to the unavailability of the entire
file system, making it a single point of failure. To mitigate this, Hadoop
provides mechanisms like high availability (HA) through a secondary
NameNode and tools like Hadoop Federation and Hadoop Cluster.
2. DataNode:
DataNodes are worker nodes in the HDFS architecture.
They store the actual data blocks that make up the files in HDFS.
DataNodes communicate with the NameNode to report the list of
blocks they are storing and to replicate or delete blocks based on
instructions from the NameNode.
DataNodes are responsible for serving read and write requests from
clients and other Hadoop components.
3. Secondary NameNode:
Despite its name, the Secondary NameNode does not act as a standby
or backup NameNode.
Its primary role is to periodically merge the fsimage and edits log files
produced by the NameNode to prevent them from growing
indefinitely.
The Secondary NameNode generates a new combined image of the file
system, which is then sent back to the NameNode to replace the
current fsimage file.
This process helps reduce the startup time of the NameNode in case of
failure and minimizes the risk of data loss in the event of NameNode
failure.
1. ResourceManager (RM):
The ResourceManager is the master daemon in the YARN architecture.
It is responsible for managing and allocating cluster resources among
different applications.
The ResourceManager consists of two main components:
Scheduler: Allocates resources to various applications based on
their resource requirements, scheduling policies, and constraints.
ApplicationManager: Manages the lifecycle of applications
running on the cluster, including submission, monitoring, and
termination.
2. NodeManager (NM):
NodeManagers are worker nodes in the YARN architecture.
They run on each node in the Hadoop cluster and are responsible for
managing resources such as CPU, memory, and disk on that node.
NodeManagers report resource availability and health status to the
ResourceManager and execute tasks allocated to them by the
ResourceManager.
NodeManagers monitor the resource usage of containers running on
the node and report back to the ResourceManager for resource
accounting and monitoring.
3. ApplicationMaster (AM):
The ApplicationMaster is a per-application component responsible for
coordinating and managing the execution of a specific application on
the cluster.
When a client submits an application to run on the cluster, YARN
launches an ApplicationMaster instance for that application.
The ApplicationMaster negotiates with the ResourceManager for
resources, requests containers from NodeManagers, monitors the
progress of tasks, and handles failures and retries.
Each application running on the cluster has its own ApplicationMaster
instance, ensuring isolation and resource management at the
application level.
1. hadoop fs:
This is the main command used to interact with HDFS. It has various
subcommands to perform different operations.
2. hadoop fs -ls:
Lists the contents of a directory in HDFS.
Example: hadoop fs -ls /user
3. hadoop fs -mkdir:
Creates a directory in HDFS.
Example: hadoop fs -mkdir /user/mydirectory
4. hadoop fs -put:
Copies files or directories from the local file system to HDFS.
Example: hadoop fs -put localfile.txt
/user/mydirectory
5. hadoop fs -get:
Copies files or directories from HDFS to the local file system.
Example: hadoop fs -get /user/mydirectory/hdfsfile.txt
localfile.txt
6. hadoop fs -rm:
Deletes files or directories in HDFS.
Example: hadoop fs -rm /user/mydirectory/hdfsfile.txt
7. hadoop fs -cat:
Displays the contents of a file in HDFS.
Example: hadoop fs -cat /user/mydirectory/hdfsfile.txt
8. hadoop fs -copyToLocal:
Copies files or directories from HDFS to the local file system.
Example: hadoop fs -copyToLocal
/user/mydirectory/hdfsfile.txt localfile.txt
9. hadoop fs -copyFromLocal:
Copies files or directories from the local file system to HDFS.
Example: hadoop fs -copyFromLocal localfile.txt
/user/mydirectory/hdfsfile.txt
10. hadoop fs -du:
Displays the disk usage of files and directories in HDFS.
Example: hadoop fs -du /user/mydirectory
11. hadoop fs -chmod:
Changes the permissions of files or directories in HDFS.
Example: hadoop fs -chmod 777
/user/mydirectory/hdfsfile.txt
12. hadoop fs -chown:
Changes the owner of files or directories in HDFS.
Example: hadoop fs -chown username
/user/mydirectory/hdfsfile.txt
13. hadoop fs -chgrp:
Changes the group of files or directories in HDFS.
Example: hadoop fs -chgrp groupname
/user/mydirectory/hdfsfile.txt
• Challenge:
A leading retail chain faced challenges in optimizing its inventory management
and enhancing customer satisfaction. The company struggled with stockouts,
excess inventory, and lacked insights into customer preferences, leading to
suboptimal stocking decisions.
• Solution:
The retail chain implemented a comprehensive big data analytics solution to
address these challenges.
• Steps Taken:
Data Collection
Customer Segmentation
Demand Forecasting
Inventory Optimization
Personalized Marketing
• Results:
Reduced Stockouts and Excess Inventory:
Improved Customer Satisfaction:
Increased customer loyalty and repeat business.
Increased Revenue:
Operational Efficiency:
• Conclusion:
This case study demonstrates how big data analytics can transform retail
operations by providing actionable insights. The implemented solution not only
optimized inventory management but also enhanced the overall customer
experience, leading to increased revenue and operational efficiency.
Now, let's walk through an example of how this Hadoop cluster would work
with a MapReduce job:
1. Job Submission:
A user submits a MapReduce job to the Hadoop cluster, specifying the
input data location, map and reduce functions, and any other job
configurations.
The job is submitted to the ResourceManager, which assigns it an
application ID and schedules it for execution.
2. Job Initialization:
The ResourceManager communicates with the NameNode to
determine the location of input data blocks.
The ResourceManager selects NodeManagers to run the map and
reduce tasks based on resource availability and scheduling policies.
The ResourceManager launches an ApplicationMaster for the job,
which is responsible for managing the job's execution.
3. Map Phase:
The ApplicationMaster negotiates with the ResourceManager to
allocate resources for map tasks.
NodeManagers execute map tasks in parallel across the cluster, reading
input data blocks from DataNodes and applying the user-defined map
function.
Intermediate key-value pairs are generated by the map tasks and
partitioned based on keys.
The output of the map tasks is written to local disk and buffered until it
is ready for the shuffle and sort phase.
4. Shuffle and Sort:
Intermediate key-value pairs generated by map tasks are shuffled and
sorted based on keys.
The shuffle and sort process involves transferring data over the network
from map tasks to reduce tasks and grouping data by key.
This phase ensures that all values associated with the same key are sent
to the same reducer for processing.
5. Reduce Phase:
The ApplicationMaster negotiates with the ResourceManager to
allocate resources for reduce tasks.
NodeManagers execute reduce tasks in parallel across the cluster,
reading intermediate data from map tasks and applying the user-
defined reduce function.
The reduce tasks aggregate and process the intermediate key-value
pairs to generate the final output.
6. Output:
The final output of the MapReduce job is written to HDFS or another
distributed file system.
Each reducer produces its output file, which contains the final results of
the computation.
The output files can be accessed by the user for further analysis or
processing.