0% found this document useful (0 votes)
20 views

Unit 1 Notes

The document discusses the key characteristics of big data including volume, velocity, variety, veracity and value. It also covers traditional business intelligence vs big data, what is big data analytics, types of big data analytics and difference between big data analytics and engineering.

Uploaded by

sanchitghare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Unit 1 Notes

The document discusses the key characteristics of big data including volume, velocity, variety, veracity and value. It also covers traditional business intelligence vs big data, what is big data analytics, types of big data analytics and difference between big data analytics and engineering.

Uploaded by

sanchitghare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

1.

What are characteristics of Big Data:

Big data is characterized by several key features, often referred to as the "3Vs" -
Volume, Velocity, and Variety. Additionally, two more Vs - Veracity and Value -
are sometimes included to provide a more comprehensive understanding. Here
are the characteristics of big data:

1. Volume: Big data refers to datasets that are extremely large in size, far beyond
the capacity of traditional data processing systems to manage, store, and analyze
efficiently. The volume of data can range from terabytes to petabytes and even
exabytes.
2. Velocity: Big data is generated and collected at an unprecedented speed. Data
streams in continuously from various sources such as social media, sensors, web
logs, and transactions. The velocity of data refers to the rate at which data is
generated, captured, and processed in real-time or near real-time.
3. Variety: Big data comes in various formats and types, including structured, semi-
structured, and unstructured data. Structured data, such as relational databases,
follows a predefined schema. Semi-structured data, like JSON or XML files, has
some organization but lacks a fixed schema. Unstructured data, such as text,
images, audio, and video, lacks any predefined structure.
4. Veracity: Veracity refers to the quality, accuracy, and reliability of the data. Big
data sources may include noisy, incomplete, inconsistent, or erroneous data.
Ensuring data veracity involves assessing data quality, detecting and correcting
errors, and maintaining data integrity throughout the data lifecycle.
5. Value: The ultimate goal of big data analysis is to derive meaningful insights,
actionable intelligence, and business value from the vast amounts of data
collected. Extracting value from big data involves applying advanced analytics
techniques, such as data mining, machine learning, and predictive modeling, to
uncover patterns, trends, correlations, and hidden knowledge that can inform
decision-making, drive innovation, and optimize processes.

2. Compare Traditional Business Intelligence with Big Data?


3. What is Big Data Analytics:
4. Types of Big Data Analytics:

5. Difference between big data analytics and big data engineering:


 In a big data analytics case study, you might explore how a company utilized
large datasets to gain insights, make data-driven decisions, or improve
business processes.
 For instance, a retail company could analyze customer purchasing patterns to
optimize inventory and marketing strategies.
 On the other hand, a big data engineering case study would focus on the
technical aspects of handling massive datasets.
 It could detail how a company redesigned its data architecture, implemented
data pipelines, or scaled its infrastructure to efficiently process and store
large volumes of data.
 An example might involve a technology firm enhancing its data storage and
processing capabilities to accommodate growing data volumes.
 In summary, big data analytics case studies highlight the strategic use of data
for business insights, while big data engineering case studies showcase the
technical solutions and infrastructure developed to handle large-scale data
processing.

6. Explain architecture of a Hadoop:

Hadoop is an open-source framework for distributed storage and processing


of large-scale datasets across clusters of commodity hardware. The
architecture of Hadoop consists of several key components, each playing a
specific role in the storage, processing, and management of data. Here's an
overview of the architecture of Hadoop:

1. Hadoop Distributed File System (HDFS):


 HDFS is the primary storage layer of Hadoop, designed to store large
datasets reliably across a cluster of machines.
 It follows a master-slave architecture with two main components:
NameNode and DataNode.
 NameNode: Manages the metadata of the file system, including the
namespace, file-to-block mapping, and access control.
 DataNode: Stores the actual data blocks and manages read and write
operations on the data.
2. Yet Another Resource Negotiator (YARN):
 YARN is the resource management and job scheduling component of
Hadoop.
 It allows multiple data processing engines to run on top of Hadoop,
enabling diverse workloads such as MapReduce, Apache Spark, Apache
Flink, and Apache Hive.
 YARN consists of ResourceManager and NodeManager.
 ResourceManager: Manages cluster resources, allocates containers,
and schedules application tasks.
 NodeManager: Runs on each worker node and manages resources
such as CPU, memory, and disk on that node.
3. MapReduce:
 MapReduce is a programming model and processing engine for
distributed data processing in Hadoop.
 It divides data processing tasks into two phases: Map and Reduce.
 Map: Processes input data and produces intermediate key-value pairs.
 Reduce: Aggregates and combines intermediate key-value pairs to
generate the final output.
 MapReduce jobs are submitted to the YARN ResourceManager for
execution.
4. Hadoop Common:
 Hadoop Common contains libraries and utilities shared by other
Hadoop modules.
 It provides common functionalities such as authentication,
configuration, logging, and networking.
5. Hadoop Ecosystem:
 Hadoop ecosystem consists of various projects and tools built on top
of Hadoop core components to extend its capabilities.
 Examples include Apache Hive for SQL-like querying, Apache Pig for
data flow scripting, Apache HBase for NoSQL database, Apache Spark
for in-memory processing, Apache Kafka for real-time data streaming,
and many others.

The architecture of Hadoop is designed to be scalable, fault-tolerant, and


cost-effective, making it suitable for processing and analyzing large volumes
of data across distributed clusters. It enables organizations to store, process,
and derive insights from big data, driving innovation, decision-making, and
business value across various industries.

7. What are the different components of HDFS:

The Hadoop Distributed File System (HDFS) comprises several components


that work together to store and manage large datasets across a cluster of
machines. These components include:

1. NameNode:
 The NameNode is the master node in the HDFS architecture.
 It manages the metadata of the file system, including the namespace
hierarchy, file permissions, and file-to-block mappings.
 The NameNode stores metadata in memory for faster access and
periodically persists it to the disk in the form of the fsimage and edits
log files.
 The failure of the NameNode can lead to the unavailability of the entire
file system, making it a single point of failure. To mitigate this, Hadoop
provides mechanisms like high availability (HA) through a secondary
NameNode and tools like Hadoop Federation and Hadoop Cluster.
2. DataNode:
 DataNodes are worker nodes in the HDFS architecture.
 They store the actual data blocks that make up the files in HDFS.
 DataNodes communicate with the NameNode to report the list of
blocks they are storing and to replicate or delete blocks based on
instructions from the NameNode.
 DataNodes are responsible for serving read and write requests from
clients and other Hadoop components.
3. Secondary NameNode:
 Despite its name, the Secondary NameNode does not act as a standby
or backup NameNode.
 Its primary role is to periodically merge the fsimage and edits log files
produced by the NameNode to prevent them from growing
indefinitely.
 The Secondary NameNode generates a new combined image of the file
system, which is then sent back to the NameNode to replace the
current fsimage file.
 This process helps reduce the startup time of the NameNode in case of
failure and minimizes the risk of data loss in the event of NameNode
failure.

8. What are different components of YARN:


YARN (Yet Another Resource Negotiator) is the resource management and job
scheduling component of Hadoop. It enables multiple data processing
engines to run on top of Hadoop, allowing for diverse workloads such as
MapReduce, Apache Spark, Apache Flink, and Apache Hive. YARN consists of
several key components that work together to manage resources and
schedule tasks efficiently across a Hadoop cluster. These components include:

1. ResourceManager (RM):
 The ResourceManager is the master daemon in the YARN architecture.
 It is responsible for managing and allocating cluster resources among
different applications.
 The ResourceManager consists of two main components:
 Scheduler: Allocates resources to various applications based on
their resource requirements, scheduling policies, and constraints.
 ApplicationManager: Manages the lifecycle of applications
running on the cluster, including submission, monitoring, and
termination.
2. NodeManager (NM):
 NodeManagers are worker nodes in the YARN architecture.
 They run on each node in the Hadoop cluster and are responsible for
managing resources such as CPU, memory, and disk on that node.
 NodeManagers report resource availability and health status to the
ResourceManager and execute tasks allocated to them by the
ResourceManager.
 NodeManagers monitor the resource usage of containers running on
the node and report back to the ResourceManager for resource
accounting and monitoring.
3. ApplicationMaster (AM):
 The ApplicationMaster is a per-application component responsible for
coordinating and managing the execution of a specific application on
the cluster.
 When a client submits an application to run on the cluster, YARN
launches an ApplicationMaster instance for that application.
 The ApplicationMaster negotiates with the ResourceManager for
resources, requests containers from NodeManagers, monitors the
progress of tasks, and handles failures and retries.
 Each application running on the cluster has its own ApplicationMaster
instance, ensuring isolation and resource management at the
application level.

9. Explain commands of HDFS:


In HDFS (Hadoop Distributed File System), you interact with the file system
using command-line tools or APIs provided by Hadoop. Below are some
commonly used commands for interacting with HDFS:

1. hadoop fs:
 This is the main command used to interact with HDFS. It has various
subcommands to perform different operations.
2. hadoop fs -ls:
 Lists the contents of a directory in HDFS.
 Example: hadoop fs -ls /user
3. hadoop fs -mkdir:
 Creates a directory in HDFS.
 Example: hadoop fs -mkdir /user/mydirectory
4. hadoop fs -put:
 Copies files or directories from the local file system to HDFS.
 Example: hadoop fs -put localfile.txt
/user/mydirectory
5. hadoop fs -get:
 Copies files or directories from HDFS to the local file system.
 Example: hadoop fs -get /user/mydirectory/hdfsfile.txt
localfile.txt
6. hadoop fs -rm:
 Deletes files or directories in HDFS.
 Example: hadoop fs -rm /user/mydirectory/hdfsfile.txt
7. hadoop fs -cat:
 Displays the contents of a file in HDFS.
 Example: hadoop fs -cat /user/mydirectory/hdfsfile.txt
8. hadoop fs -copyToLocal:
 Copies files or directories from HDFS to the local file system.
 Example: hadoop fs -copyToLocal
/user/mydirectory/hdfsfile.txt localfile.txt
9. hadoop fs -copyFromLocal:
 Copies files or directories from the local file system to HDFS.
 Example: hadoop fs -copyFromLocal localfile.txt
/user/mydirectory/hdfsfile.txt
10. hadoop fs -du:
 Displays the disk usage of files and directories in HDFS.
 Example: hadoop fs -du /user/mydirectory
11. hadoop fs -chmod:
 Changes the permissions of files or directories in HDFS.
 Example: hadoop fs -chmod 777
/user/mydirectory/hdfsfile.txt
12. hadoop fs -chown:
 Changes the owner of files or directories in HDFS.
 Example: hadoop fs -chown username
/user/mydirectory/hdfsfile.txt
13. hadoop fs -chgrp:
 Changes the group of files or directories in HDFS.
 Example: hadoop fs -chgrp groupname
/user/mydirectory/hdfsfile.txt

10. Explain working of Map Reduce:

11. Case study on big data analytics:

• Challenge:
A leading retail chain faced challenges in optimizing its inventory management
and enhancing customer satisfaction. The company struggled with stockouts,
excess inventory, and lacked insights into customer preferences, leading to
suboptimal stocking decisions.
• Solution:
The retail chain implemented a comprehensive big data analytics solution to
address these challenges.
• Steps Taken:
Data Collection
Customer Segmentation
Demand Forecasting
Inventory Optimization
Personalized Marketing
• Results:
Reduced Stockouts and Excess Inventory:
Improved Customer Satisfaction:
Increased customer loyalty and repeat business.
Increased Revenue:
Operational Efficiency:
• Conclusion:
This case study demonstrates how big data analytics can transform retail
operations by providing actionable insights. The implemented solution not only
optimized inventory management but also enhanced the overall customer
experience, leading to increased revenue and operational efficiency.

12. Case study on big data analytics:


• Steps Taken:
Data Infrastructure Overhaul.
Upgraded the data infrastructure to a distributed and scalable architecture.
Adopted big data technologies such as Apache Hadoop and Apache Spark for
distributed processing.
• Real-time Data Ingestion:
Implemented a real-time data ingestion pipeline to capture sales transactions,
customer interactions, and inventory updates in real-time.
Utilized Apache Kafka for seamless and scalable event streaming.
• Data Storage Optimization:
Employed distributed storage solutions like Hadoop Distributed File System (HDFS)
for efficient and cost-effective storage of large datasets.
Utilized data compression techniques to optimize storage space.
• Data Processing and Transformation:
Developed data processing pipelines using Apache Spark for efficient and parallelized
data transformation.
Applied data cleaning and enrichment processes to enhance the quality of incoming
data.
• Integration with Inventory Systems:
Integrated the big data infrastructure with the inventory management system for
real-time updates.
Enabled automated triggers for inventory replenishment based on demand
forecasts.
• Results:
Real-time Insights
Scalability and Performance
Cost Savings
Improved Inventory Management

13. How do we submit MapReduce job to YARN?

14. Explain with example hadoop cluster:

Consider a Hadoop cluster comprising several physical or virtual machines,


each with its own processing power, memory, and storage capacity. Let's say
our cluster consists of the following nodes:

1. NameNode (Master Node): Responsible for storing metadata and


coordinating file system operations.
2. Secondary NameNode (Optional): Assists the NameNode by performing
periodic checkpoints and merging edit logs.
3. ResourceManager (Master Node): Manages resources and schedules jobs
across the cluster.
4. DataNodes (Worker Nodes): Store data blocks and perform data processing
tasks.
5. NodeManagers (Worker Nodes): Manage resources and execute tasks on
behalf of the ResourceManager.

Now, let's walk through an example of how this Hadoop cluster would work
with a MapReduce job:

1. Job Submission:
 A user submits a MapReduce job to the Hadoop cluster, specifying the
input data location, map and reduce functions, and any other job
configurations.
 The job is submitted to the ResourceManager, which assigns it an
application ID and schedules it for execution.
2. Job Initialization:
 The ResourceManager communicates with the NameNode to
determine the location of input data blocks.
 The ResourceManager selects NodeManagers to run the map and
reduce tasks based on resource availability and scheduling policies.
 The ResourceManager launches an ApplicationMaster for the job,
which is responsible for managing the job's execution.
3. Map Phase:
 The ApplicationMaster negotiates with the ResourceManager to
allocate resources for map tasks.
 NodeManagers execute map tasks in parallel across the cluster, reading
input data blocks from DataNodes and applying the user-defined map
function.
 Intermediate key-value pairs are generated by the map tasks and
partitioned based on keys.
 The output of the map tasks is written to local disk and buffered until it
is ready for the shuffle and sort phase.
4. Shuffle and Sort:
 Intermediate key-value pairs generated by map tasks are shuffled and
sorted based on keys.
 The shuffle and sort process involves transferring data over the network
from map tasks to reduce tasks and grouping data by key.
 This phase ensures that all values associated with the same key are sent
to the same reducer for processing.
5. Reduce Phase:
 The ApplicationMaster negotiates with the ResourceManager to
allocate resources for reduce tasks.
 NodeManagers execute reduce tasks in parallel across the cluster,
reading intermediate data from map tasks and applying the user-
defined reduce function.
 The reduce tasks aggregate and process the intermediate key-value
pairs to generate the final output.
6. Output:
 The final output of the MapReduce job is written to HDFS or another
distributed file system.
 Each reducer produces its output file, which contains the final results of
the computation.
 The output files can be accessed by the user for further analysis or
processing.

Throughout this process, Hadoop provides fault tolerance by automatically


handling failures and rerunning tasks as needed. It also optimizes resource
utilization by dynamically allocating resources based on job requirements and
cluster availability. Overall, the Hadoop cluster efficiently processes large-scale
data workloads in a distributed and fault-tolerant manner, enabling
organizations to derive insights and value from their data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy