BDA Unit 1 Notes
BDA Unit 1 Notes
Systems that process and store big data have become a common component of data
management architectures in organizations, combined with tools that support big
data analytics uses. Big data is often characterized by the three V's:
Big data is often stored in a data lake. While data warehouses are commonly built
on relational databases and contain structured data only, data lakes can support
various data types and typically are based on Hadoop clusters, cloud object storage
services, NoSQL databases or other big data platforms.
Characteristics of BigData:
Big Data contains a large amount of data that is not being processed by traditional
data storage or the processing unit. It is used by many multinational
companies to process the data and business of many organizations. The data flow
would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many
more.
Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are uploaded
each day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
The data is categorized as below:
Structured data: In Structured schema, along with all the required columns. It is
in a tabular form. Structured Data is stored in the relational database management
system.
Unstructured Data: All the unstructured files, log files, audio files,
and image files are included in the unstructured data. Some organizations have
much data available, but they did not know how to derive the value of data since
the data is raw.
Example: Web server logs, i.e., the log file is created and maintained by some
server that contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently. Big Data is also essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc.d Skip 10s
To get valid and relevant results from big data analytics applications, data
scientists and other data analysts must have a detailed understanding of the
available data and a sense of what they're looking for in it. That makes data
preparation, which includes profiling, cleansing, validation and transformation of
data sets, a crucial first step in the analytics process.
Once the data has been gathered and prepared for analysis, various data
science and advanced analytics disciplines can be applied to run different
applications, using tools that provide big data analytics features and capabilities.
Those disciplines include machine learning and its deep learning offshoot,
predictive modeling, data mining, statistical analysis, streaming analytics, text
mining and more.
Using customer data as an example, the different branches of analytics that can be
done with sets of big data include the following:
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. Predictive
analytics uses data to determine the probable outcome of an event or a likelihood
of a situation occurring. Predictive analytics holds a variety of statistical
techniques from modeling, machine learning, data mining, and game theory that
analyze current and historical facts to make predictions about a future
event. Techniques that are used for predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to
approach future events. It looks at past performance and understands the
performance by mining historical data to understand the cause of success or failure
in the past. Almost all management reporting such as sales, marketing, operations,
and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to
classify customers or prospects into groups. Unlike a predictive model that focuses
on predicting the behavior of a single customer, Descriptive analytics identifies
many different relationships between customer and product.
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science,
business rule, and machine learning to make a prediction and then suggests a
decision option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting
action benefits from the predictions and showing the decision maker the
implication of each decision option. Prescriptive Analytics not only anticipates
what will happen and when to happen but also why it will happen. Further,
Prescriptive Analytics can suggest decision options on how to take advantage of a
future opportunity or mitigate a future risk and illustrate the implication of each
decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by
using analytics to leverage operational and usage data combined with data of
external factors such as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any
question or for the solution of any problem. We try to find any dependency and
pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a
problem, and they also keep detailed information about their disposal otherwise
data collection may turn out individual for every problem and it will be very time-
consuming. Common techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
Analytics architecture refers to the infrastructure and systems that are used to
support the collection, storage, and analysis of data. There are several key
components that are typically included in an analytics architecture:
1. Data collection: This refers to the process of gathering data from various
sources, such as sensors, devices, social media, websites, and more.
2. Transformation: When the data is already collected then it should be cleaned
and transformed before storing.
3. Data storage: This refers to the systems and technologies used to store and
manage data, such as databases, data lakes, and data warehouses.
4. Analytics: This refers to the tools and techniques used to analyze and interpret
data, such as statistical analysis, machine learning, and visualization.
Together, these components work together to enable organizations to collect, store,
and analyze data in order to make informed decisions and drive business outcomes.
The analytics architecture is the framework that enables organizations to collect,
store, process, analyze, and visualize data in order to support data-driven decision-
making and drive business value.
Benefits:
There are several ways in which you can use analytics architecture to benefit your
organization:
1. Support data-driven decision-making: Analytics architecture can be used to
collect, store, and analyze data from a variety of sources, such as transactions,
social media, web analytics, and sensor data. This can help you make more
informed decisions by providing you with insights and patterns that you may
not have been able to detect otherwise.
2. Improve efficiency and effectiveness: By using analytics architecture to
automate tasks such as data integration and data preparation, you can reduce the
time and resources required to analyze data, and focus on more value-added
activities.
3. Enhance customer experiences: Analytics architecture can be used to gather
and analyze customer data, such as demographics, preferences, and behaviors,
to better understand and meet the needs of your customers. This can help you
improve customer satisfaction and loyalty.
4. Optimize business processes: Analytics architecture can be used to analyze
data from business processes, such as supply chain management, to identify
bottlenecks, inefficiencies, and opportunities for improvement. This can help
you optimize your processes and increase efficiency.
5. Identify new opportunities: Analytics architecture can help you discover new
opportunities, such as identifying untapped markets or finding ways to improve
product or service offerings.
Analytics architecture can help you make better use of data to drive business value
and improve your organization’s performance.
Applications of Analytics Architecture
Analytics architecture can be applied in a variety of contexts and industries to
support data-driven decision-making and drive business value. Here are a few
examples of how analytics architecture can be used:
1. Financial services: Analytics architecture can be used to analyze data from
financial transactions, customer data, and market data to identify patterns and
trends, detect fraud, and optimize risk management.
2. Healthcare: Analytics architecture can be used to analyze data from electronic
health records, patient data, and clinical trial data to improve patient outcomes,
reduce costs, and support research.
3. Retail: Analytics architecture can be used to analyze data from customer
transactions, web analytics, and social media to improve customer experiences,
optimize pricing and inventory, and identify new opportunities.
4. Manufacturing: Analytics architecture can be used to analyze data from
production processes, supply chain management, and quality control to
optimize operations, reduce waste, and improve efficiency.
5. Government: Analytics architecture can be used to analyze data from a variety
of sources, such as census data, tax data, and social media data, to support
policy-making, improve public services, and promote transparency.
Analytics architecture can be applied in a wide range of contexts and industries to
support data-driven decision-making and drive business value.
Limitations of Analytics Architecture
There are several limitations to consider when designing and implementing an
analytical architecture:
1. Complexity: Analytical architectures can be complex and require a high level
of technical expertise to design and maintain.
2. Data quality: The quality of the data used in the analytical system can
significantly impact the accuracy and usefulness of the results.
3. Data security: Ensuring the security and privacy of the data used in the
analytical system is critical, especially when working with sensitive or personal
information.
4. Scalability: As the volume and complexity of the data increase, the analytical
system may need to be scaled to handle the increased load. This can be a
challenging and costly task.
5. Integration: Integrating the various components of the analytical system can be
a challenge, especially when working with a diverse set of data sources and
technologies.
6. Cost: Building and maintaining an analytical system can be expensive, due to
the cost of hardware, software, and personnel.
7. Data governance: Ensuring that the data used in the analytical system is
properly governed and compliant with relevant laws and regulations can be a
complex and time-consuming task.
8. Performance: The performance of the analytical system can be impacted by
factors such as the volume and complexity of the data, the quality of the
hardware and software used, and the efficiency of the algorithms and processes
employed.
extracting information
source.
Warehouse, Dashboard,
etc.
strategic plans.
Hadoop
Tableau Spark
Sisense Cassandra
Microsoft Power BI
Characteristics/ Properties Below are the six features Big data can be described
Interactive reports,
Ranking reports.
implementing the
Improved new strategies.
Cost savings.
which help in
increasing revenues
etc.
Wholesale, etc.
Hadoop:
Hadoop is a framework that uses distributed storage and parallel processing to store
and manage big data. It is the software most used by data analysts to handle big data,
and its market size continues to grow. There are three components of Hadoop:
Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
Hadoop MapReduce - Hadoop MapReduce is the processing unit.
Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource
management unit.
Features of Hadoop
Apache Hadoop is the most popular and powerful big data tool, Hadoop provides
world’s most reliable storage layer — HDFS,
a batch Processing engine — MapReduce and
a Resource Management Layer — YARN.
Important features of Hadoop which are given below-
1. Open Source
Apache Hadoop is an open source project. It means its code can be modified
according to business requirements.
2. Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster, data is
processed in parallel on a cluster of nodes.
3. Fault Tolerance
This is one of the very important features of Hadoop. By default 3 replicas of
each block is stored across the cluster in Hadoop and it can be changed also as per
the requirement. So if any node goes down, data on that node can be recovered
from other nodes easily with the help of this characteristic. Failures of nodes or
tasks are recovered automatically by the framework. This is how Hadoop is fault
tolerant.
4. Reliability
Due to replication of data in the cluster, data is reliably stored on the cluster of
machine despite machine failures. If your machine goes down, then also your data
will be stored reliably due to this characteristic of Hadoop.
5. High Availability
Data is highly available and accessible despite hardware failure due to multiple
copies of data. If a machine or few hardware crashes, then data will be accessed
from another path.
6. Scalability
Hadoop is highly scalable in the way new hardware can be easily added to the
nodes. This feature of Hadoop also provides horizontal scalability which means
new nodes can be added on the fly without any downtime.
7. Economic
Apache Hadoop is not very expensive as it runs on a cluster of commodity
hardware. We do not need any specialized machine for it. Hadoop also provides
huge cost saving also as it is very easy to add more nodes on the fly here. So if
requirement increases, then you can increase nodes as well without any downtime
and without requiring much of pre-planning.
8. Easy to use
No need of client to deal with distributed computing, the framework takes care of
all the things. So this feature of Hadoop is easy to use.
9. Data Locality
This one is a unique features of Hadoop that made it easily handle the Big Data.
Hadoop works on data locality principle which states that move computation to data
instead of data to computation. When a client submits the MapReduce algorithm,
this algorithm is moved to data in the cluster rather than bringing data to the
location where the algorithm is submitted and then processing it.
Hadoop Assumptions
Hadoop is written with large clusters of computers in mind and is built around the
following hadoop assumptions:
Hardware may fail, (as commodity hardware can be used)
Applications that run on HDFS have large data sets. A typical file in HDFS is
gigabytes to terabytes in size.
retrieval. concurrently.
Hadoop Components:
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
1. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so fast.
When you are dealing with Big Data, serial processing is no more of any use.
MapReduce has mainly 2 tasks which are divided phase-wise: In first phase, Map is
utilized and in next phase Reduce is utilized.
2. HDFS
NameNode(Master)
DataNode(Slave)
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
HADOOP DAEMONS:
Daemons mean Process. Hadoop Daemons are a set of processes that run on
Hadoop. Hadoop is a framework written in Java, so all these processes are Java
Processes.
NameNode
DataNode
Resource Manager
Node Manager
1. NameNode
Features:
As Namenode works on the Master System, the Master system should have good
processing power and more RAM than Slaves.
It stores the information of DataNode such as their Block id’s and Number of
Blocks
2. DataNode
DataNode works on the Slave system. The NameNode always instructs DataNode
for storing the Data. DataNode is a program that runs on the slave system that
serves the read/write request from the client. As the data is stored in this DataNode,
they should possess high memory to store more Data.
3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In case the
Hadoop cluster fails, or crashes, the secondary Namenode will take the hourly
backup or checkpoints of that data and store this data into a file name fsimage. This
file then gets transferred to a new system. A new MetaData is assigned to that new
system and a new Master is created with this MetaData, and the cluster is made to
run again correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have High-
Availability and Federation features that minimize the importance of this
Secondary Name Node in Hadoop2. It continuously reads the MetaData from the
RAM of NameNode and writes into the Hard Disk.
4. Resource Manager
Resource Manager is also known as the Global Master Daemon that works on the
Master System. The Resource Manager Manages the resources for the applications
that are running in a Hadoop Cluster. The Resource Manager Mainly consists of 2
things.
1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and
also makes a memory resource on the Slaves in a Hadoop cluster to host
the Application Master. The scheduler is utilized for providing resources for
applications in a Hadoop cluster and for monitoring this application.
5. Node Manager
The Node Manager works on the Slaves System that manages the memory
resource within the Node and Memory Disk. Each Slave Node in a Hadoop cluster
has a single NodeManager Daemon running in it. It also sends this monitoring
information to the Resource Manager.
Comparing SQLdatabases and Hadoop
Fault
Hadoop is highly fault tolerant SQL has good fault tolerance
Tolerance
Write data once, read data multiple Read and Write data multiple
Data Update times times
Design of HDFS:
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is
mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in such a
way that it believes more in storing the data in a large chunk of blocks rather than
storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage
layer and the other devices present in that Hadoop cluster. Data storage Nodes in
HDFS.
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave. DataNodes are mainly utilized for
storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to
500 or even more than that. The more number of DataNode, the Hadoop cluster
will be able to store more data. So it is advised that the DataNode should have
High storing capacity to store a large number of file blocks.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is
default and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is
this file got divided into blocks of 128MB+128MB+128MB+16MB = 400MB
size. Means 4 blocks are created each of 128MB except the last one. Hadoop
doesn’t know or it doesn’t care about what data is stored in these blocks so it
considers the final file blocks as a partial record as it does not have any idea
regarding it. In the Linux file system, the size of a file block is about 4KB which
is very much less than the default size of file blocks in the Hadoop file system.
As we all know Hadoop is mainly configured for storing the large size data which
is in petabyte, this is what makes Hadoop file system different from other file
systems as it can be scaled, nowadays file blocks of 128MB to 256MB are
considered in Hadoop.
Step 1: The client opens the file it wishes to read by calling open() on the File
Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in the
file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client
for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block.
Step 6: When the client has finished reading the file, a function is called, close()
on the FSDataInputStream.
File Write in HDFS
DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has
the right permissions to create the file. If these checks pass, the name node
prepares a record of the new file; otherwise, the file can’t be created and therefore
packets, which it writes to an indoor queue called the info queue. The data queue
is consumed by the DataStreamer, which is liable for asking the name node to
allocate new blocks by picking an inventory of suitable data nodes to store the
replicas.
Step 4: Similarly, the second data node stores the packet and forwards it to the
Step 6: This action sends up all the remaining packets to the data node pipeline
and waits for acknowledgments before connecting to the name node to signal