Big Data Analytics (Unit-II)
Big Data Analytics (Unit-II)
Big data application: Extracts insights like hidden patterns, market trends, and
customer preferences
Data ingestion: Moves data, especially unstructured data, to a system where it can be
stored and analyzed
Computer data storage: Centralizes and consolidates data from various sources for
analytical purposes
Data warehouse: A centralized storage container that consolidates company data
Data analytics: Helps organizations gain insights, optimize operations, and predict
future outcomes
ETL tools: Prepares a new data source to be stored
Automated generation of insights: Provides an easier and faster way to obtain
important findings
Business Analytics: Uses data to enable data-driven decisions
Data lakes: Stores large amounts of raw data
2. Data Source Layer.
Ans = Data Sources Layer
Organizations generate a huge amount of data on a daily basis. The basic function of the data
sources layer is to absorb and integrate the data coming from various sources, at varying
velocity and in different formats. Before this data is considered for big datastack, we have to
differentiate between the noise and relevant information.
The data source layer in big data is capable of processing large amounts of data from
different sources in batch and real-time. These sources include:
Data warehouses
RDMS
SaaS apps
Internet of Things sensors
The data available for analysis can vary in origin and format. The format may be
structured, unstructured, or semi-structured. The speed of data arrival and delivery will vary
according to the source. The data collection mode may be direct or through data providers, in
batch mode or in real-time.
3. Ingestion Layer.
Ans = Ingestion Layer : The role of the ingestion layer is to absorb the huge inflow of data
and sort it out in different categories. This layer separates noise from relevant information. It
can handle huge volume, high velocity, and a variety of data. The ingestion layer validates,
cleanses, transforms, reduces, and integrates the unstructured data into the Big Data stack for
further processing.
The role of the ingestion layer is to absorb the huge inflow of data and sort it out in different
categories. This layer separates noise from relevant information. It can handle huge volume,
high velocity, and a variety of data. The ingestion layer validates, cleanses, transforms,
reduces, and integrates the unstructured data into the Big Data stack for further processing. he
data ingestion layer is the first layer in the big data architecture. It's responsible for
collecting data from various sources, such as: IoT devices, Data lakes, Databases, SaaS
applications.
The data ingestion layer prioritizes and categorizes the data. It also:
Encryption
Support for protocols such as Secure Sockets Layer and HTTP over SSL
Figure illustrates the functioning of the ingestion layer:
4. Storage Layer
Ans = Storage Layer
Hadoop is an open source framework used to store large volumes of data in a distributed
manner across multiple machines The Hadoop storage layer supports fault-tolerance and
parallelization. which enable high-speed distributed processing algorithms to execute over
large-scale data. There are two major components of Hadoop: a scalable Hadoop Distributed
File System (HDFS) that can support petabytes of data and a MapReduce engine that
computes results in batches.
HDPS is a file system that is used to store huge volumes of data across a large number of
commodity machines in a cluster. The data can be in terabytes or petabytes. HDFS stores data
in the form of blocks of files and follows the write-once-read-many model to data from these
blocks of files The files stored in the HDPS are operated upon by many complex programs,
as per the requirement
data storage requirements can be addressed by a single concept known as Not Only SQL
(NoSQL) databases. Some examples of NoSQL databases include HBASE, MongoDB,
AllegroGraph, and InfiniteGraph.
5. RDMS and Big Data.
Ans = Storing Data In Data Bases and Data Warehouses:
RDBMS and Big Data,
An RDBMS uses a relational model where all the data is stored using preset schemas. These
schemas are linked using the values in specific columns of each table. The data is
hierarchical, which means for data to be stored or transacted it needs to adhere to ACID
standards, namely:
Atomicity-Ensures full completion of a database operation.
Consistency-Ensures that data abides by the schema (table) standards, such as correct data
type entry, constraints, and keys.
Isolation-Refers to the encapsulation of information. Makes only necessary information
visible.
Durability-Ensures that transactions stay valid even after a power failure or errors.
In traditional database systems, every time data is accessed or modified, it requires to be
moved (indexed) to a central location for processing. Therein lies a major limitation of
hardware upgradation. You can upgrade your hardware to improve performance, however,
depending on the hardware platform, there is a limitation on the number of processors and
system memory that can be used to concurrently perform database operations. Besides the
processing power restraint, network latency can also occur during data transfer to the central
node.
6. Issues with relational model.
1 – Maintenance Problem
The maintenance of the relational database becomes difficult over time due to the increase in
the data. Developers and programmers have to spend a lot of time maintaining the database.
2 – Cost
The relational database system is costly to set up and maintain. The initial cost of the
software alone can be quite pricey for smaller businesses, but it gets worse when you factor
in hiring a professional technician who must also have expertise with that specific kind of
program.
3 – Physical Storage
A relational database is comprised of rows and columns, which requires a lot of physical
memory because each operation performed depends on separate storage. The requirements of
physical memory may increase along with the increase of data.
4 – Lack of Scalability
While using the relational database over multiple servers, its structure changes and becomes
difficult to handle, especially when the quantity of the data is large. Due to this, the data is
not scalable on different physical storage servers. Ultimately, its performance is affected i.e.
lack of availability of data and load time etc. As the database becomes larger or more
distributed with a greater number of servers, this will have negative effects like latency and
availability issues affecting overall performance.
5 – Complexity in Structure
Relational databases can only store data in tabular form which makes it difficult to represent
complex relationships between objects. This is an issue because many applications require
more than one table to store all the necessary data required by their application logic.
Ans = A relational database is a collection of information that organizes data points with
defined relationships for easy access. In the relational database model, the data structures --
including data tables, indexes and views -- remain separate from the physical storage
structures, enabling database administrators to edit the physical data storage without affecting
the logical data structure.
In the enterprise, relational databases are used to organize data and identify relationships
between key data points. They make it easy to sort and find information, which helps
organizations make business decisions more efficiently and minimize costs. They work well
with structured data.
The data tables used in a relational database store information about related objects. Each row
holds a record with a unique identifier -- known as a key -- and each column contains the
attributes of the data. Each record assigns a value to each feature, making relationships
between data points easy to identify.
The standard user and application program interface (API) of a relational database is the
Structured Query Language. SQL code statements are used both for interactive queries for
information from a relational database and for gathering data for reports. Defined data
integrity rules must be followed to ensure the relational database is accurate and accessible.
1.Categorizing data. Database administrators can easily categorize and store data in a
relational database that can then be queried and filtered to extract information for reports.
Relational databases are also easy to extend and aren't reliant on physical organization. After
the original database creation, a new data category can be added without having to modify the
existing applications.
2.Accuracy. Data is stored just once, eliminating data deduplication in storage procedures.
3.Ease of use. Complex queries are easy for users to carry out with SQL, the main query
language used with relational databases.
5.Security. Direct access to data in tables within an RDBMS can be limited to specific users.
The first step, extraction, involves pulling data from various sources. These sources can be
anything from databases, cloud data storage, data lakes, to big data platforms. SQL
(Structured Query Language) is often used in this step to query and retrieve data from these
sources, including disparate sources like Amazon Redshift and Google BigQuery.
Once the data is extracted, it undergoes the transformation process. This step involves
cleaning, validating, and converting the data into a consistent format that can be used in the
data warehouse. This might involve tasks such as removing duplicates, validating data for
consistency and accuracy, and converting data types to match the data warehouse schema.
The final step is loading the data into the data warehouse. This involves writing the
transformed data into the data warehouse's storage system. Depending on the requirements,
this could be a full load, where all the data is written into the warehouse, or an incremental
load, where only new or updated data is written.
This process has evolved with the advent of cloud data warehouses and big data, leading to
new techniques and tools for data integration. For instance, the ingestion of data into
platforms like Amazon Redshift and Google BigQuery has become more streamlined and
efficient.
10.Data Visualization.
Ans = Data visualization is the fourth layer and is responsible for creating visualizations
of the data that humans can easily understand. This layer is important for making the data
accessible.
The data visualization layer in big data architecture measures the success of a project. It
allows users to perceive the value of the data. The data visualization layer uses Microsoft
Power BI to enable users to:
Ingestion layer: Loads data from data sources into the data platform
Analytics layer: Consumes business insight derived from analytics applications
Manage layer: Separates noise and relevant information from a huge data set
Some visualization types include:
RCV Academy
Ans = Big data security is a collection of measures and tools that protect data and analytics
methods from attacks, theft, and other malicious activities. Big data security is made up of
three layers: incoming, stored, and outgoing data.
Big data security tools and measures include:
Ans = Big data virtualization is a process that creates virtual structures for big data
systems. It enables organizations to use all the data they collect to achieve various goals and
objectives.
Big data virtualization offers a modernized approach to data integration. It serves as a logical
data layer that combines all enterprise data to produce real-time information for business
users.
Big data virtualization guarantees that data is adequately connected with other systems so that
organizations may harness big data for analytics and operations.
Big data virtualization minimizes persistent data stores and associated costs. It integrates data
from multiple sources of different types into a holistic, logical view without moving it
physically.
Hadoop infrastructure layer takes care of the hardware and network requirements. It can
provide a virtualized cloud environment or a distributed grid of commodity servers over a fast
gigabit network. Following are the main components of a Hadoop infrastructure:
N commodity servers (8-core, 24 GBs RAM, 4 to 12 TBs, gig-E)
2-level network (20 to 40 nodes per rack)
14. Platform Management Layer in big data.
Ans = The management system in big data focuses on data access and data mining. The
management system is made up of six modules:
Interface acquisition, Program scheduling, Data aggregation, Platform alerting,
Marketing analysis, Visualization.
The platform management layer includes an edge application service platform for virtualized
resource management, which allocates resources in the network to different services and
provides the operation and management of edge services.
Ans = NoSQL databases, which stand for "not only SQL," are a popular alternative to
traditional relational databases. They are designed to handle large amounts of unstructured or
semi-structured data, and are often used for big data and real-time web applications.
However, like any technology, NoSQL databases come with their own set of challenges.
Challenges of NoSQL :
1)Data modeling and schema design : One of the biggest challenges with NoSQL databases
is data modeling and schema design. Unlike relational databases, which have a well-defined
schema and a fixed set of tables, NoSQL databases often do not have a fixed schema. This
can make it difficult to model and organize data in a way that is efficient and easy to query.
Additionally, the lack of a fixed schema can make it difficult to ensure data consistency and
integrity.
3)Scalability : NoSQL databases are often used for big data and real-time web applications,
which means that they need to be able to scale horizontally. However, scaling a NoSQL
database can be complex and requires careful planning. You may need to consider issues
such as sharding, partitioning, and replication, as well as the impact of these decisions on
query performance and data consistency.
5)Data security : Ensuring the security of sensitive data is a critical concern for any
organization. NoSQL databases, however, may not have the same level of built-in security
features as relational databases. This means that additional measures may need to be put in
place to secure data at rest and in transit, such as encryption and authentication.
Improves efficiency
Allows for fewer physical servers in a data center
Helps platforms scale to handle large volumes of data
Improves application processing performance
Allows you to run different operating systems on the same hardware
Big data is a collection of structured, unstructured, and semi-structured data that continues to
grow exponentially. It's characterized by: Volume, Variety, Velocity, Variability.
Virtualization is not legally required for big data analysis, but software frameworks are
more efficient in a virtualized environment. For example, any MapReduce algorithm will
perform better in a virtualized environment.
Ans = Big data monitoring tracks metrics like: Response times, Resource utilization, Error
rates, Transaction performance.
Monitoring can alert users to issues or anomalies so they can take action.
The security and governance layer of big data architecture includes: Access control,
Encryption, Network security, Usage monitoring, Auditing mechanisms.
The security layer also tracks the operations of other layers.
Consistency –
Consistency means that the nodes will have the same copies of a replicated data
item visible for various transactions. A guarantee that every node in a distributed
cluster returns the same, most recent and a successful write. Consistency refers
to every client having the same view of the data. There are various types of
consistency models. Consistency in CAP refers to sequential consistency, a very
strong form of consistency.
Availability –
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be
completed. Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time. The key word here is “every”. In simple
terms, every node (on either side of a network partition) must be able to respond
in a reasonable amount of time.
Partition Tolerance –
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions,
where the nodes in each partition can only communicate among each other. That
means, the system continues to function and upholds its consistency guarantees
in spite of network partitions. Network partitions are a fact of life. Distributed
systems guaranteeing partition tolerance can gracefully recover from partitions
once the partition heals.
The following figure represents which database systems prioritize specific
properties at a given time:
20. List some major functions of the big data architecture model.
Ans = A big data architecture is a system that manages, stores, processes, and analyzes
large amounts of data. It's designed to handle data that's too large or complex for traditional
database systems.
Big data architectures typically involve one or more of the following types of workload:
There is more than one workload type involved in big data systems, and they are broadly
classified as follows:
1. Merely batching data where big data-based sources are at rest is a data processing
situation.
2. Real-time processing of big data is achievable with motion-based processing.
3. The exploration of new interactive big data technologies and tools.
4. The use of machine learning and predictive analysis.