0% found this document useful (0 votes)
19 views

ADBMS-Module 1 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

ADBMS-Module 1 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

ADBMS & Big Data(22MCS23)

Advanced DBMS and Big data


Module 1
Understanding Big data
What is big data?

Big data primarily refers to data sets that are too large or complex to be dealt with
by traditional data-processing application software. Data with many entries offer
greater statistical power, while data with higher complexity may lead to a higher
false discovery rate.

Difference Between Big Data and Data Mining


Data Mining Big Data

It is one of the method in the Big Data is a technique to collect, maintain and
pipeline of Big Data. process the huge information. It explains the data
ADBMS & Big Data(22MCS23)

Data Mining Big Data

relationship.

It is about extracting the vital and valuable


Data mining is a part of
information from huge amount of the data.It is a
Knowledge Discovery of the
technique of tracking and discovering of trends of
Data. It is close view of the
complex data sets. It is a large or overall view of
data.
the data.

The goal is same as Big The goal is to make data more vital and usable i.e.
Data as it is one of the tool by extracting only important information from the
of Big Data. huge data within existing traditional aspects.

It is manual as well as It is only automated as computing huge data is


automated in nature. difficult.

It only focuses on only one It focuses and works with all form of data i.e.
form of data. i.e. structured. structured, unstructured or semi-structured.

It is used to create certain


business insights. Data It is mainly used for business purposes and
mining is a manager of the customer satisfaction. Big Data is a mine.
mine.

It is a sub set of Big Data.


It is a super set of Data Mining.
i.e. one of the tools.

It is a tool to dig up the vital


information from the large It is more involved with the processes of handling
data. Data can be large as voluminous data. Data can only be large.
well as small.
ADBMS & Big Data(22MCS23)

S.No. Big Data Data Warehouse

Big data is the data which is in Data warehouse is the collection of historical
1. enormous form on which technologies data from different operations in an
can be applied. enterprise.

Big data is a technology to store and Data warehouse is an architecture used to


2.
manage large amount of data. organize the data.

It takes structured, non-structured or


3. It only takes structured data as an input.
semi-structured data as an input.

Big data does processing by using Data warehouse doesn’t use distributed file
4.
distributed file system. system for processing.

Big data doesn’t follow any SQL In data warehouse we use SQL queries to
5.
queries to fetch data from database. fetch data from relational databases.

Apache Hadoop can be used to handle Data warehouse cannot be used to handle
6.
enormous amount of data. enormous amount of data.

When new data is added, the changes When new data is added, the changes in
7. in data are stored in the form of a file data do not directly impact the data
which is represented by a table. warehouse.

Data warehouse requires more efficient


Big data doesn’t require efficient
management techniques as the data is
8. management techniques as compared
collected from different departments of the
to data warehouse.
enterprise.
ADBMS & Big Data(22MCS23)

Why Big data


Importance of Big data

Big data is the next generation of data warehousing and business analytics and
is poised to deliver top line revenue cost effectively for enterprises. The greatest
part about this phenomenon is the rapid pace of innovation and change.

There are three standard answer for questions

1. Computing perfect storm. Big Data analytics are the natural result of four
major global trends: Moore ’s Law (which basically says that tech- nology always
gets cheaper), mobile computing (that smart phone or mobile tablet in your hand),
social networking (Facebook, Foursquare, Pinterest, etc.), and cloud computing
(you don ’t even have to own hardware or software anymore; you can rent or lease
someone else ’s).

2. Data perfect storm. Volumes of transactional data have been around for
decades for most big firms, but the flood gates have now opened with more
volume, and the velocity and variety—the three Vs—of data that has arrived in
unprecedented ways. This perfect storm of the three Vs makes it extremely
complex and cumbersome with the cur- rent data management and analytics
technology and practices.

3. Convergence perfect storm. Another perfect storm is happening, too.


Traditional data management and analytics software and hard- ware technologies,
open-source technology, and commodity hardware are merging to create new
alternatives for IT and business executives to address Big Data analytics.People are
able to store that much data now and more than they ever before. We have reached
this tipping point where they don’t have to make decisions about which half to
keep or how much history to keep. It’s now economically feasible to keep all of
your history and all of your variables and go back later when you have a new
question and start looking for an answer

Big data in many sectors today will range from a few dozen terabytes to multiple
petabytes (thousands of terabytes).The real challenge is identifying or developing
most cost-effective and reliable methods for extracting value from all the terabytes
ADBMS & Big Data(22MCS23)

and petabytes of data now available. That’s where Big Data analytics become
necessary.

Comparing traditional analytics to Big Data analytics, the differences in speed,


scale, and complexity are tremendous.

Companies use big data in their systems to improve operational efficiency, provide
better customer service, create personalized marketing campaigns and take other
actions that can increase revenue and profits. Businesses that use big data
effectively hold a potential competitive advantage over those that don't because
they're able to make faster and more informed business decisions.

For example, big data provides valuable insights into customers that companies
can use to refine their marketing, advertising and promotions to increase customer
engagement and conversion rates. Both historical and real-time data can be
analyzed to assess the evolving preferences of consumers or corporate buyers,
enabling businesses to become more responsive to customer wants and needs.

Medical researchers use big data to identify disease signs and risk factors.
Doctors use it to help diagnose illnesses and medical conditions in patients. In
addition, a combination of data from electronic health records, social media sites,
the web and other sources gives healthcare organizations and government agencies
up-to-date information on infectious disease threats and outbreaks.

There are multiple benefits organizations can get by using big data.

Here are some more examples of how organizations in various industries use big
data:

Big data helps oil and gas companies identify potential drilling locations and
monitor pipeline operations. Likewise, utilities use it to track electrical grids.
Types of big data


ADBMS & Big Data(22MCS23)

 Financial services firms use big data systems for risk management and real-time
analysis of market data.
 Manufacturers and transportation companies rely on big data to manage their
supply chains and optimize delivery routes.
 Government agencies use big data for emergency response, crime prevention
and smart city initiatives.

Types of big data

The information contained in big data repositories can be classified .2.5


quintillion bytes of data are generated every day by users. Predictions by Statista
suggest that by the end of 2021, 74 Zettabytes( 74 trillion GBs) of data would be
generated by the internet. Managing such a vacuous and perennial outsourcing of
data is increasingly difficult. So, to manage such huge complex data, Big data
was introduced, it is related to the extraction of large and complex data into
meaningful data which can’t be extracted or analyzed by traditional methods.

All data cannot be stored in the same way. The methods for data storage
can be accurately evaluated after the type of data has been identified. A Cloud
Service, like Microsoft Azure, is a one-stop destination for storing all kinds of
data; blobs, queues, files, tables, disks, and applications data. However, even
within the Cloud, there are special services to deal with specific sub-categories of
data.
For example, Azure Cloud Services like Azure SQL and Azure Cosmos DB help
in handling and managing sparsely varied kinds of data.
Applications Data is the data that is created, read, updated, deleted, or processed
by applications. This data could be generated via web apps, android apps, iOS
apps, or any applications whatsoever. Due to a varied diversity in the kinds of
data being used, determining the storage approach is a little nuanced.
Big data can be classified in 3 types
ADBMS & Big Data(22MCS23)

Structured Data

 Structured data can be crudely defined as the data that resides in a fixed field
within a record.
 It is type of data most familiar to our everyday lives. for ex: birthday,address
 A certain schema binds it, so all the data has the same set of properties.
Structured data is also called relational data. It is split into multiple tables to
enhance the integrity of the data by creating a single record to depict an entity.
Relationships are enforced by the application of table constraints.
 The business value of structured data lies within how well an organization can
utilize its existing systems and processes for analysis purposes.

Sources of structured data


A Structured Query Language (SQL) is needed to bring the data together.
Structured data is easy to enter, query, and analyze. All of the data follows the
same format. However, forcing a consistent structure also means that any
ADBMS & Big Data(22MCS23)

alteration of data is too tough as each record has to be updated to adhere to the
new structure. Examples of structured data include numbers, dates, strings, etc.
The business data of an e-commerce website can be considered to be structured
data.
Class SectionRoll NoGrade

Geek1 11 A 1 A

Geek2 11 A 2 B

Geek3 11 A 3 A

Cons of Structured Data


1. Structured data can only be leveraged in cases of predefined functionalities.
This means that structured data has limited flexibility and is suitable for
certain specific use cases only.
2. Structured data is stored in a data warehouse with rigid constraints and a
definite schema. Any change in requirements would mean updating all of that
structured data to meet the new needs. This is a massive drawback in terms of
resource and time management.

Semi-Structured Data

 Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized
into rows and columns like that in a spreadsheet. However, there are some
features like key-value pairs that help in discerning the different entities from
each other.
 Since semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
 A data serialization language is used to exchange semi-structured data across
systems that may even have varied underlying infrastructure.
ADBMS & Big Data(22MCS23)

 Semi-structured content is often used to store metadata about a business


process but it can also include files containing machine instructions for
computer programs.
 This type of information typically comes from external sources such as social
media platforms or other web-based data feeds.

Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw
valuable insights. Due to a simple format, data serialization readers can be
implemented on hardware with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in
files, transit, store, and parse. The sender and the receiver don’t need to know
about the other system. As long as the same serialization language is used, the
data can be understood by both systems comfortably. There are three
predominantly used Serialization languages.

1. XML– XML stands for eXtensible Markup Language. It is a text-based


markup language designed to store and transport data. XML parsers can be found
ADBMS & Big Data(22MCS23)

in almost all popular development platforms. It is human and machine-readable.


XML has definite standards for schema, transformation, and display. It is self-
descriptive. Below is an example of a programmer’s details in XML.
 XML

<ProgrammerDetails>

<FirstName>Jane</FirstName>

<LastName>Doe</LastName>

<CodingPlatforms>

<CodingPlatform
Type="Fav">GeeksforGeeks</CodingPlatform>

<CodingPlatform
Type="2ndFav">Code4Eva!</CodingPlatform>

<CodingPlatform
Type="3rdFav">CodeisLife</CodingPlatform>

</CodingPlatforms>

</ProgrammerDetails>

<!--The 2ndFav and 3rdFav Coding Platforms


are imaginative because Geeksforgeeks is the
best!-->
ADBMS & Big Data(22MCS23)

XML expresses the data using tags (text within angular brackets) to shape the
data (for ex: FirstName) and attributes (For ex: Type) to feature the data.
However, being a verbose and voluminous language, other formats have gained
more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file
format for data interchange. JSON is easy to use and uses human/machine-
readable text to store and transmit data objects.
 Javascript

"firstName": "Jane",

"lastName": "Doe",

"codingPlatforms": [

{ "type": "Fav", "value": "Geeksforgeeks"


},

{ "type": "2ndFav", "value": "Code4Eva!"


},

{ "type": "3rdFav", "value": "CodeisLife"


}

This format isn’t as formal as XML. It’s more like a key/value pair model than a
formal data depiction. Javascript has inbuilt support for JSON. Although JSON is
very popular amongst web developers, non-technical personnel find it tedious to
ADBMS & Big Data(22MCS23)

work with JSON due to its heavy dependence on JavaScript and structural
characters (braces, commas, etc.)
3. YAML– YAML is a user-friendly data serialization language. Figuratively, it
stands for YAML Ain’t Markup Language. It is adopted by technical and non-
technical handlers all across the globe owing to its simplicity. The data structure
is defined by line separation and indentation and reduces the dependency on
structural characters. YAML is extremely comprehensive and its popularity is a
result of its human-machine readability.

YAML example
A product catalog organized by tags is an example of semi-structured data.

Unstructured Data

 Unstructured data is the kind of data that doesn’t adhere to any definite
schema or set of rules. Its arrangement is unplanned and haphazard.
 Photos, videos, text documents, and log files can be generally considered
unstructured data. Even though the metadata accompanying an image or a
video may be semi-structured, the actual data being dealt with is unstructured.
 Additionally, Unstructured data is also known as ―dark data‖ because it cannot
be analyzed without the proper software tools.
ADBMS & Big Data(22MCS23)

Characteristics or dimensions of big data

1. Volume

The volume of your data is how much of it there is – measured in gigabytes,


zettabytes (ZB), and yottabytes (YB). Industry trends predict a significant increase
in data volume over the next few years. Earlier, there were issues with storing and
processing this enormous volume of data. But nowadays, data gathered from all
these sources is organized using distributed systems like Hadoop. Understanding
the usefulness of the data requires knowledge of its magnitude. Additionally, one
may use the volume to identify if a data set is big data or not.

2. Velocity

Velocity describes how quickly data is processed. Any significant data operation
has to operate at a high rate. The linkage of incoming data sets, activity bursts, and
the pace of change make up this phenomenon. Sensors, social media platforms, and
application logs all continuously generate enormous volumes of data. There is no
use in spending time or effort
ADBMS & Big Data(22MCS23)

on it if the data flow is not constant.

3. Variety

The many types of big data are referred to as variety. As it impacts performance, it
is one of the main problems the big data sector is now dealing with. It’s crucial to
organize your data so that you can manage its diversity effectively. Variety is the
wide range of information you collect from numerous sources.

4. Veracity

The correctness of your data is referred to as veracity. The accuracy of your


findings can be severely harmed by poor veracity, making it one of the most
crucial big data qualities. It specifies the level of data reliability. It is vital to
remove the information that is not essential and use the remaining data for
processing because most of the data you encounter is unstructured.

5. Value

Value is the advantage that the data provides to your company. Does it reflect the
objectives of your company? Does it aid in the growth of your company? It’s one
of the most crucial fundamentals of big data. Data scientists first transform
unprocessed data into knowledge. The best data from this data collection is then
extracted once it has been cleaned. On this data set, analysis and pattern
recognition are performed. The results of the method may be used to determine the
value of the data.

Convergence of key trends

Refer text book

Grid Computing

The High Performance Computing (HPC) and Grid Computing communities have
been doing large-scale data processing for years, using such Application
Program Interfaces (APIs) as Message Passing Interface (MPI). The approach in
HPC is to distribute the work across a cluster of machines, which access a shared
ADBMS & Big Data(22MCS23)

file system, hosted by a Storage Area Network (SAN). This works well for
predominantly compute- intensive jobs, but it becomes a problem when nodes need
to access larger data volumes since the network bandwidth is the bottleneck and
compute nodes become idle.

MapReduce tries to collocate the data with the compute node, so data access is fast
because it is local. This feature, known as data locality, is at the heart of
MapReduce and is the reason for its good performance. Recognizing that network
bandwidth is the most precious resource in a data center environment (it is easy to
saturate network links by copying data around), MapReduce implementations
conserve it by explicitly modelling network topology. MapReduce operates only at
the higher level: the programmer thinks in terms of functions of key and value
pairs, and the data flow is implicit.

Coordinating the processes in a large-scale distributed computation is a challenge.


The hardest aspect is gracefully handling partial failure—when you don’t know
whether or not a remote process has failed—and still making progress with the
overall computation. MapReduce spares the programmer from having to think
about failure, since the implementation detects failed map or reduce tasks and
reschedules replacements on machines that are healthy. MapReduce is able to do
this because it is a shared-nothing architecture, meaning that tasks have no
dependence on one other. So from the programmer’s point of view, the order in
which the tasks run doesn’t matter. By contrast, MPI programs have to explicitly
manage their own check pointing and recovery, which gives more control to the
programmer but makes them more difficult to write.

MapReduce might sound like a restrictive programming model as we are limited to


key and value types that are related in specified ways, and mappers and reducers
run with very limited coordination between one another (the mappers pass keys
and values to reducers). Large range of algorithms can be expressed in
MapReduce, from image analysis, to graph-based problems, to machine learning
algorithms.

Volunteer Computing
ADBMS & Big Data(22MCS23)

SETI, the Search for Extra-Terrestrial Intelligence, runs a project called


SETI@home in which volunteers donate CPU time from their otherwise idle
computers to analyze radio telescope data for signs of intelligent life outside earth.
SETI@home is the most well-known of many volunteer computing projects.
Others include the Great Internet Mersenne Prime Search (to search for large prime
numbers) and Folding@home (to understand protein folding and how it relates to
disease).Volunteer computing projects work by breaking the problem they are
trying to solve into chunks called work units, which are sent to computers around
the world to be analyzed. For example, a SETI@home work unit is about 0.35 MB
of radio telescope data, and takes hours or days to analyze on a typical home
computer. When the analysis is completed, the results are sent back to the server,
and the client gets another work unit. As a precaution to combat cheating, each
work unit is sent to three different machines and needs at least two results to agree
to be accepted.

Although SETI@home may be similar to MapReduce (breaking a problem into


independent pieces to be worked on in parallel), there are some significant
differences. The SETI@home problem is very CPU-intensive, which makes it
suitable for running on hundreds of thousands of computers across the world
because the time to transfer the work unit is dwarfed by the time to run the
computation on it. Volunteers are donating CPU cycles, not bandwidth.

MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated
hardware running in a single data center with very high aggregate bandwidth inter-
connects. By contrast, SETI@home runs a perpetual computation on untrusted
machines on the Internet with highly variable connection speeds and no data
locality.

Unstructured Data
Unstructured data tends to grow exponentially, unlike structured data, which tends
to grow in a more linear fashion.Unstructured data is basically information that
either does not have a predefined data model and/or does not fi t well into a
relational database. Unstructured information is typically text heavy, but may
contain data such as unstructured social data to monitor their own systems.Although
ADBMS & Big Data(22MCS23)

there is a worth in big data analytics there are stiil some business and technology
hurdles to clear dates

Unstructured data tends to grow exponentially, unlike structured data, which


tends to grow in a more linear fashion. From a business perspective, you’ll need to
learn how to:

■ Use Big Data analytics to drive value for your enterprise that aligns with
your core competencies and creates a competitive advantage for your
enterprise
■ Capitalize on new technology capabilities and leverage your existing
technology assets
■ Enable the appropriate organizational change to move towards fact- based
decisions, adoption of new technologies, and uniting people from
multiple disciplines into a single multidisciplinary team
■ Deliver faster and superior results by embracing and capitalizing on the
ever-increasing rate of change that is occurring in the global market place

Big Data analytics uses a wide variety of advanced analytics, as listed in Figure

SQL Descripti Data Predictiv Simulatioo Optimizatio


Analy ve Mining Analytics n n
tics Analytics
• Count • Univari • Associat • Classificati • Monte • Linear
• Mean ate ion rules on Carlo optimizati
• OLAP distribut • Clustering • Regression • Agent- on
ion • Feature • Forecasting based • Non-

• Centra extracti • Spatial modeli linear


l on • Mac
ng optimizati
tenden hine • Discrete on
cy lear event
• Dispersion ning modelin
• Text
g
analytics
ADBMS & Big Data(22MCS23)

There are potential Big Data Business Models for enterprises seeking to exploit Big
Data analytics

Big Data analytics certainly represents an enormous opportunity for businesses


to exploit their data assets to realize substantial bottom line results for their
enterprise. The key to success for organizations seeking to take advantage of this
opportunity is:
■ Leverage all your current data and enrich it with new data sources

■ Enforce data quality policies and leverage today’s best technology and

people to support the policies


■ Relentlessly seek opportunities to imbue your enterprise with fact- based

decision making
■ Embed your analytic insights throughout your organization

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy