ADBMS-Module 1 Notes
ADBMS-Module 1 Notes
Big data primarily refers to data sets that are too large or complex to be dealt with
by traditional data-processing application software. Data with many entries offer
greater statistical power, while data with higher complexity may lead to a higher
false discovery rate.
It is one of the method in the Big Data is a technique to collect, maintain and
pipeline of Big Data. process the huge information. It explains the data
ADBMS & Big Data(22MCS23)
relationship.
The goal is same as Big The goal is to make data more vital and usable i.e.
Data as it is one of the tool by extracting only important information from the
of Big Data. huge data within existing traditional aspects.
It only focuses on only one It focuses and works with all form of data i.e.
form of data. i.e. structured. structured, unstructured or semi-structured.
Big data is the data which is in Data warehouse is the collection of historical
1. enormous form on which technologies data from different operations in an
can be applied. enterprise.
Big data does processing by using Data warehouse doesn’t use distributed file
4.
distributed file system. system for processing.
Big data doesn’t follow any SQL In data warehouse we use SQL queries to
5.
queries to fetch data from database. fetch data from relational databases.
Apache Hadoop can be used to handle Data warehouse cannot be used to handle
6.
enormous amount of data. enormous amount of data.
When new data is added, the changes When new data is added, the changes in
7. in data are stored in the form of a file data do not directly impact the data
which is represented by a table. warehouse.
Big data is the next generation of data warehousing and business analytics and
is poised to deliver top line revenue cost effectively for enterprises. The greatest
part about this phenomenon is the rapid pace of innovation and change.
1. Computing perfect storm. Big Data analytics are the natural result of four
major global trends: Moore ’s Law (which basically says that tech- nology always
gets cheaper), mobile computing (that smart phone or mobile tablet in your hand),
social networking (Facebook, Foursquare, Pinterest, etc.), and cloud computing
(you don ’t even have to own hardware or software anymore; you can rent or lease
someone else ’s).
2. Data perfect storm. Volumes of transactional data have been around for
decades for most big firms, but the flood gates have now opened with more
volume, and the velocity and variety—the three Vs—of data that has arrived in
unprecedented ways. This perfect storm of the three Vs makes it extremely
complex and cumbersome with the cur- rent data management and analytics
technology and practices.
Big data in many sectors today will range from a few dozen terabytes to multiple
petabytes (thousands of terabytes).The real challenge is identifying or developing
most cost-effective and reliable methods for extracting value from all the terabytes
ADBMS & Big Data(22MCS23)
and petabytes of data now available. That’s where Big Data analytics become
necessary.
Companies use big data in their systems to improve operational efficiency, provide
better customer service, create personalized marketing campaigns and take other
actions that can increase revenue and profits. Businesses that use big data
effectively hold a potential competitive advantage over those that don't because
they're able to make faster and more informed business decisions.
For example, big data provides valuable insights into customers that companies
can use to refine their marketing, advertising and promotions to increase customer
engagement and conversion rates. Both historical and real-time data can be
analyzed to assess the evolving preferences of consumers or corporate buyers,
enabling businesses to become more responsive to customer wants and needs.
Medical researchers use big data to identify disease signs and risk factors.
Doctors use it to help diagnose illnesses and medical conditions in patients. In
addition, a combination of data from electronic health records, social media sites,
the web and other sources gives healthcare organizations and government agencies
up-to-date information on infectious disease threats and outbreaks.
There are multiple benefits organizations can get by using big data.
Here are some more examples of how organizations in various industries use big
data:
Big data helps oil and gas companies identify potential drilling locations and
monitor pipeline operations. Likewise, utilities use it to track electrical grids.
Types of big data
ADBMS & Big Data(22MCS23)
Financial services firms use big data systems for risk management and real-time
analysis of market data.
Manufacturers and transportation companies rely on big data to manage their
supply chains and optimize delivery routes.
Government agencies use big data for emergency response, crime prevention
and smart city initiatives.
All data cannot be stored in the same way. The methods for data storage
can be accurately evaluated after the type of data has been identified. A Cloud
Service, like Microsoft Azure, is a one-stop destination for storing all kinds of
data; blobs, queues, files, tables, disks, and applications data. However, even
within the Cloud, there are special services to deal with specific sub-categories of
data.
For example, Azure Cloud Services like Azure SQL and Azure Cosmos DB help
in handling and managing sparsely varied kinds of data.
Applications Data is the data that is created, read, updated, deleted, or processed
by applications. This data could be generated via web apps, android apps, iOS
apps, or any applications whatsoever. Due to a varied diversity in the kinds of
data being used, determining the storage approach is a little nuanced.
Big data can be classified in 3 types
ADBMS & Big Data(22MCS23)
Structured Data
Structured data can be crudely defined as the data that resides in a fixed field
within a record.
It is type of data most familiar to our everyday lives. for ex: birthday,address
A certain schema binds it, so all the data has the same set of properties.
Structured data is also called relational data. It is split into multiple tables to
enhance the integrity of the data by creating a single record to depict an entity.
Relationships are enforced by the application of table constraints.
The business value of structured data lies within how well an organization can
utilize its existing systems and processes for analysis purposes.
alteration of data is too tough as each record has to be updated to adhere to the
new structure. Examples of structured data include numbers, dates, strings, etc.
The business data of an e-commerce website can be considered to be structured
data.
Class SectionRoll NoGrade
Geek1 11 A 1 A
Geek2 11 A 2 B
Geek3 11 A 3 A
Semi-Structured Data
Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized
into rows and columns like that in a spreadsheet. However, there are some
features like key-value pairs that help in discerning the different entities from
each other.
Since semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
A data serialization language is used to exchange semi-structured data across
systems that may even have varied underlying infrastructure.
ADBMS & Big Data(22MCS23)
Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw
valuable insights. Due to a simple format, data serialization readers can be
implemented on hardware with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in
files, transit, store, and parse. The sender and the receiver don’t need to know
about the other system. As long as the same serialization language is used, the
data can be understood by both systems comfortably. There are three
predominantly used Serialization languages.
<ProgrammerDetails>
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<CodingPlatforms>
<CodingPlatform
Type="Fav">GeeksforGeeks</CodingPlatform>
<CodingPlatform
Type="2ndFav">Code4Eva!</CodingPlatform>
<CodingPlatform
Type="3rdFav">CodeisLife</CodingPlatform>
</CodingPlatforms>
</ProgrammerDetails>
XML expresses the data using tags (text within angular brackets) to shape the
data (for ex: FirstName) and attributes (For ex: Type) to feature the data.
However, being a verbose and voluminous language, other formats have gained
more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file
format for data interchange. JSON is easy to use and uses human/machine-
readable text to store and transmit data objects.
Javascript
"firstName": "Jane",
"lastName": "Doe",
"codingPlatforms": [
This format isn’t as formal as XML. It’s more like a key/value pair model than a
formal data depiction. Javascript has inbuilt support for JSON. Although JSON is
very popular amongst web developers, non-technical personnel find it tedious to
ADBMS & Big Data(22MCS23)
work with JSON due to its heavy dependence on JavaScript and structural
characters (braces, commas, etc.)
3. YAML– YAML is a user-friendly data serialization language. Figuratively, it
stands for YAML Ain’t Markup Language. It is adopted by technical and non-
technical handlers all across the globe owing to its simplicity. The data structure
is defined by line separation and indentation and reduces the dependency on
structural characters. YAML is extremely comprehensive and its popularity is a
result of its human-machine readability.
YAML example
A product catalog organized by tags is an example of semi-structured data.
Unstructured Data
Unstructured data is the kind of data that doesn’t adhere to any definite
schema or set of rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered
unstructured data. Even though the metadata accompanying an image or a
video may be semi-structured, the actual data being dealt with is unstructured.
Additionally, Unstructured data is also known as ―dark data‖ because it cannot
be analyzed without the proper software tools.
ADBMS & Big Data(22MCS23)
1. Volume
2. Velocity
Velocity describes how quickly data is processed. Any significant data operation
has to operate at a high rate. The linkage of incoming data sets, activity bursts, and
the pace of change make up this phenomenon. Sensors, social media platforms, and
application logs all continuously generate enormous volumes of data. There is no
use in spending time or effort
ADBMS & Big Data(22MCS23)
3. Variety
The many types of big data are referred to as variety. As it impacts performance, it
is one of the main problems the big data sector is now dealing with. It’s crucial to
organize your data so that you can manage its diversity effectively. Variety is the
wide range of information you collect from numerous sources.
4. Veracity
5. Value
Value is the advantage that the data provides to your company. Does it reflect the
objectives of your company? Does it aid in the growth of your company? It’s one
of the most crucial fundamentals of big data. Data scientists first transform
unprocessed data into knowledge. The best data from this data collection is then
extracted once it has been cleaned. On this data set, analysis and pattern
recognition are performed. The results of the method may be used to determine the
value of the data.
Grid Computing
The High Performance Computing (HPC) and Grid Computing communities have
been doing large-scale data processing for years, using such Application
Program Interfaces (APIs) as Message Passing Interface (MPI). The approach in
HPC is to distribute the work across a cluster of machines, which access a shared
ADBMS & Big Data(22MCS23)
file system, hosted by a Storage Area Network (SAN). This works well for
predominantly compute- intensive jobs, but it becomes a problem when nodes need
to access larger data volumes since the network bandwidth is the bottleneck and
compute nodes become idle.
MapReduce tries to collocate the data with the compute node, so data access is fast
because it is local. This feature, known as data locality, is at the heart of
MapReduce and is the reason for its good performance. Recognizing that network
bandwidth is the most precious resource in a data center environment (it is easy to
saturate network links by copying data around), MapReduce implementations
conserve it by explicitly modelling network topology. MapReduce operates only at
the higher level: the programmer thinks in terms of functions of key and value
pairs, and the data flow is implicit.
Volunteer Computing
ADBMS & Big Data(22MCS23)
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated
hardware running in a single data center with very high aggregate bandwidth inter-
connects. By contrast, SETI@home runs a perpetual computation on untrusted
machines on the Internet with highly variable connection speeds and no data
locality.
Unstructured Data
Unstructured data tends to grow exponentially, unlike structured data, which tends
to grow in a more linear fashion.Unstructured data is basically information that
either does not have a predefined data model and/or does not fi t well into a
relational database. Unstructured information is typically text heavy, but may
contain data such as unstructured social data to monitor their own systems.Although
ADBMS & Big Data(22MCS23)
there is a worth in big data analytics there are stiil some business and technology
hurdles to clear dates
■ Use Big Data analytics to drive value for your enterprise that aligns with
your core competencies and creates a competitive advantage for your
enterprise
■ Capitalize on new technology capabilities and leverage your existing
technology assets
■ Enable the appropriate organizational change to move towards fact- based
decisions, adoption of new technologies, and uniting people from
multiple disciplines into a single multidisciplinary team
■ Deliver faster and superior results by embracing and capitalizing on the
ever-increasing rate of change that is occurring in the global market place
Big Data analytics uses a wide variety of advanced analytics, as listed in Figure
There are potential Big Data Business Models for enterprises seeking to exploit Big
Data analytics
■ Enforce data quality policies and leverage today’s best technology and
decision making
■ Embed your analytic insights throughout your organization