0% found this document useful (0 votes)
5 views100 pages

bda mod1

for vtu notes of bda mod1

Uploaded by

btsislifeduh7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views100 pages

bda mod1

for vtu notes of bda mod1

Uploaded by

btsislifeduh7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

MODULE - 1

By,

Dr. Pushpa S. K
Professor & HoD,
Dept. of Information Science & Engg.
BMS Institute of Technology & Mgmt.
Bengaluru.

Department of Information Science and Engg


Transform Here
Introduction to Big Data

Need of Big Data


The rise in technology has led to the production and storage of
voluminous amounts of data.
Earlier megabytes (106 B) were used but nowadays petabytes (1015 B)
are used for processing, analysis, discovering new facts and generating
new knowledge.
Conventional systems for storage, processing and analysis pose
challenges in large growth in volume of data, variety of data, various
forms and formats, increasing complexity, faster generation of data
and need of quickly processing, analyzing and usage.

Department of Information Science and Engg


09-10-2024 2
Transform Here
Figure 1.1 shows data usage and growth. As size and complexity increase, the
proportion of unstructured data types also increase.

An example of a traditional tool for structured data storage and querying is


RDBMS. Volume, velocity and variety (3Vs) of data need the usage of number of
programs and tools for analyzing and processing at a very high speed.
Department of Information Science and Engg
09-10-2024 3
Transform Here
1.2 BIG DATA

• Data is information, usually in the form of facts or statistics that one


can analyze or use for further calculations.
• Data is information that can be stored and used by a computer
program.
• Data is information presented in numbers, letters, or other form.
• Data is information from series of observations, measurements or
facts. Data is information from series of behavioral observations,
measurements or facts.

Department of Information Science and Engg


09-10-2024 4
Transform Here
Web Data
• Web data is the data present on web servers (or enterprise servers)
in the form of text, images, videos, audios and multimedia files for
web users. A user (client software) interacts with this data. A client
can access (pull) data of responses from a server. The data can also
publish (push) or post (after registering subscription) from a server.
Internet applications including web sites, web services, web portals,
online business applications, emails, chats, tweets and social
networks provide and consume the web data.
• Examples of Web data
• Wikipedia,
• Google Maps,
• YouTube.
• Face Book

Department of Information Science and Engg


09-10-2024 5
Transform Here
Classification of Data

• Data can be classified as


• Structured
• Semi-structured
• Multi-structured
• Unstructured.

Structured Data

Structured data conform and associate with data schemas and data models.
Structured data are found in tables (rows and columns). Nearly 15-20% data
are in structured

Department of Information Science and Engg


09-10-2024 6
Transform Here
Structured data enables the following:
• data insert, delete, update and append
• indexing to enable faster data retrieval
• Scalability which enables increasing or decreasing capacities and data
processing operations such as, storing, processing and analytics
• Transactions processing which follows ACID rules (Atomicity, Consistency,
Isolation and Durability)
• encryption and decryption for data security.

Department of Information Science and Engg


09-10-2024 7
Transform Here
Semi Structured Data
• Examples of semi-structured data are XML and JSON documents.
• Semi-structured data contain tags or other markers, which separate
semantic elements and enforce hierarchies of records and fields
within the data.
• Semi-structured form of data does not conform and associate with
formal data model structures.
• Data do not associate data models, such as the relational database
and table models.
Multi Structured Data
• Multi-structured data refers to data consisting of multiple formats of data, viz.
structured, semi-structured and/or unstructured data.
• Multi-structured data sets can have many formats.
• They are found in non-transactional systems.
• For example, streaming data on customer interactions, data of multiple
sensors, data at web or enterprise server or the data- warehouse data in
multiple formats.
Department of Information Science and Engg
09-10-2024 8
Transform Here
Unstructured Data
• Data does not possess data features such as a table or a database.
• Unstructured data are found in file types such as .TXT, .CSV. Data may be as key-value pairs, such as
hash key-value pairs.
• Data may have internal structures, such as in e- mails.
• The data do not reveal relationships, hierarchy relationships, objet oriented relationship.
• The relationships, schema and features need to be separately established.
Examples of unstructured Data
• Mobile data: Text messages, chat messages, tweets, blogs and comments
• Website content data: YouTube videos, browsing data, e-payments, web store data, user-generated
maps
• Social media data: For exchanging data in various forms
• Texts and documents
• Personal documents and e-mails
• Text internal to an organization: Text within documents, logs, survey results
• Satellite images, atmospheric data, surveillance, traffic videos, images from
• Instagram, Flickr (upload, access, organize, edit and share photos from any device from anywhere
in the world).
Department of Information Science and Engg
09-10-2024 9
Transform Here
Big Data Definitions
• Big Data is high-volume, high-velocity and/or high-variety information that
requires new forms of processing for enhanced decision making, insight
discovery and process optimization
• A collection of data sets so large or complex that traditional data
processing applications are inadequate.” -Wikipedia
• Data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges-oxford English
dictionary.
• Big Data refers to data sets whose size is beyond the ability of typical
database software tool to capture, store, manage and analyze

Department of Information Science and Engg


09-10-2024 10
Transform Here
Big Data Characteristics

• Volume: is related to size of the data


• Veloctiy: refers to the speed of generation of data.
• Variety: comprises of a variety of data
• Varacity: quality of data captured, which can vary greatly, affecting its
accurate analysis

Department of Information Science and Engg


09-10-2024 11
Transform Here
Big Data Types
• Social networks and web data, such as Facebook, Twitter, e-mails, blogs and YouTube.
• Transactions data and Business Processes {BPs} data, such as credit card transactions,
flight bookings, etc. and public agencies data such as medical records, insurance
business data etc.
• Customer master data such as data for facial recognition and for the name, date of birth,
marriage anniversary, gender, location and income category,
• Machine-generated data, such as machine-to-machine or Internet of Things data, and
the data from sensors, trackers, web logs and computer systems log.
• Computer generated data is also considered as machine generated data
• Human-generated data such as biometrics data, human—machine interaction data,
email records with a mail server and MySQL database of student grades.
• Humans also records their experiences in ways such as writing these in notebooks
diaries, taking photographs or audio and video clips.
• Human-sourced information is now almost entirely digitized and stored everywhere
from personal computers to social networks

Department of Information Science and Engg


09-10-2024 12
Transform Here
Examples of Big Data
• Chocolate Marketing Company with large number of installed
Automatic Chocolate Vending Machines (ACVMs).
• Automotive Components and Predictive Automotive Maintenance
Services. (ACPAMS) rendering customer services for maintenance and
servicing of (Internet) connected cars and its components
• Weather data Recording, Monitoring and Prediction (WRMP)
Organization.

Department of Information Science and Engg


09-10-2024 13
Transform Here
Department of Information Science and Engg
09-10-2024 14
Transform Here
Department of Information Science and Engg
09-10-2024 15
Transform Here
Big Data Classification
Big Data can be classified on the basis of its characteristics that are used for designing
data architecture for processing and analytics.

Big Data Handling Techniques


• Following are the techniques deployed for Big Data storage, applications, data
management and mining and analytics:
• Huge data volumes storage, data distribution, high-speed networks and high
performance computing
• Applications scheduling using open source, reliable, scalable, distributed file system,
• distributed database, parallel and distributed computing systems, such as Hadoop or
Spark

Department of Information Science and Engg


09-10-2024 16
Transform Here
Big Data Handling Techniques Cont..
• Open source tools which are scalable, elastic and provide
virtualized environment, clusters of data nodes, task and thread
management
• Data management using NoSQL, document database, column-
oriented database, graph database and other form of databases
used as per needs of the applications and in-memory data
management using columnar or Parquet formats during
program execution
• Data mining and analytics, data retrieval, data reporting, data
visualization and machine learning Big Data tools.

Department of Information Science and Engg


09-10-2024 17
Transform Here
Scalability and Parallel Processing
• Big Data needs processing of large data volume, and therefore needs
intensive computations.
• Processing complex applications with large datasets (terabyte to petabyte
datasets) need hundreds of computing nodes.
• Processing of this much distributed data within a short time and at minimum
cost is problematic.
• Scalability is the capability of a system to handle the workload as per the
magnitude of the work.
• System capability needs increment with the increased workloads.
• When the workload and complexity exceed the system capacity, scale it up
and scale it out.
• Scalability enables increase or decrease in the capacity of data storage,
processing& analytics.

Department of Information Science and Engg


09-10-2024 18
Transform Here
Analytical Scalability
• Vertical scalability means scaling up the given system’s resources and
increasing the system’s analytics, reporting and visualization capabilities.
This is an additional way to solve problems of greater complexities. Scaling
up means designing the algorithm according to the architecture that uses
resources efficiently.
• Terabyte of data take time t for processing, code size with increasing
complexity increase by factor n, then scaling up means that processing
takes equal, less or much less than (n * t).
• Horizontal scalability means increasing the number of systems working in
coherence and scaling out the workload. Processing different datasets of a
large dataset deploys horizontal scalability. Scaling out means using more
resources and distributing the processing and storage tasks in parallel. The
easiest way to scale up and scale out execution of analytics software is to
implement it on a bigger machine with more CPUs for greater volume,
velocity, variety and complexity of data. The software will definitely
perform better on a bigger machine.
Department of Information Science and Engg
09-10-2024 19
Transform Here
Massive Parallel Processing Platforms

Parallelization of tasks can be done at several levels:


• Distributing separate tasks onto separate threads on the same CPU
• Distributing separate tasks onto separate CPUs on the same
computer.
• Distributing separate tasks onto separate computers.

Department of Information Science and Engg


09-10-2024 20
Transform Here
Distributed Computing Model
• A distributed computing model uses cloud, grid or clusters, which process
and analyze big and large datasets on distributed computing nodes
connected by high-speed networks.
• Big Data processing uses a parallel, scalable and no-sharing program model,
such as MapReduce, for computations on it.

Department of Information Science and Engg


09-10-2024 21
Transform Here
Cloud Computing
• “Cloud computing is a type of Internet-based computing that provides
shared processing resources and data to the computers and other devices
on demand.”
• One of the best approach for data processing is to perform parallel and
distributed computing in a cloud-computing environment
• Cloud resources can be Amazon Web Service (AWS) Elastic Compute Cloud
(EC2), Microsoft Azure or Apache CloudStack.
Features of Cloud Computing
• on-demand service
• resource pooling,
• scalability,
• accountability,
• broad network access.
• Cloud services can be accessed from anywhere and at any time through the
Internet.
Department of Information Science and Engg
09-10-2024 22
Transform Here
Cloud Services
There are three types of Cloud Services
• Infrastructure as a Service (IaaS):
• Platform as a Service (PaaS):
• Software as a Service (SaaS):
Infrastructure as a Service (IaaS):
• Providing access to resources, such as hard disks, network connections,
databases storage, data center and virtual server spaces is Infrastructure
as a Service (IaaS).
• Some examples are Tata Communications, Amazon data centers and
virtual servers.
• Apache CloudStack is an open source software for deploying and
managing a large network of virtual machines, and offers public cloud
services which provide highly scalable Infrastructure as a Service (IaaS).
Platform as a Service
Department of Information Science and Engg
09-10-2024 23
Transform Here
Platform as a Service
• It implies providing the runtime environment to allow developers to
build applications and services, which means cloud Platform as a
Service.

• Software at the clouds support and manage the services, storage,


networking, deploying, testing, collaborating, hosting and
maintaining applications.

• Examples are Hadoop Cloud Service (IBM BigInsight, Microsoft


Azure HD Insights, Oracle Big Data Cloud Services). Software as a
service .

Department of Information Science and Engg


09-10-2024 24
Transform Here
Software as a service
• Providing software applications as a service to end- users is known as
Software as a Service.

• Software applications are hosted by a service provider and made


available to customers over the Internet.

• Some examples are SQL Google SQL, IBM BigSQL, Microsoft Polybase
and Oracle Big Data SQL.

Department of Information Science and Engg


09-10-2024 25
Transform Here
Department of Information Science and Engg
09-10-2024 26
Transform Here
Grid Computing
• Grid Computing refers to distributed computing, in which a group of
computers from several locations are connected with each other to
achieve a common task. Non interactive connection between computers.
• The computer resources are heterogeneously and geographically
dispersing.
• A group of computers that might spread over remotely comprise a grid.
• A single grid of course, dedicates at an instance to a particular application
only.
• Grid computing, similar to cloud computing, is scalable.
• Cloud computing depends on sharing of resources (for example, networks,
servers, storage, applications and services) to attain coordination and
coherence among resources similar to grid computing.
• Similarly, grid also forms a distributed network for resource integration.

Department of Information Science and Engg


09-10-2024 27
Transform Here
Department of Information Science and Engg
09-10-2024 28
Transform Here
Cluster Computing

• A cluster is a group of computers connected by a network. The group


works together to accomplish the same task. Clusters are used mainly for
load balancing. They shift processes between nodes to keep an even load
on the group of connected computers.
Ex. Hadoop Architecture

Department of Information Science and Engg


09-10-2024 29
Transform Here
Volunteer Computing
• Volunteers provide computing resources to projects of importance that
use resources to do distributed computing and/or storage. Volunteer
computing is a distributed computing paradigm which uses computing
resources of the volunteers. Volunteers are organizations or members who
own personal computers. Projects examples are science related projects
executed by universities or academia in general.
• Some issues with volunteer computing systems are:
• Volunteered computers heterogeneity
• Drop outs from the network over time
• Their sporadic availability
• Incorrect results at volunteers are unaccountable as
they are essentially from anonymous volunteers.

Department of Information Science and Engg


09-10-2024 30
Transform Here
Department of Information Science and Engg
09-10-2024 31
Transform Here
Designing the Data Architecture
• Big Data architecture is the logical and/or physical layout/structure of how Big
Data will be stored, accessed and managed within a Big Data or IT
environment.

• Architecture logically defines how Big Data solution will work, the core
components (hardware, database, software, storage) used, flow of
information, security and more.

• Data analytics need the number of sequential steps. Big Data architecture
design task simplifies when using the logical layers approach. Figure 1.2 shows
the logical layers and the functions which are considered in Big Data
architecture.

Department of Information Science and Engg


09-10-2024 32
Transform Here
Designing the Data Architecture

Data processing architecture consists of five layers:


• Identification of data sources.
• Acquisition, ingestion, extraction, pre-processing, transformation of
data.
• Data storage at files, servers, cluster or cloud.
• Data-processing.
• Data consumption in the number of programs and tools.

Department of Information Science and Engg


09-10-2024 33
Transform Here
Figure 1.2 Design of logical layers in a data processing architecture, and functions in
the layers
Department of Information Science and Engg
09-10-2024 34
Transform Here
Data Architecture
• Logical layer 1 (L1) is for identifying data sources, which are external, internal
or both. The
• layer 2 (L2) is for data-ingestion. Data ingestion means a process of absorbing
• information, just like the process of absorbing nutrients and medications into
• the body by eating or drinking them .Ingestion is the process of obtaining and
importing
• data for immediate use or transfer. Ingestion may be in batches or in real time
using pre-processing or semantics.
Layer 1
L1 considers the following aspects in a design:
• Amount of data needed at ingestion layer 2 (L2)
• Push from L1 or pull by L2 as per the mechanism for the usages
• Source data-types: Database, files, web or service
• Source formats, i.e., semi-structured, unstructured or structured.
Department of Information Science and Engg
09-10-2024 35
Transform Here
Layer 2
• Ingestion and ETL processes either in real time, which means store and use
the data as generated, or in batches.
• Batch processing is using discrete datasets at scheduled or periodic intervals
of time.

Layer 3
• Data storage type (historical or incremental), format, compression, incoming
data
• Frequency, querying patterns and consumption requirements for L4 or L5
• Data storage using Hadoop distributed file system or NoSQL data stores—
HBase, Cassandra, MongoDB.

Department of Information Science and Engg


09-10-2024 36
Transform Here
Layer 4
• Data processing software such as MapReduce, Hive, Pig, Spark, Spark
Mahout, Spark
• Streaming
• Processing in scheduled batches or real time or hybrid
• Processing as per synchronous or asynchronous processing requirements at
L5.
Layer 5
• Data integration
• Datasets usages for reporting and visualization
• Analytics (real time, near real time, scheduled batches), BPs, BIs, knowledge
discovery
• Export of datasets to cloud, web or other systems

Department of Information Science and Engg


09-10-2024 37
Transform Here
Managing Data for Analysis
• Data managing means enabling, controlling, protecting, delivering and
enhancing the value of data and information asset. Reports, analysis and
visualizations need well- defined data.
Data management functions include:
1. Data assets creation, maintenance and protection
2. Data governance, which includes establishing the processes for ensuring
the availability, usability, integrity, security and high-quality of data. The
processes enable trustworthy data availability for analytics, followed by the
decision making at the enterprise.
3. Data architecture creation, modelling and analysis
4. Database maintenance, administration and management system. For
example, RDBMS (relational database management system), NoSQL
5. Managing data security, data access control, deletion, privacy and security
6. Managing the data quality

Department of Information Science and Engg


09-10-2024 38
Transform Here
Managing Data for Analysis
• Data collection using the ETL process
• Managing documents, records and contents
• Creation of reference and master data, and data control and supervision
• Data and application integration
• Integrated data management, enterprise-ready data creation, fast access and
analysis, automation and simplification of operations on the data,
• Data warehouse management
• Maintenance of business intelligence
• Data mining and analytics algorithms.

Department of Information Science and Engg


09-10-2024 39
Transform Here
Data Source

• A Data Source is the first point where data is birthed or where it is


first digitized. Ultimately, it is a data source so long as a process or
system accesses it and utilizes it. The source could either be through
physical means like surveys, or interviews or through digital means
like readings from sensors.

Department of Information Science and Engg


09-10-2024 40
Transform Here
Data Source
• Applications, programs and tools use data. Sources can be external, such as
sensors, trackers, web logs, computer systems logs and feeds. Sources can be
machines, which source data from data-creating programs.
• A source can be internal. Sources can be data repositories, such as database,
relational database, flat file, spreadsheet, mail server, web server, directory
services, even text or files such as comma-separated values (CSV) files.
Source may be a data store for applications.
Data sources may be structured
Structured
Semi-structured
Multi-structured
Unstructured

Department of Information Science and Engg


09-10-2024 41
Transform Here
Structured Data Sources
• Data source for ingestion, storage and processing can be a file, database or
streaming data.
• The source may be on the same computer running a program or a networked
computer
• Structured data sources are SQL Server, MySQL, Microsoft Access database,
Oracle DBMS, IBM DB2, Informix, Amazon SimpleDB or a file-collection
directory at a server.
• Data source name needs to be a meaningful name.
• A directory enable references for access to data. The dictionary consists of a
set of master lookup tables. The dictionary stores at a central location. The
central location enables easier access as well as administration of changes in
sources. The name of the dictionary can be UniversityStudent_DataPlus
Grades. The master –directory server can also be called NameNode.

Department of Information Science and Engg


09-10-2024 42
Transform Here
Structured Data Sources
Microsoft applications consider two types of sources for processing:
1. Machine source
2. File source
Machine source are present on computer nodes, such as servers. A machine
identifies a source by the user-defined name, driver-manager name and source
driver name.
File source and stored files: An application executing the data, first connects to
the driver manager of the source. A user, client or application does not register
with the source, but connects to the manager when required.
The process of connection is simple when using a file data source in case the file
contains a connection string that would otherwise have to be built using a call to
a connect function driver.

Department of Information Science and Engg


09-10-2024 43
Transform Here
Structured Data Sources
• Oracle applications considers two types of data sources
i) Database- which identifies the database information that the software
needs to connect to database.
ii) Logic machine, which identifies the machine which runs batches of
applications and master business functions.
Ex: source definition identifies the machine. The source can be on a network.
The definition in that case also includes network information, such as the name
of the server, which hosts the machine functions.
The applications consider data sources as the ones where the database tables
reside and where the software runs logic objects for an enterprise. Data sources
can point to:
1. A database in a specific location or in a data library of OS
2. A specific machine in the enterprise that processes logic
3. A data source master table which stores data source definitions. The table
may be at a centralized source or at server-map for the source.
Department of Information Science and Engg
09-10-2024 44
Transform Here
Unstructured Data Source
• Unstructured data sources are distributed over high-speed networks.
• The data need high velocity processing. Sources are from distributed file
systems.
• The sources are of file types, such as .txt (text file), .csv (comma separated
values file). Data may be as key value pairs, such as hash key-values pairs.
Data may have internal structures, such as in e-mail, Facebook pages, twitter
messages etc. The data do not model, reveal relationships, hierarchy
relationship or object oriented features, such as extensibility.

Department of Information Science and Engg


09-10-2024 45
Transform Here
Unstructured - Data Sources
• Data Sources - Sensors, Signals and GPS The data sources can be sensors, sensor
networks, signals from machines, devices, controllers and intelligent edge nodes
of different types in the industry M2M communication and the GPS systems.
• Sensors are electronic devices that sense the physical environment. Sensors are
devices which are used for measuring temperature, pressure, humidity, light
intensity, traffic in proximity, acceleration, locations, object(s) proximity,
orientations and magnetic intensity, and other physical states and parameters.
Sensors play an active role in the automotive industry.
• RFIDs and their sensors play an active role in RFID based supply chain
management, and tracking parcels, goods and delivery.
• Sensors embedded in processors, which include machine-learning instructions,
and wireless communication capabilities are innovations. They are sources in IoT
applications.

Department of Information Science and Engg


09-10-2024 46
Transform Here
Data Quality
• High quality means data, which enables all the required operations, analysis,
decisions, planning and knowledge discovery correctly.
• Five R's as follows:
1 Relevancy,
2 Recency,
3 Range,
4 Robustness
5 Reliability.
Data Integrity
Data integrity refers to the maintenance of consistency and accuracy in data over
its usable life. Software, which store, process, or retrieve the data, should maintain
the integrity of data. Data should be incorruptible Factors Affecting Data Quality

Department of Information Science and Engg


09-10-2024 47
Transform Here
Data Noise
Outlier
Missing Value
Duplicate value
Data Noise
Noise One of the factors effecting data quality is noise.
Noise in data refers to data giving additional meaningless information besides
true (actual/required) information.
Noise is random in character, which means frequency with which it occurs is
variable over time.
Outlier
An outlier in data refers to data, which appears to not belong to the data set. For
example, data that is outside an expected range.
Actual outliers need to be removed from the dataset, else the result will be
effected by a small or large amount.
Department of Information Science and Engg
09-10-2024 48
Transform Here
Missing Value, duplicate Value
Missing Values Another factor effecting data quality is missing values. Missing
value implies data not appearing in the data set.
Duplicate Values Another factor effecting data quality is duplicate values.
Duplicate value implies the same data appearing two or more times in a
dataset.

Department of Information Science and Engg


09-10-2024 49
Transform Here
Data Preprocessing
• Data pre-processing is an important step at the ingestion layer. Pre-
processing is a must before data mining and analytics. Pre-processing is
also a must before running a Machine Learning (ML) algorithm.
Pre-processing needs are:
• Dropping out of range, inconsistent and outlier values
• Filtering unreliable, irrelevant and redundant information
• Data cleaning, editing, reduction and/or wrangling
• Data validation, transformation or transcoding
• ELT processing

Department of Information Science and Engg


09-10-2024 50
Transform Here
Data Cleaning

• Data cleaning refers to the process of removing or correcting


incomplete, incorrect, inaccurate or irrelevant parts of the data after
detecting them.
• Data cleaning is done before mining of data. Incomplete or irrelevant
data may result into misleading decisions.
• Data cleaning tools help in refining and structuring data into usable
data. Examples of such tools are OpenRefine and DataCleaner.

Department of Information Science and Engg


09-10-2024 51
Transform Here
Data Enrichment

• Data enrichment refers to operations or processes which refine,


enhance or improve the raw data.
• Data editing refers to the process of reviewing and adjusting the
acquired datasets.
• The editing controls the data quality.
• Editing methods are (i) interactive, (ii) selective, (iii) automatic, (iv)
aggregating and (v) distribution.
• Data reduction enables the transformation of acquired information
into an ordered, correct and simplified form. The basic concept of
reduction of multitudinous amount of data, and use of the
meaningful parts. The reduction uses editing, scaling, coding,
sorting, collating, smoothening, interpolating and preparing tabular
summaries.

Department of Information Science and Engg


09-10-2024 52
Transform Here
Data wrangling
• Data wrangling refers to the process of transforming and mapping the
data. Results from analytics are then appropriate and valuable.
• Mapping enables data into another format, which makes it valuable
for analytics and data visualizations
Data format used during pre-processing
• Comma-separated values CSV
• Java Script Object Notation (JSON) as batches of object arrays or
resource arrays
• Tag Length Value (TLV)
• Key-value pairs
• Hash-key-value pair

Department of Information Science and Engg


09-10-2024 53
Transform Here
Data Store Export to Cloud
Data pre-processing, analysis, visualization, data store export

Department of Information Science and Engg


09-10-2024 54
Transform Here
Cloud Services

Department of Information Science and Engg


09-10-2024 55
Transform Here
Department of Information Science and Engg
09-10-2024 56
Transform Here
Data Store and Analytics

This section describe data storage and analysis, and comparison


between Big Data management and analysis with traditional database
management systems.
Data Storage and Management: Traditional Systems
• Traditional systems use structured or semi-structured data
• The sources of structured data store are:
• Traditional relational database-management system (RDBMS) data,
such as MySQL DB2, enterprise server and data warehouse

Department of Information Science and Engg


09-10-2024 57
Transform Here
Data Store and Analytics
SQL
• An RDBMS uses SQL (Structured Query Language). SQL is a language for
viewing or changing (update, insert or append or delete) databases.
1. Create schema, Create schema, which is a structure which contains
description of objects (base tables, views, constraints) created by a user. The
user can describe the data and define the data in the database.
2. Create catalog, which consists of a set of schemas which describe the
database.
3. Data Definition Language (DDL) for the commands which depicts a database,
that include creating, altering and dropping of tables and establishing the
constraints. A user can create and drop databases and tables, establish foreign
keys, create view, stored procedure, functions in the database etc.
4. Data Manipulation Language (DML) for commands that maintain and query
the database. A user can manipulate (INSERT/UPDATE) and access (SELECT) the
data.
Department of Information Science and Engg
09-10-2024 58
Transform Here
Data Store and Analytics
5. Data Control Language (DCL) for commands that control a database, and
include administering of privileges and committing. A user can set (grant, add or
revoke) permissions on tables, procedures and views.
Distributed Database Management System
• A distributed DBMS (DDBMS) is a collection of logically interrelated databases
at multiple system over a computer network.
• A collection of logically related databases.
• Cooperation between databases in a transparent manner.
• be 'location independent' which means the user is unaware of where the data
is located, and it is possible to move the data from one physical location to
another without affecting the user.

Department of Information Science and Engg


09-10-2024 59
Transform Here
In-Memory Column Formats Data

• A columnar format in-memory allows faster data retrieval when only


a few columns in a table need to be selected during query processing
or aggregation.
• Online Analytical Processing (OLAP) in real-time transaction
processing is fast when using in-memory column format tables.
• The CPU accesses all columns in a single instance of access to the
memory in columnar format in memory data-storage.

Department of Information Science and Engg


09-10-2024 60
Transform Here
In-Memory Row Format Databases
• A row format in-memory allows much faster data processing during OLTP
• Each row record has corresponding values in multiple columns and the on-
line values store at the consecutive memory addresses in row format.
Enterprise Data-Store Server and Data Warehouse
• Enterprise data server use data from several distributed sources which store
data using various technologies.
• All data merge using an integration tool.
• Integration enables collective viewing of the datasets at the data warehouse.
• Enterprise data integration may also include integration with application(s),
such as analytics, visualization, reporting, business intelligence and
knowledge discovery

Department of Information Science and Engg


09-10-2024 61
Transform Here
In-Memory Row Format Databases
Following are some standardised business processes, as defined in the Oracle
application-integration architecture:
• Integrating and enhancing the existing systems and processes
• Business intelligence
• Data security and integrity
• New business services/products (Web services)
• Collaboration/knowledge management
• Enterprise architecture/SOA
• e-commerce
• External customer services
• Supply chain automation/visualization
• Data centre optimization

Department of Information Science and Engg


09-10-2024 62
Transform Here
Data Storage and Analysis

Department of Information Science and Engg


09-10-2024 63
Transform Here
Department of Information Science and Engg
09-10-2024 64
Transform Here
Department of Information Science and Engg
09-10-2024 65
Transform Here
• https://www.youtube.com/watch?v=02n-fzzbKyo

Department of Information Science and Engg


09-10-2024 66
Transform Here
Steps 1 to 5 in Enterprise data integration and management with Big- Data for high
performance computing using local and cloud resources for the analytics, applications and
services
Department of Information Science and Engg
09-10-2024 67
Transform Here
Big Data Storage
• NoSQL databases are considered as semi-structured data. Big Data Store uses
NoSQL. NOSQL stands for No SQL or Not Only SQL.
• The stores do not integrate with applications using SQL. NoSQL is also used in
cloud data store.
Features of NoSQL are as follows:
• It is a class of non-relational data storage systems, and the flexible data models
and multiple schema:
• Class consisting of uninterrupted key/value or big hash table
• Class consisting of unordered keys and using JSON (PNUTS)
• Class consisting of ordered keys and semi-structured data storage systems
[BigTable, Cassandra (used in Facebook/Apache) and HBase]
• Do not use the JOINS
• Data written at one node can replicate at multiple nodes, therefore Data storage
is fault-tolerant,
• May relax the ACID rules during the Data Store transactions.
Department of Information Science and Engg
09-10-2024 68
Transform Here
Department of Information Science and Engg
09-10-2024 69
Transform Here
Department of Information Science and Engg
09-10-2024 70
Transform Here
Big Data Platform
A Big Data platform supports large datasets and volume of data. The data generate
at a higher velocity, in more varieties or in higher veracity. Managing Big Data
requires large resources of MPPs, cloud, parallel processing and specialized tools.
Bigdata platform should provision tools and services for:
1. storage, processing and analytics,
2. developing, deploying, operating and managing Big Data environment,
3. reducing the complexity of multiple data sources and integration of applications
into one cohesive solution,
4. custom development, querying and integration with other systems, and
5. the traditional as well as Big Data techniques.

Department of Information Science and Engg


09-10-2024 71
Transform Here
Data management, storage and analytics of Big data captured at the companies
and services require the following:
1. New innovative non-traditional methods of storage, processing and analytics
2. Distributed Data Stores
3. Creating scalable as well as elastic virtualized platform (cloud computing)
4. Huge volume of Data Stores
5. Massive parallelism
6. High speed networks
7. High performance processing, optimization and tuning
8. Data management model based on Not Only SQL or NoSQL.

Department of Information Science and Engg


09-10-2024 72
Transform Here
9. In-memory data column-formats transactions processing or dual in-memory
data columns as well as row formats for OLAP and OLTP
10. Data retrieval, mining, reporting, visualization and analytics
11. Graph databases to enable analytics with social network messages, pages
and data analytics
12. Machine learning or other approaches
13. Big data sources: Data storages, data warehouse, Oracle Big Data,
MongoDB NoSQL, Cassandra NoSQL
14. Data sources: Sensors, Audit trail of Financial transactions data, external
data such as Web, Social Media, weather data, health records data.

Department of Information Science and Engg


09-10-2024 73
Transform Here
Hadoop
Big Data platform consists of Big Data storage(s), server(s) and data
management and business intelligence software. Storage can deploy
Hadoop Distributed File System (HDFS), NoSQL data stores, such as
HBase, MongoDB, Cassandra. HDFS system is an open source storage
system. HDFS is a scaling, self-managing and self-healing file system.

Department of Information Science and Engg


09-10-2024 74
Transform Here
The Hadoop system packages application-programming model.
Hadoop is a scalable and reliable parallel computing platform.
Hadoop manages Big Data distributed databases.
Small height cylinders represent MapReduce and big ones represent the
Hadoop.

Department of Information Science and Engg


09-10-2024 75
Transform Here
Big Data Stack
• A stack consists of a set of software components and data store units.
Applications, machinelearning algorithms, analytics and visualization tools use
Big Data Stack (BDS) at a cloud service, such as Amazon EC2, Azure or private
cloud. The stack uses cluster of high performance machines.

Department of Information Science and Engg


09-10-2024 76
Transform Here
Big Data Analytics
• Data Analytics can be formally defined as the statistical and mathematical data
analysis that clusters, segments, ranks and predicts future possibilities. An
important feature of data analytics is its predictive, forecasting and
prescriptive capability. Analytics uses historical data and forecasts new values
or results. Analytics suggests techniques which will provide the most efficient
and beneficial results for an enterprise.
• Analysis of data is a process of inspecting, cleaning, transforming and modeling
data with the goal of discovering useful information, suggesting conclusions
and supporting decision making.
Phases in analytics
• Analytics has the following phases before deriving the new facts, providing
business intelligence and generating new knowledge.
• 1. Descriptive analytics enables deriving the additional value from
visualizations and reports

Department of Information Science and Engg


09-10-2024 77
Transform Here
Big Data Analytics
• 2. Predictive analytics is advanced analytics which enables extraction of new
facts and knowledge, and then predicts/forecasts
• 3. Prescriptive analytics enable derivation of the additional value and
undertake better decisions for new option(s) to maximize the profits
• 4. Cognitive analytics enables derivation of the additional value and undertake
better decision.

Figure 1.9 shows an overview of a reference model for analytics architecture.


The figure also shows on the right-hand side the Big Data file systems, machine
learning algorithms and query languages and usage of the Hadoop ecosystem

Department of Information Science and Engg


09-10-2024 78
Transform Here
Department of Information Science and Engg
09-10-2024 79
Transform Here
Berkely Data Analysis Stack(BDAS)

Berkeley Data Analytics Stack (BDAS) consists of data processing, data


management and resource management layers. Following list these:
• 1. Applications, AMP-Genomics and Carat run at the BDAS. Data processing
software component provides in-memory processing which processes the data
efficiently across the frameworks. AMP stands for Berkeley's Algorithms,
Machines and Peoples Laboratory.
• 2. Data processing combines batch, streaming and interactive computations.
• 3. Resource management software component provides for sharing the
infrastructure across various frameworks.
• Figure 1.10 shows a four layers architecture for Big Data Stack that consists of
Hadoop,
• MapReduce, Spark core and SparkSQL, Streaming, R, GraphX, MLib, Mahout,
Arrow and Kafka
Department of Information Science and Engg
09-10-2024 80
Transform Here
Berkely Data Analysis Stack(BDAS)

Department of Information Science and Engg


09-10-2024 81
Transform Here
Big Data Applications
• Big Data in Marketing and Sales Data are important for most aspect of
marketing, sales and advertising.
• Customer Value (CV) depends on three factors - quality, service and price.
• A definition of marketing is the creation, communication and delivery of value
to customers.
• Customer (desired) value means what a customer desires from a product.
• Customer (perceived) value means what the customer believes to have
received from a product after purchase of the product.
• Customer value analytics (CVA) means analyzing what a customer really
needs. CVA makes it possible for leading marketers, such as Amazon to deliver
the consistent customer experiences.

Department of Information Science and Engg


09-10-2024 82
Transform Here
Department of Information Science and Engg
09-10-2024 83
Transform Here
Big Data Analytics in Detection of Marketing Frauds
Big Data analytics enable fraud detection. Big Data usages has the following
features-for enabling detection and prevention of frauds:
• Fusing of existing data at an enterprise data warehouse with data from
sources such as social media, websites, biogs, e-mails, and thus enriching
existing data
• Using multiple sources of data and connecting with many applications
• Providing greater insights using querying of the multiple source data
• Analyzing data which enable structured reports and visualization
• Providing high volume data mining, new innovative applications and thus
leading to new business intelligence and knowledge discovery.
• Making it less difficult and faster detection of threats, and predict likely
frauds by using various data and information publicly available.

Department of Information Science and Engg


09-10-2024 84
Transform Here
Big Data Risks

• Large volume and velocity of Big Data provide greater insights but
also associate risks with the data used. Data included may be
erroneous, less accurate or far from reality. Analytics introduces new
errors due to such data.
• Five data risks, described by Bernard Marr are
• data security,
• data privacy breach,
• costs affecting profits,
• bad analytics and
• bad data

Department of Information Science and Engg


09-10-2024 85
Transform Here
BIG DATA RISK ???

Department of Information Science and Engg


09-10-2024 86
Transform Here
Big Data Credit Card Risk Management

Financial institutions, such as banks, extend loans to industrial and household


sectors. These institutions in many countries face credit risks, mainly risks of
(i) loan defaults, (ii) timely return of interests and principal amount. Financing
institutions are keen to get insights into the following:
• 1. Identifying high credit rating business groups and individuals,
• 2. Identifying risk involved before lending money
• 3. Identifying industrial sectors with greater risks
• 4. Identifying types of employees (such as daily wage earners in
construction sites) and businesses (such as oil exploration) with greater risks
• 5. Anticipating liquidity issues (availability of money for further issue of
credit and rescheduling credit installments) over the years.

Department of Information Science and Engg


09-10-2024 87
Transform Here
Department of Information Science and Engg
09-10-2024 88
Transform Here
Big Data in Healthcare
Big Data analytics in health care use the following data sources:clinical records,
(ii) pharmacy records, (3) electronic medical records (4) diagnosis logs and
notes and (v) additional data, such as deviations from person usual activities,
medical leaves from job, social interactions. Healthcare analytics using Big Data
can facilitate the following:
1. Provisioning of value-based and customer-centric healthcare,
2. Utilizing the 'Internet of Things' for health care
3. Preventing fraud, waste, abuse in the healthcare industry and reduce
healthcare costs (Examples of frauds are excessive or duplicate claims for
clinical and hospital treatments. Example of waste is unnecessary tests. Abuse
means unnecessary use of medicines, such as tonics and testing facilities.)
4. Improving outcomes
5. Monitoring patients in real time.

Department of Information Science and Engg


09-10-2024 89
Transform Here
Department of Information Science and Engg
09-10-2024 90
Transform Here
Big Data in Medicine
• Big Data analytics deploys large volume of data to identify and derive
intelligence using predictive models about individuals. Big Data driven
approaches help in research in medicine which can help patients
• Following are some findings: building the health profiles of individual patients
and predicting models for diagnosing better and offer better treatment,
• Aggregating large volume and variety of information around from multiple
sources the DNAs, proteins, and metabolites to cells, tissues, organs,
organisms, and ecosystems, that can enhance the understanding of biology of
diseases. Big data creates patterns and models by data mining and help in
better understanding and research,
• Deploying wearable devices data, the devices data records during active as
well as inactive periods, provide better understanding of patient health, and
better risk profiling the user for certain diseases.

Department of Information Science and Engg


09-10-2024 91
Transform Here
Big Data in Advertising
• The impact of Big Data is tremendous on the digital advertising industry. The
digital advertising industry sends advertisements using SMS, e-mails, WhatsApp,
Linkedln, Facebook, Twitter and other mediums.
• Big Data captures data of multiple sources in large volume, velocity and variety
of data unstructured and enriches the structured data at the enterprise data
warehouse. Big data real time analytics provide emerging trends and patterns,
and gain actionable insights for facing competitions from similar products. The
data helps digital advertisers to discover new relationships, lesser competitive
regions and areas.
• Success from advertisements depend on collection, analyzing and mining. The
new insights enable the personalization and targeting the online, social media
and mobile for advertisements called hyper-localized advertising.
• Advertising on digital medium needs optimization. Too much usage can also
effect negatively. Phone calls, SMSs, e-mail-based advertisements can be
nuisance if sent without appropriate researching on the potential targets. The
analytics help in this direction. The usage of Big Data after appropriate filtering
and elimination is crucial enabler of BigData Analytics with appropriate data,
data forms and data handling
09-10-2024
in the
Department right manner.
of Information Science and Engg
92
Transform Here
IT Companies Generate Huge Data

https://www.youtube.com/watch?v=LkigwsL4qz8
Department of Information Science and Engg
09-10-2024 93
Transform Here
Market and Sales team

Department of Information Science and Engg


09-10-2024 94
Transform Here
Department of Information Science and Engg
09-10-2024 95
Transform Here
Big Data in Medicine

• https://www.youtube.com/watch?v=-TE_CD3vG90

Department of Information Science and Engg


09-10-2024 96
Transform Here
Department of Information Science and Engg
09-10-2024 97
Transform Here
Department of Information Science and Engg
09-10-2024 98
Transform Here
• Simplified Big Data Opportunity Example — Point-of-Care
Intervention Opportunities
• Similarly, the insights drawn from big data provide significant
opportunities to improve healthcare outcomes, increasingly at the
point-of-care, as shown in four simplified examples here.

Department of Information Science and Engg


09-10-2024 99
Transform Here
Department of Information Science and Engg
Transform Here

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy