0% found this document useful (0 votes)
71 views

Big Data and Blockchain Basics: Dr. Poonam Saini Poonamsaini@pec - Edu.in

This document provides an overview of big data and blockchain basics. It defines big data as high-volume, high-velocity, and high-variety information assets that require new forms of processing to enable enhanced decision making. It discusses why big data is important for government, private sector, and science. It also outlines the four V's of big data - volume, velocity, variety, and veracity. Additionally, it covers big data challenges around privacy, data access and sharing, storage, and processing. Finally, it provides an overview of big data technologies and trends in big data analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Big Data and Blockchain Basics: Dr. Poonam Saini Poonamsaini@pec - Edu.in

This document provides an overview of big data and blockchain basics. It defines big data as high-volume, high-velocity, and high-variety information assets that require new forms of processing to enable enhanced decision making. It discusses why big data is important for government, private sector, and science. It also outlines the four V's of big data - volume, velocity, variety, and veracity. Additionally, it covers big data challenges around privacy, data access and sharing, storage, and processing. Finally, it provides an overview of big data technologies and trends in big data analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

BIG DATA AND BLOCKCHAIN BASICS

Presenter:
Dr. Poonam Saini
poonamsaini@pec.edu.in
Data platform landscape map
• Complex array of current data platform providers
• Compares platform capabilities
• Understand where providers intersect and diverge
• Identify shortlists of choices to choose enterprise needs
Big Data Everywhere!
• Lots of data is being collected and warehoused
– Web data, e-commerce
– Purchases at grocery stores
– Bank/Credit Card transactions
– Social Network
– Healthcare Network
– Machines/Automobiles
What is “big data”?
• "Big Data are high-volume, high-velocity, high-variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and
process optimization” (Gartner 2012)
• Complicated (intelligent) analysis of data may make a small
data “appear” to be “big”
• Bottom line: Any data that exceeds our current capability of
processing can be regarded as “big”
Why is “big data” a “big deal”?
• Government
– Obama administration announced “big data” initiative
– Many different big data programs launched
• Private Sector
– Walmart handles more than 1 million customer transactions every hour, which is
imported into databases estimated to contain more than 2.5 petabytes of data
– Facebook handles 40 billion photos from its user base.
– Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts
world-wide
• Science
– Large Synoptic Survey Telescope will generate 140 Terabyte of data every 5 days.
– Biomedical computation like decoding human Genome & personalized medicine
– Social science revolution
Lifecycle of Data: 4 “A”s

Aggregation Int
ata eg
d D rat
e re ed
catt Da
S ta
Acquisition Analysis

ge
Log w led
dat o
a Application Kn
Computational View of Big Data

Data Visualization

Data Access Data Analysis

Data Understanding Data Integration

Formatting, Cleaning

Storage Data
Big data- from 3v’s to 4V’s
Big Data’s Properties
• Variety - the stored data is not all of the same type or category
– Structured data - data that is organized in a structure so that it is
identifiable e.g. SQL data
– Semi-structured data - a form of structured data that has a self-
describing structure yet does not conform with the formal structure
of a relational database e.g. XML
– Unstructured data - data with no identifiable structure e.g. image
Big Data’s Properties…
• Volume - The “Big” in Big data and represents the large volume
or size of the data
–  At present the data existing is in petabytes and is supposed to
increase to zetabytes in the near future
– For example big social networking sites are producing data in order of
terabytes everyday and this amount of data is difficult to handle using
traditional systems
Big Data’s Properties…

• Velocity -  represents not only the speed at which the data is


incoming, but also the speed at which the data is outgoing
– Traditional systems are not capable of performing analytics on data that
is constantly in motion
• Variability - represents the inconsistency of the data flow
– The flow of data can be highly inconsistent, leading to periodic peaks and
lows
– Daily, seasonal and event-triggered peak data loads can be challenging to
manage, especially for unstructured data
– For example a large natural disaster would spike page visits for cnn.com
Big Data’s Properties…
• Complexity
– Represents the difficulty of linking, matching, cleansing, and
transforming data from multiple sources
• Value
– Systems must not only be designed to handle Big data efficiently and
effectively, but also be able to filter the most important data from all
of the data collected
– This filtered data is what helps add value to a business
Structure
What is unstructured data?
a generic label for describing any corporate information
that is not in a database

information that either does not have a pre-defined


data model or is not organized in a pre-defined manner

data that does not reside in fixed locations

any data that has no identifiable structure


Why is unstructured data important?

• Unstructured data doubles every three months


• 7 million web pages are added every day
• 80% of business is conducted on unstructured information
• 85% of all data stored is held in an unstructured format
Big Data Challenges and Issues
• Privacy and Security
– The most important issue with Big data which includes conceptual,
technical as well as legal significance
– The personal information of a person when combined with external
large data sets leads to the inference of new private facts about that
person
– Big data used by law enforcement will increase the chances of certain
tagged people to suffer from adverse consequences without the ability
to fight back or even having knowledge that they are being
discriminated against
Big Data Challenges and Issues
• Data Access and Sharing of Information
– If data is to be used to make accurate decisions in time it becomes
necessary that it should be available in accurate, complete and timely
manner
• Storage and Processing Issues
– Many companies are struggling to store the large amount of data they
are producing
• Outsourcing storage to the cloud may seem like an option but long upload times
and constant updates to the data preclude this option
– Processing a large amount of data also takes a lot of time
Big Data Technology
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Image data (Scanned images)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF) etc.
• Streaming Data
– You can only scan the data once
What to do with the data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining and Statistical Modeling
– F(Big Data)=knowledge
• Hypothesis Validation
• Hypothesis Creation (EDA)
Trends in Big Data Analytics

YEAR NAME DEVELOPMENTS


1994-2004 Big Data - e-commerce
1.0 - web mining techniques:
Ø web usage mining
Ø web structure mining
Ø web content mining
2005-2014 Big Data - Social media content mining, usage mining, structure mining
2.0 - Sentimental analytics, natural language processing, computational
linguistics
- Social network analysis to measure social network structure
2015 Onwards Big Data - IOT applications generating huge image audio and video data
3.0 - Streaming analytics
Why R?

Highest Paid IT Skill 75% of data Supports close to 10,000


professionals use R free packages
Linkedin Skills and
O'Reilly Survey, 2016 Rexer Survey, CRAN Figure as on
Oct 2015 Second best December 2016
Most-used data science programming languages
language after SQL for data science

O’Reilly Survey, O'Reilly Survey, 2016


Jan 2014

Companies Already Onboard R


R is the #1 Google Search for Advanced Analytics software Google Trends, April 2016
Facebook BCG
Google Uber
R is #13 of all Programming Languages
Redmonk Language Ratings, June 2015 Twitter Lloyds of London
McKinsey & Many More…
Demand for R language skills is on the rise. ANZ Bank

www.imarticus.org 23
What is Hadoop?
Hadoop is Transforming
Businesses Across Industries
1 in 4
Organizations use Hadoop to manage
their data today
(up from 1 out of 10 in 2012)

BIG DATA STORING AND FASTER PROCESSING


Hadoop is an open source software framework created in 2005 that keeps and
processes big data in a distributed manner on large collection of hardware.

BUSINESS SOLUTIONS ACROSS DOMAINS AND INDUSTRIES:


Low cost solution with a high fault tolerance to access and create value from
data.
“The growing use of Apache Hadoop, increasing data warehouse volume sizes and the accumulation of legacy systems in
organizations are fostering structured data growth. These factors are leading enterprises to understand how to reuse,
repurpose and gain critical insight from this data.” Gartner
www.imarticus.org 24
Why Python?
Python is a powerful, flexible, open-source language that is easy to learn, easy to use, and has powerful libraries for data
manipulation and analysis

What are the reasons for its sudden popularity?

Python is an open source software that is Multi-purpose language that can be used
Cost of Ownership free to download. Versatility to build an entire application

Big data Python has become one of the big go-to languages for big data processing due to its wide selection of libraries
compatibility

A Data Scientists’ Dream Integration


Python offers extensive analytics Python is particularly useful in data analytics because it has a In industry, the data science trend
capabilities for Text & Predictive rich library for reading and writing data, running calculations on shows increasing popularity of
Analytics. the information and creating graphical representations of data Python. A Python-based application
sets. stack can more easily integrate a data
scientist who writes Python code,
IDLE & Spyder IDE is widely used We can write map reduce programs in python using PyDoop. since that eliminates a key hurdle in
Here is where Python scores over R. While R uses in-memory productionizing a data scientist's
for data mining.
processing, Python using PyDoop can process PetaBytes of data work.

Big Data Analytics made possible


by PyDoop and Scipy
www.imarticus.org 25
Why Python?

46% of job ads


mention Python 2nd most
Official language of
(after SQL) popular data science
Google
language
KDNuggets KDNuggets 2013
Among top Dec 2014 Ranked #1 of
in-demand data science all programming
skills languages
Codeeval rankings, Feb
KDNuggets,
2015
Dec2014

Companies Already
Onboard Python
Google IBM
Yahoo National Weather
Quora Service
Nokia & Many More…
ABN
AMRO Bank

www.imarticus.org 26
What is Data Visualization?

Data visualization is the presentation of data in a pictorial or graphical format. For centuries, people have depended on
visual representations such as charts and maps to understand information more easily and quickly.

27
Why Tableau for Data Visualization?
Tableau is a powerful, flexible Data Visualization tool that is easy to learn, easy to use, and has powerful libraries for data
visualization and presentation.

Tableau is a competitively priced software Multi-purpose package that can be used to


Cost of Ownership that is available for a trial download. Versatility build an entire application

Tableau has become one of the big go-to software programs for Data visualization due to the wide variety of tools it
Big data
provides and compatibility with Big Data platforms such as Hadoop.
compatibility

28
Why Tableau for Data Visualization?

A BUSINESS ANALYSTS’ DREAM INTEGRATION


Tableau offers Powerful
Tableau is easy to learn, use, and significantly faster than existing Tableau integrates exceptionally well
visualization capabilities, without
solutions. One can easily see patterns, identify trends and with R and Hadoop, making it a
a single line of code.
discover visual insights in seconds. No wizards, no scripts. powerful visualization tool for
analytics and big data use cases.
Tableau facilitates live, up-to-date data analysis that taps into the Developers creating web applications
Experiment with trend analyses, power of the firm’s data warehouse. can integrate fully interactive Tableau
regressions, correlations. content into their applications via the
Extract data into Tableau’s data engine and take advantage of JavaScript API.
breakthrough in-memory architecture.
Scalable, secure and Reliable Cloud
and Mobile Connectivity.

www.imarticus.org 29
Profiling and monitoring tools
Technologies to handle big data- the layers
Yup! He sent the money

Blockchain
Overview and Fundamentals
Decentralization

 Shifting power and authority from one central entity

 Making power available to the members themselves

 Enable community members self-sovereign (P2P)

 Example- BitTorrent, Bitcoin


Decentralization Benefits

 Systems are less likely to fail


 if they rely on separated redundant components

 Harder to attack
 malicious entities cannot exploit system's users
– being not at one place

 Example- a blockchain with 100 nodes is highly resistant to attack


Blockchain provides one virtual ledger-
Distributed Ledger
‣ One common trusted ledger
Bob
‣ Today often implemented by a
centralized arbitrator
Alice Charlie
‣ Blockchain creates one single
One ledger for all parties
Ledger
‣ Replicated and produced
Fran Dave
collaboratively
k
Eve ‣ Trust in ledger comes from
– Cryptographic protection
– Distributed validation
Cryptographic Concepts

 Public Key Cryptography


Cryptographic Concepts

 Hash Functions
Cryptographic Concepts
 Digital Signing
Anatomy of a Block

 Block- the literal building blocks of the blockchain

 What is inside a block exactly?

 The primary purpose- record transactions

 Major fields of a Block-


 Height
 Timestamp

 Nonce
 Hash of previous block
Nodes and Network

 Nodes are the gateway to and the agents of the


blockchain

 Systems that communicate with one another, ensure


the validity of the blockchain and store its local copies

 Full Nodes-store the entire blockchain and verify Nodes are the computers that make
up a blockchain network
everything, every single transaction

 Light Nodes- store just a portion of the


blockchain
Miners
 Miners- are separate from nodes and do not store
blockchain
 These are network participants who create blocks and
send them to nodes
 To verify and then to be accepted/rejected
 A node could also be a miner but it need not mine

 Broader Steps-  Full node receives a valid block from a miner


 Includes its local copy of the blockchain
 Broadcasts that block to few other connected nodes
 Those nodes check the block and broadcast it out
 Process repeats and block spreads across entire network
 Process starts again on the next block
Blockchain is interdisciplinary

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy