0% found this document useful (0 votes)

46 views69 pages

BDS-Session-1.1

The Big Data Systems course at BITS Pilani focuses on data analytics, emphasizing the storage and processing of Big Data using distributed computing models and frameworks like Hadoop and Spark. It aims to equip students with skills in NoSQL databases, data ingestion, and real-time analytics, while also covering essential topics such as data classification, architecture, and the challenges of Big Data. The course includes experiential learning through practical exercises and case studies to prepare students for the demands of data-driven enterprises.

Uploaded by

Shubham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views69 pages

BDS-Session-1.1

Uploaded by

Shubham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

DSECL ZG 522: Big Data Systems

Session 1.1: Introduction to Big Data

BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Janardhanan PS
janardhanan.ps@wilp.bits-pilani.ac.in
What to expect from this course

• The course introduces systems for data analytics with particular emphasis on storage and
processing of Big Data.
• It introduces computing models for distributed processing for scalability and fault-tolerance.
• It covers frameworks and tools for ingestion and batch processing of data stored on distributed
file systems, in-memory distributed processing and stream processing for real time analytics.
NoSQL databases are covered in detail.
• The Big Data Systems course is designed to impart foundation knowledge of Big Data
processing using NoSQL databases, Hadoop and Spark.
• This course will help the students to master essential skills on NoSQL databases, Hadoop eco-
system products, Apache Spark framework including Spark SQL, Spark Streaming and Machine
learning programming.
• Amazon’s storage and database services are used as exemplar platforms on Cloud.
2
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How Data Scientists find this course beneficial

• Data scientists use machine learning and predictive analytics to gain insights
from large amounts of data (bigdata).
• To prepare the data, you should build your expertise in big data platforms and
tools, including NoSQL databases, Hadoop, Pig, Hive, Spark, and MapReduce.
• It would be helpful if you are fluent in at least two programming languages,
including structured query language (SQL), Python, Scala, and Java.
• Acquire working level knowledge on Linux Operating system
• You would also benefit from non-technical workplace skills, like
communication, collaboration, intellectual curiosity, and business acumen.

3
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for Experiential learning
Topic No. Selected Topics in Syllabus for experiential learning
1 • Exercises on Distributed Systems – Hadoop configuration and HDFS
• Exercises using Map-reduce model: Standard patterns in map reduce models.
2 • Exercises on NoSQL – Installation and configuration
• Exercises with NoSQL database – Simple CRUD operations and Failure / Consistency tests;
• Cassandra - Consistency levels, CRUD operations, Schema on read, queries
• MongoDB – installation, data ingestion, queries
• Neo4j Graph database - Relationships and queries
• HBase queries
3 • Exercises with Pig queries to perform Map-reduce job and understand how to build queries and underlying principles;
• Exercises on creating Hive databases and HiveQL query operations, exploring built in functions, partitioning, data analysis
4 • Exercises on Spark to demonstrate RDD, and operations such as Map, FlatMap, Filter, PairRDD;
• Typical Spark Programming idioms such as : Selecting Top N, Sorting, and Joins;
• Exercises on DataFrames, Datasets, and Spark SQL
• Spark Streaming - Sample Streams, Structured Streaming, Windowed Streaming
5 • Exercises using Spark MLlib: Regression, Classification, Collaborative Filtering, Clustering
6 • Exercises on Analytics on the Cloud – using AWS S3, AWS EMR, AWS data stores / databases, Querying with DynamoDB.

4
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Text Books, References

T1 Seema Acharya and Subhashini Chellappan. Big Data and Analytics. Wiley
India Pvt. Ltd. Second Edition
T2 Raj Kamal and Preeti Saxena, Big Data Analytics. McGraw Hill Education
(India) Pvt.Ltd
R1 DT Editorial Services. Big Data - Black Book. DreamTech. Press. 2016
R2 Kai Hwang, Jack Dongarra, and Geoffrey C. Fox. Distributed and Cloud
Computing: From Parallel Processing to the Internet of Things. Morgan
Kauffman 2011
R3 Martin Kleppmann - Designing Data-Intensive Applications - O'Reilly, 2017
AR Additional reading (As per Topic)

5
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics - Sessions

• S1: Introduction to Big Data and data locality

• S2: Parallel and Distributed Processing
• S3: Big Data Analytics and Big Data System characteristics
• S4: Consistency, Availability, Partition tolerance and Data Lifecycle
• S5: Typical NoSQL Databases
• S6: Big Data Lifecycle and distributed computing
• S7: Hadoop Architecture and Programming
• S8: Hadoop subsystems for Storage and Processing
Mid-Sem Exam
• S9-S10: Hadoop ecosystem technologies
• S11-14: In-memory computing and streaming - Spark
• S15: Big Data and Cloud Computing
• S16: Big Data Storage on cloud (AWS)

6
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for today

• Motivation
✓ Why do modern Enterprises need to work with volume data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Desirable characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies

7
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of a data-driven Enterprise: A large online retailer (1)

• What data is collected

✓ Millions of transactions and browsing clicks per day across products, users
✓ Delivery tracking
✓ Reviews on multiple channels - website, social media, customer support
✓ Support emails, logged calls
✓ Ad click and browsing data
✓…
• Data is a mix of metrics, natural language text, logs, events, videos, images etc.

8
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of a data-driven Enterprise: A large online retailer (2)

• What is this data used for

✓ User profiling for better shopping experience
✓ Operations efficiency metrics
✓ Improve customer support experience, support training
✓ Demand forecasting
✓ Product marketing
✓…
• Data is the only way to create competitive differentiators, retain customers and
ensure growth

9
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evolution of Bigdata and their characteristics

Kilobyte (KB) – 103 bytes

Megabyte (MB) – 106 bytes
Gigabyte (GB) – 109 bytes
Terabyte (TB) – 1012 bytes
Petabyte (PB) – 1015 bytes
Exabyte (EB) – 1018 bytes
Zettabyte (ZB) – 1021 bytes
Yottabyte (YB) – 1024 bytes

10
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data volume growth

• Facebook: 500+ TB/day

of comments, images,
videos etc.
• NYSE: 1TB/day of
trading data
• A Jet Engine: 20TB / hour
of sensor / log data

Source : What is big data?

11
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Variety of data sources

Source : What is Big Data?

12
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Characteristics

• How big is the Big Data?

• What is big data today may not be so tomorrow
• One's Big Data may be small Data for another

Any data that can challenge our current technology in

some manner can be considered as Big Data:
▪ - Storage
▪ - Communication
▪ - Speed of Generating
▪ - Meaningful Analysis

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Challenge

⚫ In the past, the most difficult problem for businesses was how to store
all the data.
⚫ The challenge now is no longer to store large amounts of information,
but to understand and analyze this data.
⚫ By making sense out of this data through sophisticated analytics, and by
presenting the key findings in an easily discernable fashion, we can
derive value out of Big data
⚫ Big data creates a new Digital divide – Those who can process and those
who cannot
⚫ Data is only as useful as the decisions it enables

14
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Changes in Data used in Web applications

•Web applications generate lot of temporary data that do not really belong to the main
structured data store. Eg:
✓ shopping carts
✓ retained searches
✓ site personalization
✓ incomplete user questionnaires.
• Data set consists of large quantities of unstructured data in the form of Text,Images,Videos etc.
• Binary large objects in RDBMS (BLOB, CLOB) cannot handle this properly.
• Local data transactions that do not have to be very durable. Eg: – "liking" items on website
• Need to run queries against data that do not involve simple hierarchical relations
Eg: "all people in a social network who have not purchased book this year "
15
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data classification

• Structured data is metrics, events that

• Structured • Semi- • Unstructured can be put in RDBMS with fixed schema
structured
• Semi-structured data are XML, JSON
structure where traditional RDBMS have
Web pages
support with varying efficiency but
Databases
XML
Images
needs new kind of NoSQL databases
• New applications produce unstructured
data which could be natural language
text and multi-media content

16
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data usage pattern

• Higher demand now of analyzing unstructured data to glean insights

• Examples:
✓ Analysis of social media content for sentiment analysis
✓ Analysis of unstructured text content by search engines on the web as well as
within enterprise
✓ Analysis of support emails, calls for estimating customer satisfaction
✓ NLP / Conversational AI for answering user questions from backend data

17
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Structured Data

• Data is transformed and stored as per pre-defined

schema TABLE Employee (
• Traditionally stored in RDBMS
emp_id int PRIMARY KEY,
• CRUD operations on records
name varchar (50),
• ACID semantics (Atomicity, Consistency, Isolation,
Durability) designation varchar(25),
• Fine grain security and authorisation salary int,
• Known techniques on scaling RDBMS - more on this later dept_code int FOREIGN KEY
• Typically used by Systems of Record, e.g. OLTP
systems, with strong consistency requirements and read )
/ write workloads

18
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Semi-Structured Data

• No explicit data and schema separation {

• Models real life situations better because "title": "Sweet fresh strawberry",
attributes for every record could be
different "type": "fruit",
• Easy to add new attributes "description": "Sweet fresh strawberry",
• XML, JSON structures “image": "1.jpg",
• Databases typically support flexible ACID “weight": 250,
properties, esp consistency of replicas "expiry": 30/5/2021,
• Typically used by Systems of Engagement, "price": 29.45,
e.g. social media “avg_rating": 4
“reviews”: [
{ “user” : “p1“, “rating”: 2, “review”: “ ….. “
} …

]
19
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Unstructured Data (1)

• More real-life data

✓ video, voice, text, emails, chats, comments, reviews,
blogs …
• There is some structure that is typically extracted from
the data depending on the use case
✓ image adjustments at pixel structure level
✓ face recognition from video
✓ tagging of features extracted in image / video
✓ annotation of text

20
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Unstructured Data (2)

• What can we do with it ?

✓ Data mining
• Association rule mining, e.g. market basket or affinity analysis
• Regression, e.g. predict dependent variable from independent variables
• Collaborative filtering, e.g. predict a user preference from group preferences
✓ NLP - e.g. Human to Machine interaction, conversational systems
✓ Text Analytics - e.g. sentiment analysis, search
✓ Noisy text analytics - e.g. spell correction, speech to text

21
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Define Big Data – 3Vs and beyond

From patient records to social

media

From modelling jobs

to fraud detection

From clickstream analysis to

sentiment analysis

Big Data Systems 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Some make it 4Vs

Volume Velocity Variety Veracity

Terabytes to Streaming Data, Structured, Uncertainity due

Exabytes of Milliseconds to Unstructured, to data
Existing Data to seconds to text, Multimedia inconsistency,
incompleteness,
be processed respond ambiguity,
latency,
Data in many deception etc
Data at Rest Data in Motion forms Data in Doubt

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data – More Vs

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Velocity – Data streams

• What are Data Streams?

– Continuous streams
– Huge, Fast, and Changing
– Scan the data only once
• Why Data Streams?
– The arriving speed of streams and the huge amount of data are beyond
our capability to store them.
– “Real-time” processing
• Window Models
– Landscape window (Entire Data Stream)
– Sliding Window
– Hopping Window
• Mining Data Stream
25
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
RDBMS and Web applications

• Most enterprise solutions have RDBMS back-end

• Very little change to RDBMS between 1980 and 2000
• Many application servers, one database:
✓ Response slows down when DB is concurrently accessed by applications and Analytics tools
✓ Response of SQL queries depend on size of data base
✓ Easy to parallelize application servers to 100s of servers, but
✓ Harder to parallelize databases to same scale
• Most data originates from devices and volume of data grows at very high rate
• Web-based applications shows spikes in usage
❑ Especially true for public-facing e-Commerce sites
• RDBMS becomes a point of contention

26
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Isn’t a traditional RDBMS good enough ?
Example Web Analytics Application
• Designing an application to monitor the page hits for a portal
• Every time a user visiting a portal page in browser, the server side keeps track of that visit
• Maintains a simple database table that holds information about each page hit
• If user visits the same page again, the page hit count is increased by one
• Uses this information for doing analysis of popular pages among the users

Source : Adapted from Big Data by Nathan Marz

Big Data Systems 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
RDBMS is optimized for space- Not for speed

• Normalization removes data duplication and ensures data consistency

• RDBMS schemas are highly normalized to minimize the data storage and to
speed up inserts, update and deletes
• High degree of normalization is a disadvantage when it comes to retrieving
data, as multiple tables may have to be joined to get all the desired
information
• Creating these joins and reading from multiple tables can have a severe
impact on performance, as multiple reads to disk may be required
• Analytical queries need to access large portions of the whole database,
resulting in long run times

28
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Fixed Table - No flexibility

• RDBMS Tables are designed and fixed once for all

• Number of columns in a table cannot be changed without stopping a running
application
• Schema definition of tables cannot be changed on the fly
• Any change to the table definition involves recreation of the table lasting for
several hours/days
• Growth of number of rows in a table is not unlimited.
• Time taken for processing queries varies with size of table
• Application performance degrades as the number of rows in a table
increase (even with indexing)
• For example, aggregation of values in a table of 1 billion entries may take
hours together

29
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues with RDBMS (1)
replicas of shards in a social site DB
• Not all BigData use cases need strong ACID semantics,
esp Systems of Engagement
✓ Becomes a bottleneck with many replicas and many
attributes - need to optimize fast writes and reads
with less updates
Write Read
• Fixed schema is not sufficient … as application becomes
popular more attributes need to be captured and DB
modelling becomes an issue.
✓ Which attributes are used depends on the use case.

want to add field applicable to

some products
Big Data Systems 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues with RDBMS (2)

• Cannot handle very wide denormalized attribute sets

• Data layout formats - column or row major - depends on
use case
✓ What if we query only few columns ? Do I need to
touch the entire row in storage layer ?
• Expensive to retain and query long term data - need low
cost solution what if a JSON had 1000+ attributes
demographic records
for millions of users

Big Data Systems 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
RDBMS Bottleneck in scalability

• RDBMS engines are monolithic software systems

• Originally designed to run on single CPU systems
• Not core aware – not fully multi-threaded to take advantage of all cores
• For more performance, upgrade servers - Enjoy free performance lunch
• Upgrading a server is an exercise that requires application downtime
• CPU clock speed has hit the thermal barrier
• Processors moving to multi-core
• Multi-core adaptation needs software redesign (No free lunch)
• Given the relatively unpredictable user growth rate of modern software
systems, there is either over or under provisioning of resources
• Evolution of In Memory Data Grids and NoSQL helps in overcoming the
limitations of traditional RDBMS used in modern web applications

32
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Road Blocks to RDBMS Scaling

❑ RDBMS technology is a forced fit for modern interactive software systems

❑ RDBMS is incredibly complex internally, and changes are difficult
❑ Vendors of RDBMS technology have little incentive to disrupt a technology generating
billions of dollars for them annually
❑ It requires huge investments to address the RDBMS scaling issue and find out a viable
solution
❑ Techniques used to extend the useful scope of RDBMS technology fight symptoms but
not the disease itself

Provocative statement:
The relational database will be a footnote in history, because of fundamental flaws in the
RDBMS approach to managing relational data of the present era

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling of RDBMS

• RDBMS can scale up (Vertical scaling) on a bigger server – Enjoys free performance lunch
provided by Moore's law
• When the capacity of single server is reached, solution is to scale out and distribute the load
across multiple servers
• RDBMS were designed for single CPU systems – scale out bottleneck
• This is when the complexity of relational databases starts to rub against their potential to scale
• Began to look at multi-node database solutions known as ‘scaling out’ or ‘horizontal scaling’

• Different for approaches for RDBMS scaling include:

✓ Queuing
✓ Master-Slave
✓ Sharding

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling with intermediate layer
Using a queue
• Portal is very popular, lot of users visiting it
✓ Many users are concurrently visiting the pages of portal
✓ Every time a page is visited, database needs to be updated to keep track of this visit
✓ Database write is heavy operation
✓ Database write is now a bottleneck !
• Solution
✓ Use an intermediate queue between the web server and database
✓ Queue will hold messages
✓ Message will not be lost

Big Data Systems 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling out RDBMS – Master/Slave

• All writes are written to the master.

• All reads performed against the replicated slave databases
• Good for mostly read, very few update applications
• Critical reads may be incorrect as writes may not have been propagated down
• Large data sets can pose problems as master needs to duplicate data to slaves

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling out RDBMS- Sharding

• Sharding is a type of database partitioning that separates large databases into

smaller, faster, more easily managed parts.
• These smaller parts are called data shards and are hosted on multiple machines
on same or different types of databases (MySQL/PostgreSQL)
• The word shard means "a small part of a whole.“
• Need to manage parallel access in the application
• Scales well for both reads and writes
• Not transparent, application needs to be partition-aware

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling with Database Partitions (Sharding)

• Application is too popular

✓ Users are using it very heavily, increasing the
load on application
✓ Maintaining the page view count is becoming
difficult even with queue
• Solution
✓ Use database partitions
✓ Data is divided into partitions (shards) which are
hosted on multiple machines
✓ Database writes are parallelized
✓ Scalability increasing
✓ Also complexity increasing!

38
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Difference between Sharding and Partitioning

▪ Sharding and partitioning break up a large database into

smaller databases
▪ But, there is a difference between the two methods.
▪ After a database is sharded, the data in the new tables
is spread across multiple systems
▪ But, partitioning groups data subsets within a single
database instance.

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues with RDBMS sharding

• With too many shards some disk is bound to fail

• Fault tolerance needs shard replicas - so more things to manage
• Complex logic to read / write because need to locate the right replicas of shards
shard - human errors can be devastating
• Keep re-sharding and balancing as data grows or load increases
• What is the consistency semantics of updating replicas ? Should a
read on a replica be allowed before it is updated ?
• Is it optimised when data is written once and read many times or Write Read
vice versa ?

Big Data Systems 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for today

• Motivation
✓ Why do modern Enterprises need to work with data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies

Big Data Systems 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Desirable Characteristics of Big Data Systems (1)

• Application does not need to bother about common issues

like sharding, replication replicated / partitioned storage
✓ Developers more focused on application logic rather
than data management
• Easier to model data with flexible schema
✓ Not necessary that every record has same set of key-value document
attributes
graph
• If possible, treat data as immutable
✓ Keep adding timestamped versions of data values
✓ Avoid human errors by not destroying a good copy t1, k t2, k t3, k …

Big Data Systems 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Desirable Characteristics of Big Data Systems (2)

• Application specific consistency models

✓ a reader may read a replica that’s has not been updated yet
as in “read preference” options in MongoDB
✓ e.g. comments on social media
• Treat data as immutable. Handle high data volume, at very
fast rate coming from variety of sources because immutable
writes are faster with flexible consistency models
✓ Keep adding data versions with timestamp
✓ Replica updates can keep happening in the background

Cassandra is a NoSQL database for write-heavy workload and eventual consistency

Big Data Systems 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Desirable Characteristics of Big Data Systems (3)

• Built as distributed and incrementally scalable systems

✓ add new nodes to scale as in a Hadoop cluster
• Options to have cheaper long term data retention
✓ long term data reads can have more latency and can be less expensive
to store on commodity hardware, e.g. Hadoop file system (HDFS)
• Generalized programming models that work close to the data
✓ e.g. Hadoop map-reduce that runs tasks on data nodes

Big Data Systems 44 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Challenges in Big Data Systems (1)

• Latency issues in algorithms and data storage working with large data sets
• Basic design considerations of Distributed and Parallel systems - reliability,
availability, consistency
• What data to keep and for how long - depends on analysis use case
• Cleaning / Curation of data
• Overall orchestration involving large volumes of data
• Choose the right technologies from many options, including open source, to build
the Big Data System for the use cases

Big Data Systems 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for today

Big Data Systems 46 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Needs of Big Data Systems

• Processing of large data volume

• Intensive computations

• Scalability enables increase or decrease in the capacity of data storage,

processing and analytics, as per the complexity of computations and volume of
data
• Types of Scalability
✓ Vertical
✓ Horizontal
✓ Elastic

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Vertical Scalability (Scaling Up)

• Scaling up the given system’s resources and increasing the system’s analytics, reporting and
visualization capabilities
• Solve problems of greater complexities by scaling up
• For example, x TB of data take time t for processing, code size with increasing complexity
increase by factor n, then scaling up means that processing takes equal, less or much less than
(n×t) for x TB.
• Server changes
▪ More powerful CPU
▪ More memory
▪ Product companies
▪ Enjoys Free Performance Lunch facilitated by Moore’s law
▪ Exploited by RDBMS aka SQL Databases by new releases

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Horizontal Scalability (Scaling Out)

• Horizontal scalability means increasing the number of systems working in

coherence and scaling out the workload
• Processing different datasets of a large dataset by increasing number of systems
running in parallel.
• Scaling out means using more resources and distributing the processing and
storage tasks in parallel
• If r resources in a system process x TB of data in time t, then the (p×x) TB on p
parallel distributed nodes such that the time taken up remains t or is slightly more
than t
• Parallelization of jobs at several levels:
(i) Distributing separate tasks onto separate threads on the same CPU,
(ii) Distributing separate tasks onto separate CPUs on the same computer and
(iii) Distributing separate tasks onto separate computers

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Elastic Scaling – Cloud computing

Cloud computing
• on-demand service
• resource pooling,
• scalability,
• accountability, and
• broad network access.

Elastic scaling (scaling up and scaling down) by dynamic provisioning based

on computational need

Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Big Data solutions
1. Batch processing of big data sources at rest
✓ Building ML models, statistical aggregates
✓ “What percentage of users in US last year watched shows starring Kevin Spacey and
completed a season within 4 weeks or a movie within 4 hours”
✓ “Predict number of US family users in age 30-40 who will buy a Kelloggs cereal if
they purchase milk”

2. Real-time processing of big data in motion

✓ Fraud detection from real-time financial transaction data
✓ Detect fake news on social media platforms

3. Interactive exploration with ad-hoc queries

✓ Which region and product has least sales growth in last quarter

Big Data Systems 51 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big data architecture style

• Designed to handle the ingestion, processing, and analysis of data that is too large or complex for
traditional database systems.

Source : Microsoft Big Data Architecture

Big Data Systems 52 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lambda Architecture
• Lambda architecture is a way of processing massive quantities of data (i.e. “Big Data”) that provides
access to batch-processing and stream-processing methods with a hybrid approach .
• Lambda architecture is used to solve the problem of computing arbitrary functions in real time.
• The lambda architecture is composed of 3 layers:
1. Batch Layer
2. Serving Layer
3. Speed Layer (Stream Layer)

What Is Lambda Architecture? (databricks.com)

Source: http://vda-lab.github.io/2019/10/lambda-architecture
Big Data Systems 53 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Logical Layers in Big Data Processing Architecture

Big Data Systems 54 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Systems Components (1)

1. Data sources
✓ One or more data sources like
databases, docs, files, IoT devices,
images, video etc.

2. Data Storage
✓ Data for batch processing operations
is typically stored in a distributed file
store that can hold high volumes of
large files in various formats.
✓ Data can also be stored in key-value
stores.

e.g. social data e.g. medical images

Big Data Systems 55 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Systems Components (2)

3. Batch processing
✓ Process data files using long-running
parallel batch jobs to filter, sort,
aggregate or prepare the data for
analysis.
✓ Usually these jobs involve reading source
files, processing them, and writing the
output to new files.

e.g. search scans on unindexed docs 4. Real-time message ingestion

✓ Capture data from real-time sources and
integrate with stream processing.
Typically these are in-memory systems
with optional storage backup for
resiliency.

Big Data Systems 56 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Systems Components (3)
5. Stream processing
Real-time in-memory filtering, aggregating
or preparing the data for further analysis. The
processed stream data is then written to an output
sink. These are mainly in-memory systems. Data
can be written to files, database, or integrated
with an API. e.g. fraud detection logic

6. Analytical data store

Real-time or batch processing can be used to
prepare the data for further analysis. The
processed data in stored in a structured format to
be queried using analytical tools. The analytical
data store used to serve these queries can be a
Kimball-style relational data warehouse or
BigData warehouse like Hive. There may be also
NoSQL stores such as MongoDB, HBase.
e.g. financial transaction history across clients
for spend analysis
Big Data Systems 57 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Systems Components (4)

7. Analysis and reporting

The goal of most big data solutions is to provide
insights into the data through analysis and
reporting. These can be various OLAP, search and
reporting tools.
e.g. weekly management report
8. Orchestration and ETL
Most big data solutions consist of repeated data
processing operations, encapsulated in workflows,
that transform source data, move data between
multiple sources and sinks, load the processed data
into an analytical data store, or push the results
straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology
such Azure Data Factory, Apache Oozie or Sqoop.

Big Data Systems 58 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Technology Ecosystem (showing mostly Apache projects)
Keep meta-data in-memory processing Key-value Indexed Data Complex Manage
DWH stores data Scripting
for distributed of streaming data ingest processing workflows
frameworks

In-memory processing:

Storm, Kafka, Spark-S

SQL over Hadoop:

Stream processing:

HBASE, MongoDB

Machine Learning:
Flume, Sqoop

SparkMLlib

Scripting:
Spark

NoSQL:

Search:
Coordination: Zookeeper

Hive

Solr

ETL:

Pig

Scheduler:
Oozie
Resource management and basic map-reduce: Yarn for Hadoop* nodes
Manage
map-reduce
Storage: HDFS

* nodes run map-reduce jobs (more on this later) ** we’ll cover all technologies in detail
Big Data Systems 59 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Case Study: IT Ops
Using Big Data tools and architecture for managing IT
IT Operations Analytics

• IT systems generate large volumes of monitoring, logging and event data

• Can we use this data to proactively look for anomalous patterns and
predict an issue
• Can we localise possible root causes
• Can we help an engineer quickly explore the data to confirm the specific
root cause

Big Data Systems 61 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
IT Operations Analytics

• IT systems generate large volumes of monitoring, logging and event data

• Can we use this data to proactively look for anomalous patterns and predict an issue
• Can we localise possible symptoms and possible causes
• Can we help an engineer quickly explore the data to confirm the specific root cause

Real-time streaming analysis of metrics

- in-memory fast compute
- real-time model updates and model lookups

Interactive time-sensitive search and exploration of log and metric data

- older data may take time (has this happened earlier in last month)
- but new data should be fast (what happened few minutes back)

Big Data Systems 62 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Platform
Search on long-term logs Store metrics as key-value pairs Integrate with metric and log sources
Build models on metrics
Search on short-term logs Modelling logic for metric
Detect metric anomalies dependencies and normal ranges
In-memory processing:

Stream processing:

SQL over Hadoop:

Machine Learning:
Spark streaming

Kafka, Logstash
Cassandra

SparkMLlib

Scripting:
Spark

NoSQL:

Search:
Coordination: Zookeeper

Hive

Solr

ETL:

Pig

Scheduler:
Custom
Resource management and basic map-reduce: Yarn for Hadoop nodes

Storage: HDFS

Big Data Systems 63 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Where will you apply this architecture style

• Consider this architecture style when you need to:

✓ Store and process data in volumes too large for a traditional database
✓ Handle semi-structured or data with evolving structure - e.g. demographic
data with hundreds of attributes
✓ Transform unstructured data for analysis and reporting
✓ Capture, process, and analyze unbounded streams of data with low latency
✓ Capture, process, and analyze unbounded historical data cost effectively

Big Data Systems 64 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big data architecture benefits
• Technology choices
✓ Variety of technology options in open source and from vendors are available
• Performance through parallelism
✓ Big data solutions take advantage of data or task parallelism, enabling high-performance solutions that
scale to large volumes of data.
• Elastic scale
✓ All of the components in the big data architecture support scale-out provisioning, so that you can adjust your
solution to small or large workloads and pay only for the resources that you use.
• Flexibility with consistency semantics (more in CAP theorem)
✓ E.g. Cassandra or MongoDB can make inconsistent reads for better scale and fault tolerance
• Good cost performance ratio
✓ Ability to reduce cost at the expense of performance. E.g. long term data storage in commodity HDFS
nodes.
• Interoperability with existing solutions
✓ The components of the big data architecture are also used for IoT processing and enterprise BI solutions,
enabling you to create an integrated solution across data workloads. e.g. Hadoop can work with data in
Amazon S3.

Big Data Systems 65 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big data architecture challenges

• Complexity
✓ Big data solutions can be extremely complex, with numerous components to handle data
ingestion from multiple data sources. It can be challenging to build, test, and troubleshoot big
data processes.
• Skillset
✓ Many big data technologies are highly specialized, and use frameworks and languages that
are not typical of more general application architectures. On the other hand, big data
technologies are evolving new APIs that build on more established languages.
• Technology maturity
✓ Many of the technologies used in big data are evolving. While core Hadoop technologies
such as Hive and Pig have stabilized, emerging technologies such as Spark introduce
extensive changes and enhancements with each new release.

Big Data Systems 66 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for today

• Motivation
✓Why do modern Enterprises need to work with data
✓What is Big Data and data classification
✓Scaling RDBMS
• What is a Big Data System
✓Characteristics
✓Design challenges
• Architecture
✓High level architecture of Big Data solutions
✓Technology ecosystem
✓Case studies

Big Data Systems 67 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary

• Why modern Enterprises and new age applications are data-centric

• Challenges with existing data systems
• Advantages and challenges with Big Data systems
• High level architecture and technology ecosystem
• Some real applications using the tech stack

Big Data Systems 68 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Next Session:
Locality of Reference (LOR)

Big data analytics notes
No ratings yet
Big data analytics notes
33 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
BDS-1-15Merged
No ratings yet
BDS-1-15Merged
916 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
Big Data - Midsem
No ratings yet
Big Data - Midsem
526 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
IDS All Merged 4x1 Landscape Print
No ratings yet
IDS All Merged 4x1 Landscape Print
210 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
BDS Session 1 Merged Print
No ratings yet
BDS Session 1 Merged Print
205 pages
Unit 1_BDS_DS307
No ratings yet
Unit 1_BDS_DS307
47 pages
BDA_UNIT_1
No ratings yet
BDA_UNIT_1
32 pages
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
No ratings yet
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
117 pages
Big Data Chapter-I_new
No ratings yet
Big Data Chapter-I_new
49 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Big Data Analysis by deshbandhu
No ratings yet
Big Data Analysis by deshbandhu
368 pages
BDA U1
No ratings yet
BDA U1
80 pages
BDA_DIGITAL NOTES
No ratings yet
BDA_DIGITAL NOTES
85 pages
Hand Book: Ahmedabad Institute of Technology
No ratings yet
Hand Book: Ahmedabad Institute of Technology
103 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
BDA_Unit-1_DN
No ratings yet
BDA_Unit-1_DN
22 pages
Module 1
No ratings yet
Module 1
54 pages
Unit-11 big data
No ratings yet
Unit-11 big data
18 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
BDS-Session-1
No ratings yet
BDS-Session-1
71 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
BDA Session 1
No ratings yet
BDA Session 1
38 pages
Digital Notes IDBA Final Original
No ratings yet
Digital Notes IDBA Final Original
156 pages
BDCC Unit 1
No ratings yet
BDCC Unit 1
165 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data Analytics-Digital Notes
No ratings yet
Big Data Analytics-Digital Notes
86 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
BDS Session 1
No ratings yet
BDS Session 1
70 pages
BDA (18CS72) Module-1
No ratings yet
BDA (18CS72) Module-1
36 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
BIG data1
No ratings yet
BIG data1
49 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
ESE_BDA
No ratings yet
ESE_BDA
28 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
PMP Exam Prep - 2023 11th Edition (Rita Mulcahy, PMP With Margo Kirwin)
93% (69)
PMP Exam Prep - 2023 11th Edition (Rita Mulcahy, PMP With Margo Kirwin)
456 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Brochure - UpGrad & BITS Pilani - PG Program in Big Data Engineering
No ratings yet
Brochure - UpGrad & BITS Pilani - PG Program in Big Data Engineering
16 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Python in Excel (2024)
100% (10)
Python in Excel (2024)
607 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Big Data Engineering PDF
0% (1)
Big Data Engineering PDF
16 pages
BDA-UNIT-I-LM
No ratings yet
BDA-UNIT-I-LM
14 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
Unit 1
No ratings yet
Unit 1
19 pages
Big Data Engineering PDF
No ratings yet
Big Data Engineering PDF
17 pages
BDA2023Outline
No ratings yet
BDA2023Outline
7 pages
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
100% (10)
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
244 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (15)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
HCIA-openGauss V1.0Training Materials
No ratings yet
HCIA-openGauss V1.0Training Materials
504 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
Learn Excel Data Analysis
100% (15)
Learn Excel Data Analysis
721 pages
cc-unit-1
No ratings yet
cc-unit-1
36 pages
Data Analysis With Microsoft Excel
92% (24)
Data Analysis With Microsoft Excel
532 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
Chapter 1-Introduction To Distributed Systems
No ratings yet
Chapter 1-Introduction To Distributed Systems
59 pages
Microsoft Power BI Cookbook by Greg Deckler
100% (19)
Microsoft Power BI Cookbook by Greg Deckler
655 pages
Microsoft Project 2019 Step by Step (PDFDrive)
100% (13)
Microsoft Project 2019 Step by Step (PDFDrive)
476 pages
8 Elasticity in Cloud
No ratings yet
8 Elasticity in Cloud
22 pages
Data Center Handbook Plan Design Build and Operations of A Smart Data Center 2nbsped 1119597501 9781119597506 Compress
77% (13)
Data Center Handbook Plan Design Build and Operations of A Smart Data Center 2nbsped 1119597501 9781119597506 Compress
755 pages
Huawei Cloud Service Map (v108) 0220
No ratings yet
Huawei Cloud Service Map (v108) 0220
56 pages
Machine Learning With Python
100% (14)
Machine Learning With Python
692 pages
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
100% (21)
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
133 pages
SQL PDF
100% (13)
SQL PDF
221 pages
FINGENT-Corp.-standardizing-international-operations-with-sap-s-4hana-implementation-roll-out-CS
No ratings yet
FINGENT-Corp.-standardizing-international-operations-with-sap-s-4hana-implementation-roll-out-CS
13 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Fundamentals of Computer Programming With C# (By Svetlin Nakov & Co.)
100% (15)
Fundamentals of Computer Programming With C# (By Svetlin Nakov & Co.)
1,132 pages
NOI Netcool Annou - Letter
No ratings yet
NOI Netcool Annou - Letter
30 pages
Mongodb Atlas Setting Up Using Managed Mongodb
No ratings yet
Mongodb Atlas Setting Up Using Managed Mongodb
17 pages
101 Best Microsoft Excel Tips & Tricks Ebook v1.3 - LM
96% (26)
101 Best Microsoft Excel Tips & Tricks Ebook v1.3 - LM
616 pages
Green Cloud: A Literature Review of Energy-Aware Computing
100% (1)
Green Cloud: A Literature Review of Energy-Aware Computing
26 pages
A Visual Introduction To Apache Kafka PDF
No ratings yet
A Visual Introduction To Apache Kafka PDF
84 pages
YouTube Data Analysis Using Hadoop1
No ratings yet
YouTube Data Analysis Using Hadoop1
69 pages
Load Balancing and Service Discovery Using Docker
No ratings yet
Load Balancing and Service Discovery Using Docker
10 pages
Internet of Things (IoT)
100% (9)
Internet of Things (IoT)
366 pages
BigQuery Partitioning vs Clustering blog first draf
No ratings yet
BigQuery Partitioning vs Clustering blog first draf
7 pages
DRUID
No ratings yet
DRUID
11 pages
Chapter (1) Introduction To Distributed Systems: Q:define Distributed System and Draw Figure For It?
No ratings yet
Chapter (1) Introduction To Distributed Systems: Q:define Distributed System and Draw Figure For It?
15 pages
DATA Center Design
91% (11)
DATA Center Design
129 pages
Excel Formulas and Functions
85% (27)
Excel Formulas and Functions
126 pages
Chapter09 - How To Test Performance
No ratings yet
Chapter09 - How To Test Performance
23 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
93% (43)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
IEEE Online Auction System
No ratings yet
IEEE Online Auction System
3 pages
VTSP 5.5. Course 2 VMware Vsphere Vcenter
100% (1)
VTSP 5.5. Course 2 VMware Vsphere Vcenter
75 pages
Learn Excel Dashboard
100% (15)
Learn Excel Dashboard
233 pages
Agriculture DPI
No ratings yet
Agriculture DPI
3 pages
Analytics Python Programming
92% (13)
Analytics Python Programming
203 pages
PYTHON Learn Python Programming in 90 Minutes or Less Python Learning Python Python Programming Python Tutorial Python Programming For Beginners Python For Dummies Book 1 PDF
92% (12)
PYTHON Learn Python Programming in 90 Minutes or Less Python Learning Python Python Programming Python Tutorial Python Programming For Beginners Python For Dummies Book 1 PDF
161 pages
Cloud Computing
No ratings yet
Cloud Computing
41 pages
The Complete Cyber Security Course, Hacking Exposed
96% (28)
The Complete Cyber Security Course, Hacking Exposed
282 pages
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
100% (13)
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
171 pages
CC Unit3 Overview
No ratings yet
CC Unit3 Overview
2 pages
Datamigration
No ratings yet
Datamigration
23 pages
Aws Exam Dope
No ratings yet
Aws Exam Dope
9 pages
RAC - ASM - VOTING DISK Interview Questions & Answer
No ratings yet
RAC - ASM - VOTING DISK Interview Questions & Answer
22 pages
Excel VBA Bundle 2 Books Excel VBA and Macros and 51 Awesome Macros
100% (19)
Excel VBA Bundle 2 Books Excel VBA and Macros and 51 Awesome Macros
230 pages
Storage - Block, Object File Storage
No ratings yet
Storage - Block, Object File Storage
1 page
Python Cheat Sheet: Ata Tructures
100% (12)
Python Cheat Sheet: Ata Tructures
2 pages
HP Reference Configuration For HP Virtual Server Environment (VSE) and IBM Informix Dynamic Server-4AA1-9747ENW
No ratings yet
HP Reference Configuration For HP Virtual Server Environment (VSE) and IBM Informix Dynamic Server-4AA1-9747ENW
14 pages
Robert D. Watkins: PMP, IT Project +, MCSE, MCP+I, MCP, CCNA, A+ Certified
No ratings yet
Robert D. Watkins: PMP, IT Project +, MCSE, MCP+I, MCP, CCNA, A+ Certified
6 pages
Panorama: Key Security Features Management
No ratings yet
Panorama: Key Security Features Management
6 pages
Modbus Training
100% (6)
Modbus Training
75 pages
Juniper NGN White Paper
No ratings yet
Juniper NGN White Paper
10 pages
HP ProLiant DL360e Gen8
No ratings yet
HP ProLiant DL360e Gen8
4 pages
Object Oriented Python Tutorial
100% (20)
Object Oriented Python Tutorial
111 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDS-Session-1.1

Uploaded by

BDS-Session-1.1

Uploaded by

DSECL ZG 522: Big Data Systems

Session 1.1: Introduction to Big Data

• S1: Introduction to Big Data and data locality

• What data is collected

• What is this data used for

Kilobyte (KB) – 103 bytes

• Facebook: 500+ TB/day

Source : What is big data?

Source : What is Big Data?

• How big is the Big Data?

Any data that can challenge our current technology in

• Structured data is metrics, events that

• Higher demand now of analyzing unstructured data to glean insights

• Data is transformed and stored as per pre-defined

• No explicit data and schema separation {

• More real-life data

• What can we do with it ?

From patient records to social

From modelling jobs

From clickstream analysis to

Volume Velocity Variety Veracity

Terabytes to Streaming Data, Structured, Uncertainity due

• What are Data Streams?

• Most enterprise solutions have RDBMS back-end

Source : Adapted from Big Data by Nathan Marz

• Normalization removes data duplication and ensures data consistency

• RDBMS Tables are designed and fixed once for all

want to add field applicable to

• Cannot handle very wide denormalized attribute sets

• RDBMS engines are monolithic software systems

❑ RDBMS technology is a forced fit for modern interactive software systems

• Different for approaches for RDBMS scaling include:

• All writes are written to the master.

• Sharding is a type of database partitioning that separates large databases into

• Application is too popular

▪ Sharding and partitioning break up a large database into

• With too many shards some disk is bound to fail

• Application does not need to bother about common issues

• Application specific consistency models

Cassandra is a NoSQL database for write-heavy workload and eventual consistency

• Built as distributed and incrementally scalable systems

• Processing of large data volume

• Scalability enables increase or decrease in the capacity of data storage,

• Horizontal scalability means increasing the number of systems working in

Elastic scaling (scaling up and scaling down) by dynamic provisioning based

2. Real-time processing of big data in motion

3. Interactive exploration with ad-hoc queries

Source : Microsoft Big Data Architecture

What Is Lambda Architecture? (databricks.com)

e.g. social data e.g. medical images

e.g. search scans on unindexed docs 4. Real-time message ingestion

6. Analytical data store

7. Analysis and reporting

Storm, Kafka, Spark-S

SQL over Hadoop:

• IT systems generate large volumes of monitoring, logging and event data

• IT systems generate large volumes of monitoring, logging and event data

Real-time streaming analysis of metrics

Interactive time-sensitive search and exploration of log and metric data

SQL over Hadoop:

• Consider this architecture style when you need to:

• Why modern Enterprises and new age applications are data-centric

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.