BDS-Session-1.1
BDS-Session-1.1
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Janardhanan PS
janardhanan.ps@wilp.bits-pilani.ac.in
What to expect from this course
• The course introduces systems for data analytics with particular emphasis on storage and
processing of Big Data.
• It introduces computing models for distributed processing for scalability and fault-tolerance.
• It covers frameworks and tools for ingestion and batch processing of data stored on distributed
file systems, in-memory distributed processing and stream processing for real time analytics.
NoSQL databases are covered in detail.
• The Big Data Systems course is designed to impart foundation knowledge of Big Data
processing using NoSQL databases, Hadoop and Spark.
• This course will help the students to master essential skills on NoSQL databases, Hadoop eco-
system products, Apache Spark framework including Spark SQL, Spark Streaming and Machine
learning programming.
• Amazon’s storage and database services are used as exemplar platforms on Cloud.
2
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How Data Scientists find this course beneficial
• Data scientists use machine learning and predictive analytics to gain insights
from large amounts of data (bigdata).
• To prepare the data, you should build your expertise in big data platforms and
tools, including NoSQL databases, Hadoop, Pig, Hive, Spark, and MapReduce.
• It would be helpful if you are fluent in at least two programming languages,
including structured query language (SQL), Python, Scala, and Java.
• Acquire working level knowledge on Linux Operating system
• You would also benefit from non-technical workplace skills, like
communication, collaboration, intellectual curiosity, and business acumen.
3
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for Experiential learning
Topic No. Selected Topics in Syllabus for experiential learning
1 • Exercises on Distributed Systems – Hadoop configuration and HDFS
• Exercises using Map-reduce model: Standard patterns in map reduce models.
2 • Exercises on NoSQL – Installation and configuration
• Exercises with NoSQL database – Simple CRUD operations and Failure / Consistency tests;
• Cassandra - Consistency levels, CRUD operations, Schema on read, queries
• MongoDB – installation, data ingestion, queries
• Neo4j Graph database - Relationships and queries
• HBase queries
3 • Exercises with Pig queries to perform Map-reduce job and understand how to build queries and underlying principles;
• Exercises on creating Hive databases and HiveQL query operations, exploring built in functions, partitioning, data analysis
4 • Exercises on Spark to demonstrate RDD, and operations such as Map, FlatMap, Filter, PairRDD;
• Typical Spark Programming idioms such as : Selecting Top N, Sorting, and Joins;
• Exercises on DataFrames, Datasets, and Spark SQL
• Spark Streaming - Sample Streams, Structured Streaming, Windowed Streaming
5 • Exercises using Spark MLlib: Regression, Classification, Collaborative Filtering, Clustering
6 • Exercises on Analytics on the Cloud – using AWS S3, AWS EMR, AWS data stores / databases, Querying with DynamoDB.
4
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Text Books, References
T1 Seema Acharya and Subhashini Chellappan. Big Data and Analytics. Wiley
India Pvt. Ltd. Second Edition
T2 Raj Kamal and Preeti Saxena, Big Data Analytics. McGraw Hill Education
(India) Pvt.Ltd
R1 DT Editorial Services. Big Data - Black Book. DreamTech. Press. 2016
R2 Kai Hwang, Jack Dongarra, and Geoffrey C. Fox. Distributed and Cloud
Computing: From Parallel Processing to the Internet of Things. Morgan
Kauffman 2011
R3 Martin Kleppmann - Designing Data-Intensive Applications - O'Reilly, 2017
AR Additional reading (As per Topic)
5
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics - Sessions
6
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for today
• Motivation
✓ Why do modern Enterprises need to work with volume data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Desirable characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies
7
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of a data-driven Enterprise: A large online retailer (1)
8
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of a data-driven Enterprise: A large online retailer (2)
9
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evolution of Bigdata and their characteristics
10
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data volume growth
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Challenge
⚫ In the past, the most difficult problem for businesses was how to store
all the data.
⚫ The challenge now is no longer to store large amounts of information,
but to understand and analyze this data.
⚫ By making sense out of this data through sophisticated analytics, and by
presenting the key findings in an easily discernable fashion, we can
derive value out of Big data
⚫ Big data creates a new Digital divide – Those who can process and those
who cannot
⚫ Data is only as useful as the decisions it enables
14
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Changes in Data used in Web applications
•Web applications generate lot of temporary data that do not really belong to the main
structured data store. Eg:
✓ shopping carts
✓ retained searches
✓ site personalization
✓ incomplete user questionnaires.
• Data set consists of large quantities of unstructured data in the form of Text,Images,Videos etc.
• Binary large objects in RDBMS (BLOB, CLOB) cannot handle this properly.
• Local data transactions that do not have to be very durable. Eg: – "liking" items on website
• Need to run queries against data that do not involve simple hierarchical relations
Eg: "all people in a social network who have not purchased book this year "
15
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data classification
16
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data usage pattern
17
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Structured Data
18
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Semi-Structured Data
]
19
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Unstructured Data (1)
20
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Unstructured Data (2)
21
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Define Big Data – 3Vs and beyond
Big Data Systems 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Some make it 4Vs
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data – More Vs
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Velocity – Data streams
26
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Isn’t a traditional RDBMS good enough ?
Example Web Analytics Application
• Designing an application to monitor the page hits for a portal
• Every time a user visiting a portal page in browser, the server side keeps track of that visit
• Maintains a simple database table that holds information about each page hit
• If user visits the same page again, the page hit count is increased by one
• Uses this information for doing analysis of popular pages among the users
Big Data Systems 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
RDBMS is optimized for space- Not for speed
28
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Fixed Table - No flexibility
29
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues with RDBMS (1)
replicas of shards in a social site DB
• Not all BigData use cases need strong ACID semantics,
esp Systems of Engagement
✓ Becomes a bottleneck with many replicas and many
attributes - need to optimize fast writes and reads
with less updates
Write Read
• Fixed schema is not sufficient … as application becomes
popular more attributes need to be captured and DB
modelling becomes an issue.
✓ Which attributes are used depends on the use case.
Big Data Systems 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
RDBMS Bottleneck in scalability
32
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Road Blocks to RDBMS Scaling
Provocative statement:
The relational database will be a footnote in history, because of fundamental flaws in the
RDBMS approach to managing relational data of the present era
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling of RDBMS
• RDBMS can scale up (Vertical scaling) on a bigger server – Enjoys free performance lunch
provided by Moore's law
• When the capacity of single server is reached, solution is to scale out and distribute the load
across multiple servers
• RDBMS were designed for single CPU systems – scale out bottleneck
• This is when the complexity of relational databases starts to rub against their potential to scale
• Began to look at multi-node database solutions known as ‘scaling out’ or ‘horizontal scaling’
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling with intermediate layer
Using a queue
• Portal is very popular, lot of users visiting it
✓ Many users are concurrently visiting the pages of portal
✓ Every time a page is visited, database needs to be updated to keep track of this visit
✓ Database write is heavy operation
✓ Database write is now a bottleneck !
• Solution
✓ Use an intermediate queue between the web server and database
✓ Queue will hold messages
✓ Message will not be lost
Big Data Systems 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling out RDBMS – Master/Slave
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling out RDBMS- Sharding
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling with Database Partitions (Sharding)
38
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Difference between Sharding and Partitioning
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues with RDBMS sharding
Big Data Systems 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for today
• Motivation
✓ Why do modern Enterprises need to work with data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies
Big Data Systems 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Desirable Characteristics of Big Data Systems (1)
Big Data Systems 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Desirable Characteristics of Big Data Systems (2)
Big Data Systems 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Desirable Characteristics of Big Data Systems (3)
Big Data Systems 44 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Challenges in Big Data Systems (1)
• Latency issues in algorithms and data storage working with large data sets
• Basic design considerations of Distributed and Parallel systems - reliability,
availability, consistency
• What data to keep and for how long - depends on analysis use case
• Cleaning / Curation of data
• Overall orchestration involving large volumes of data
• Choose the right technologies from many options, including open source, to build
the Big Data System for the use cases
Big Data Systems 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for today
• Motivation
✓ Why do modern Enterprises need to work with data
✓ What is Big Data and data classification
✓ Scaling RDBMS
• What is a Big Data System
✓ Characteristics
✓ Design challenges
• Architecture
✓ High level architecture of Big Data solutions
✓ Technology ecosystem
✓ Case studies
• Summary
Big Data Systems 46 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Needs of Big Data Systems
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Vertical Scalability (Scaling Up)
• Scaling up the given system’s resources and increasing the system’s analytics, reporting and
visualization capabilities
• Solve problems of greater complexities by scaling up
• For example, x TB of data take time t for processing, code size with increasing complexity
increase by factor n, then scaling up means that processing takes equal, less or much less than
(n×t) for x TB.
• Server changes
▪ More powerful CPU
▪ More memory
▪ Product companies
▪ Enjoys Free Performance Lunch facilitated by Moore’s law
▪ Exploited by RDBMS aka SQL Databases by new releases
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Horizontal Scalability (Scaling Out)
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Elastic Scaling – Cloud computing
Cloud computing
• on-demand service
• resource pooling,
• scalability,
• accountability, and
• broad network access.
Big Data Systems BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Big Data solutions
1. Batch processing of big data sources at rest
✓ Building ML models, statistical aggregates
✓ “What percentage of users in US last year watched shows starring Kevin Spacey and
completed a season within 4 weeks or a movie within 4 hours”
✓ “Predict number of US family users in age 30-40 who will buy a Kelloggs cereal if
they purchase milk”
Big Data Systems 51 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big data architecture style
• Designed to handle the ingestion, processing, and analysis of data that is too large or complex for
traditional database systems.
Big Data Systems 52 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lambda Architecture
• Lambda architecture is a way of processing massive quantities of data (i.e. “Big Data”) that provides
access to batch-processing and stream-processing methods with a hybrid approach .
• Lambda architecture is used to solve the problem of computing arbitrary functions in real time.
• The lambda architecture is composed of 3 layers:
1. Batch Layer
2. Serving Layer
3. Speed Layer (Stream Layer)
Big Data Systems 54 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Systems Components (1)
1. Data sources
✓ One or more data sources like
databases, docs, files, IoT devices,
images, video etc.
2. Data Storage
✓ Data for batch processing operations
is typically stored in a distributed file
store that can hold high volumes of
large files in various formats.
✓ Data can also be stored in key-value
stores.
3. Batch processing
✓ Process data files using long-running
parallel batch jobs to filter, sort,
aggregate or prepare the data for
analysis.
✓ Usually these jobs involve reading source
files, processing them, and writing the
output to new files.
Big Data Systems 56 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Systems Components (3)
5. Stream processing
Real-time in-memory filtering, aggregating
or preparing the data for further analysis. The
processed stream data is then written to an output
sink. These are mainly in-memory systems. Data
can be written to files, database, or integrated
with an API. e.g. fraud detection logic
Big Data Systems 58 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Technology Ecosystem (showing mostly Apache projects)
Keep meta-data in-memory processing Key-value Indexed Data Complex Manage
DWH stores data Scripting
for distributed of streaming data ingest processing workflows
frameworks
In-memory processing:
HBASE, MongoDB
Machine Learning:
Flume, Sqoop
SparkMLlib
Scripting:
Spark
NoSQL:
Search:
Coordination: Zookeeper
Hive
Solr
ETL:
Pig
Scheduler:
Oozie
Resource management and basic map-reduce: Yarn for Hadoop* nodes
Manage
map-reduce
Storage: HDFS
* nodes run map-reduce jobs (more on this later) ** we’ll cover all technologies in detail
Big Data Systems 59 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Case Study: IT Ops
Using Big Data tools and architecture for managing IT
IT Operations Analytics
Big Data Systems 61 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
IT Operations Analytics
Big Data Systems 62 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big Data Platform
Search on long-term logs Store metrics as key-value pairs Integrate with metric and log sources
Build models on metrics
Search on short-term logs Modelling logic for metric
Detect metric anomalies dependencies and normal ranges
In-memory processing:
Stream processing:
Machine Learning:
Spark streaming
Kafka, Logstash
Cassandra
SparkMLlib
Scripting:
Spark
NoSQL:
Search:
Coordination: Zookeeper
Hive
Solr
ETL:
Pig
Scheduler:
Custom
Resource management and basic map-reduce: Yarn for Hadoop nodes
Storage: HDFS
Big Data Systems 63 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Where will you apply this architecture style
Big Data Systems 64 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big data architecture benefits
• Technology choices
✓ Variety of technology options in open source and from vendors are available
• Performance through parallelism
✓ Big data solutions take advantage of data or task parallelism, enabling high-performance solutions that
scale to large volumes of data.
• Elastic scale
✓ All of the components in the big data architecture support scale-out provisioning, so that you can adjust your
solution to small or large workloads and pay only for the resources that you use.
• Flexibility with consistency semantics (more in CAP theorem)
✓ E.g. Cassandra or MongoDB can make inconsistent reads for better scale and fault tolerance
• Good cost performance ratio
✓ Ability to reduce cost at the expense of performance. E.g. long term data storage in commodity HDFS
nodes.
• Interoperability with existing solutions
✓ The components of the big data architecture are also used for IoT processing and enterprise BI solutions,
enabling you to create an integrated solution across data workloads. e.g. Hadoop can work with data in
Amazon S3.
Big Data Systems 65 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Big data architecture challenges
• Complexity
✓ Big data solutions can be extremely complex, with numerous components to handle data
ingestion from multiple data sources. It can be challenging to build, test, and troubleshoot big
data processes.
• Skillset
✓ Many big data technologies are highly specialized, and use frameworks and languages that
are not typical of more general application architectures. On the other hand, big data
technologies are evolving new APIs that build on more established languages.
• Technology maturity
✓ Many of the technologies used in big data are evolving. While core Hadoop technologies
such as Hive and Pig have stabilized, emerging technologies such as Spark introduce
extensive changes and enhancements with each new release.
Big Data Systems 66 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Topics for today
• Motivation
✓Why do modern Enterprises need to work with data
✓What is Big Data and data classification
✓Scaling RDBMS
• What is a Big Data System
✓Characteristics
✓Design challenges
• Architecture
✓High level architecture of Big Data solutions
✓Technology ecosystem
✓Case studies
Big Data Systems 67 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary
Big Data Systems 68 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Next Session:
Locality of Reference (LOR)