0% found this document useful (0 votes)
17 views

Capstone Project

The document discusses the use of Big Data analytics and recommendation systems in retail to optimize product sales and enhance decision-making. It outlines the architecture of Data Warehouses, differentiating between operational and informational data, and describes the multi-dimensional data model used for sales analysis. Additionally, it covers various OLAP technologies, including ROLAP, MOLAP, and HOLAP, and their applications in different business scenarios.

Uploaded by

Sayandeep Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Capstone Project

The document discusses the use of Big Data analytics and recommendation systems in retail to optimize product sales and enhance decision-making. It outlines the architecture of Data Warehouses, differentiating between operational and informational data, and describes the multi-dimensional data model used for sales analysis. Additionally, it covers various OLAP technologies, including ROLAP, MOLAP, and HOLAP, and their applications in different business scenarios.

Uploaded by

Sayandeep Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Atanu Manna Market Data Analytics and

Rupsa Jana recommended system for


Manirul Mallick retail vertical using Big Data
Sayandeep Mandal Analysis

Product Sales Analytics,


Swagata Nanda Sales Prediction and
Shadan Alam Optimization for e-commerce
Nandana Mukherjee business environment using
MD Rahit Azim Data Warehouse Architecture.
There are so many choices that people often feel trapped,
whether they're trying to choose a movie to watch, the
right product to buy, or new music to listen to. To solve
this problem, recommendation systems comes into play
that help people find their way through all of these
choices by giving them unique ideas based on their likes
and dislikes.

Predicting product sales involves analyzing historical sales data,


considering market trends, competitor activity, and potential
changes in consumer behavior to estimate future sales
volume; this can be done using various forecasting methods like
moving averages, linear regression, or opportunity stage
forecasting, depending on the product and available data.
Data Warehouse

The Data Warehouse is an


architecture,
not a technology.
Data Warehouse Architecture
A Data Warehouse Architecture (DWA) is a way of representing the
overall structure of data, communication, processing and presentation
that exists for end-user computing within the enterprise. The
architecture is made up of a number of interconnected parts:

➢ Operational Database / External Database Layer


➢ Information Access Layer
➢ Data Access Layer
➢ Data Directory (Metadata) Layer
➢ Process Management Layer
➢ Data Warehouse Layer

4
Data warehousing concepts
Operational / informational data:
❑ Operational data is the data you use to run your business. This data is what is
typically stored, retrieved, and updated by your Online Transactional
Processing (OLTP) system. An OLTP system may be, for example, a
reservations system, an accounting application, or an order entry application.
❑ Informational data is created from the wealth of operational data that exists in
your business and some external data useful to analyze your business.
Informational data is what makes up a data warehouse. Informational data is
typically:
➢ Summarized operational data
➢ De-normalized and replicated data
➢ Infrequently updated from the operational systems
➢ Optimized for decision support applications
➢ Possibly "read only" (no updates allowed)
➢ Stored on separate systems to lessen impact on operational systems

5
Operational vs. Informational
Data
Operational Informational
Primarily primitive, Primarily derived,
highly detailed somewhat summarized
Current; guaranteed Historical; accuracy
accurate now maintained over time
Constantly updated Infrequently updated
Minimal redundancy Managed redundancy
Static structure, Dynamic structure,
dynamic content static content
Referential integrity Historical integrity
Supports day-to-day Supports long-term
business functions informational requirements
The Complete DWH System

Information Sources Data Warehouse OLAP Servers Clients


Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
OLAP
Semistructured Data
Sources Warehouse serve

extract Query/Reporting
transform serve
load
refresh e.g., ROLAP
Operational etc.
Data Mining
DB’s serve

Data Marts
Three-Tier DWH System
• Warehouse database server
– Almost always a relational DBMS, Analytical DBMS, rarely
flat files
• OLAP servers
– Relational OLAP (ROLAP): extended relational DBMS that
maps operations on multidimensional data to standard
relational operators
– Multidimensional OLAP (MOLAP): special-purpose server
that directly implements multidimensional data and operations
• Clients
– Query and reporting tools
– Analysis tools
– Data mining tools
Sample Unified BI Architecture: Investment Banking Domain
Data Feeds Factory FTP Data Validation Components Presentation
Server Layer
Heterogeneous
A DWH Database Server
Common
Operational Interface Star-Schema MIS/OP

Normalized Schema
(Atomic Level Data)
(Source Image)
Source Layer B Reports

Staging Area
Systems Model 1 Model 4
2
Flat Files 1
(Risk) Analytics
Model 2
Oracle 3 Model 5

Security Layer
Flat Files
Model 3 (Profitability)
FTP

SQL Aggregated Data


Flat Files
Server
1 2 3 Web
Loading & Transformation
Extraction

Portal
Flat Files Operational MIS &Adhoc OLAP Server
(Optional)
ETL Report Builder
ETL Server

Report build &

Banking Feeds 1 2 3
Publish Layer

(Bank Specific Tool


Implementation)
Cube / WEB

Reference
Customized Report Publish SERVER
ETL Server
Master Data
Application
These are various paths in the report generation process. If any of these paths are not needed for a Bank, then the
1 2 3
components on that path are not needed in the setup.
Data validation: Option A-Sample data from Staging area, Option B-Sample data from report for validation.
A B
9 These will be compared to check the validity of data and result is stored in a log file.
Sample: Data Marts Blocks
Flat File Systems - 2 Dimensional
Character Positions

Records
Relational Data - 2 Dimensional

Columns Row 1 Row 2 Row 3 Row 4 Row 5


Customer ID AZ12345 AZ12345 AZ12345 AZ12345 AZ12345
Cust Record Change Date 01/15/94 05/01/95 06/01/95 03/15/96 04/01/96
Customer Record End Date 04/30/95 05/31/95 03/14/96 03/31/96 null
Customer Status Active Suspend Active Active Active
Customer Status Date 01/15/94 05/01/95 06/01/95 06/01/95 06/01/95
Customer Address State UT UT UT UT UT
Customer Zip Code 84094 84094 84094 84094 84094
Customer Type Corp Corp Corp Corp Corp
Discount Plan A A A A B
Account Manager Jones Jones Jones Smith Smith
The Multi-Dimensional Data Model
“Sales by product line over the past six months”
“Sales and Quantity by store between 1990 and 1995”

Store Info Key columns joining fact table


to dimension tables Numerical Measures

Prod Id Time Id Store Id Sales Qty


Fact table for
Product Info measures

Dimension tables Time Info

...
MOLAP: Dimensional Modeling Using the Multi Dimensional
Model

• MDDB: a special-purpose data model


• Facts stored in multi-dimensional arrays
• Dimensions used to index array
• Sometimes on top of relational DB
The MOLAP Cube
Fact table view: Multi-dimensional cube:
sale prodId storeId amt
p1 s1 12 s1 s2 s3
p2 s1 11 p1 12 50
p1 s3 50 p2 11 8
p2 s2 8

dimensions = 2
3-D Cube
Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 s1 s2 s3
day 2
p1 s3 1 50 p1 44 4
p2 s2 1 8 p2 s1 s2 s3
p1 s1 2 44 day 1
p1 12 50
p1 s2 2 4 p2 11 8

dimensions = 3
Multi-Dimensional Cube Visualization
roll-up to region
New Town Dimensions:
Salt Lake
Time, Product, Store
roll-up to brandAttributes:
JU
10
Product (upc, price, …)
Juice
Store …
Product

Milk 34
56 …
Coke
Cream 32 Hierarchies:
Soap 12 Product → Brand → …
Bread 56 roll-up to week
Day → Week → Quarter
M T W Th F S S
Store → Region → Country
Time
56 units of bread sold in JU/Kolkata on M (Monday)
Cube Aggregation: Roll-up

Example: computing sums


s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
Cube Operators for Roll-up

s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)
Extended Cube
* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 81
p1 s3 1 50
p2 s2 1 8
p1 s1 2 44
p1 s2 2 4
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 ans date sum
p1 s3 1 50 1 81
p2 s2 1 8 2 48
p1 s1 2 44
p1 s2 2 4
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 s1 1 12 sale prodId date amt
p2 s1 1 11 p1 1 62
p1 s3 1 50 p2 1 19
p2 s2 1 8
p1 s1 2 44 p1 2 48
p1 s2 2 4

rollup
drill-down
Aggregation Using Hierarchies

s1 s2 s3
day 2
p1 44 4
store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region

country

region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)
Points to be noticed about MOLAP
• Pre-calculating or pre-consolidating transactional data improves
speed.
BUT
Fully pre-consolidating incoming data, MDDs require an enormous
amount of overhead both in processing time and in storage. An input
file of 200MB can easily expand to 5GB

MDDs are great candidates for the <50GB department data marts.

• Rolling up and Drilling down through aggregate data.

• With MDDs, application design is essentially the definition of


dimensions and calculation rules, while the RDBMS requires that
the database schema be a star or snowflake.
Hybrid OLAP (HOLAP)
• HOLAP = Hybrid OLAP:

– Best of both worlds

– Storing detailed data in RDBMS

– Storing aggregated data in MDBMS

– User access via MOLAP tools


Data Flow in HOLAP
RDBMS Server MDBMS Server Client
Multi-
dimensiona
SQL- l access
Read
User Multidimension
Multi-
data Meta data al Viewer
dimensio
Derived
naldata
data
SQL-
Reach Relational
Through Viewer
SQL-
Read
When deciding which technology to go for, consider:

1) Performance:

• How fast will the system appear to the end-user?

• MDD server vendors believe this is a key point in their favor.

2) Data volume and scalability:

• While MDD servers can handle up to 50GB of storage, RDBMS


servers can handle hundreds of gigabytes and terabytes.
Examples
• ROLAP
– Telecommunication startup: call data records (CDRs)
– ECommerce Site
– Credit Card Company
– Share/Capital Market
• MOLAP
– Analysis and budgeting in a financial department
– Sales analysis
• HOLAP
– Sales department of a multi-national company
– Banks and Financial Service Providers
Dimensional Model-
Business View
Business Requirement
• Business Questions:
– “What kind of customers buy which of our Video
products, and where?
– Are there geographic buying patterns for particular
products?
– Are there certain customer demographics that effect
purchasing patterns?
– And do these demographics have an impact on store
location, or the other way around?”
• Metric:
Video Sales by Customer Type by Product Type by
Store Location
(over some period of Time)
Multidimensional Analysis

Product

Household

Telecomm
‘99

Video ‘98

‘97
Audio

Sales Channel
Retail Direct Special
Multidimensional Analysis
1996

1995

Product

Household
Video

Telecomm

Audio
Direct
Retail

Video

Audio

Sales Channel
Retail Direct Special
Multidimensional Analysis

Product

Household Time ?
Telecomm

Video

Audio

Sales Channel
Retail Direct Special
Multidimensional Analysis

Customer
Product Information
• Geographic
Location
Household • Customer Type
• Income Bracket
Telecomm
?

Video ?

?
Audio

Sales Channel
Retail Direct Special
Multidimensional Analysis
Government

Commercial

Video Video
Audio Audio

1997 1998 1999


Multidimensional Analysis
Government

Government
Commercial
Commercial
Product
Individual

Household
Video

Telecomm
Retail

Video

Audio

Sales Channel
Retail Direct Special
Multidimensional Analysis
Retail

Government

Direct
Product
Special

Household
Video

Telecomm
Commercial
Government

Video

Audio

Customer Type
Government Comm Individual
ercial
Multidimensional Analysis
Government

Commercial
Product
Individual

Household

Telecomm

Europe (EC)
Video
Asia Pacific

Audio North America

Sales Channel
Retail Direct Special
Multidimensional Analysis
Retail Direct Special
Household $300 $200 $100
Government $100 $50 $10
Commercial $100 $75 $70
Individual $100 $75 $30
Telecomm $6000 $3000 $400
Government $2000 $500 $100
Commercial $1000 $1500 $100
Individual $3000 $1000 $200
Video $4444 $2222 $777
Government $1030 $150 $222
Commercial $1311 $1175 $111
Individual $2103 $897 $444
Audio $50 $75 $25
Government $10 $25 $20
Commercial $10 $25 $0
Individual $30 $25 $5
Tools available
• ROLAP:
– ORACLE 8i
– ORACLE Reports; ORACLE Discoverer
– ORACLE Warehouse Builder
– Arbors Software’s Essbase

• MOLAP:
– ORACLE Express Server
– ORACLE Express Clients (C/S and Web)
– MicroStrategy’s DSS server
– Platinum Technologies’ Plantinum InfoBeacon

• HOLAP:
– ORACLE 8i
– ORACLE Express Serve
– ORACLE Relational Access Manager
– ORACLE Express Clients (C/S and Web)
Conclusion
• ROLAP: RDBMS -> star/snowflake schema

• MOLAP: MDD -> Cube structures

• ROLAP or MOLAP: Data models used play major role in performance differences

• MOLAP: for summarized and relatively lesser volumes of data (10-50GB)

• ROLAP: for detailed and larger volumes of data

• Both storage methods have strengths and weaknesses

• The choice is requirement specific, though currently data warehouses are


predominantly built using RDBMSs/ROLAP.
Simple Example of Parallel storage and parallel processing

• One farmer wants to harvest grapes and sell out all the fruits it
in the nearby Town.
• After harvesting he then stores the produce in a storage room.
Simple Example of Parallel storage and parallel processing

Challenge: Now the storage became the bottleneck to store and access all the
fruits in a single storage place.
Simple Example of Parallel storage and parallel processing

So the farmer now decided to distribute the storage area and give each
one of them a different storage area.
Simple Example of Parallel storage and parallel processing
High yield and processing
approach Business Decision

To complete the order on time, all of them work parallelly


with their own storage space.
Simple Example of Parallel storage and parallel processing

This solution helps them to complete the order on time without any hassles.
This is the way they
continue to grow more
and more and deliver
more and more fruit
baskets and formed a large
farm.

HOW THIS STORY IS


RELATED TO BIG
DATA??
• Now a days we have verities of data which needs to be processed.
• So one single processor and storage cannot process such a large verities of on demand data in
real time.
• These data processing are very much time consuming
Challenges of Single Storage Unit

• Single storage unit became the bottleneck due to which network overhead
was generated.
• The solution was to use distributed storage for each processor
• This enabled easy access to store and retrieve data.
• This method worked and no network overhead is generated
• No bottleneck will be there for data pull or push with high lead time
• This is known as parallel processing of distributed storage called
MapReduce
What is in with Hadoop
The Hadoop Distributed
File System (HDFS) is a
distributed file system
designed to run
on hardware based on
open standards or what is
called commodity
hardware. This means
the system is capable of
running
MapReduce isdifferent
a
1. Storage Unit of Hadoop operating
programming model or systems
- HDFS ( Hadoop Distributed file system) (OSes)
pattern such as Windows
within the
or Linux without
Hadoop
- It is specially designed for storing huge data sets in commodity framework requiring
that
2. hardware
Processing Unit of Hadoop special drivers.
is used to access big
Map Reduce is a used to retrieve big data from HDFS data stored in the
Hadoop File System
(HDFS). It is a core
component, integral to
the functioning of the
There is only one Name
Node but there could be
multiple data node.
Master/Slave node is
typically form the HDFS
Cluster

Name node maintain and


manages the data node. It
also store the metadata.

Data Nodes stores the


actual data, does reading,
writing and processing,
performs replications as
well.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy