Analytics Roadmap

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

3 Pillars of Big Data

• Exponentially growing massive data

• New Tech Ecosystems which provides capacity to store/process

varied structured and unstructured data

• Advanced Analytics: AI, Machine Learning, Deep Learning


%10
Business Issue Understanding

•What decisions needs to be made?


•What information is needed to inform those decisions?
•What type of analysis can provide the information needed to
inform those decisions?
Business Issue Understanding

ABC is a retail goods market, which has hundreds of shops all over the
country. According to their annual business plan, they have two main
objectives which focus on Sales Performance and Stock Management.
• Marketing team believes that their sales performance might increase
by offering cross-products with discount rates.
• At the same time, Operations Office has an objective to decrease costs
of expired goods.
Business Issue Understanding

•What decisions needs to be made?


 Which products should we offer to our customers?
 What should be the discount rates?
 How the company sell goods which are close to their expiration
dates more effectively?
Business Issue Understanding

•What information is needed to inform those decisions?


 Which products are sold together
 The effect of discount rates on customer behaviour
Business Issue Understanding

•What type of analysis can provide the information needed to inform


those decisions?
 Analysis of past sales of products to capture possible sales
patterns.
 Linking sales patterns with product data
Data Understanding

•What data is needed?

 Product data
 Sales history by products
 Stock data
 Expiration dates of products

•What data is available?


•What are the important characteristics of the data?
Data Preperation
Data Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
Data Integration
•Integrating necessary data from various sources

Example:
 Product data >> ERP Database Definitions
 Sales history by products >> Sales Transaction Data
 Stock data >> Stock Management Data
 Expiration dates of products >> ERP Database
Data
Data Preperation
Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Data Manipulation
 Restructuring data to be ready for model building
Ex:
True / False values to > 1 / 0
Time / Date handling
Product types :> Factor

Tools:
R Packages : Python: SQL
dpylr Spark
reshape2
numpy
tidyr
data.table pandas Alteryx
… Knime
Azure
Data
Data Preperation
Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Missing Value Handling
• Finding the missing values
• Deciding how to treat them
• Delete the record
• Fill manually
• Fill with mean, median, last value before, first next value
• Fill with a model (regression, decision tree…etc)

• Feature Selection
• Which features should be included in the model?
• Eliminating features with huge ratios of missing values
• If more than %40 of values are missing,feature could be excluded.

• Deselecting features which are highly correlated or represent the same phenomena
• Ex: Heat degree: one column in ‘degress celcius’ ; other column ‘fahrenheit’
• Ex: Date of birth ; Age
Data
Data Preperation
Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Feature Generation
• Deriving features from other features
• Ex: Total Sales Per District
Kadıköy
2011
540
2012
650
2013
800
2014
900
2015
910
2016
1105
2017
1200
2018
1400
Optimum NA 200 400 340 500 590 899 560
Natilus 440 420 450 465 502 520 510 560
Beşiktaş 820 890 905 910 902 900 920 880
Üsküdar 50 200 400 600 800 421 430 500
Kartal 30 50 90 150 250 320 220 150

Which districts has the highest sales on average?


Which districts sales has changed most / least over years?
Which districts have common charateristics according to their sales volume?
• Feature Generation
• Deriving features from other features
Total Sales Per District 2011 2012 2013 2014 2015 2016 2017 2018

• Ex: Kadıköy
Optimum
540
NA
650
200
800
400
900
340
910
500
1105
590
1200
899
1400
560
Natilus 440 420 450 465 502 520 510 560
Beşiktaş 820 890 905 910 902 900 920 880
Üsküdar 50 200 400 600 800 421 430 500
Kartal 30 50 90 150 250 320 220 150

Total Sales Per District Mean Variance Min Max Range


Kadıköy 938.1 267.53 1105 1400 295
Optimum 498.4 205.80 560 899 339
Natilus 483.4 44.18 510 560 50
Beşiktaş 890.9 29.08 880 920 40
Üsküdar 425.1 214.71 421 500 79
Kartal 157.5 94.44 150 320 170

Which districts has the highest sales on average? Kadıköy


Which districts sales has changed most / least over years? Most : Kadıköy / Least: Beşiktaş
Which districts have common charateristics according to their sales volume? ? > Cluster analysis
• Feature Generation

Total Sales Per District Mean Variance Min Max Range


Kadıköy 938.1 267.53 1105 1400 295
Optimum 498.4 205.80 560 899 339
Natilus 483.4 44.18 510 560 50
Beşiktaş 890.9 29.08 880 920 40
Üsküdar 425.1 214.71 421 500 79
Kartal 157.5 94.44 150 320 170

Which districts have common charateristics


according to their sales volume? ? > Cluster
analysis?
Data
Data Preperation
Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Dimensionality Reduction

• Reducing the number of features in a data model by grouping them or eliminating them

• Missing Value Ratio

• Low Variance Filter

• High Correlation Filter

• Decision Tree / Random Forest Importance Matrix

• Principal Component Analysis (PCA)

• Backward Feature Elimination

• Forward Feature Construction


Data Preperation
Data Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction

•Outlier Removal
•Normalization
• Outlier Removal
• Removing the observation points
that are distant from the observations
Data Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal

•Normalization
• Normalization
• Normalization of ratings means adjusting values measured on different scales to

a notionally common scale, to enable them to compare with each other and

leverage their effect on the model in a similiar scale.


Day Feature1_People Number_of_Complaints Feature_2_Temperature Types of Normalization:
1 2200 3 14
2 800 0 14 - Min-Max
3 1200 12 15 Normalization
4 4100 0 17
5 5200 0 14
- Decimal Scaling
6 220 18 12 - Standard Deviation
7 20 33 13
….

Std Dev
Normalization
Common Types of Normalization:
- Min-Max Normalization

- Decimal Scaling : Multiplying of dividing by pow(10,k)

- Standard Deviation:
[x - mean(x)] * sd(x)


Modeling

•Determine the methodology


•Determine the important factors or variables
•Build a model
•Run the model
…various modeling techniques are selected and applied, and
their parameters are calibrated to optimal values. Typically,
there are several techniques for the same data mining problem
type. Some techniques have specific requirements on the form
of data. Therefore, stepping back to the data preparation phase
is often needed." - Wikipedia

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy