0% found this document useful (0 votes)
186 views

Unit 2 Notes

The document discusses the data analytics lifecycle and key phases in approaching analytics problems. It covers 6 phases: business understanding, analytic understanding, data requirements, data collection, data understanding, and data preparation. It also discusses key roles in successful analytics projects including business users, project sponsors, project managers, business intelligence analysts, database administrators, data engineers, and data scientists. Finally, it covers the discovery phase including learning the business domain, identifying resources, framing the problem, and identifying stakeholders.

Uploaded by

ramya ravindran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views

Unit 2 Notes

The document discusses the data analytics lifecycle and key phases in approaching analytics problems. It covers 6 phases: business understanding, analytic understanding, data requirements, data collection, data understanding, and data preparation. It also discusses key roles in successful analytics projects including business users, project sponsors, project managers, business intelligence analysts, database administrators, data engineers, and data scientists. Finally, it covers the discovery phase including learning the business domain, identifying resources, framing the problem, and identifying stakeholders.

Uploaded by

ramya ravindran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

18CSE396T– DATA SCIENCE

Unit –II : Session –1 : SLO-1

SRM Institute of Science and Technology 1


APPROACHING ANALYTICS PROBLEMS
 The Data Analytics Lifecycle is designed
specifically for Big Data problems and data science
projects

 The lifecycle has six phases, and project work can


occur in several phases at once.

 For most phases in the lifecycle, the movement


can be either forward or backward 2
SRM Institute of Science and Technology
APPROACHING ANALYTICS PROBLEMS

Business Understanding:

 Before solving any problem in the Business domain it needs to


be understood properly.

 Business understanding forms a concrete base, which further


leads to easy resolution of queries.

SRM Institute of Science and Technology 3


APPROACHING ANALYTICS PROBLEMS

Analytic Understanding

The approaches can be of 4 types: Descriptive


approach (current status and information provided.
Diagnostic approach(a.k.a statistical analysis, what
is happening and why it is happening)
Predictive approach(it forecasts on the trends or
future events probability)
 Prescriptive approach( how the problem should be
solved actually).

SRM Institute of Science and Technology 4


APPROACHING ANALYTICS PROBLEMS

Analytic Understanding

The approaches can be of 4 types: Descriptive


approach (current status and information provided.
Diagnostic approach(a.k.a statistical analysis, what
is happening and why it is happening)
Predictive approach(it forecasts on the trends or
future events probability)
 Prescriptive approach( how the problem should be
solved actually).

SRM Institute of Science and Technology 5


APPROACHING ANALYTICS PROBLEMS

Data Requirements:

The above chosen analytical method indicates the necessary


data content, formats and sources to be gathered.

 During the process of data requirements, one should find the


answers for questions like „what‟, „where‟, „when‟, „why‟, „how‟ &
„who‟.

SRM Institute of Science and Technology 6


APPROACHING ANALYTICS PROBLEMS

Data Collection:

Data collected can be obtained in any random format. So,


according to the approach chosen and the output to be obtained,
the data collected should be validated.

SRM Institute of Science and Technology 7


APPROACHING ANALYTICS PROBLEMS

Data Understanding:

 Data understanding answers the question “Is the data


collected representative of the problem to be solved?”.

 Descriptive statistics calculates the measures applied over


data to access the content and quality of matter.

SRM Institute of Science and Technology 8


APPROACHING ANALYTICS PROBLEMS

Data Preparation:

 This whole process includes transformation, normalization etc.

Modelling:

 Modelling decides whether the data prepared for processing is


appropriate or requires more finishing and seasoning.
 This phase focuses on the building of predictive/descriptive
models.

SRM Institute of Science and Technology 9


APPROACHING ANALYTICS PROBLEMS

Evaluation:

Model evaluation is done during model development. It checks
for the quality of the model to be assessed and also if it meets
the business requirements
Deployment:

Deployment phase checks how much the model can withstand
in the external environment and perform superiorly as
compared to others.

SRM Institute of Science and Technology 10


APPROACHING ANALYTICS PROBLEMS

Feedback:

Feedback is the necessary purpose which helps in refining the


model and accessing its performance and impact.

SRM Institute of Science and Technology 11


KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
 Business User
 Project Sponsor
 Project Manager
 Business Intelligence Analyst
 Database Administrator (DBA)
 Data Engineer
 Data Scientist

SRM Institute of Science and Technology 2


KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Business User:
 Someone who understands the domain area and usually
benefits from the results.
 This person can consult and advise the project team on
the context of the project, the value of the results, and how
the outputs will be operationalized.

SRM Institute of Science and Technology 3


KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Project Manager:
Ensures that key milestones and objectives are met on time and at
the expected quality.
Business Intelligence Analyst :
 Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPis),
key metrics, and business intelligence from a reporting
perspective.
 Business Intelligence Analysts generally create dashboards and
reports and have knowledge of the data feeds and sources.
SRM Institute of Science and Technology 4
KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT
Database Administrator (DBA):
 His responsibilities include providing access to key databases
or tables and ensuring the appropriate security levels are in
place related to the data repositories.

SRM Institute of Science and Technology 5


KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT

Data Engineer:
 Leverages deep technical skills to assist with tuning SQL
queries for data management and data extraction, and
provides support for data ingestion into the analytic
sandbox.
 DBA sets up and configures the databases to be used, the
data engineer executes the actual data extractions and
performs substantial data manipulation to facilitate the
analytics.

SRM Institute of Science and Technology 6


KEY ROLES FOR A SUCCESSFUL
ANALYTICS PROJECT

Data Scientist:
 Provides subject matter expertise for analytical techniques,
data modeling, and applying valid analytical techniques to
given business problems.
 Ensures overall analytics objectives are met.

SRM Institute of Science and Technology 7


PHASE 1: DISCOVERY
 The data science team must learn and investigate the
problem, develop context and understanding, and
learn about the data sources needed and available for
the project.
 Learning the Business Domain
 Resources
 Framing the Problem
 Identifying Key Stakeholders
 Interviewing the Analytics Sponsor
 Developing Initial Hypotheses
 Identifying Potential Data Sources

SRM Institute of Science and Technology 2


PHASE 1: DISCOVERY
Learning the Business Domain:

 Understanding the domain area of the problem is


essential.
 Data scientists have deep knowledge of the
methods, techniques, and ways for applying
heuristics to a variety of business and conceptual
problems

SRM Institute of Science and Technology 3


PHASE 1: DISCOVERY
Resources:

 As part of the discovery phase, the team needs to


assess the resources available to support the
project.
 In this context, resources include technology, tools,
systems, data, and people.

SRM Institute of Science and Technology 4


PHASE 1: DISCOVERY
Resources:
 Does the requisite level of expertise exist within the
organization today, or will it need to be cultivated?
 The team will need to determine whether it must collect
additional data, purchase it from outside sources, or
transform existing data.
 Ensure the project team has the right mix of domain
experts, customers, analytic talent, and project
management to be effective.

SRM Institute of Science and Technology 5


PHASE 1: DISCOVERY
Framing the Problem :

 Framing is the process of stating the analytics problem to


be solved.
 It is crucial to state the analytics problem, as well as why
and to whom it is important
 it is important to identify the main objectives of the project,
identify what needs to be achieved in business terms, and
identify what needs to be done to meet the needs.

SRM Institute of Science and Technology 6


PHASE 1: DISCOVERY

 It is best practice to share the statement of goals


and success criteria with the team and confirm
alignment with the project sponsor's expectations.
 Establishing criteria for both success and failure
helps the participants to avoid unproductive effort
and remain aligned with the project sponsors

SRM Institute of Science and Technology 7


PHASE 1: DISCOVERY
Identifying Key Stakeholders:

 Important step is to identify the key stakeholders and their


interests in the project.
 During these discussions, the team can identify the
success criteria, key risks, and stakeholders
 When interviewing stakeholders, learn about the domain
area and any relevant history from similar analytics projects.

SRM Institute of Science and Technology 8


PHASE 1: DISCOVERY
Identifying Key Stakeholders:

 Depending on the number of stakeholders and


participants, the team may consider outlining the type of
activity and participation expected from each stakeholder
and participant.
 This will set clear expectations with the participants and
avoid delays later

SRM Institute of Science and Technology 9


PHASE 1: DISCOVERY
Interviewing the Analytics Sponsor:

 The team should plan to collaborate with the stakeholders


to clarify and frame the analytics problem.
 Sponsors may have a predetermined solution that may not
necessarily realize the desired
 outcome.
 In these cases, the team must use its knowledge and
expertise to identify the true underlying
 problem and appropriate solution.
SRM Institute of Science and Technology 10
PHASE 1: DISCOVERY
 Data science team typically may have a more objective
understanding of the problem set than the stakeholders,
who may be suggesting
solutions.
Some tips for interviewing project sponsors:
 Prepare for the interview; draft questions, and review with
colleagues.
 Use open-ended questions; avoid asking leading
questions.
 Document what the team heard, and review it with the
sponsors. SRM Institute of Science and Technology 11
PHASE 1: DISCOVERY
Developing Initial Hypotheses:
 This step involves forming ideas that the team
 can test with data.
 In this way, the team can compare its answers with the
outcome of an experiment or test to generate additional
possible solutions to problems
 Another part of this process involves gathering and
assessing hypotheses from stakeholders and domain
 experts who may have their own perspective on what
the problem is, what the solution should be, and how
 to arrive at a solution.
SRM Institute of Science and Technology 12
PHASE 1: DISCOVERY

Identifying Potential Data Sources:


 The team should perform five main activities during this
step of the discovery phase:
 Identify data sources
 Capture aggregated data sources
 Review the raw data
 Evaluate the data structures and tools needed
 Sort of data infrastructure needed for this type of
problem

SRM Institute of Science and Technology 13


PHASE 1: DISCOVERY
 The data science team must learn and investigate the
problem, develop context and understanding, and
learn about the data sources needed and available for
the project.
 Learning the Business Domain
 Resources
 Framing the Problem
 Identifying Key Stakeholders
 Interviewing the Analytics Sponsor
 Developing Initial Hypotheses
 Identifying Potential Data Sources

SRM Institute of Science and Technology 2


PHASE 1: DISCOVERY
Learning the Business Domain:

 Understanding the domain area of the problem is


essential.
 Data scientists have deep knowledge of the
methods, techniques, and ways for applying
heuristics to a variety of business and conceptual
problems

SRM Institute of Science and Technology 3


PHASE 1: DISCOVERY
Resources:

 As part of the discovery phase, the team needs to


assess the resources available to support the
project.
 In this context, resources include technology, tools,
systems, data, and people.

SRM Institute of Science and Technology 4


PHASE 1: DISCOVERY
Resources:
 Does the requisite level of expertise exist within the
organization today, or will it need to be cultivated?
 The team will need to determine whether it must collect
additional data, purchase it from outside sources, or
transform existing data.
 Ensure the project team has the right mix of domain
experts, customers, analytic talent, and project
management to be effective.

SRM Institute of Science and Technology 5


PHASE 1: DISCOVERY
Framing the Problem :

 Framing is the process of stating the analytics problem to


be solved.
 It is crucial to state the analytics problem, as well as why
and to whom it is important
 it is important to identify the main objectives of the project,
identify what needs to be achieved in business terms, and
identify what needs to be done to meet the needs.

SRM Institute of Science and Technology 6


PHASE 1: DISCOVERY

 It is best practice to share the statement of goals


and success criteria with the team and confirm
alignment with the project sponsor's expectations.
 Establishing criteria for both success and failure
helps the participants to avoid unproductive effort
and remain aligned with the project sponsors

SRM Institute of Science and Technology 7


PHASE 1: DISCOVERY
Identifying Key Stakeholders:

 Important step is to identify the key stakeholders and their


interests in the project.
 During these discussions, the team can identify the
success criteria, key risks, and stakeholders
 When interviewing stakeholders, learn about the domain
area and any relevant history from similar analytics projects.

SRM Institute of Science and Technology 8


PHASE 1: DISCOVERY
Identifying Key Stakeholders:

 Depending on the number of stakeholders and


participants, the team may consider outlining the type of
activity and participation expected from each stakeholder
and participant.
 This will set clear expectations with the participants and
avoid delays later

SRM Institute of Science and Technology 9


PHASE 1: DISCOVERY
Interviewing the Analytics Sponsor:

 The team should plan to collaborate with the stakeholders


to clarify and frame the analytics problem.
 Sponsors may have a predetermined solution that may not
necessarily realize the desired
 outcome.
 In these cases, the team must use its knowledge and
expertise to identify the true underlying
 problem and appropriate solution.
SRM Institute of Science and Technology 10
PHASE 1: DISCOVERY
 Data science team typically may have a more objective
understanding of the problem set than the stakeholders,
who may be suggesting
solutions.
Some tips for interviewing project sponsors:
 Prepare for the interview; draft questions, and review with
colleagues.
 Use open-ended questions; avoid asking leading
questions.
 Document what the team heard, and review it with the
sponsors. SRM Institute of Science and Technology 11
PHASE 1: DISCOVERY
Developing Initial Hypotheses:
 This step involves forming ideas that the team
 can test with data.
 In this way, the team can compare its answers with the
outcome of an experiment or test to generate additional
possible solutions to problems
 Another part of this process involves gathering and
assessing hypotheses from stakeholders and domain
 experts who may have their own perspective on what
the problem is, what the solution should be, and how
 to arrive at a solution.
SRM Institute of Science and Technology 12
PHASE 1: DISCOVERY

Identifying Potential Data Sources:


 The team should perform five main activities during this
step of the discovery phase:
 Identify data sources
 Capture aggregated data sources
 Review the raw data
 Evaluate the data structures and tools needed
 Sort of data infrastructure needed for this type of
problem

SRM Institute of Science and Technology 13


PHASE 2: DATA PREPARATION
 The second phase of the Data Analytics Lifecycle involves
data preparation, which includes the steps to explore,
preprocess, and condition data prior to modeling and
analysis.
 To get the data into the sandbox, the team needs to
perform ETLT, by a combination of extracting, transforming,
and loading data into the sandbox. Once the data is in the
sandbox, the team needs to learn about the data and
become familiar with it.
SRM Institute of Science and Technology 2
PHASE 2: DATA PREPARATION

 The team may perform data visualizations to help team


members understand the data, including its trends,
outliers, and relationship among data variables. The step
involves
 Preparing the Analytic Sandbox
 Performing ETLT
 Learning About the Data
 Data Conditioning
 Survey and Visualize
 Common Tools for the Data Preparation Phase
SRM Institute of Science and Technology 3
PHASE 2: DATA PREPARATION
Preparing the Analytic Sandbox

 When developing the analytic sandbox, it is a best practice


to collect all kinds of data there, as team members need
access to high volumes and varieties of data for a Big
Data analytics project.
 This can include Analytic Sandbox: everything from
summary-level aggregated data, structured data, raw data
feeds, and unstructured text data from call logs or web
logs, depending on the kind of analysis the team plans to
undertake SRM Institute of Science and Technology 4
PHASE 2: DATA PREPARATION

 Expect the sandbox to be large.lt may contain raw data,


aggregated data, and other data types that are less
commonly used in organizations.
 Sandbox size can vary greatly depending on the project. A
good rule is to plan for the sandbox to be at least 5-10 times
the size of the original data sets, partly because copies of the
data may be created that serve as specific tables or data
stores for specific kinds of analysis in the project.

SRM Institute of Science and Technology 5


PHASE 2: DATA PREPARATION
Performing ETLT
 In ETL, users perform extract, transform, load processes to
extract data from a datastore, perform data transformations,
and load the data back into the datastore.
 ln this case, the data is extracted in its raw form and loaded
into the data store, where analysts can choose to transform
the data into a new state or leave it in its original, raw
condition.

SRM Institute of Science and Technology 6


PHASE 2: DATA PREPARATION

Performing ETLT

 As part of the ETLT step, it is advisable to make an inventory


of the data and compare the data currently available with
datasets the team needs (Gap Analysis).

SRM Institute of Science and Technology 7


PHASE 2: DATA PREPARATION
Learning About the Data:
 A critical aspect of a data science project is to become
familiar with the data itself
 Clarifies the data that the data science team has access to at
the start of the project
 Highlights gaps by identifying datasets within an organization
that the team may find useful
 Identifies datasets outside the organization that may be
useful to obtain, through open APIs, data sharing, or
purchasing data to supplement already existing datasets
SRM Institute of Science and Technology 8
PHASE 2: DATA PREPARATION
Data Conditioning:

 Data conditioning refers to the process of cleaning data,


normalizing datasets, and performing transformations On the
data
 Data conditioning can involve many complex steps to join or
merge data sets or otherwise get datasets into a state that
enables analysis in further phases.

SRM Institute of Science and Technology 9


PHASE 2: DATA PREPARATION
Data Conditioning:
 What are the data sources? What are the target fields (for
example, columns of the tables)?
 How clean is the data?
 How consistent are the contents and files?
 Review the content of data columns or other inputs
 Look for any evidence of systematic error.

SRM Institute of Science and Technology 10


PHASE 2: DATA PREPARATION

Survey and Visualize:

 After the team has collected and obtained at least some of


the datasets needed for the subsequent analysis, a useful
step is to leverage data visualization tools to gain an
overview of the data.
 Seeing high-level patterns in the data enables one to
understand characteristics about the data very quickly
SRM Institute of Science and Technology 11
PHASE 2: DATA PREPARATION

Survey and Visualize:

 Review data to ensure that calculations remained consistent


within columns or across tables for a given data field.
 Does the data distribution stay consistent over all the data? If
not, what kinds of actions should be taken to address this
problem?
 Assess the granularity of the data, the range of values, and
the level of aggregation of the data.
SRM Institute of Science and Technology 12
PHASE 2: DATA PREPARATION

 For time-related variables, are the measurements daily,


weekly, monthly?
 Is the data standardized/normalized? Are the scales
consistent?
 For geospatial datasets, are state or country abbreviations
consistent across the data?

SRM Institute of Science and Technology 13


PHASE 2: DATA PREPARATION

Common Tools for the Data Preparation Phase:

 Hadoop :can perform massively parallel and custom analysis


for web traffic parsing, GPS location analytics and genomic
analysis
 Alpine Miner : provides a graphical user interface (GUI) for
creating analytic work flows, including data manipulations and
a series of analytic events such as data-mining techniques

14
PHASE 2: DATA PREPARATION
 Open Refine :(formerly called Google Refine) is "a free, open
source, powerful tool for working with messy data." It is a
popular GUI-based tool.
 Data Wrangler :is an interactive tool for data clean ing and
transformation. Wrangler was developed at Stanford University
and can be used to perform many transformations on a given
dataset forming data transformations

SRM Institute of Science and Technology 15


PHASE 2: DATA PREPARATION
 The second phase of the Data Analytics Lifecycle involves
data preparation, which includes the steps to explore,
preprocess, and condition data prior to modeling and
analysis.
 To get the data into the sandbox, the team needs to
perform ETLT, by a combination of extracting, transforming,
and loading data into the sandbox. Once the data is in the
sandbox, the team needs to learn about the data and
become familiar with it.
SRM Institute of Science and Technology 2
PHASE 2: DATA PREPARATION

 The team may perform data visualizations to help team


members understand the data, including its trends,
outliers, and relationship among data variables. The step
involves
 Preparing the Analytic Sandbox
 Performing ETLT
 Learning About the Data
 Data Conditioning
 Survey and Visualize
 Common Tools for the Data Preparation Phase
SRM Institute of Science and Technology 3
PHASE 2: DATA PREPARATION
Preparing the Analytic Sandbox

 When developing the analytic sandbox, it is a best practice


to collect all kinds of data there, as team members need
access to high volumes and varieties of data for a Big
Data analytics project.
 This can include Analytic Sandbox: everything from
summary-level aggregated data, structured data, raw data
feeds, and unstructured text data from call logs or web
logs, depending on the kind of analysis the team plans to
undertake SRM Institute of Science and Technology 4
PHASE 2: DATA PREPARATION

 Expect the sandbox to be large.lt may contain raw data,


aggregated data, and other data types that are less
commonly used in organizations.
 Sandbox size can vary greatly depending on the project. A
good rule is to plan for the sandbox to be at least 5-10 times
the size of the original data sets, partly because copies of the
data may be created that serve as specific tables or data
stores for specific kinds of analysis in the project.

SRM Institute of Science and Technology 5


PHASE 2: DATA PREPARATION
Performing ETLT
 In ETL, users perform extract, transform, load processes to
extract data from a datastore, perform data transformations,
and load the data back into the datastore.
 ln this case, the data is extracted in its raw form and loaded
into the data store, where analysts can choose to transform
the data into a new state or leave it in its original, raw
condition.

SRM Institute of Science and Technology 6


PHASE 2: DATA PREPARATION

Performing ETLT

 As part of the ETLT step, it is advisable to make an inventory


of the data and compare the data currently available with
datasets the team needs (Gap Analysis).

SRM Institute of Science and Technology 7


PHASE 2: DATA PREPARATION
Learning About the Data:
 A critical aspect of a data science project is to become
familiar with the data itself
 Clarifies the data that the data science team has access to at
the start of the project
 Highlights gaps by identifying datasets within an organization
that the team may find useful
 Identifies datasets outside the organization that may be
useful to obtain, through open APIs, data sharing, or
purchasing data to supplement already existing datasets
SRM Institute of Science and Technology 8
PHASE 2: DATA PREPARATION
Data Conditioning:

 Data conditioning refers to the process of cleaning data,


normalizing datasets, and performing transformations On the
data
 Data conditioning can involve many complex steps to join or
merge data sets or otherwise get datasets into a state that
enables analysis in further phases.

SRM Institute of Science and Technology 9


PHASE 2: DATA PREPARATION
Data Conditioning:
 What are the data sources? What are the target fields (for
example, columns of the tables)?
 How clean is the data?
 How consistent are the contents and files?
 Review the content of data columns or other inputs
 Look for any evidence of systematic error.

SRM Institute of Science and Technology 10


PHASE 2: DATA PREPARATION

Survey and Visualize:

 After the team has collected and obtained at least some of


the datasets needed for the subsequent analysis, a useful
step is to leverage data visualization tools to gain an
overview of the data.
 Seeing high-level patterns in the data enables one to
understand characteristics about the data very quickly
SRM Institute of Science and Technology 11
PHASE 2: DATA PREPARATION

Survey and Visualize:

 Review data to ensure that calculations remained consistent


within columns or across tables for a given data field.
 Does the data distribution stay consistent over all the data? If
not, what kinds of actions should be taken to address this
problem?
 Assess the granularity of the data, the range of values, and
the level of aggregation of the data.
SRM Institute of Science and Technology 12
PHASE 2: DATA PREPARATION

 For time-related variables, are the measurements daily,


weekly, monthly?
 Is the data standardized/normalized? Are the scales
consistent?
 For geospatial datasets, are state or country abbreviations
consistent across the data?

SRM Institute of Science and Technology 13


PHASE 2: DATA PREPARATION

Common Tools for the Data Preparation Phase:

 Hadoop :can perform massively parallel and custom analysis


for web traffic parsing, GPS location analytics and genomic
analysis
 Alpine Miner : provides a graphical user interface (GUI) for
creating analytic work flows, including data manipulations and
a series of analytic events such as data-mining techniques

14
PHASE 2: DATA PREPARATION
 Open Refine :(formerly called Google Refine) is "a free, open
source, powerful tool for working with messy data." It is a
popular GUI-based tool.
 Data Wrangler :is an interactive tool for data clean ing and
transformation. Wrangler was developed at Stanford University
and can be used to perform many transformations on a given
dataset forming data transformations

SRM Institute of Science and Technology 15


MODEL PLANNING
Data science team identifies candidate models to apply to the data for
clustering, classifying, or finding relationships in the data depending on the goal
of the project
Assess the structure of the datasets.
Ensure that the analytical techniques enable the team to meet the business
objectives and accept or reject the working hypotheses.
Determine if the situation warrants a single model or a series of techniques as
part of a larger analytic workflow.

SRM Institute of Science and Technology 2


MODEL PLANNING

 Data Exploration and Variable Selection


 Model Selection
 Common Tools for the Model Planning Phase

SRM Institute of Science and Technology 3


MODEL PLANNING
Data Exploration and Variable Selection:

 Data exploration takes place in the data preparation


phase, those activities focus mainly on data hygiene
and on assessing the quality of the data itself.
 the objective of the data exploration is to understand
the relationships among the variables to inform
selection of the variables and methods and to
understand the problem domain

SRM Institute of Science and Technology 4


MODEL PLANNING

Data Exploration and Variable Selection:

 The key to this approach is to aim for capturing the


most essential predictors and variables rather than
considering every possible variable that people think
may influence the outcome.

SRM Institute of Science and Technology 5


MODEL PLANNING

Model Selection
 The team's main goal is to choose an analytical
technique, or a short list of candidate techniques, based
on the end goal of the project.
 In the case of machine learning and data mining, these
rules and conditions are grouped into several general
sets of techniques, such as classification, association
rules, and clustering.

SRM Institute of Science and Technology 6


MODEL PLANNING

Model Selection

 Teams create the initial models using a statistical


software package such as R, SAS, or Matlab.
 Although these tools are designed for data mining and
machine learning algorithms, they may have limitations
when applying the models to very large datasets, as is
common with Big Data.

SRM Institute of Science and Technology 7


MODEL PLANNING

Model Selection

 Teams create the initial models using a statistical


software package such as R, SAS, or Matlab.
 Although these tools are designed for data mining and
machine learning algorithms, they may have limitations
when applying the models to very large datasets, as is
common with Big Data.

SRM Institute of Science and Technology 8


MODEL PLANNING
Data science team identifies candidate models to apply to the data for
clustering, classifying, or finding relationships in the data depending on the goal
of the project
Assess the structure of the datasets.
Ensure that the analytical techniques enable the team to meet the business
objectives and accept or reject the working hypotheses.
Determine if the situation warrants a single model or a series of techniques as
part of a larger analytic workflow.

SRM Institute of Science and Technology 2


MODEL PLANNING

 Data Exploration and Variable Selection


 Model Selection
 Common Tools for the Model Planning Phase

SRM Institute of Science and Technology 3


MODEL PLANNING
Data Exploration and Variable Selection:

 Data exploration takes place in the data preparation


phase, those activities focus mainly on data hygiene
and on assessing the quality of the data itself.
 the objective of the data exploration is to understand
the relationships among the variables to inform
selection of the variables and methods and to
understand the problem domain

SRM Institute of Science and Technology 4


MODEL PLANNING

Data Exploration and Variable Selection:

 The key to this approach is to aim for capturing the


most essential predictors and variables rather than
considering every possible variable that people think
may influence the outcome.

SRM Institute of Science and Technology 5


MODEL PLANNING

Model Selection
 The team's main goal is to choose an analytical
technique, or a short list of candidate techniques, based
on the end goal of the project.
 In the case of machine learning and data mining, these
rules and conditions are grouped into several general
sets of techniques, such as classification, association
rules, and clustering.

SRM Institute of Science and Technology 6


MODEL PLANNING

Model Selection

 Teams create the initial models using a statistical


software package such as R, SAS, or Matlab.
 Although these tools are designed for data mining and
machine learning algorithms, they may have limitations
when applying the models to very large datasets, as is
common with Big Data.

SRM Institute of Science and Technology 7


MODEL PLANNING

Model Selection

 Teams create the initial models using a statistical


software package such as R, SAS, or Matlab.
 Although these tools are designed for data mining and
machine learning algorithms, they may have limitations
when applying the models to very large datasets, as is
common with Big Data.

SRM Institute of Science and Technology 8


MODEL BUILDING PHASE

 The data science team needs to develop data sets for


training, testing, and production purposes.
 These data sets enable the data scientist to develop the
analytical model and train it while holding aside some of
the data for testing the model.
 During this phase, users run models from analytical
software packages, such as R or SAS, on file extracts
and small data sets for testing purposes. On a small
scale, assess the validity of the model and its results.
SRM Institute of Science and Technology 2
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Common Tools for the Model Building Phase
 SAS Enterprise Miner allows users to run predictive and
descriptive models based on large volumes of data from
across the enterprise.
 SPSS Modeler offers methods to explore and analyze
data through a GUI.
 Matlab provides a high-level language for performing a
variety of data analytics, algorithms, and data exploration

SRM Institute of Science and Technology 3


COMMON TOOLS FOR THE MODEL BUILDING
PHASE

 Alpine Miner provides a GUI front end for users to


develop analytic work flows and interact with Big Data
tools and platforms on the back end.
 STATISTICA and Mathematica are popular and well-
regarded data mining and analytics tools

SRM Institute of Science and Technology 4


COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Open Source tools:

 R and PL/R :PL/R is a procedural language for


PostgreSQL with R.
 Octave: a free software programming language for
computational modeling, has some of the functionality of
Matlab.

SRM Institute of Science and Technology 5


COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Open Source tools:

 WEKA is a free data mining software package with an


analytic workbench
 Python is a programming language that provides toolkits
for machine learning and analysis, such as scikit-learn,
numpy, scipy, pandas, and related data visualization
using matplotlib

SRM Institute of Science and Technology 6


MODEL BUILDING PHASE

 The data science team needs to develop data sets for


training, testing, and production purposes.
 These data sets enable the data scientist to develop the
analytical model and train it while holding aside some of
the data for testing the model.
 During this phase, users run models from analytical
software packages, such as R or SAS, on file extracts
and small data sets for testing purposes. On a small
scale, assess the validity of the model and its results.
SRM Institute of Science and Technology 2
COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Common Tools for the Model Building Phase
 SAS Enterprise Miner allows users to run predictive and
descriptive models based on large volumes of data from
across the enterprise.
 SPSS Modeler offers methods to explore and analyze
data through a GUI.
 Matlab provides a high-level language for performing a
variety of data analytics, algorithms, and data exploration

SRM Institute of Science and Technology 3


COMMON TOOLS FOR THE MODEL BUILDING
PHASE

 Alpine Miner provides a GUI front end for users to


develop analytic work flows and interact with Big Data
tools and platforms on the back end.
 STATISTICA and Mathematica are popular and well-
regarded data mining and analytics tools

SRM Institute of Science and Technology 4


COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Open Source tools:

 R and PL/R :PL/R is a procedural language for


PostgreSQL with R.
 Octave: a free software programming language for
computational modeling, has some of the functionality of
Matlab.

SRM Institute of Science and Technology 5


COMMON TOOLS FOR THE MODEL BUILDING
PHASE
Open Source tools:

 WEKA is a free data mining software package with an


analytic workbench
 Python is a programming language that provides toolkits
for machine learning and analysis, such as scikit-learn,
numpy, scipy, pandas, and related data visualization
using matplotlib

SRM Institute of Science and Technology 6


COMMUNICATE RESULTS

 After executing the model, the team needs to compare


the outcomes of the modeling to the criteria established
for success and failure.

 When conducting this assessment, determine if the


results are statistically significant and valid

SRM Institute of Science and Technology 2


COMMUNICATE RESULTS

 The best practice in this phase is to record all the findings


and then select the three most significant ones that can
be shared with the stakeholders.

 The team will have documented the key findings and


major insights derived from the analysis.

SRM Institute of Science and Technology 3


ANALYSIS OVER DIFFERENT MODELS

 Better performance
 Longer lifetime
 Easier retraining
 Speedy production

SRM Institute of Science and Technology 4


COMMUNICATE RESULTS

 After executing the model, the team needs to compare


the outcomes of the modeling to the criteria established
for success and failure.

 When conducting this assessment, determine if the


results are statistically significant and valid

SRM Institute of Science and Technology 2


COMMUNICATE RESULTS

 The best practice in this phase is to record all the findings


and then select the three most significant ones that can
be shared with the stakeholders.

 The team will have documented the key findings and


major insights derived from the analysis.

SRM Institute of Science and Technology 3


ANALYSIS OVER DIFFERENT MODELS

 Better performance
 Longer lifetime
 Easier retraining
 Speedy production

SRM Institute of Science and Technology 4


OPERATIONALIZE

 The team communicates the benefits of the project more


broadly and sets up a pilot project to deploy the work in a
controlled way before broadening the work to a full enterprise
or ecosystem of users.

 This allows the team to learn from the deployment and make
any needed adjustments before launching the model across
the enterprise.

 The presentation needs to include supporting information


about analytical methodology and data sources

SRM Institute of Science and Technology 2


OPERATIONALIZE

 Creating a mechanism for performing ongoing monitoring


of model accuracy and, if accuracy degrades, finding
ways to retrain the model.

SRM Institute of Science and Technology 3


OPERATIONALIZE
 Business Intelligence Analyst needs to know if the
reports and dashboards he manages will be impacted
and need to change.
 Data Engineer and Database Administrator (DBA)
typically need to share their code from the analytics
project and create a technical document on how to
implement it.
 Data Scientist needs to share the code and explain the
model to managers, and other stakeholders.

SRM Institute of Science and Technology 4


MOVING MODEL TO DEPLOYMENT ENVIRONMENT
 Developing Core Material for Multiple Audiences
 Project Goals
 Main Findings
 Approach
 Model Description
 Model Details
 Providing Technical Specifications and Code

SRM Institute of Science and Technology 5


OPERATIONALIZE

 The team communicates the benefits of the project more


broadly and sets up a pilot project to deploy the work in a
controlled way before broadening the work to a full enterprise
or ecosystem of users.

 This allows the team to learn from the deployment and make
any needed adjustments before launching the model across
the enterprise.

 The presentation needs to include supporting information


about analytical methodology and data sources

SRM Institute of Science and Technology 2


OPERATIONALIZE

 Creating a mechanism for performing ongoing monitoring


of model accuracy and, if accuracy degrades, finding
ways to retrain the model.

SRM Institute of Science and Technology 3


OPERATIONALIZE
 Business Intelligence Analyst needs to know if the
reports and dashboards he manages will be impacted
and need to change.
 Data Engineer and Database Administrator (DBA)
typically need to share their code from the analytics
project and create a technical document on how to
implement it.
 Data Scientist needs to share the code and explain the
model to managers, and other stakeholders.

SRM Institute of Science and Technology 4


MOVING MODEL TO DEPLOYMENT ENVIRONMENT
 Developing Core Material for Multiple Audiences
 Project Goals
 Main Findings
 Approach
 Model Description
 Model Details
 Providing Technical Specifications and Code

SRM Institute of Science and Technology 5


ANALYTICS PLAN

 Discovery , Business problem framed


 Initial Hypotheses
 Data and Scope
 Model planning-Analytic Techniques
 Result and Key finding
 Business impact

SRM Institute of Science and Technology 2


ANALYTICS PLAN

 Discovery , Business problem framed


 Initial Hypotheses
 Data and Scope
 Model planning-Analytic Techniques
 Result and Key finding
 Business impact

SRM Institute of Science and Technology 2


KEY DELIVERABLES OF ANALYTICS PROJECT

 Developing Core Material for Multiple Audiences


 Project Goals
 Main Findings
 Approach
 Model Description
 Model Details
 Providing Technical Specifications and Code

SRM Institute of Science and Technology 2


PRESENTING YOUR RESULTS TO THE PROJECT
SPONSOR
 project sponsor is the person who wants the data science
result—generally for the business need that it will fill.

1.Summarize the motivation behind the project, and its


goals.
2.State the project’s results.
3.Back up the results with details (Code), as needed.
4.Discuss recommendations, outstanding issues, and
possible future work.

SRM Institute of Science and Technology 3


PRESENTING YOUR RESULTS TO THE PROJECT
SPONSOR
Project sponsor presentation takeaways

 Keep it short.

 Keep it focused on the business issues, not the


technical ones.

 Your project sponsor might use your presentation to


help sell the project or its results to the rest of the
organization.
SRM Institute of Science and Technology 4
PROVIDING TECHNICAL SPECIFICATIONS AND
CODE
 The team should anticipate questions from IT related to
how computationally expensive it will be to run the model
in the production environment.

 Teams should approach writing technical documentation


for their code and specifications.

 Introduce your results early in the presentation, rather


than building up to them.

SRM Institute of Science and Technology 5


KEY DELIVERABLES OF ANALYTICS PROJECT

 Developing Core Material for Multiple Audiences


 Project Goals
 Main Findings
 Approach
 Model Description
 Model Details
 Providing Technical Specifications and Code

SRM Institute of Science and Technology 2


PRESENTING YOUR RESULTS TO THE PROJECT
SPONSOR
 project sponsor is the person who wants the data science
result—generally for the business need that it will fill.

1.Summarize the motivation behind the project, and its


goals.
2.State the project’s results.
3.Back up the results with details (Code), as needed.
4.Discuss recommendations, outstanding issues, and
possible future work.

SRM Institute of Science and Technology 3


PRESENTING YOUR RESULTS TO THE PROJECT
SPONSOR
Project sponsor presentation takeaways

 Keep it short.

 Keep it focused on the business issues, not the


technical ones.

 Your project sponsor might use your presentation to


help sell the project or its results to the rest of the
organization.
SRM Institute of Science and Technology 4
PROVIDING TECHNICAL SPECIFICATIONS AND
CODE
 The team should anticipate questions from IT related to
how computationally expensive it will be to run the model
in the production environment.

 Teams should approach writing technical documentation


for their code and specifications.

 Introduce your results early in the presentation, rather


than building up to them.

SRM Institute of Science and Technology 5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy