Predictive Analytics Complete Notes
Predictive Analytics Complete Notes
PRIDICTIVE ANALYTICS
LECTURE NOTES
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING)
• Predictive analytics issues are typically solved using models such multiple
linear regression, logistic regression, auto-regressive integrated moving
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
average (ARIMA), decision trees, and neural networks. We may use
regression models to identify the connections between these variables and
how to take advantage of those connections when making judgements
Automotive industry:
Predictive analytics and other forms of AI to cover the way for self-driving vehicles
by predicting what will happen in the immediate future while driving a car down the
road. This process needs to happen continuously when a vehicle is in motion,
drawing information from multiple sensors and making judgment calls about which
potential actions would be a safety risk.
Beyond autonomous vehicles, manufacturers and retailers can also use predictive
analytics to their benefit. For example, predictive analytics helps factories create
vehicles faster using fewer resources.
For example:
Tesla uses predictive analytics in the form of neural network accelerators for their
self-driving vehicles. A neural network model simulates how human brains use
information to make decisions.
When you receive an alert to suspicious activity in your bank account, you can thank
predictive analytics for determining that something doesn’t seem right based on
deviations from your routine, such as a transaction in a different city. Financial
institutions and other companies use predictive analytics to reduce credit risk,
combat fraud, predict future cash flow, analyze insurance coverage, and look for new
For example:
Orcolus is a program that businesses can use when determining someone’s credit
eligibility. Orcolus uses AI and ML to offer a more stable solution for examining
documents and avoiding fraud.
For example:
For example:
For example:
Subway used predictive analytics to decide whether to raise the price of their $5
Footlong sandwich. Their data showed that the low price point wasn’t causing them
to sell enough sandwiches to make up for a bump in price. Using a predictive
analytics program offered by Mastercard, Subway learned that customers purchasing
Footlong sandwiches added additional items to their orders, such as a side of chips
or a drink.
Similar to the manufacturing industry, utility companies can use predictive analytics
to watch out for equipment failures and safety concerns. Due to the potentially
catastrophic nature of equipment failure and malfunction in the utility industry, it’s
vital for companies to invest in predictive analytics to keep things running as
smoothly as possible.
For example:
The first step in any data analysis process is to define your objective. In data
analytics jargon, this is sometimes called the ‘problem statement’.
Defining your objective means coming up with a hypothesis and figuring how to test
it. Start by asking: What business problem am I trying to solve? While this might
sound straightforward, it can be trickier than it seems. For instance, your
organization’s senior management might pose an issue, such as: “Why are we losing
customers?” It’s possible, though, that this doesn’t get to the core of the problem. A
2.Collecting the data : Once you’ve established your objective, you’ll need to create
a strategy for collecting and aggregating the appropriate data. A key part of this is
determining which data you need. This might be quantitative (numeric) data, e.g.
sales figures, or qualitative (descriptive) data, such as customer reviews. All data fit
into one of three categories: first-party, second-party, and third-party data. Let’s
explore each one.
3. Cleaning the data : Once you’ve collected your data, the next step is to get it
ready for analysis. This means cleaning, or ‘scrubbing’ it, and is crucial in making
sure that you’re working with high-quality data. Key data cleaning tasks include:
4.Analyzing the data : Finally, you’ve cleaned your data. Now comes the fun bit
analyzing it! The type of data analysis you carry out largely depends on what your
goal is. But there are many techniques available. time-series analysis, and regression
analysis are just a few you might have heard of and Descriptive analysis identifies
what has already happened and Diagnostic analytics focuses on understanding why
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
something has happened and Predictive analysis allows you to identify future trends
based on historical data and Prescriptive analysis allows you to make
recommendations for the future. More important than the different types, though, is
how you apply them. This depends on what insights you’re hoping to gain.
5. Sharing your results: You’ve finished carrying out your analyses. You have your
insights. The final step of the data analytics process is to share these insights with
the wider world (or at least with your organization’s stakeholders!) This is more
complex than simply sharing the raw results of your work it involves interpreting
the outcomes, and presenting them in a manner that’s digestible for all types of
audiences. Since you’ll often present information to decision-makers, it’s very
important that the insights you present are 100% clear and unambiguous. For this
reason, data analysts commonly use reports, dashboards, and interactive
visualizations to support their findings.
Various Analytics techniques are:
There are four different types of analytics techniques:
5. Manufacturing
Business analysts work with data to help stakeholders understand the things that
affect operations and the bottom line. Identifying things like equipment downtime,
inventory levels, and maintenance costs help companies streamline inventory
management, risks, and supply-chain management to create maximum efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by measuring
marketing and advertising metrics, identifying consumer behavior and the target
audience, and analyzing market trends.
➢ A statistical model embodies(represent) a set of assumptions concerning the
generation of the observed data, and similar data from a larger population.
➢ A model represents, often in considerably idealized form, the data-generating
process.
➢ Example: Speech, Signal processing is an enabling technology that
encompasses the fundamental theory, applications, algorithms, and
implementations of processing or transferring information contained in many
different physical, symbolic, or abstract formats broadly designated as signals.
➢ It uses mathematical, statistical, computational, heuristic, and linguistic
representations, formalisms, and techniques for representation, modeling,
analysis, synthesis, discovery, recovery, sensing, acquisition, extraction,
learning, security, or forensics.
➢ In manufacturing statistical models are used to define Warranty policies,
solving various conveyor related issues, Statistical Process Control etc.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Databases & Type of data and variables:
A Database Management System (DBMS) is a software system that is designed to
manage and organize data in a structured manner. It allows users to create, modify,
and query a database, as well as manage the security and access controls for that
database.
An integral component of a Database Management System (DBMS) that is required
to determine its structure A piece of middleware that extends or supplants the native
data dictionary of a DBMS.
Relational Database Management System: (RDBMS) is a software system used
to maintain relational databases. Many relational database systems have an option
of using the SQL.
Example: Format tables &files, Number of users, size.
NoSQL Database:
is a non-relational Data Management System, that does not require a fixed schema
(logical, Physical, view ). It avoids joins, and is easy to scale. The major purpose of
using a NoSQL database is for distributed data stores with humongous data storage
needs. NoSQL is used for Big data and real-time web apps. For example, companies
like Twitter, Facebook and Google collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.”
Types of missing data : Missing data can be classified into one of three categories .
1. Regression
Simple Regression :
Used to predict a continuous dependent variable based on a single independent
variable. Simple linear regression should be used when there is only a single
independent variable.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Multiple Regression: Used to predict a continuous dependent variable based on
multiple independent variables.
Multiple linear regression should be used when there are multiple independent
variables.
Linear Regression: Linear regression is one of the simplest and most widely
used statistical models. This assumes that there is a linear relationship between the
independent and dependent variables. This means that the change in the dependent
variable is proportional to the change in the independent variables.
K-means clustering: assigns data points to one of the K clusters depending on their
distance from the center of the clusters. It starts by randomly assigning the clusters
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
centroid in the space. Then each data point assign to one of the cluster based on its
distance from centroid of the cluster.
.
Hierarchical clustering : is a connectivity-based clustering model that groups the
data points together that are close to each other based on the measure of similarity
or distance. The assumption is that data points that are close to each other are more
similar or related than data points that are farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the
hierarchical relationships between groups. Individual data points are located at the
bottom of the dendrogram, while the largest clusters, which include all the data
points, are located at the top. In order to generate different numbers of clusters, the
dendrogram can be sliced at various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a
measure of similarity or distance between data points. Clusters are divided or merged
repeatedly until all data points are contained within a single cluster, or until the
predetermined number of clusters is attained.
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes
Multi-layer Perceptron
Multi-layer perception is also known as MLP. It is fully connected dense layers,
which transform any input dimension to the desired dimension. A multi-layer
perception is a neural network that has multiple layers. To create a neural network
we combine neurons together so that the outputs of some neurons are inputs of other
neurons.
A multi-layer perceptron has one input layer and for each input, there is one
neuron(or node), it has one output layer with a single node for each output and it can
have any number of hidden layers and each hidden layer can have any number of
nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is depicted below.
Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to numbers
between 0 and 1 using the sigmoid formula.
5. Classification models: are used to classify data into one or more categories based
on one or more input variables. Classification models identify the relationship
between the input variables and the output variable, and use that relationship to
accurately classify new data into the appropriate category. Classification models are
commonly used in fields like marketing, healthcare, and computer vision, to classify
data such as spam emails, medical diagnoses, and image recognition.
•
Structure
FFNNs consist of input and output layers with optional hidden layers
in between. Input data travels through the network from input nodes,
passing through hidden layers (if present), and culminating in output
nodes.
• Activation and Propagation
These networks operate via forward propagation, where data moves in
one direction without feedback loops. Activation functions like step
functions determine whether neurons fire based on weighted inputs.
For instance, a neuron may output 1 if its input exceeds a threshold
(usually 0), and -1 if it falls below.
FFNNs are efficient for handling noisy data and are relatively straightforward to
implement, making them versatile tools in various AI applications.
4. Cannot be used for deep learning [due to absence of dense layers and
back propagation]
8.Multilayer Perceptron
The Multi-Layer Perceptron (MLP) represents an entry point into complex neural
networks, designed to handle sophisticated tasks in various domains such as:
➢ Speech recognition
➢ Machine translation
➢ Complex classification tasks
MLPs are characterized by their multilayered structure, where input data traverses
through interconnected layers of artificial neurons.
This architecture includes input and output layers alongside multiple hidden layers,
typically three or more, forming a fully connected neural network.
•
Bidirectional Propagation:
Utilizes forward propagation (for computing outputs) and backward
propagation (for adjusting weights based on error).
• Weight Adjustment:
During backpropagation, weights are optimized to minimize prediction
errors by comparing predicted outputs against actual training inputs.
• Activation Functions :
Nonlinear functions are applied to the weighted inputs of neurons,
enhancing the network’s capacity to model complex relationships. The
output layer often uses softmax activation for multi-class classification
tasks.
Advantages on Multi-Layer Perceptron
1. Used for deep learning [due to the presence of dense fully connected
layers and back propagation]
1. Efficient
2. Independent training
3. Robustness
Disadvantages of Modular Neural Network
Business process modeling can also help you group similar processes together and
anticipate how they should operate. The primary objective of business process
modeling tools is to analyze how things are right now and simulate how should
they be carried out to achieve better results.
Least Square Method: In statistics, when we have data in the form of data points that
can be represented on a cartesian plane by taking one of the variables as the
independent variable represented as the x-coordinate and the other one as the
dependent variable represented as the y-coordinate, it is called scatter data. This data
might not be useful in making interpretations or predicting the values of the dependent
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
variable for the independent variable where it is initially unknown. So, we try to get
an equation of a line that fits best to the given data points with the help of the Least
Square Method.
In this article, we will learn the least square method, its formula, graph, and
solved examples on it.
The red points in the above plot represent the data points for the sample data available.
Independent variables are plotted as x-coordinates and dependent ones are plotted as
y-coordinates. The equation of the line of best fit obtained from the least squares
method is plotted as the red line in the graph.
We can conclude from the above graph that how the least squares method helps us to
find a line that best fits the given data points and hence can be used to make further
predictions about the value of the dependent variable where it is not known initially.
Limitations of the Least Square Method
The least squares method assumes that the data is evenly distributed and doesn’t
contain any outliers for deriving a line of best fit. But, this method doesn’t provide
accurate results for unevenly distributed data or for data containing outliers.
Check: Least Square Regression Line
Least Square Method Formula
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
The Least Square Method formula is used to find the best-fitting line through a set of
data points by minimizing the sum of the squares of the vertical distances (residuals)
of the points from the line. For a simple linear regression, which is a line of the
form y=ax+b, where y is the dependent variable, x is the independent variable, a is the
slope of the line, and b is the y-intercept, the formulas to calculate the slope (a) and
intercept (b) of the line are derived from the following equations:
1. Slope (a) Formula: a=n(∑xy)−(∑x)(∑y) / n(∑x2)−(∑x)2
2. Intercept (b) Formula: b=(∑y)−a(∑x) / n
Where:
• n is the number of data points,
• ∑xy is the sum of the product of each pair of x and y values,
• ∑x is the sum of all x values,
• ∑y is the sum of all y values,
• ∑x2 is the sum of the squares of x values.
These formulas are used to calculate the parameters of the line that best fits the data
according to the criterion of the least squares, minimizing the sum of the squared
differences between the observed values and the values predicted by the linear model.
Variable rationalization:
Variable rationalization in predictive analytics involves optimizing the selection
and transformation of features (variables) used in predictive models. The goal is to
enhance model performance, reduce complexity, and ensure that the variables
contribute meaningfully to the predictive power of the model. Here’s a detailed
breakdown of the process:
• Missing Values: Analyze the extent and pattern of missing values. Decide
whether to impute, remove, or ignore variables with missing data.
3. Feature Selection
4. Feature Engineering
• Create New Features: Generate new variables that might provide additional
predictive power. For example, combining multiple features into a single
feature that captures interactions.
• Transform Variables: Apply transformations (e.g., logarithmic, polynomial) to
variables to better capture relationships and improve model performance.
• Encoding Categorical Variables: Convert categorical variables into numerical
formats using techniques like one-hot encoding or label encoding, as
required by the modeling algorithm.
5. Dimensionality Reduction
7. Iterative Refinement
• Simplify Models: Aim for a model that uses the least number of features
necessary to achieve high performance. Simpler models are often easier to
interpret and maintain.
• Understandability: Ensure that the selected variables make sense and that the
model’s decisions can be explained in terms of the features.
9. Documentation
Summary
Model Building:
Model building in predictive analytics involves creating and training models to
forecast future events or outcomes based on historical data. This process typically
follows a structured approach, including data preparation, model selection, training,
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
evaluation, and deployment. Here’s a step-by-step guide to model building in
predictive analytics:
• Objective: Clearly define the problem you’re trying to solve and the outcome
you want to predict.
• Success Metrics: Identify metrics to measure the success of the model (e.g.,
accuracy, precision, recall, F1 score).
• Data Collection: Gather relevant data from various sources. This might include
historical records, transaction logs, surveys, etc.
• Data Cleaning: Handle missing values, outliers, and inconsistencies in the data.
This may involve imputing missing values, removing duplicates, or correcting
errors.
• Feature Engineering: Create new features or modify existing ones to improve
model performance. This can include normalization, scaling, encoding
categorical variables, and generating interaction terms.
4. Select a Model
• Training and Testing Sets: Divide the data into training and testing sets
(e.g., 80/20 split) to train the model and evaluate its performance on unseen
data.
• Validation Set: Optionally, use a validation set to fine-tune model parameters
and select the best model.
• Fit the Model: Use the training data to train the model. This involves
adjusting model parameters to minimize the error between predictions and
actual values.
• Hyperparameter Tuning: Optimize model hyperparameters using techniques
such as grid search or random search to improve performance.
• Regular Updates: Periodically retrain and update the model with new data to
ensure it remains accurate and relevant.
• Model Drift: Watch for changes in model performance over time due to shifts
in data distribution or other external factors.
By following these steps, you can build predictive models that provide valuable
insights and support decision-making processes based on data.
When you're working with data, sometimes you'll find that some values are
missing. This could be due to various reasons—maybe someone forgot to fill in the
data, or maybe the data was lost. In predictive analytics, having missing data can
be a problem because many machine learning models need complete data to make
accurate predictions.
Why Is It Important?
If you have a dataset with missing values and you try to build a predictive model,
the model might not work as well, or it might not work at all. By imputing, or
filling in, the missing values, you can make sure your model has the complete
information it needs to make accurate predictions.
How Do We Do It?
1. Mean/Median/Mode Imputation:
o Mean: Replace the missing value with the average of the other values.
o Median: Replace the missing value with the middle value (when all
values are ordered).
o Mode: Replace the missing value with the most frequent value.
Example: If you have ages like [25, 30, , 40], where “” is missing, you could
replace it with the mean (say, 32) or median (say, 30).
2. Forward/Backward Fill:
o Forward Fill: Replace the missing value with the previous value in
the dataset.
o Backward Fill: Replace the missing value with the next value in the
dataset.
Example: If you have temperatures like [70, 72, _, 75], you could replace the
missing one with 72 (forward fill) or 75 (backward fill).
Example: If you’re missing a value in a row of data, KNN will look at other
rows that are similar to it and use their values to fill in the gap.
4. Predictive Imputation:
Example: If you have a dataset with age, income, and education level, and
income is missing, you could build a model that predicts income based on
age and education.
5. Multiple Imputation:
o Instead of filling in a single value, multiple imputation creates several
different datasets, each with different imputed values. You then run
your analysis on all of these datasets and combine the results.
This method gives a more robust and realistic estimate because it considers
the uncertainty of the missing data.
Final Thoughts
Missing imputation is a crucial step in predictive analytics. The method you choose
depends on your data and the specific problem you’re trying to solve. Properly
handling missing data ensures that your predictive models are accurate and
reliable.
3. Healthcare
4. Retail
7. Human Resources
8. Energy
9. Telecommunications
10. Insurance
UNIT-3
compare segmentation and regression using simple examples.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Segmentation Example
Imagine you work at a pet store, and you want to group your customers based on the types of
pets they own.
Segmentation helps you understand that different customers have different needs.
You might then send dog food coupons to the "Dog Owners" group and bird food
coupons to the "Bird Owners" group.
Regression Example
Now, imagine you want to predict how much dog food a customer will buy based
on how many dogs they have.
• Segmentation: Groups data into categories (e.g., Dog Owners vs. Cat Owners).
• Regression: Predicts a continuous outcome (e.g., predicting the amount of dog food a
customer will buy).
In summary, segmentation is about grouping data into categories, while regression is about
predicting numerical values.
Supervised learning is like learning with a teacher who gives you the
correct answers. Imagine you’re learning to recognize animals in
pictures, and someone shows you a picture of a cat and tells you, “This
is a cat.” Then, they show you another picture and say, “This is a dog.”
Over time, you learn to identify cats and dogs on your own because
you’ve been given examples with the correct labels.
Unsupervised Learning
• Example: Let’s say you have a bunch of customer data, like their
age, income, and spending habits, but no labels or categories. You
use unsupervised learning to find patterns and group similar
customers together. You might discover that there are natural
segments like “young adults with high income” and “middle-aged
with moderate spending.”
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
• Key Point: The key here is that the learning process is not guided
by labeled data. The model tries to find patterns, groupings, or
structures within the data without knowing the correct answers
upfront.
Summary of Differences
• Supervised Learning:
o Uses labeled data (with correct answers provided).
o Learns to predict outcomes based on these labels.
o Example: Predicting whether an email is spam based on labeled examples.
• Unsupervised Learning:
o Uses unlabeled data (no correct answers provided).
o Learns to find patterns or groupings in the data.
o Example: Grouping customers into segments based on their behavior without
predefined labels.
In simple terms, supervised learning is like learning with a teacher who tells you the correct
answers, while unsupervised learning is like figuring things out on your own by finding patterns.
In supervised learning
it is not possible to In unsupervised learning it
learn larger and more is possible to learn larger
complex models than and more complex models
in unsupervised than in supervised learning
Model learning
In supervised learning
In unsupervised learning
training data is used to
training data is not used.
Training data infer model
Supervised learning is
Unsupervised learning is
also called
also called clustering.
Another name classification.
Optical Character
Find a face in an image.
Example Recognition
1. Binary Classification:
o Binary classification involves predicting one of two possible
outcomes or classes.
o It is commonly used in situations where the result can be one of two
values, like "yes" or "no," "true" or "false."
Examples:
Examples:
Examples:
Examples:
Classification
Description Examples
Type
Binary Fraud detection, customer
Two possible outcomes
Classification churn, loan default
Multiclass More than two outcomes, each Product recommendation,
Classification instance belongs to one class disease classification
Multilabel Customer segmentation,
Multiple labels for one instance
Classification email tagging
Imbalanced Majority vs. minority class Anomaly detection, rare
Classification imbalance disease prediction
Descriptive
Descriptive analytics focuses on a collection and analysis of historical data to
identify trends and patterns. For instance, tracking course completion rates
and assessment scores falls under descriptive analytics. This type of analysis
Predictive
Predictive analytics forecasts future outcomes based on historical data.
Statistical algorithms and machine learning techniques are able to predict
learner behavior, performance, and engagement. You might leverage this type
to predict which employees will excel in specific training modules or identify
who might be at-risk.
Diagnostic
Diagnostic analytics looks under the hood—why did something happen? This
analysis digs deeper into the data to uncover the root causes of specific
outcomes. If a particular training program has low completion rates,
diagnostic analytics can help identify underlying issues, such as content
complexity or low learner engagement.
Prescriptive analytics
Prescriptive analytics provides actionable recommendations based on other
types of insights. Learning and development leaders can then decide on a
new and better course of action to achieve desired goals and outcomes.
• The key is to stay focused and specific. For example, figure out which
employees are most likely to apply skills and which ones aren’t. To
accomplish this, you’ll need to look at how learning performance
correlates to other outcomes. You might use indicators such as quiz
scores, participation in group discussions, or completion times.
In summary, knowledge is what you know, skills are what you can do,
and competence is how well you can apply both together in real-world
situations.
Policies are the rules, guidelines, and best practices that dictate how predictive
analytics should be conducted. They are important to ensure that the analytics
process is ethical, fair, and legally compliant.
Policies are the rules or guidelines that tell us how to do things the right way
when using predictive models.
• Example: Imagine you are collecting student test scores to predict how well
students will do next year. A data privacy policy ensures you don’t share any
student’s personal information (like their names or addresses) with anyone
else without permission. It’s like locking away their private information to
keep it safe.
• Example: If you’re using data to predict which students might need extra help
in math, a fairness policy makes sure that the model doesn’t give different
results just because someone is a boy or girl. The model should be fair to
everyone!
3. Security Policy
• Example: You’re using a computer to store all the data about students. A
security policy makes sure you lock it with a password so that no one can
steal or misuse the data.
4. Transparency Policy
Example: If a student or teacher asks, “How did you predict that this student will
need help?” a transparency policy ensures you can explain how the model made
that prediction, not just saying, “The computer told me so!”
• Example: You’re collecting students’ test scores from different schools. You
should write down where each score came from, when you collected it, and
any changes you made to the data (like removing errors). It’s like noting
down the details of where you got the information.
3. Performance Records
• Example: After using your model to predict student success, you should track
how accurate it was. Did it get most predictions right or wrong? Keeping a
log of the model’s performance helps you know if you need to improve it.
4. Audit Trails
• Example: Imagine the principal wants to see if anyone changed the predictions
for certain students. An audit trail is like a history log that shows who used
the model, what they did, and when. It’s like keeping a record of who touched
your notebook and what they wrote in it.
Simple Summary:
• Policies = Rules to make sure everything is done fairly, safely, and correctly
(e.g., not sharing private info, treating everyone fairly).
For example, if you are predicting sales for today, the values of sales from
yesterday (t-1), two days ago (t-2), etc. become features.
• To make the series stationary (no trend), you can difference it:
• This differenced series serves as a feature to model trends more accurately.
Suppose you want to predict daily sales. Some possible features in your ARIMA
model might be:
Imagine you have a graph of daily ice cream sales over a year. Ice cream
sales might go up in summer (seasonal), increase slightly every year due
to popularity (trend), and have random ups and downs on certain days
(noise). The STL approach separates these components so that we can
study each one on its own:
1. Seasonal component: It
looks for the repeating pattern, like more ice
cream sales in summer.
2. Trend component: It shows whether sales are going up or down in the
long run.
3. Remainder (Noise): This is the leftover randomness that doesn’t fit into
seasonality or trend, like one-day spikes in sales on really hot days.
What is Seasonal Decomposition?
Seasonal decomposition is a statistical technique for breaking down a time
series into its essential components, which often include the trend, seasonal
patterns, and residual (or error) components. The goal is to separate the
different sources of variation within the data to understand better and
analyze each component independently. The fundamental components are
discussed below:
STL helps us see the big picture in data with trends and seasonal patterns
and smooths out the random parts, making it easier to forecast what might
happen next.
• Formula:
• Explanation:
This is the average of the absolute differences between actual values
(yt) and predicted values (y^t)
o Pros: Easy to interpret.
o Cons: Treats all errors equally, ignoring their direction (over- or
under-prediction).
Explanation:
MFE shows whether a model is consistently over-predicting (positive bias)
or under-predicting (negative bias).
Formula:
• Formula:
• Explanation:
MSE calculates the average of squared differences between actual
and forecasted values.
o Pros: Penalizes larger errors more than smaller ones (because of
squaring).
o Cons: Sensitive to outliers.
• Formula:
• Explanation:
RMSE is the square root of MSE, giving error in the same unit as the
data.
o Pros: Easier to interpret compared to MSE.
o Cons: Sensitive to outliers, like MSE.
• Formula:
• Explanation:
MAPE measures the percentage difference between forecasted and
actual values.
o Pros: Expresses accuracy as a percentage, making it easy to
understand.
o Cons: Fails when actual values yty_tyt are close to zero (divides
by small numbers).
• Formula:
• Explanation:
sMAPE corrects for MAPE's tendency to blow up when actual values
are small.
o Pros: Handles low or zero actual values better.
o Cons: Still treats large and small errors equally.
• Height: Could represent the maximum value of something, like the peak
number of sales on a busy day.
• Average: The average value over time, like average sales per day, to
understand typical patterns.
• Energy: Could represent intensity or usage over time, like energy
consumption or activity level.
Each feature you choose gives your model clues to help it make accurate
predictions.
• Height: The highest sales day in each week (helps understand peak times).
• Average: Average daily sales over the past month (helps with understanding
overall demand).
• Energy/Pattern: Could represent how active sales are on weekends versus
weekdays, showing seasonality or trends.
You would organize these features in a table to make it easier for your model to
read.
Date Peak Sales (Height) Average Daily Sales Weekend Boost (Energy)
Week 1 1200 850 High
Week 2 1100 830 Medium
Week 3 1300 870 High
This table gives your model structured data with important clues.
Once you have your features ready, the next step is to analyze them and find
patterns or relationships. This analysis helps the model understand which features
are most important for making predictions.
1. Identify Influences: Features help you understand what impacts the outcome.
By selecting key features (like Height, Average, Energy) and analyzing their effects,
you create a strong foundation for accurate predictions in predictive analytics. This
process turns raw data into meaningful insights and actionable predictions.
UNIT-5
1. Project Initiation:
o Define the objective of the analysis (e.g., predicting sales, customer
churn, etc.).
o Identify data sources and stakeholders.
2. Data Management:
o Data Collection: Clear steps for gathering relevant data from different
sources.
o Data Cleaning: Guidelines on how to handle missing values, outliers, or
data inconsistencies.
o Data Storage: Instructions on where and how to store datasets securely
(e.g., in databases or cloud storage).
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
3. Model Development:
o Model Selection: Procedures for choosing the appropriate algorithm
(e.g., regression, decision trees, etc.).
o Feature Engineering: Instructions on how to create new variables or
transform existing ones to improve model performance.
o Model Training and Testing: Guidelines on splitting data into training
and testing sets and evaluating model accuracy.
o Documentation: Keep records of why certain methods were chosen,
parameter settings, and performance metrics.
4. Version Control:
o Use tools like Git to track changes in code and models.
o Provide guidelines on how and when to update documentation to
reflect changes in the project.
5. Model Deployment:
o Instructions on how to deploy models to production environments.
o Steps to monitor model performance and ensure it remains accurate
over time.
6. Knowledge Sharing:
o Reports and Summaries: Create clear, non-technical summaries of
findings and share them with stakeholders.
o Code Repositories: Guidelines for organizing and sharing code through
platforms like GitHub or Bitbucket.
o Team Collaboration Tools: Use platforms like Confluence, Microsoft
Teams, or Slack for sharing insights and progress updates.
o Training and Onboarding: Procedures for training new team members
on the project and its documentation.
In short, SOPs in predictive analytics are like a rulebook that guides teams on how
to document and share knowledge effectively to ensure high-quality results and
smooth collaboration.
The purpose of a document explains why the document is being created. It states
the main goal or reason for writing it. For example, a document might aim to
The scope of a document outlines what will and won’t be covered. It describes
the limits or boundaries of the content. This can include details about specific
topics that will be discussed, the level of detail expected, or the aspects that won’t
be included. Clearly defining the scope prevents confusion and ensures the
document stays relevant and manageable.
In summary, the purpose is the “why,” and the scope is the “what” and “how far”
of the document.
1. Case Studies
3. White Papers
4. Technical Reports
Introduction
Urban air pollution has become a significant public health and
environmental challenge in cities worldwide. Due to rapid
industrialization, increased vehicular emissions, and other factors, urban
areas are experiencing higher concentrations of pollutants that pose risks
to both the environment and human well-being.
Problem Statement
"Our city has consistently recorded high levels of air pollution, primarily
due to vehicle emissions and industrial activities. This has resulted in an
increase in respiratory illnesses and a reduction in the overall quality of
life for residents. Our goal is to reduce air pollution levels by 40% over
the next two years through the implementation of stricter emission
regulations and the promotion of clean energy alternatives."
Objectives
The primary objective of this project is to achieve a 40% reduction in air
pollution levels in the city within two years. To reach this goal, the
project aims to:
Proposed Solutions
1. Implementing Stricter Emission Regulations
Implementation Plan
1. Timeline
Expected Outcomes
• Health Benefits: Reduction in cases of respiratory and
cardiovascular diseases.
• Environmental Improvements: Improved air quality and
healthier ecosystems.
• Economic Impact: Long-term cost savings from reduced
healthcare expenses and increased productivity.
• Community Engagement: Increased community awareness and
participation in air quality initiatives.
Conclusion
Reducing air pollution in urban areas requires a multifaceted approach
that involves policy enforcement, technological innovation, and
community collaboration. By committing to these strategies, we can
create a healthier, cleaner environment for current and future generations.