0% found this document useful (0 votes)
326 views

Predictive Analytics Complete Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
326 views

Predictive Analytics Complete Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

B. V.

RAJU INSTITUTE OF TECHNOLOGY


(UGC-Autonomous) Approved by AICTE & Affiliated to JNTUH, Hyderabad NAAC & NBA Accredited
Vishnupur, Narsapur, Medak Dist. – 502 313, Telangana, India.

PRIDICTIVE ANALYTICS
LECTURE NOTES
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING)

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


UNIT-1&2
Introduction to Predictive Analytics & Linear Regression
What is Analytics?
Analytics is the process of identifying, explaining, and sharing important trends in
data. Simply said, analytics enables us to notice information and insights that would
otherwise go undetected.
Analytics is a journey that involves a combination of potential skills, advanced
technologies, applications, and processes used by firm to gain business insights from
data and statistics. This is done to perform business planning.

What is Predictive Analytics: Predictive analytics is the use of statistics and


modeling techniques to determine future performance based on current and
historical data.

Why Predictive analytics ? : Predictive analytics helps businesses detect and


prevent fraud and avoid financial losses. For example, many credit card companies
have fraud detection built into their machine learning algorithms. With machine
learning, the algorithm knows your buying patterns and can identify suspicious
activity. Mainly we use predictive analytics in like
• When making decisions, people frequently deal with issues like: What should
be the proper pricing for a product? Which client is most likely to miss a loan
payment? Which goods ought to be suggested to a repeat customer? Finding
the correct answers to these questions may be difficult yet satisfying.

• In many corporate areas, predictive analytics is becoming a competitive


strategy that may distinguish high-performing businesses. In order to operate
a business effectively, it seeks to anticipate the likelihood of a future event,
such as customer attrition, loan defaults, and stock market swings.

• Predictive analytics issues are typically solved using models such multiple
linear regression, logistic regression, auto-regressive integrated moving
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
average (ARIMA), decision trees, and neural networks. We may use
regression models to identify the connections between these variables and
how to take advantage of those connections when making judgements

Places where Analytics is used: Predictive analytics can offer insight to


companies across industries and even for public safety. For example, local weather
forecasts run on predictive analytic technology. Let’s examine how big data and
machine learning are changing the landscape of industries like the automotive
industry, financial services, manufacturing, health care, marketing and retail, and the
oil, gas, and utilities industries.

Automotive industry:
Predictive analytics and other forms of AI to cover the way for self-driving vehicles
by predicting what will happen in the immediate future while driving a car down the
road. This process needs to happen continuously when a vehicle is in motion,
drawing information from multiple sensors and making judgment calls about which
potential actions would be a safety risk.

Beyond autonomous vehicles, manufacturers and retailers can also use predictive
analytics to their benefit. For example, predictive analytics helps factories create
vehicles faster using fewer resources.

For example:

Tesla uses predictive analytics in the form of neural network accelerators for their
self-driving vehicles. A neural network model simulates how human brains use
information to make decisions.

Financial services and risk reduction:

When you receive an alert to suspicious activity in your bank account, you can thank
predictive analytics for determining that something doesn’t seem right based on
deviations from your routine, such as a transaction in a different city. Financial
institutions and other companies use predictive analytics to reduce credit risk,
combat fraud, predict future cash flow, analyze insurance coverage, and look for new

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


business opportunities. Companies use predictive analytics to determine how likely
a person or company is to pay their debts or default on their obligations.

For example:

Orcolus is a program that businesses can use when determining someone’s credit
eligibility. Orcolus uses AI and ML to offer a more stable solution for examining
documents and avoiding fraud.

Manufacturing and industrial automation:


In a manufacturing setting, predictive analytics can anticipate significant equipment
failures, which can be expensive and potentially dangerous to employees. By
analyzing past equipment failures, this form of AI can determine imminent failure
and notify an employee when conditions start to look dangerous. Similar predictive
analytics methods can watch out for situations that pose a risk to employee health
and safety, reducing workplace injury and potentially boosting employee morale as
well.

For example:

In 2020, Ford Motor Company used predictive analytics to anticipate maintenance


in its Valencia, Spain factory. By repairing equipment before it broke down and
caused unplanned downtime, Ford saved more than $1 million in unplanned
downtime.

Health care industry:


Predictive analytics benefit the health care industry by predicting how chronic or
dangerous conditions occur. Patients with asthma or COPD can use a wearable
predictive analytics device to spot changes in their breathing patterns that could
signal a problem. Similarly, a wearable device could detect allergic reactions as they
occur and automatically give the patient epinephrine in response.

For example:

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Northern Light Health hospital system in Maine implemented predictive analytics in
the face of the COVID-19 pandemic when it became more critical to anticipate future
needs and maintain situational awareness. They built a data analytic system that
could forecast their census, or patient population, in four-, eight-, and 12-hour blocks
of time, in addition to a range of other functionalities. The result is that patients have
been outcomes and receive care faster.

Marketing and retail:


Marketing professionals use predictive analytics in many different ways: to tailor
marketing to specific segments of their target audience, for seasonal sales
forecasting, to improve customer relations, and to further engage customers. For
example, a company might use predictive analytics to power a recommendation
engine that suggests new products to customers based on products they’ve already
viewed or purchased. Previous customer behavior can also help predict how
customers progress through the sales funnel. This insight can help you place targeted
touchpoints to engage proactively with customers.

For example:

Subway used predictive analytics to decide whether to raise the price of their $5
Footlong sandwich. Their data showed that the low price point wasn’t causing them
to sell enough sandwiches to make up for a bump in price. Using a predictive
analytics program offered by Mastercard, Subway learned that customers purchasing
Footlong sandwiches added additional items to their orders, such as a side of chips
or a drink.

Oil, gas, and utilities:


When it comes to oil, gas, and utilities, we can use predictive analytics to forecast
demand for energy based on historical use and seasonal events like weather patterns.
Likewise, utility companies can predict how prices will likely fluctuate over time.

Similar to the manufacturing industry, utility companies can use predictive analytics
to watch out for equipment failures and safety concerns. Due to the potentially
catastrophic nature of equipment failure and malfunction in the utility industry, it’s
vital for companies to invest in predictive analytics to keep things running as
smoothly as possible.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


This technology can also create more reliable and safe conditions for workers in
potentially dangerous energy production facilities.

For example:

Exxon Mobil uses predictive analytics to power autonomous drilling stations in


Guyana. Using AI and machine learning, Exxon predicts the ideal conditions for
underwater drilling and enables a closed-loop automation system to minimize the
need for personnel to intervene.

Various Steps Involved in Predictive Analytics:

1.Defining the question:

The first step in any data analysis process is to define your objective. In data
analytics jargon, this is sometimes called the ‘problem statement’.

Defining your objective means coming up with a hypothesis and figuring how to test
it. Start by asking: What business problem am I trying to solve? While this might
sound straightforward, it can be trickier than it seems. For instance, your
organization’s senior management might pose an issue, such as: “Why are we losing
customers?” It’s possible, though, that this doesn’t get to the core of the problem. A

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


data analyst’s job is to understand the business and its goals in enough depth that
they can frame the problem the right way

2.Collecting the data : Once you’ve established your objective, you’ll need to create
a strategy for collecting and aggregating the appropriate data. A key part of this is
determining which data you need. This might be quantitative (numeric) data, e.g.
sales figures, or qualitative (descriptive) data, such as customer reviews. All data fit
into one of three categories: first-party, second-party, and third-party data. Let’s
explore each one.

3. Cleaning the data : Once you’ve collected your data, the next step is to get it
ready for analysis. This means cleaning, or ‘scrubbing’ it, and is crucial in making
sure that you’re working with high-quality data. Key data cleaning tasks include:

• Removing major errors, duplicates, and outliers—all of which are inevitable


problems when aggregating data from numerous sources.
• Removing unwanted data points—extracting irrelevant observations that
have no bearing on your intended analysis.
• Bringing structure to your data—general ‘housekeeping’, i.e. fixing typos or
layout issues, which will help you map and manipulate your data more
easily.
• Filling in major gaps—as you’re tidying up, you might notice that important
data are missing. Once you’ve identified gaps, you can go about filling them.

4.Analyzing the data : Finally, you’ve cleaned your data. Now comes the fun bit
analyzing it! The type of data analysis you carry out largely depends on what your
goal is. But there are many techniques available. time-series analysis, and regression
analysis are just a few you might have heard of and Descriptive analysis identifies
what has already happened and Diagnostic analytics focuses on understanding why
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
something has happened and Predictive analysis allows you to identify future trends
based on historical data and Prescriptive analysis allows you to make
recommendations for the future. More important than the different types, though, is
how you apply them. This depends on what insights you’re hoping to gain.

5. Sharing your results: You’ve finished carrying out your analyses. You have your
insights. The final step of the data analytics process is to share these insights with
the wider world (or at least with your organization’s stakeholders!) This is more
complex than simply sharing the raw results of your work it involves interpreting
the outcomes, and presenting them in a manner that’s digestible for all types of
audiences. Since you’ll often present information to decision-makers, it’s very
important that the insights you present are 100% clear and unambiguous. For this
reason, data analysts commonly use reports, dashboards, and interactive
visualizations to support their findings.
Various Analytics techniques are:
There are four different types of analytics techniques:

1. Descriptive analytics: Focuses on understanding historical data and what


happened.
2. Diagnostic analytics: Aims to identify why something happened by
analyzing patterns and relationships.
3. Predictive analytics: Uses statistical models to predict future outcomes
based on historical data.
4. Prescriptive analytics: Recommends actions to optimize outcomes based
on predictions and business objectives.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Application of Modelling in Business:
• Applications of predictive Modeling can be termed as Business analytics.
• Business analytics involves the collating, sorting, processing, and studying of
business-related data using statistical models and iterative methodologies. The
goal of BA is to narrow down which datasets are useful and which can increase
revenue, productivity, and efficiency.
• Business analytics (BA) is the combination of skills, technologies, and
practices used to examine an organization's data and performance as a way to
gain insights and make data-driven decisions in the future using statistical
analysis.
• Although business analytics is being leveraged in most commercial sector and
industries, the following applications are the most common.

1. Credit Card Companies


Credit and debit cards are an everyday part of consumer spending, and they are an
ideal way of gathering information about a purchaser’s spending habits, financial
situation, behavior trends, demographics, and lifestyle preferences.
2. Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to retain customer
loyalty to stay in business for the long haul. CRM systems analyze important
performance indicators such as demographics, buying patterns, socio-economic
information, and lifestyle.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
3. Finance
The financial world is a volatile place, and business analytics helps to extract insights
that help organizations maneuver their way through tricky terrain. Corporations turn
to business analysts to optimize budgeting, banking, financial planning, forecasting,
and portfolio management.
4. Human Resources
Business analysts help the process by pouring through data that characterizes high
performing candidates, such as educational background, attrition rate, the average
length of employment, etc. By working with this information, business analysts help
HR by forecasting the best fits between the company and candidates.

5. Manufacturing
Business analysts work with data to help stakeholders understand the things that
affect operations and the bottom line. Identifying things like equipment downtime,
inventory levels, and maintenance costs help companies streamline inventory
management, risks, and supply-chain management to create maximum efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by measuring
marketing and advertising metrics, identifying consumer behavior and the target
audience, and analyzing market trends.
➢ A statistical model embodies(represent) a set of assumptions concerning the
generation of the observed data, and similar data from a larger population.
➢ A model represents, often in considerably idealized form, the data-generating
process.
➢ Example: Speech, Signal processing is an enabling technology that
encompasses the fundamental theory, applications, algorithms, and
implementations of processing or transferring information contained in many
different physical, symbolic, or abstract formats broadly designated as signals.
➢ It uses mathematical, statistical, computational, heuristic, and linguistic
representations, formalisms, and techniques for representation, modeling,
analysis, synthesis, discovery, recovery, sensing, acquisition, extraction,
learning, security, or forensics.
➢ In manufacturing statistical models are used to define Warranty policies,
solving various conveyor related issues, Statistical Process Control etc.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Databases & Type of data and variables:
A Database Management System (DBMS) is a software system that is designed to
manage and organize data in a structured manner. It allows users to create, modify,
and query a database, as well as manage the security and access controls for that
database.
An integral component of a Database Management System (DBMS) that is required
to determine its structure A piece of middleware that extends or supplants the native
data dictionary of a DBMS.
Relational Database Management System: (RDBMS) is a software system used
to maintain relational databases. Many relational database systems have an option
of using the SQL.
Example: Format tables &files, Number of users, size.
NoSQL Database:
is a non-relational Data Management System, that does not require a fixed schema
(logical, Physical, view ). It avoids joins, and is easy to scale. The major purpose of
using a NoSQL database is for distributed data stores with humongous data storage
needs. NoSQL is used for Big data and real-time web apps. For example, companies
like Twitter, Facebook and Google collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.”

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database
technologies that can store structured, semi-structured, unstructured and
polymorphic data.
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response
time becomes slow when you use RDBMS for massive volumes of data. To resolve
this problem, we could “scale up” our systems by upgrading our existing hardware.
This process is expensive. The alternative for this issue is to distribute database load
on multiple hosts whenever the load increases. This method is known as “scaling
out.
• Examples
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
– Relational Databases (SQL) : Oracle, MySQL, SQL Server
– Non-relational Databases (NoSQL): MongoDB, CouchDB, BigTable

Type of data and variables:


Data consist of individuals and variables that give us information about those
individuals. An individual can be an object or a person. It refers to a set of values,
which are usually organized by variables and observational units.
A variable is an attribute, such as a measurement or a label.
Data can be categorized on various parameters like Categorical, Type etc. Data is of
2 types – Numeric and Character. Again numeric data can be further divided into sub
group of – Discrete and Continuous.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Again, Data can be divided into 2 categories – Nominal and ordinal.
Also based on usage data is divided into 2 categories – Quantitative (Nominal and
ordinal.) and Qualitative (discrete & continues)
Manufacturing industries also have their data divided in the groups discussed above.
Like production quantity is a discrete quantity while production rate is a continuous
data. Similarly quality parameter can be given ratings which ordinal data: holds
single value: rank whereas nominal : eg: name: no rank , position. Binary data can
either be nominal or ordinal. It has only two option. Eg 0, 1, /fail , pass/yes, No.

Missing imputations: Imputation is the process of replacing missing data


with substituted values.
• It Can lead to bias and /or lack of precision in statistical analysis :
• Observed value, unobserved, complete system fault.
There can be two types of gaps in Data:
1. Missing Data Imputation
2. Model based Technique

Types of missing data : Missing data can be classified into one of three categories .

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


1.MCAR: Data which is Missing Completely at Random has nothing systematic
about which observations are missing values. There is no relationship between
missingness and either observed or unobserved covariates.
2. MAR: Missing at Random is weaker than MCAR. The missingness is still random,
but due entirely to observed variables. For example, those from a lower
socioeconomic status may be less willing to provide salary information (but we
know their SES status). The key is that the missingness is not due to the values which
are not observed. MCAR implies MAR but not vice-versa.
3. MNAR : If the data are Missing Not At Random, then the missingness depends on
the values of the missing data. Censored data falls into this category. For example,
individuals who are heavier are less likely to report their weight. Another example,
the device measuring some response can only measure values above .5. Anything
below that is missing.
Predictive modeling: also known as predictive analytics, is a mathematical
technique that combines AI and machine learning with historical data
to accurately predict future outcomes. By building mathematical models that take
relevant input variables and generate predicted output variables, businesses can
make real-time decisions based on these almost instantaneous calculations12. The
primary purpose of predictive analytics is to make predictions about outcomes,
trends, or events based on patterns and insights from historical data1. Organizations
use predictive analytics to gain a competitive advantage by understanding past
behavior and predicting future behavior.

Treatment of Missing Values:


➢ 1. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not very
effective, unless the tuple contains several attributes with missing values. It is
especially poor when the percentage of missing values per attribute varies
considerably.
➢ 2. Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many missing
values.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


➢ 3. Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown” or -∞.
➢ are replaced by, say, “Unknown,” then the mining program may mistakenly
think that they form an interesting concept, since they all have a value in
common-that of “Unknown.” Hence, although this method is simple, it is not
foolproof.
➢ 4. Use the attribute mean to fill in the missing value: Considering the average
value of that particular attribute and use this value to replace the missing value
in that attribute column.
➢ 5. Use the attribute mean for all samples belonging to the same class as the
given tuple:
➢ For example: if classifying customers according to credit risk, replace the
missing value with the average income value for customers in the same credit
risk category as that of the given tuple.
➢ 6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction. For example, using the other customer
attributes in your data set, you may construct a decision tree to predict the
missing values for income.

Data Modelling Techniques:


Some common predictive modeling techniques include:

1. Regression

Regression tasks help to predict outcomes based on continuous values. It’s a


supervised ML approach that uses one or more independent variables to predict
target values – assuming that there is some sort of relationship that can be inferred
between data inputs and outputs.

A common example of a regression task is predicting housing prices based on


factors like location, number of rooms, square footage, and the year a home was
built. But regression tasks are also helpful for score estimates, risk assessments,
weather forecasting, and market predictions.

Simple Regression :
Used to predict a continuous dependent variable based on a single independent
variable. Simple linear regression should be used when there is only a single
independent variable.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Multiple Regression: Used to predict a continuous dependent variable based on
multiple independent variables.
Multiple linear regression should be used when there are multiple independent
variables.

Linear Regression: Linear regression is one of the simplest and most widely
used statistical models. This assumes that there is a linear relationship between the
independent and dependent variables. This means that the change in the dependent
variable is proportional to the change in the independent variables.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


NonLinear Regression :
Relationship between the dependent variable and independent variable(s) follows a
nonlinear pattern. Provides flexibility in modeling a wide range of functional
forms.

What are dependent and independent variables?


In predictive modeling and statistics, dependent and independent variables are key
concepts.
Dependent Variable: The dependent variable is the main factor or outcome that
you’re interested in predicting or understanding. It’s often denoted as “Y” in
mathematical equations. In a study or experiment, the dependent variable is the
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
variable that is measured or observed. For example, in a study looking at the effect
of studying time on test scores, the test scores would be the dependent variable
because they depend on the amount of time spent studying.
Independent Variable: Independent variables are the factors or variables that are
manipulated or controlled in a study. They are used to predict or explain changes in
the dependent variable. Independent variables are often denoted as “X” in
mathematical equations. In the study mentioned earlier, the independent variable
would be the amount of time spent studying, as this is the variable that is being
manipulated to see its effect on test scores.

Polynomial Regression : is a form of linear regression in which the relationship


between the independent variable x and dependent variable y is modelled as an nth-
degree polynomial. Polynomial regression fits a nonlinear relationship between the
value of x and the corresponding conditional mean of y, denoted E(y | x).
Polynomial regression is a type of regression analysis used in statistics and
machine learning when the relationship between the independent variable (input)
and the dependent variable (output) is not linear. While simple linear regression
models the relationship as a straight line, polynomial regression allows for more
flexibility by fitting a polynomial equation to the data.
When the relationship between the variables is better represented by a curve rather
than a straight line, polynomial regression can capture the non-linear patterns in the
data.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


logistic regression: Logistic regression is a supervised machine learning
algorithm used for classification tasks where the goal is to predict the probability
that an instance belongs to a given class or not. Logistic regression is a statistical
algorithm which analyze the relationship between two data factors.
• Logistic regression predicts the output of a categorical dependent
variable. Therefore, the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S”
shaped logistic function, which predicts two maximum values (0 or 1).

2.Clustering Technique: Clustering models group data points based on


similarities, providing insights into the inherent structure within the dataset.

Example: Customer segmentation in marketing uses clustering to identify groups


with similar purchasing behaviours for targeted campaigns. Types of Clustering
techniques are

K-means clustering, hierarchical clustering, and density-based clustering.

K-means clustering: assigns data points to one of the K clusters depending on their
distance from the center of the clusters. It starts by randomly assigning the clusters
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
centroid in the space. Then each data point assign to one of the cluster based on its
distance from centroid of the cluster.

.
Hierarchical clustering : is a connectivity-based clustering model that groups the
data points together that are close to each other based on the measure of similarity
or distance. The assumption is that data points that are close to each other are more
similar or related than data points that are farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the
hierarchical relationships between groups. Individual data points are located at the
bottom of the dendrogram, while the largest clusters, which include all the data
points, are located at the top. In order to generate different numbers of clusters, the
dendrogram can be sliced at various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a
measure of similarity or distance between data points. Clusters are divided or merged
repeatedly until all data points are contained within a single cluster, or until the
predetermined number of clusters is attained.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Density-Based Clustering: refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data
points in the region separated by two clusters of low point density are considered as
noise. The surroundings with a radius ε of a given object are known as the ε
neighborhood of the object. If the ε neighborhood of the object comprises at least a
minimum number, MinPts of objects, then it is called a core object.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


3. Decision tree : is a type of supervised learning algorithm that is commonly
used in machine learning to model and predict outcomes based on input data. It is a
tree-like structure where each internal node tests on attribute, each branch
corresponds to attribute value and each leaf node represents the final decision or
prediction. The decision tree algorithm falls under the category of supervised
learning. They can be used to solve both regression and classification problems.

• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.

• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


4.Neural network : is a machine learning program, or model, that makes
decisions in a manner similar to the human brain, by using processes that mimic
the way biological neurons work together to identify phenomena, weigh options
and arrive at conclusions.

Types of neural networks:

Multi-layer Perceptron
Multi-layer perception is also known as MLP. It is fully connected dense layers,
which transform any input dimension to the desired dimension. A multi-layer
perception is a neural network that has multiple layers. To create a neural network
we combine neurons together so that the outputs of some neurons are inputs of other
neurons.
A multi-layer perceptron has one input layer and for each input, there is one
neuron(or node), it has one output layer with a single node for each output and it can
have any number of hidden layers and each hidden layer can have any number of
nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is depicted below.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


In the multi-layer perceptron diagram above, we can see that there are three inputs
and thus three input nodes and the hidden layer has three nodes. The output layer
gives two outputs, therefore there are two output nodes. The nodes in the input layer
take input and forward it for further process, in the diagram above the nodes in the
input layer forwards their output to each of the three nodes in the hidden layer, and
in the same way, the hidden layer processes the information and passes it to the
output layer.

Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to numbers
between 0 and 1 using the sigmoid formula.

Convolutional Neural Network (CNN): is a type of Deep Learning neural network


architecture commonly used in Computer Vision. Computer vision is a field of
Artificial Intelligence that enables a computer to understand and interpret the image
or visual data.
Convolutional Neural Network (CNN) is the extended version of artificial neural
networks (ANN) which is predominantly used to extract the feature from the grid-
like matrix dataset. For example visual datasets like images or videos where data
patterns play an extensive role.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


The human brain processes a huge amount of information the second we see an
image. Each neuron works in its own receptive field and is connected to other
neurons in a way that they cover the entire visual field. Just as each neuron responds
to stimuli only in the restricted region of the visual field called the receptive field in
the biological vision system, each neuron in a CNN processes data only in its
receptive field as well. The layers are arranged in such a way so that they detect
simpler patterns first (lines, curves, etc.) and more complex patterns (faces, objects,
etc.) further along. By using a CNN, one can enable sight to computers.
Recurrent Neural Network(RNN): is a type of Neural Network where the output
from the previous step is fed as input to the current step. In traditional neural
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
networks, all the inputs and outputs are independent of each other. Still, in cases
when it is required to predict the next word of a sentence, the previous words are
required and hence there is a need to remember the previous words. Thus, RNN
came into existence, which solved this issue with the help of a Hidden Layer. The
main and most important feature of RNN is its Hidden state, which remembers
some information about a sequence. The state is also referred to as Memory
State since it remembers the previous input to the network. It uses the same
parameters for each input as it performs the same task on all the inputs or hidden
layers to produce the output. This reduces the complexity of parameters, unlike other
neural networks.

5. Classification models: are used to classify data into one or more categories based
on one or more input variables. Classification models identify the relationship
between the input variables and the output variable, and use that relationship to
accurately classify new data into the appropriate category. Classification models are
commonly used in fields like marketing, healthcare, and computer vision, to classify
data such as spam emails, medical diagnoses, and image recognition.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


6.Time series models: are used to analyze and forecast data that varies over time.
Time series models help you identify patterns and trends in the data and use that
information to make predictions about future values. Time series models are used
in a wide variety of fields, including financial analytics, economics, and weather
forecasting, to predict outcomes such as stock prices, GDP growth, and
temperatures.

7.Feed Forward Neural Networks

Feed Forward Neural Networks (FFNNs) are foundational in neural network


architecture, particularly in applications where traditional machine learning
algorithms face limitations.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


They facilitate tasks such as simple classification, face recognition, computer
vision, and speech recognition through their uni-directional flow of data.


Structure
FFNNs consist of input and output layers with optional hidden layers
in between. Input data travels through the network from input nodes,
passing through hidden layers (if present), and culminating in output
nodes.
• Activation and Propagation
These networks operate via forward propagation, where data moves in
one direction without feedback loops. Activation functions like step
functions determine whether neurons fire based on weighted inputs.
For instance, a neuron may output 1 if its input exceeds a threshold
(usually 0), and -1 if it falls below.
FFNNs are efficient for handling noisy data and are relatively straightforward to
implement, making them versatile tools in various AI applications.

Advantages of Feed Forward Neural Networks

1. Less complex, easy to design & maintain


2. Fast and speedy [One-way propagation]
3. Highly responsive to noisy data
Disadvantages of Feed Forward Neural Networks:

4. Cannot be used for deep learning [due to absence of dense layers and
back propagation]

8.Multilayer Perceptron

The Multi-Layer Perceptron (MLP) represents an entry point into complex neural
networks, designed to handle sophisticated tasks in various domains such as:
➢ Speech recognition
➢ Machine translation
➢ Complex classification tasks
MLPs are characterized by their multilayered structure, where input data traverses
through interconnected layers of artificial neurons.

This architecture includes input and output layers alongside multiple hidden layers,
typically three or more, forming a fully connected neural network.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Operation:


Bidirectional Propagation:
Utilizes forward propagation (for computing outputs) and backward
propagation (for adjusting weights based on error).
• Weight Adjustment:
During backpropagation, weights are optimized to minimize prediction
errors by comparing predicted outputs against actual training inputs.
• Activation Functions :
Nonlinear functions are applied to the weighted inputs of neurons,
enhancing the network’s capacity to model complex relationships. The
output layer often uses softmax activation for multi-class classification
tasks.
Advantages on Multi-Layer Perceptron

1. Used for deep learning [due to the presence of dense fully connected
layers and back propagation]

Disadvantages on Multi-Layer Perceptron:

1.Comparatively complex to design and maintain

9.Modular Neural Network

Applications of Modular Neural Network

2. Stock market prediction systems


3. Adaptive MNN for character recognitions
4. Compression of high level input data
A modular neural network has a number of different networks that function
independently and perform sub-tasks. The different networks do not really interact
with or signal each other during the computation process. They work independently
towards achieving the output.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


As a result, a large and complex computational process are done significantly faster
by breaking it down into independent components. The computation speed increases
because the networks are not interacting with or even connected to each other.
Advantages of Modular Neural Network

1. Efficient
2. Independent training
3. Robustness
Disadvantages of Modular Neural Network

4. Moving target Problems


10. Sequence to sequence models:

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


A sequence-to-sequence model consists of two Recurrent Neural Networks. Here,
there exists an encoder that processes the input and a decoder that processes the
output. The encoder and decoder work simultaneously – either using the same
parameter or different ones. This model, on contrary to the actual RNN, is
particularly applicable in those cases where the length of the input data is equal to
the length of the output data. While they possess similar benefits and limitations of
the RNN, these models are usually applied mainly in chatbots, machine translations,
and question answering systems.

NEED FOR BUSINESS MODELLING:

Business process modeling (BPM) is the analytical or graphical representation and


illustration of the business processes or workflow of an organization. This is
mostly in the form of a flowchart developed to visualize the various business
approaches and information dissemination.
The Business Process Management Initiative (BPMI) developed the business
process modeling technique to depict business processes simply and easily for
stakeholders, business partners, developers, managers, and others to understand.
Process modeling defines the methods with which business is carried for the sole
purpose of achieving the set down goals of an organization. Business modeling
tools assist organizations to improve business operations and manage workflow.
You can implement BPM in different areas such as marketing, accounting, sales,
technical support, manufacturing, and others.
Why we need:
Business process modeling (BPM) is essential for understanding your business
processes and maximizing positive outcomes from them.
Organizations or enterprises that use business process modeling enjoy objective
business intelligence which they can use to make data-driven decisions to improve
business processes, allocate resources, and create workable business strategies.
BPM provides a clear view of business processes which helps teams to properly
track their workflows and ensure they achieve the desired results. This helps to
reduce operating costs, and secure stronger business outcomes.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
There are many benefits to business process modeling:

• Gives everyone a clear understanding of how the process works


• Provides consistency and controls the process
• Identifies and eliminates redundancies and inefficiencies
• Sets a clear starting and ending to the process

Business process modeling can also help you group similar processes together and
anticipate how they should operate. The primary objective of business process
modeling tools is to analyze how things are right now and simulate how should
they be carried out to achieve better results.

Blue property -Assumption:

Here are some possible interpretations related to properties in data analysis:

1. Assumptions in Statistical Methods: Many statistical methods have


assumptions about the data. For example, linear regression assumes linearity,
independence, homoscedasticity (constant variance of errors), and normality
of residuals. These assumptions are crucial for the validity of the results.
2. Data Quality Properties: This could refer to various characteristics of data,
such as accuracy, completeness, consistency, and reliability. Ensuring data
quality is essential for meaningful analysis.
3. Assumption of Data Distribution: In some analyses, assumptions are made
about the distribution of the data (e.g., assuming data follows a normal
distribution in parametric tests).
4. Data Privacy and Security: In some contexts, properties related to data
privacy, security, and ethical considerations might be important.

Least Square Method: In statistics, when we have data in the form of data points that
can be represented on a cartesian plane by taking one of the variables as the
independent variable represented as the x-coordinate and the other one as the
dependent variable represented as the y-coordinate, it is called scatter data. This data
might not be useful in making interpretations or predicting the values of the dependent
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
variable for the independent variable where it is initially unknown. So, we try to get
an equation of a line that fits best to the given data points with the help of the Least
Square Method.
In this article, we will learn the least square method, its formula, graph, and
solved examples on it.

What is the Least Square Method?


The Least Squares Method is used to derive a generalized linear equation between two
variables, one of which is independent and the other dependent on the former. The
value of the independent variable is represented as the x-coordinate and that of the
dependent variable is represented as the y-coordinate in a 2D cartesian coordinate
system. Initially, known values are marked on a plot. The plot obtained at this point
is called a scatter plot. Then, we try to represent all the marked points as a straight
line or a linear equation. The equation of such a line is obtained with the help of the
least squares method. This is done to get the value of the dependent variable for an
independent variable for which the value was initially unknown. This helps us to fill
in the missing points in a data table or forecast the data. The method is discussed in
detail as follows.
Least Square Method Definition
The least-squares method can be defined as a statistical method that is used to find the
equation of the line of best fit related to the given data. This method is called so as it
aims at reducing the sum of squares of deviations as much as possible. The line
obtained from such a method is called a regression line.
Formula for Least Square Method
The formula used in the least squares method and the steps used in deriving the line
of best fit from this method are discussed as follows:
• Step 1: Denote the independent variable values as xi and the dependent
ones as yi.
• Step 2: Calculate the average values of xi and yi as X and Y.
• Step 3: Presume the equation of the line of best fit as y = mx + c, where m
is the slope of the line and c represents the intercept of the line on the Y-
axis.
• Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2
• Step 5: The intercept c is calculated from the following formula:
c = Y – mX
Thus, we obtain the line of best fit as y = mx + c, where values of m and c can be
calculated from the formulae defined above.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Least Square Method Graph
Let us have a look at how the data points and the line of best fit obtained from the
least squares method look when plotted on a graph.

The red points in the above plot represent the data points for the sample data available.
Independent variables are plotted as x-coordinates and dependent ones are plotted as
y-coordinates. The equation of the line of best fit obtained from the least squares
method is plotted as the red line in the graph.
We can conclude from the above graph that how the least squares method helps us to
find a line that best fits the given data points and hence can be used to make further
predictions about the value of the dependent variable where it is not known initially.
Limitations of the Least Square Method
The least squares method assumes that the data is evenly distributed and doesn’t
contain any outliers for deriving a line of best fit. But, this method doesn’t provide
accurate results for unevenly distributed data or for data containing outliers.
Check: Least Square Regression Line
Least Square Method Formula
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
The Least Square Method formula is used to find the best-fitting line through a set of
data points by minimizing the sum of the squares of the vertical distances (residuals)
of the points from the line. For a simple linear regression, which is a line of the
form y=ax+b, where y is the dependent variable, x is the independent variable, a is the
slope of the line, and b is the y-intercept, the formulas to calculate the slope (a) and
intercept (b) of the line are derived from the following equations:
1. Slope (a) Formula: a=n(∑xy)−(∑x)(∑y) / n(∑x2)−(∑x)2
2. Intercept (b) Formula: b=(∑y)−a(∑x) / n
Where:
• n is the number of data points,
• ∑xy is the sum of the product of each pair of x and y values,
• ∑x is the sum of all x values,
• ∑y is the sum of all y values,
• ∑x2 is the sum of the squares of x values.
These formulas are used to calculate the parameters of the line that best fits the data
according to the criterion of the least squares, minimizing the sum of the squared
differences between the observed values and the values predicted by the linear model.

Variable rationalization:
Variable rationalization in predictive analytics involves optimizing the selection
and transformation of features (variables) used in predictive models. The goal is to
enhance model performance, reduce complexity, and ensure that the variables
contribute meaningfully to the predictive power of the model. Here’s a detailed
breakdown of the process:

1. Understand the Variables

• Variable Assessment: Review the characteristics and relevance of each


variable in the context of the predictive problem.
• Domain Knowledge: Leverage domain expertise to determine which variables
are theoretically important for the prediction task.

2. Evaluate Variable Quality

• Missing Values: Analyze the extent and pattern of missing values. Decide
whether to impute, remove, or ignore variables with missing data.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


• Outliers: Identify and handle outliers that may skew the model or lead to
misleading results.
• Consistency: Ensure that variables are consistently formatted and measured.

3. Feature Selection

• Correlation Analysis: Compute correlation coefficients to identify and remove


variables that are highly correlated with each other (multicollinearity). This
can simplify the model and improve interpretability.
• Statistical Tests: Use tests like Chi-square (for categorical variables), ANOVA
(for continuous variables), or other relevant statistical tests to assess the
significance of variables.
• Feature Importance: Utilize algorithms that provide feature importance scores
(e.g., tree-based methods like Random Forests or Gradient Boosting) to
select the most influential variables.

4. Feature Engineering

• Create New Features: Generate new variables that might provide additional
predictive power. For example, combining multiple features into a single
feature that captures interactions.
• Transform Variables: Apply transformations (e.g., logarithmic, polynomial) to
variables to better capture relationships and improve model performance.
• Encoding Categorical Variables: Convert categorical variables into numerical
formats using techniques like one-hot encoding or label encoding, as
required by the modeling algorithm.

5. Dimensionality Reduction

• Principal Component Analysis (PCA): Reduce the number of variables by


creating principal components that capture the most variance in the data
while retaining essential information.
• Feature Selection Algorithms: Use algorithms like Recursive Feature
Elimination (RFE) or Lasso regression to systematically select important
features while removing less significant ones.

6. Model Performance and Validation

• Cross-Validation: Perform cross-validation to evaluate how different sets of


variables impact model performance. This helps ensure that the chosen
variables lead to robust and generalizable models.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
• Performance Metrics: Measure the impact of variable changes on performance
metrics such as accuracy, precision, recall, F1 score, or RMSE (depending
on the type of model).

7. Iterative Refinement

• Iterative Process: Variable rationalization is an iterative process. Continuously


refine the set of variables based on model performance and new insights.
• Feedback Loop: Use feedback from model evaluations to make further
adjustments to the features and transformations.

8. Simplicity and Interpretability

• Simplify Models: Aim for a model that uses the least number of features
necessary to achieve high performance. Simpler models are often easier to
interpret and maintain.
• Understandability: Ensure that the selected variables make sense and that the
model’s decisions can be explained in terms of the features.

9. Documentation

• Record Decisions: Document the rationale behind feature selection and


engineering decisions, including the criteria used and the impact on model
performance.

Summary

Variable rationalization in predictive analytics is about optimizing the variables used


in your models to ensure they are relevant, impactful, and manageable. It involves
assessing variable quality, selecting the most meaningful features, transforming and
engineering features to improve model performance, and iteratively refining the
feature set based on performance metrics and domain knowledge. This process helps
create more accurate, efficient, and interpretable predictive models.

Model Building:
Model building in predictive analytics involves creating and training models to
forecast future events or outcomes based on historical data. This process typically
follows a structured approach, including data preparation, model selection, training,
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
evaluation, and deployment. Here’s a step-by-step guide to model building in
predictive analytics:

1. Define the Problem

• Objective: Clearly define the problem you’re trying to solve and the outcome
you want to predict.
• Success Metrics: Identify metrics to measure the success of the model (e.g.,
accuracy, precision, recall, F1 score).

2. Collect and Prepare Data

• Data Collection: Gather relevant data from various sources. This might include
historical records, transaction logs, surveys, etc.
• Data Cleaning: Handle missing values, outliers, and inconsistencies in the data.
This may involve imputing missing values, removing duplicates, or correcting
errors.
• Feature Engineering: Create new features or modify existing ones to improve
model performance. This can include normalization, scaling, encoding
categorical variables, and generating interaction terms.

3. Exploratory Data Analysis (EDA)

• Visualization: Use plots and charts to understand data distributions,


relationships between variables, and patterns.
• Statistical Analysis: Perform statistical tests and summaries to explore the
data’s characteristics and identify potential issues or insights.

4. Select a Model

o Choose a Model Type: Depending on the problem, select an


appropriate model. Common types include:
▪ Regression Models: Linear regression, polynomial regression,
etc. (for continuous outcomes)
▪ Classification Models: Logistic regression, decision trees,
random forests, support vector machines (SVM), etc. (for
categorical outcomes)

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


▪ Time Series Models: ARIMA, SARIMA, Prophet, etc. (for
time-dependent data)
▪ Ensemble Methods: Techniques like boosting, bagging, and
stacking to combine multiple models for better performance.
o Algorithm Selection: Choose specific algorithms and methods based
on the data and problem requirements.

5. Split the Data

• Training and Testing Sets: Divide the data into training and testing sets
(e.g., 80/20 split) to train the model and evaluate its performance on unseen
data.
• Validation Set: Optionally, use a validation set to fine-tune model parameters
and select the best model.

6. Train the Model

• Fit the Model: Use the training data to train the model. This involves
adjusting model parameters to minimize the error between predictions and
actual values.
• Hyperparameter Tuning: Optimize model hyperparameters using techniques
such as grid search or random search to improve performance.

7. Evaluate the Model

• Performance Metrics: Assess model performance using appropriate metrics


(e.g., accuracy, precision, recall, ROC AUC for classification; RMSE, MAE
for regression).
• Cross-Validation: Use cross-validation techniques to ensure the model’s
robustness and generalizability across different subsets of data.
• Error Analysis: Analyze errors and misclassifications to understand where
the model may be failing and make improvements.

8. Refine the Model

• Model Improvement: Based on evaluation results, refine the model by


adjusting features, adding new ones, or trying different algorithms.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


• Feature Selection: Remove irrelevant or redundant features that do not
contribute to the model’s performance.

9. Deploy the Model

• Integration: Integrate the model into the production environment where it


will be used for making predictions on new data.
• Monitoring: Continuously monitor the model’s performance and make
updates as needed to handle changes in data patterns or business
requirements.

10. Maintain and Update the Model

• Regular Updates: Periodically retrain and update the model with new data to
ensure it remains accurate and relevant.
• Model Drift: Watch for changes in model performance over time due to shifts
in data distribution or other external factors.

11. Documentation and Reporting

• Document the Process: Keep thorough documentation of the model-building


process, including decisions made, parameters used, and performance metrics.
• Report Results: Communicate findings and model performance to
stakeholders in a clear and understandable manner.

By following these steps, you can build predictive models that provide valuable
insights and support decision-making processes based on data.

What Are Missing Imputations?

When you're working with data, sometimes you'll find that some values are
missing. This could be due to various reasons—maybe someone forgot to fill in the
data, or maybe the data was lost. In predictive analytics, having missing data can
be a problem because many machine learning models need complete data to make
accurate predictions.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Missing imputation is the process of filling in these missing values so that your
data is complete and ready for analysis.

Why Is It Important?

If you have a dataset with missing values and you try to build a predictive model,
the model might not work as well, or it might not work at all. By imputing, or
filling in, the missing values, you can make sure your model has the complete
information it needs to make accurate predictions.

How Do We Do It?

There are several common methods to handle missing data:

1. Mean/Median/Mode Imputation:
o Mean: Replace the missing value with the average of the other values.
o Median: Replace the missing value with the middle value (when all
values are ordered).
o Mode: Replace the missing value with the most frequent value.

Example: If you have ages like [25, 30, , 40], where “” is missing, you could
replace it with the mean (say, 32) or median (say, 30).

2. Forward/Backward Fill:
o Forward Fill: Replace the missing value with the previous value in
the dataset.
o Backward Fill: Replace the missing value with the next value in the
dataset.

Example: If you have temperatures like [70, 72, _, 75], you could replace the
missing one with 72 (forward fill) or 75 (backward fill).

3. K-Nearest Neighbors (KNN) Imputation:


o This method looks at the closest (most similar) data points and fills in
the missing value based on these neighbors.

Example: If you’re missing a value in a row of data, KNN will look at other
rows that are similar to it and use their values to fill in the gap.

4. Predictive Imputation:

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


o In this method, you use a model to predict the missing value. You
treat the missing value as a target variable and use the other variables
to predict what it might be.

Example: If you have a dataset with age, income, and education level, and
income is missing, you could build a model that predicts income based on
age and education.

5. Multiple Imputation:
o Instead of filling in a single value, multiple imputation creates several
different datasets, each with different imputed values. You then run
your analysis on all of these datasets and combine the results.

This method gives a more robust and realistic estimate because it considers
the uncertainty of the missing data.

When to Use Which Method?

• Mean/Median/Mode Imputation: Simple and quick, good for numeric


data.
• Forward/Backward Fill: Useful in time series data, where the data points
are sequential.
• KNN Imputation: Better when there are patterns in your data, and you want
a more accurate imputation.
• Predictive Imputation: Ideal when you have strong predictors for the
missing data.
• Multiple Imputation: Best for complex datasets where you want to capture
the uncertainty in the imputations.

Final Thoughts

Missing imputation is a crucial step in predictive analytics. The method you choose
depends on your data and the specific problem you’re trying to solve. Properly
handling missing data ensures that your predictive models are accurate and
reliable.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Analytics applications to various Business Domains in
predictive analytics
1. Marketing and Sales

• Customer Segmentation: Predictive analytics helps identify different


customer groups based on their behaviors and preferences, allowing for
targeted marketing campaigns.
• Customer Lifetime Value (CLV): Models can predict the long-term value
of a customer, helping businesses focus on retaining high-value customers.
• Churn Prediction: Predictive models can identify customers who are likely
to stop using a service, enabling companies to take proactive measures to
retain them.
• Demand Forecasting: Predicting future product demand helps in inventory
management and optimizing supply chains.

2. Finance and Banking

• Credit Scoring: Predictive analytics is used to assess the risk of lending to


individuals or businesses by predicting the likelihood of loan default.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


• Fraud Detection: Models can detect unusual patterns that might indicate
fraudulent activity in real-time.
• Risk Management: Predictive models help in forecasting financial risks and
creating strategies to mitigate them.
• Investment Strategies: Predicting market trends and stock movements to
make informed investment decisions.

3. Healthcare

• Patient Diagnosis: Predictive models can assist in diagnosing diseases early


by analyzing patient data and identifying risk factors.
• Hospital Readmissions: Predicting which patients are likely to be
readmitted allows for better post-discharge care and resource allocation.
• Personalized Medicine: Predictive analytics helps in creating tailored
treatment plans based on individual patient profiles.
• Operational Efficiency: Forecasting patient inflow, managing staff
schedules, and optimizing resource use.

4. Retail

• Inventory Management: Predicting which products will be in demand


allows retailers to stock efficiently, reducing waste and out-of-stock
situations.
• Pricing Optimization: Analyzing market trends, customer behavior, and
competition to set optimal prices.
• Personalized Recommendations: Predictive analytics powers
recommendation engines, suggesting products to customers based on their
browsing and purchase history.
• Customer Experience: Anticipating customer needs and preferences to
improve service quality.

5. Supply Chain and Logistics

• Demand Forecasting: Predicting demand for products or materials to


ensure timely procurement and reduce costs.
• Route Optimization: Predictive models help in finding the most efficient
delivery routes, saving time and fuel.
• Inventory Optimization: Balancing stock levels to meet demand without
overstocking.
• Risk Mitigation: Predicting potential disruptions in the supply chain, such
as delays or shortages, to develop contingency plans.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
6. Manufacturing

• Predictive Maintenance: Forecasting equipment failures or maintenance


needs before they occur, reducing downtime and repair costs.
• Quality Control: Predicting defects in manufacturing processes to improve
product quality.
• Supply Chain Optimization: Ensuring that materials are available when
needed, reducing production delays.
• Production Planning: Forecasting production requirements to optimize
resource allocation.

7. Human Resources

• Employee Retention: Predicting which employees are at risk of leaving and


developing strategies to retain them.
• Recruitment: Identifying the best candidates for a job based on predictive
analysis of their profiles and past hiring data.
• Performance Management: Predicting employee performance and
potential, aiding in career development and succession planning.

8. Energy

• Demand Forecasting: Predicting energy demand to optimize production


and distribution.
• Equipment Monitoring: Using predictive analytics to forecast equipment
failures and optimize maintenance schedules.
• Energy Consumption Patterns: Analyzing data to predict and manage
energy consumption more effectively.

9. Telecommunications

• Churn Prediction: Identifying customers likely to switch to competitors,


allowing companies to take preemptive actions.
• Network Optimization: Predicting data usage patterns to optimize network
performance and avoid congestion.
• Customer Segmentation: Tailoring services and promotions to different
customer segments based on predictive insights.

10. Insurance

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


• Claims Prediction: Predicting the likelihood of claims to manage risk and
set premiums accurately.
• Fraud Detection: Identifying potentially fraudulent claims before they are
processed.
• Customer Risk Profiling: Assessing the risk profile of customers to tailor
insurance products and pricing.

In all these domains, predictive analytics helps businesses stay competitive by


allowing them to anticipate future trends, optimize operations, and better serve
their customers.

UNIT-3
compare segmentation and regression using simple examples.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Segmentation Example

Imagine you work at a pet store, and you want to group your customers based on the types of
pets they own.

• Task: Group customers by the type of pet they have.


• Process: You collect information on whether customers own dogs, cats,
birds, or fish.
• Outcome: You create segments such as "Dog Owners," "Cat Owners," "Bird
Owners," and "Fish Owners."

Segmentation helps you understand that different customers have different needs.
You might then send dog food coupons to the "Dog Owners" group and bird food
coupons to the "Bird Owners" group.

Regression Example

Now, imagine you want to predict how much dog food a customer will buy based
on how many dogs they have.

• Task: Predict the amount of dog food a customer will buy.


• Process: You collect data on the number of dogs each customer has and how
much dog food they bought in the past.
• Outcome: You find that customers with more dogs tend to buy more dog
food. You create a formula that predicts dog food purchases based on the
number of dogs.

Regression gives you a specific number, like predicting that a customer


with three dogs will buy 30 pounds of dog food this month.
Key Differences

• Segmentation: Groups data into categories (e.g., Dog Owners vs. Cat Owners).
• Regression: Predicts a continuous outcome (e.g., predicting the amount of dog food a
customer will buy).

In summary, segmentation is about grouping data into categories, while regression is about
predicting numerical values.

difference between supervised and unsupervised learning with


simple examples.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Supervised Learning

Supervised learning is like learning with a teacher who gives you the
correct answers. Imagine you’re learning to recognize animals in
pictures, and someone shows you a picture of a cat and tells you, “This
is a cat.” Then, they show you another picture and say, “This is a dog.”
Over time, you learn to identify cats and dogs on your own because
you’ve been given examples with the correct labels.

• Example: Suppose you want to teach a computer to recognize


whether an email is spam or not. You collect a bunch of emails that
have already been labeled as "spam" or "not spam." The computer
learns from these labeled examples and tries to figure out the rules
to classify future emails.
• Key Point: The key here is that the learning process is guided by
labeled data—you know the correct answers (labels) and use them
to train the model.

Unsupervised Learning

Unsupervised learning is like learning without a teacher. Imagine


you’re given a bunch of pictures of animals, but no one tells you what
they are. You might start grouping similar-looking animals together
based on their features, like putting all the animals with fur in one group
and all the animals with feathers in another. You’re learning patterns and
similarities without knowing the exact labels.

• Example: Let’s say you have a bunch of customer data, like their
age, income, and spending habits, but no labels or categories. You
use unsupervised learning to find patterns and group similar
customers together. You might discover that there are natural
segments like “young adults with high income” and “middle-aged
with moderate spending.”
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
• Key Point: The key here is that the learning process is not guided
by labeled data. The model tries to find patterns, groupings, or
structures within the data without knowing the correct answers
upfront.
Summary of Differences

• Supervised Learning:
o Uses labeled data (with correct answers provided).
o Learns to predict outcomes based on these labels.
o Example: Predicting whether an email is spam based on labeled examples.
• Unsupervised Learning:
o Uses unlabeled data (no correct answers provided).
o Learns to find patterns or groupings in the data.
o Example: Grouping customers into segments based on their behavior without
predefined labels.

In simple terms, supervised learning is like learning with a teacher who tells you the correct
answers, while unsupervised learning is like figuring things out on your own by finding patterns.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Supervised
Learning Unsupervised Learning

Uses Known and Uses Unknown Data as


Input Data Labeled Data as input input

Computational Less Computational More Computational


Complexity Complexity Complex

Uses Real-Time Analysis


Uses off-line analysis
Real-Time of Data

The number of Classes The number of Classes is


Number of Classes is known not known

Accurate and Reliable Moderate Accurate and


Accuracy of Results Results Reliable Results

The desired output is The desired, output is not


Output data given. given.

In supervised learning
it is not possible to In unsupervised learning it
learn larger and more is possible to learn larger
complex models than and more complex models
in unsupervised than in supervised learning
Model learning

In supervised learning
In unsupervised learning
training data is used to
training data is not used.
Training data infer model

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Supervised
Learning Unsupervised Learning

Supervised learning is
Unsupervised learning is
also called
also called clustering.
Another name classification.

We can test our


We can not test our model.
Test of model model.

Optical Character
Find a face in an image.
Example Recognition

In predictive analytics, classification is a common technique used to categorize


or label data into different classes based on historical data and patterns. Predictive
models are trained on past data and applied to new data to predict outcomes.
Classification is particularly useful when the outcome is categorical.

Types of Classification in Predictive Analytics

1. Binary Classification:
o Binary classification involves predicting one of two possible
outcomes or classes.
o It is commonly used in situations where the result can be one of two
values, like "yes" or "no," "true" or "false."

Examples:

oFraud detection: Predicting whether a transaction is fraudulent (fraud


or no fraud).
o Customer churn prediction: Predicting whether a customer will
leave a service (churn or not churn).
o Loan default prediction: Predicting whether a customer will default
on a loan (default or not default).
2. Multiclass Classification:
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
o In multiclass classification, there are more than two possible
categories or classes, and the model predicts which category the
instance belongs to.
o The output class is mutually exclusive, meaning each instance belongs
to only one category.

Examples:

o Product recommendation: Predicting which product category a


customer is likely to purchase next (electronics, clothing, home goods,
etc.).
o Disease classification: Predicting which type of disease a patient may
have (diabetes, heart disease, cancer).
o Credit rating classification: Predicting a person’s credit rating
(excellent, good, fair, or poor).
3. Multilabel Classification:
o Multilabel classification allows an instance to belong to more than
one class simultaneously. Each instance can be associated with
multiple outcomes.
o This is useful when each observation can be assigned to multiple
categories.

Examples:

oCustomer segmentation: A customer could be classified into


multiple segments at the same time (frequent buyer, high-spender,
online shopper).
o Email tagging: An email could be tagged as both "important" and
"work-related."
o Product classification: A product might be labeled with multiple
attributes (eco-friendly, low-cost, durable).
4. Imbalanced Classification:
o Imbalanced classification occurs when the number of instances in
one class greatly outweighs the number in other classes. This is a
challenge in predictive analytics because models tend to be biased
toward the majority class.
o Specialized techniques like oversampling or undersampling are
often used to handle imbalanced data.

Examples:

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


o Anomaly detection: Predicting rare events, such as equipment failure
in manufacturing (where failures are rare compared to normal
operation).
o Rare disease prediction: Predicting the occurrence of a rare disease
in healthcare, where healthy patients far outnumber those with the
disease.
o Insurance claims fraud: Identifying fraudulent claims, where
fraudulent cases are much rarer than legitimate ones.

Summary of Classification Types in Predictive Analytics:

Classification
Description Examples
Type
Binary Fraud detection, customer
Two possible outcomes
Classification churn, loan default
Multiclass More than two outcomes, each Product recommendation,
Classification instance belongs to one class disease classification
Multilabel Customer segmentation,
Multiple labels for one instance
Classification email tagging
Imbalanced Majority vs. minority class Anomaly detection, rare
Classification imbalance disease prediction

Each type of classification plays a crucial role in solving different predictive


analytics problems, enabling businesses and organizations to make informed, data-
driven decisions based on future predictions.

Training & Development & learning & Development:


Predictive learning analytics is a tool that leverages data to enhance training
outcomes. But before diving into the specifics, it’s important to understand a
few broader categories. There are four primary types:

Descriptive
Descriptive analytics focuses on a collection and analysis of historical data to
identify trends and patterns. For instance, tracking course completion rates
and assessment scores falls under descriptive analytics. This type of analysis

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


helps organizations understand past performance and identify areas for
improvement.

Predictive
Predictive analytics forecasts future outcomes based on historical data.
Statistical algorithms and machine learning techniques are able to predict
learner behavior, performance, and engagement. You might leverage this type
to predict which employees will excel in specific training modules or identify
who might be at-risk.

Diagnostic
Diagnostic analytics looks under the hood—why did something happen? This
analysis digs deeper into the data to uncover the root causes of specific
outcomes. If a particular training program has low completion rates,
diagnostic analytics can help identify underlying issues, such as content
complexity or low learner engagement.

Prescriptive analytics
Prescriptive analytics provides actionable recommendations based on other
types of insights. Learning and development leaders can then decide on a
new and better course of action to achieve desired goals and outcomes.

Predictive analytics in an LMS


When a learning management system (LMS) is equipped with predictive
analytics capabilities, it can completely revolutionize employee training and
skills measurement. Such a transformation is the direct result of things like:

• Personalization – Build personalized learning paths to optimize


each employee’s development journey, full of relevant and effective
training content.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


• Enhanced course design – Identify which elements need
improvement, and enable continuous optimization of training
materials for impactful learning.
• Better resource allotment – Pinpoint high-impact training modules
in order to better allocate training budgets and invest in initiatives
that deliver the highest ROI.
• Improved compliance training – Simplify compliance training by
recognizing employees who need refresher courses or additional
support in order to stay compliant with industry regulations.
• Data-driven leadership decisions – Continuously refine training
strategies to make sure that they remain aligned with business
goals and evolving workforce needs.

• Using predictive learning analytics at


your organization

• Learning specialists constantly search for new tools to boost results


related to knowledge retention and skills achievement. Although
predictive learning analytics delivers in these areas, success requires
thorough planning, ongoing support, and buy-in from leadership.

• Business leaders across several industries have already rallied around


the benefits of predictive learning analytics for their organizations,
thanks to the use of emerging digital transformation technologies.
These include data mining, predictive modeling, and machine learning.

• When used in combination, these tools identify and measure patterns in


learning data and theorize learners’ future behaviors—which has direct
applications for e-learning.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


• Building a predictive learning
analytics algorithm

• Predictive learning analytics work best if you have a specific goal in


mind when you build the algorithm. Your company's LMS is already
gathering and storing valuable data, making it a great place to start for
delivering comprehensive reports.

• Think about predictive learning analytics as a micro-tool, allowing you to


glimpse learner behavior in a more granular way.

• The key is to stay focused and specific. For example, figure out which
employees are most likely to apply skills and which ones aren’t. To
accomplish this, you’ll need to look at how learning performance
correlates to other outcomes. You might use indicators such as quiz
scores, participation in group discussions, or completion times.

• Key takeaway: Predictive learning analytics focus on individuals and


small pieces of learning content. It’s not a “one-size-fits-all approach.”
Instead, learn as you go, and stay mindful that each situation is
different. You may be able to recycle part of an analytics algorithm you
already built with some modifications.

How They Work Together:

1. You gain knowledge through learning and studying.


2. You practice skills to apply that knowledge in real life.
3. You develop competence when you can use your knowledge and
skills confidently and effectively.

In summary, knowledge is what you know, skills are what you can do,
and competence is how well you can apply both together in real-world
situations.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


In predictive analytics, policies and record-keeping play critical roles in ensuring
that predictions are made accurately, ethically, and in compliance with regulations.
Here's a breakdown of both:

Policies in Predictive Analytics

Policies are the rules, guidelines, and best practices that dictate how predictive
analytics should be conducted. They are important to ensure that the analytics
process is ethical, fair, and legally compliant.

1. Data Privacy Policies:


o Ensure that the data used for predictions is collected and handled in
ways that protect individuals' privacy.
o Laws like GDPR (in Europe) or CCPA (in California) outline strict
guidelines for how personal data can be used.
2. Bias and Fairness Policies:
o Predictive models should not discriminate against any group (e.g.,
based on race, gender, age).
o Policies ensure fairness by guiding how models are built, trained, and
tested to reduce bias.
3. Model Transparency and Accountability:
o The inner workings of predictive models should be understandable to
stakeholders, especially for important decisions (e.g., in healthcare,
finance).
o Policies ensure that there’s clear documentation of the model-building
process and why certain decisions are made.
4. Data Governance:
o Clear rules on who has access to data, how it's stored, and how it can
be used.
o Governance policies also guide the sharing of data between
departments or organizations.
5. Security Policies:
o Data security policies ensure that sensitive data used in predictive
analytics is protected from unauthorized access and cyberattacks.

Record-Keeping in Predictive Analytics

Record-keeping is about documenting everything that happens in the predictive


analytics process. It ensures transparency, traceability, and accountability.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
1. Data Collection Records:
o Keeping detailed logs about where data comes from, how it’s
collected, and any transformations applied to it.
o This helps in validating the data quality and addressing any future
concerns about the integrity of data.
2. Model Development Logs:
o Recording each step in building a predictive model, including the
choice of algorithm, features used, and hyperparameters.
o These records help track how the model was developed, making it
easier to reproduce or audit the process.
3. Training and Testing Data Records:
o Documentation of the datasets used to train and test models.
o This helps in assessing the accuracy and reliability of predictions over
time.
4. Model Performance Records:
o Logs of how well the predictive model performs, including metrics
such as accuracy, precision, recall, etc.
o These records help monitor the model’s performance over time and
allow for regular updates when necessary.
5. Audit Trails:
o A detailed history of who accessed the data and models, and what
changes were made.
o Important for regulatory compliance, especially in industries like
healthcare and finance.
6. Model Updates and Versioning:
o Keeping records of every update or version change in the model.
o This helps in tracking improvements, spotting errors, or reverting to
previous models if necessary.

Why These Are Important:

• Compliance: Policies ensure compliance with legal and regulatory


standards.
• Accountability: Record-keeping allows for an audit trail, making it easier to
trace decisions and actions in the predictive modeling process.
• Transparency: Both policies and record-keeping increase the transparency
of predictive analytics, making it easier for organizations to explain how
decisions were made.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
• Model Maintenance: Good records help with maintaining and improving
models over time. It ensures that models can be updated or retrained when
necessary, without losing track of their performance history.

In summary, policies guide how predictive analytics is done, while record-keeping


tracks what happens during the process. Together, they help ensure that predictions
are ethical, accurate, and reliable.

Policies in Predictive Analytics with Examples:-

Policies are the rules or guidelines that tell us how to do things the right way
when using predictive models.

1. Data Privacy Policy

• Example: Imagine you are collecting student test scores to predict how well
students will do next year. A data privacy policy ensures you don’t share any
student’s personal information (like their names or addresses) with anyone
else without permission. It’s like locking away their private information to
keep it safe.

2. Bias and Fairness Policy

• Example: If you’re using data to predict which students might need extra help
in math, a fairness policy makes sure that the model doesn’t give different
results just because someone is a boy or girl. The model should be fair to
everyone!

3. Security Policy

• Example: You’re using a computer to store all the data about students. A
security policy makes sure you lock it with a password so that no one can
steal or misuse the data.

4. Transparency Policy

Example: If a student or teacher asks, “How did you predict that this student will
need help?” a transparency policy ensures you can explain how the model made
that prediction, not just saying, “The computer told me so!”

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Record-Keeping in Predictive Analytics with Examples:-

Record-keeping means writing down or storing important details about what


happens during the predictive process, like keeping a diary of what you do.

1. Data Collection Records

• Example: You’re collecting students’ test scores from different schools. You
should write down where each score came from, when you collected it, and
any changes you made to the data (like removing errors). It’s like noting
down the details of where you got the information.

2. Model Development Records

• Example: When building a model to predict which students will do well in


math, you should keep a record of which formulas (algorithms) you used and
why. This way, if you need to improve the model later, you’ll remember
exactly how you created it.

3. Performance Records

• Example: After using your model to predict student success, you should track
how accurate it was. Did it get most predictions right or wrong? Keeping a
log of the model’s performance helps you know if you need to improve it.

4. Audit Trails

• Example: Imagine the principal wants to see if anyone changed the predictions
for certain students. An audit trail is like a history log that shows who used
the model, what they did, and when. It’s like keeping a record of who touched
your notebook and what they wrote in it.

Simple Summary:

• Policies = Rules to make sure everything is done fairly, safely, and correctly
(e.g., not sharing private info, treating everyone fairly).

• Record-Keeping = Writing down important details about what you do (e.g.,


where the data came from, how you made predictions).

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


• Together, they help keep predictive analytics organized, ethical, and
accountable, just like keeping class rules and a record of homework helps
keep things fair and clear for everyone.

Feature extraction in ARIMA models: is a key step in predictive


analytics to improve the forecasting performance. Let's walk through the process of
how features are extracted and used in ARIMA (AutoRegressive Integrated
Moving Average) for time series forecasting.

1. ARIMA Components Overview

ARIMA models predict future values in a time series by combining:

• AR (Auto-Regressive): Uses past values to predict the next one.


• I (Integrated): Accounts for trends by differencing the data.
• MA (Moving Average): Uses past errors to improve the forecast.

2. Feature Extraction in ARIMA

In the context of ARIMA, feature extraction focuses on identifying patterns,


trends, seasonality, and residuals in the time series. Here are some key steps and
features that are extracted for predictive analytics:
a) Lagged Values (AR features)

Use past values of the time series as features.

For example, if you are predicting sales for today, the values of sales from
yesterday (t-1), two days ago (t-2), etc. become features.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


b) Differencing (Integrated feature)

• To make the series stationary (no trend), you can difference it:
• This differenced series serves as a feature to model trends more accurately.

c) Residuals (Error terms, MA features)

• The moving average component uses past forecast errors to improve


predictions.
• Features include the errors from previous forecasts:

d) Trend and Seasonality Features

• Seasonality can be extracted using tools like Fourier transformations or


decomposition methods (like STL decomposition).
• Trend terms can be captured using polynomial or linear regressions applied
to the data.

4. Application Example: Sales Forecasting

Suppose you want to predict daily sales. Some possible features in your ARIMA
model might be:

• Lagged sales: Sales from the previous day or week.


• Seasonality features: Sales from the same day last week.
• External variables (ARIMAX): Promotions, holidays, or weather conditions.
• Residual features: Errors from previous sales forecasts.

In summary, feature extraction in ARIMA models revolves around lagged values,


differencing, residuals, trends, and exogenous variables. These features help
create a more robust and accurate forecasting model, especially for time series data
with complex patterns.

What does "STL" mean?

STL stands for Seasonal-Trend decomposition using Loess. Here’s what


each part means:

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


1. Seasonal – This is like looking for repeating patterns, like how it gets
colder every winter and warmer every summer. In data, seasonal
patterns repeat at regular intervals, like weekly, monthly, or yearly
trends.
2. Trend – This part focuses on the long-term direction the data is
moving in. Think of it like watching someone slowly walk uphill
or downhill over time. In data, a trend might show a gradual
increase or decrease, like rising prices over the years.
3. Loess – This is a statistical method to smooth out data and make it
easier to see the trend and seasonality clearly. It’s like putting a
"smoothing" filter on your data so it looks more stable and easier
to understand.
How does STL work?

Imagine you have a graph of daily ice cream sales over a year. Ice cream
sales might go up in summer (seasonal), increase slightly every year due
to popularity (trend), and have random ups and downs on certain days
(noise). The STL approach separates these components so that we can
study each one on its own:

1. Seasonal component: It
looks for the repeating pattern, like more ice
cream sales in summer.
2. Trend component: It shows whether sales are going up or down in the
long run.
3. Remainder (Noise): This is the leftover randomness that doesn’t fit into
seasonality or trend, like one-day spikes in sales on really hot days.
What is Seasonal Decomposition?
Seasonal decomposition is a statistical technique for breaking down a time
series into its essential components, which often include the trend, seasonal
patterns, and residual (or error) components. The goal is to separate the
different sources of variation within the data to understand better and
analyze each component independently. The fundamental components are
discussed below:

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


• Trend: The underlying long-term progression or direction in the
data.
• Seasonal: The repeating patterns or cycles that occur at fixed
intervals like daily, monthly or yearly.
• Residual: The random fluctuations or noise in the data that cannot
be attributed to the trend or seasonal patterns.
What is Loess (STL)?
Locally Weighted Scatterplot Smoothing or Loess is a non-parametric
regression method used for smoothing data. In the context of time series
analysis, Seasonal-Trend decomposition using Loess (STL) is a specific
decomposition method that employs the Loess technique to separate a time
series into its trend, seasonal, and residual components. STL is particularly
effective in handling time series data with complex and non-linear patterns.
In STL, the decomposition is performed by iteratively applying Loess
smoothing to the time series. This process helps capture both short-term and
long-term variations in the data, making it a robust method for decomposing
time series with irregular or changing patterns.
Why to perform seasonal decomposition?
There are various reasons for performing seasonal decomposition in Time-
series data which are discussed below:
1. Pattern Identification: Seasonal decomposition allows analysts to
identify and understand the underlying patterns within a time
series. This is crucial for recognizing recurring trends, seasonal
effects and overall data behavior.
2. Forecasting: Separating a time series into its components
facilitates more accurate forecasting. By modeling the trend,
seasonal patterns and residuals separately, it becomes possible to
make predictions and projections based on the individual
components.
3. Anomaly Detection: Detecting anomalies or unusual events in a
time series is more effective when the data is decomposed.
Anomalies are easier to identify when they stand out against the
background of the trend and seasonal patterns.
4. Statistical Analysis: Seasonal decomposition aids in statistical
analysis by providing a clearer picture of the structure of the time
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
series. This, in turn, enables the application of various statistical
methods and models.

STL helps us see the big picture in data with trends and seasonal patterns
and smooths out the random parts, making it easier to forecast what might
happen next.

Measures of Forecast Accuracy: -

1. Mean Absolute Error (MAE)

• Formula:

• Explanation:
This is the average of the absolute differences between actual values
(yt) and predicted values (y^t)
o Pros: Easy to interpret.
o Cons: Treats all errors equally, ignoring their direction (over- or
under-prediction).

2.Mean Forecast Error (MFE) or Bias

Explanation:
MFE shows whether a model is consistently over-predicting (positive bias)
or under-predicting (negative bias).

• Pros: Useful to check if forecasts are biased.


• Cons: Positive and negative errors can cancel each other out, making it
misleading

Formula:

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


3. Mean Squared Error (MSE)

• Formula:

• Explanation:
MSE calculates the average of squared differences between actual
and forecasted values.
o Pros: Penalizes larger errors more than smaller ones (because of
squaring).
o Cons: Sensitive to outliers.

4. Root Mean Squared Error (RMSE)

• Formula:

• Explanation:
RMSE is the square root of MSE, giving error in the same unit as the
data.
o Pros: Easier to interpret compared to MSE.
o Cons: Sensitive to outliers, like MSE.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


5. Mean Absolute Percentage Error (MAPE)

• Formula:

• Explanation:
MAPE measures the percentage difference between forecasted and
actual values.
o Pros: Expresses accuracy as a percentage, making it easy to
understand.
o Cons: Fails when actual values yty_tyt are close to zero (divides
by small numbers).

6.Symmetric Mean Absolute Percentage Error (sMAPE)

• Formula:

• Explanation:
sMAPE corrects for MAPE's tendency to blow up when actual values
are small.
o Pros: Handles low or zero actual values better.
o Cons: Still treats large and small errors equally.

Choosing the Right Accuracy Measure

• MAPE works well if you need percentage-based accuracy.


• MAE is ideal for understanding the average magnitude of error.
• MSE/RMSE is better when larger errors are critical (e.g., financial forecasting).
• Tracking Signal helps detect systematic bias in forecasts.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Using multiple measures ensures a comprehensive evaluation of forecast quality.
Each measure gives different insights, so it’s often helpful to compare them
together when tuning and evaluating time series models.

extracting features like Height, Average, Energy, etc., and


analyzing them for prediction
This all about selecting the most important pieces of data to help a model make
accurate guesses about the future.

Let's go through this step-by-step in a simple way.

Step 1: What Are Features?

In predictive analytics, features are like key pieces of information or characteristics


about your data that help predict an outcome. Features can be anything that
might help your model make better predictions, such as:

• Height: Could represent the maximum value of something, like the peak
number of sales on a busy day.
• Average: The average value over time, like average sales per day, to
understand typical patterns.
• Energy: Could represent intensity or usage over time, like energy
consumption or activity level.

Each feature you choose gives your model clues to help it make accurate
predictions.

Step 2: Extracting Features from Data

Extracting features means choosing and calculating the important characteristics


from your data. The goal is to capture things that will make the model smarter at
predicting future values.

Example: Predicting Sales for a Store

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Imagine you want to predict next month’s sales for a store. You collect daily sales
data and identify features that might influence future sales.

Useful Features to Extract:

• Height: The highest sales day in each week (helps understand peak times).
• Average: Average daily sales over the past month (helps with understanding
overall demand).
• Energy/Pattern: Could represent how active sales are on weekends versus
weekdays, showing seasonality or trends.

You would organize these features in a table to make it easier for your model to
read.

Date Peak Sales (Height) Average Daily Sales Weekend Boost (Energy)
Week 1 1200 850 High
Week 2 1100 830 Medium
Week 3 1300 870 High

This table gives your model structured data with important clues.

Step 3: Analyzing Features for Prediction

Once you have your features ready, the next step is to analyze them and find
patterns or relationships. This analysis helps the model understand which features
are most important for making predictions.

1. Identify Patterns and Trends:


o Look for patterns in each feature to see if it has a strong effect on the
outcome.
o Example: Does sales height increase every week or does weekend
activity have a big impact on total sales?
2. Choose a Prediction Model:
o Pick a model that works well with your data and the features you’ve
selected.
o Common models in predictive analytics:
▪ Linear Regression (for straightforward relationships)
▪ Time Series Models like ARIMA (for time-based trends)

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


▪ Machine Learning Models like Decision Trees, Random Forests, or
Neural Networks (for complex relationships)
3. Train the Model:
o Give your model the feature data and teach it the relationship
between the features and what you’re predicting.
o Example: You show the model how peak sales, average sales, and
weekend boosts from past weeks relate to actual weekly sales.
4. Use the Model to Make Predictions:
o Once the model is trained, it can use the features to predict future
sales.
o Example: If it sees that weekends usually boost sales by 30% and the
average sales have been rising, it might predict a higher sales volume
next month.

Example of Feature-Based Prediction

Let’s say you want to predict sales for next month.

1. Extract Features: You calculate the following for past data:


o Peak Sales (Height): Highest sales days in the past weeks.
o Average Sales: Average sales per day in the past weeks.
o Weekend Boost (Energy): How much sales increase on weekends.
2. Analyze and Train:
o You notice that higher peak days and average sales both lead to
higher sales next month.
o You use these features in a predictive model like linear regression or a
more advanced neural network.
3. Predict Sales for Next Month:
o The model might say, “Based on peak sales and weekend boosts, I
predict a 10% increase next month.”
4. Check Accuracy Over Time:
o After seeing real sales for the month, you check how close the
prediction was to reality and adjust if needed.

Why Extracting and Analyzing Features Matters

1. Identify Influences: Features help you understand what impacts the outcome.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


2. Improve Prediction Accuracy: Relevant features make your model smarter,
leading to better predictions.
3. Get Insightful Patterns: Analyzing features can show trends you didn’t see
before, like seasonal effects on sales.

By selecting key features (like Height, Average, Energy) and analyzing their effects,
you create a strong foundation for accurate predictions in predictive analytics. This
process turns raw data into meaningful insights and actionable predictions.

UNIT-5

Standard Operating Procedures (SOPs) for documentation and knowledge


sharing: in predictive analytics are structured guidelines that describe how tasks
related to creating and maintaining analytics projects should be carried out to ensure
consistency, efficiency, and quality. Here's a simple breakdown:

1. Purpose of SOPs in Predictive Analytics

• Consistency: To make sure everyone in a team or organization follows the


same methods for building and sharing predictive models.
• Efficiency: To save time and avoid redoing work because everyone knows
what steps to follow.
• Quality: To maintain high standards for how models are built, tested, and
shared.

2. Key Components of SOPs for Documentation and Knowledge Sharing

1. Project Initiation:
o Define the objective of the analysis (e.g., predicting sales, customer
churn, etc.).
o Identify data sources and stakeholders.
2. Data Management:
o Data Collection: Clear steps for gathering relevant data from different
sources.
o Data Cleaning: Guidelines on how to handle missing values, outliers, or
data inconsistencies.
o Data Storage: Instructions on where and how to store datasets securely
(e.g., in databases or cloud storage).
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
3. Model Development:
o Model Selection: Procedures for choosing the appropriate algorithm
(e.g., regression, decision trees, etc.).
o Feature Engineering: Instructions on how to create new variables or
transform existing ones to improve model performance.
o Model Training and Testing: Guidelines on splitting data into training
and testing sets and evaluating model accuracy.
o Documentation: Keep records of why certain methods were chosen,
parameter settings, and performance metrics.
4. Version Control:
o Use tools like Git to track changes in code and models.
o Provide guidelines on how and when to update documentation to
reflect changes in the project.
5. Model Deployment:
o Instructions on how to deploy models to production environments.
o Steps to monitor model performance and ensure it remains accurate
over time.
6. Knowledge Sharing:
o Reports and Summaries: Create clear, non-technical summaries of
findings and share them with stakeholders.
o Code Repositories: Guidelines for organizing and sharing code through
platforms like GitHub or Bitbucket.
o Team Collaboration Tools: Use platforms like Confluence, Microsoft
Teams, or Slack for sharing insights and progress updates.
o Training and Onboarding: Procedures for training new team members
on the project and its documentation.

Why SOPs Are Important

• They ensure everyone is on the same page.


• Make it easier for new team members to understand ongoing projects.
• Reduce errors and save time by providing clear instructions.
• Help maintain data security and compliance with regulations.

In short, SOPs in predictive analytics are like a rulebook that guides teams on how
to document and share knowledge effectively to ensure high-quality results and
smooth collaboration.

The purpose of a document explains why the document is being created. It states
the main goal or reason for writing it. For example, a document might aim to

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


inform, guide, explain, or make a proposal. Clearly defining the purpose helps the
writer stay focused and helps the reader understand what to expect.

The scope of a document outlines what will and won’t be covered. It describes
the limits or boundaries of the content. This can include details about specific
topics that will be discussed, the level of detail expected, or the aspects that won’t
be included. Clearly defining the scope prevents confusion and ensures the
document stays relevant and manageable.

In summary, the purpose is the “why,” and the scope is the “what” and “how far”
of the document.

Understanding structure of documents – case studies, articles, white


papers, technical reports:
The structure of a document is how it’s organized or put together to make the
information clear and easy to follow. Different types of documents have their own
structures because they are written for different reasons. Let’s look at the structure
of some common documents:

1. Case Studies

• Purpose: To describe a real-world situation or problem, explain what was


done to address it, and show the results.
• Structure:
o Introduction: Explains the background of the problem and why it’s
important.
o Problem or Challenge: Describes the issue that needed to be solved .
o Solution or Action Taken: Details what was done to fix the problem or
improve the situation.
o Results and Analysis: Shows what happened after the solution was
implemented, including any improvements or lessons learned.
o Conclusion: Summarizes the key points and any recommendations.
• Example: A case study about a company that used a new technology to
increase sales might explain the challenge (low sales), the actions taken (like
a new marketing strategy), and the outcome (sales went up by 50%).

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


2. Articles

• Purpose: To inform, entertain, or persuade readers about a specific topic .


• Structure:
o Headline/Title: Captures attention and tells what the article is about.
o Introduction: Provides an overview or hook to engage the reader.
o Body: Contains the main content, divided into sections with
subheadings or paragraphs that present the key ideas.
o Conclusion: Wraps up the article, often summarizing the points or
offering a final thought.
• Example: An article about climate change might start with a catchy title,
explain the effects of climate change in the body, and end with a call to
action, like asking readers to reduce waste.

3. White Papers

• Purpose: To provide in-depth information about a complex topic, usually to


explain or promote a solution to a problem.
• Structure:
o Title Page: Includes the title and author information.
o Executive Summary: Briefly explains what the paper is about and the
key takeaways.
o Introduction: Introduces the topic and explains why it matters .
o Problem Description: Discusses the issue or challenge in detail.
o Solution or Approach: Describes a solution or a way to address the
problem, often backed up by research or evidence.
o Conclusion: Summarizes the main points and may include a call to
action.
o References: Lists any sources used.
• Example: A white paper on renewable energy might outline the challenges of
traditional energy sources, explain how solar power can help, and present
data showing the benefits.

4. Technical Reports

• Purpose: To document and share technical or scientific research and findings


in detail.
• Structure:
o Title Page: Contains the title, author, and date.
o Abstract: A summary of the report, including the purpose, methods,
and key results.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
o Introduction: Explains the purpose of the report and provides
background information.
o Methods: Describes how the research or work was conducted .
o Results: Presents the data or findings from the research.
o Discussion: Analyzes the results and explains their significance.
o Conclusion: Summarizes the findings and suggests further research or
applications.
o Appendices (if needed): Contains extra details like graphs or raw data.
• Example: A technical report on testing a new material’s strength might explain
the testing process, show the results in graphs, and discuss how the material
could be used in construction.

These structures help readers understand the information in an organized way,


whether they are reading about real-life examples (case studies), getting general
information (articles), diving deep into a topic (white papers), or understanding
scientific work (technical reports).

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


Document on Addressing High Levels of Air
Pollution in Urban Areas

Introduction
Urban air pollution has become a significant public health and
environmental challenge in cities worldwide. Due to rapid
industrialization, increased vehicular emissions, and other factors, urban
areas are experiencing higher concentrations of pollutants that pose risks
to both the environment and human well-being.

Problem Statement
"Our city has consistently recorded high levels of air pollution, primarily
due to vehicle emissions and industrial activities. This has resulted in an
increase in respiratory illnesses and a reduction in the overall quality of
life for residents. Our goal is to reduce air pollution levels by 40% over
the next two years through the implementation of stricter emission
regulations and the promotion of clean energy alternatives."

Background and Context


According to recent studies, pollutants such as nitrogen oxides (NOx),
particulate matter and sulfur dioxide (SO2) have been found at high
levels in urban environments. The primary sources of these pollutants
include:

Vehicular Emissions: Automobiles contribute significantly to air


pollution, especially in areas with high traffic density.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
Industrial Activities: Factories and power plants emit a large quantity of
pollutants as byproducts of production processes.

Construction Dust: Urban construction contributes to airborne particles


that further degrade air quality.

Objectives
The primary objective of this project is to achieve a 40% reduction in air
pollution levels in the city within two years. To reach this goal, the
project aims to:

1. Implement and enforce stricter vehicle emission standards.


2. Promote the use of electric and hybrid vehicles.
3. Increase green spaces to act as natural air filters.
4. Encourage the adoption of clean energy technologies in industrial
processes.
5. Raise public awareness about air pollution and promote
community-driven initiatives for clean air.

Proposed Solutions
1. Implementing Stricter Emission Regulations

• Introduce policies to limit emissions from cars, trucks, and


industrial sources.
• Mandate regular emissions testing and maintenance for vehicles.
• Enforce penalties for industries that fail to meet emission
standards.
2. Promoting Clean and Renewable Energy

• Provide incentives for businesses and households to adopt solar,


wind, or other renewable energy sources.
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
• Phase out the use of coal and promote natural gas as a cleaner
alternative.
3. Expanding Green Spaces

• Develop urban parks and green belts to absorb pollutants and


improve air quality.
• Encourage tree-planting initiatives, focusing on areas with high
pollution levels.
• Partner with community organizations to maintain and expand
green spaces.

4. Public Awareness and Community Engagement

• Launch educational campaigns to inform residents about the


dangers of air pollution and ways to minimize exposure.
• Engage schools, businesses, and community groups in promoting
clean air practices.
• Establish a real-time air quality monitoring system accessible to
the public.

Implementation Plan
1. Timeline

• Month 1: Establish a project team and conduct a comprehensive


study on the city’s current air quality status.
• Month 2: secure funding for green initiatives and renewable
energy projects.(solar,wind,hydropower means with water,
biomass)
• Month 3: Launch public awareness campaigns and initiate small-
scale pilot projects for clean energy use..
DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya
2. Stakeholders Involved

• Local Government: Policy-making and enforcement.


• Environmental Agencies: Monitoring and reporting on air quality.
• Community Organizations: Educating the public and driving local
initiatives.
• Industries and Businesses: Adopting cleaner technologies.
• Residents: Participating in awareness programs and tree-planting
activities.

Expected Outcomes
• Health Benefits: Reduction in cases of respiratory and
cardiovascular diseases.
• Environmental Improvements: Improved air quality and
healthier ecosystems.
• Economic Impact: Long-term cost savings from reduced
healthcare expenses and increased productivity.
• Community Engagement: Increased community awareness and
participation in air quality initiatives.

Conclusion
Reducing air pollution in urban areas requires a multifaceted approach
that involves policy enforcement, technological innovation, and
community collaboration. By committing to these strategies, we can
create a healthier, cleaner environment for current and future generations.

DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya


DEPARTMENT OF CSE(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING) B . Lavanya

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy