Linear Regression Merged

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

EXPLORATORY DATA ANALYSIS:

DEFINTION OF EDA:
Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize
their main characteristics, often with visual methods. The primary goal of EDA is to develop
an understanding of the data, uncover underlying patterns, spot anomalies, test assumptions,
and check for relationships between variables.
What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It
involves analyzing and visualizing data to understand its key characteristics, uncover
patterns, and identify relationships between variables refers to the method of studying and
exploring record sets to apprehend their predominant traits, discover patterns, locate outliers,
and identify relationships between variables. EDA is normally carried out as a preliminary
step before undertaking extra formal statistical analyses or modeling.

Steps Involved in Exploratory Data Analysis


1. Understand the Data
Familiarize yourself with the data set, understand the domain, and identify the objectives of
the analysis.
2. Data Collection
Collect the required data from various sources such as databases, web scraping, or APIs.
3. Data Cleaning
 Handle missing values: Impute or remove missing data.
 Remove duplicates: Ensure there are no duplicate records.
 Correct data types: Convert data types to appropriate formats.
 Fix errors: Address any inconsistencies or errors in the data.
4. Data Transformation
 Normalize or standardize the data if necessary.
 Create new features through feature engineering.
 Aggregate or disaggregate data based on analysis needs.
5. Data Integration
Integrate data from various sources to create a complete data set.
6. Data Exploration
 Univariate Analysis: Analyze individual variables using summary statistics and
visualizations (e.g., histograms, box plots).
 Bivariate Analysis: Analyze the relationship between two variables with scatter plots,
correlation coefficients, and cross-tabulations.
 Multivariate Analysis: Investigate interactions between multiple variables using pair
plots and correlation matrices.
7. Data Visualization
Visualize data distributions and relationships using visual tools such as bar charts, line charts,
scatter plots, heatmaps, and box plots.
8. Descriptive Statistics
Calculate central tendency measures (mean, median, mode) and dispersion measures (range,
variance, standard deviation).
9. Identify Patterns and Outliers
Detect patterns, trends, and outliers in the data using visualizations and statistical methods.
10. Hypothesis Testing
Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square tests) to validate
assumptions or relationships in the data.
11. Data Summarization
Summarize findings with descriptive statistics, visualizations, and key insights.
12. Documentation and Reporting
 Document the EDA process, findings, and insights clearly and structured.
 Create reports and presentations to convey results to stakeholders.

Types of Exploratory Data Analysis (EDA)


1. Univariate Analysis
 Definition: Focuses on analyzing a single variable at a time.
 Purpose: To understand the variable's distribution, central tendency, and spread.
 Techniques:
 Descriptive statistics (mean, median, mode, variance, standard deviation).
 Visualizations (histograms, box plots, bar charts, pie charts).
2. Bivariate Analysis
 Definition: Examines the relationship between two variables.
 Purpose: To understand how one variable affects or is associated with another.
 Techniques:
 Scatter plots.
 Correlation coefficients (Pearson, Spearman).
 Cross-tabulations and contingency tables.
 Visualizations (line plots, scatter plots, pair plots).
3. Multivariate Analysis
 Definition: Investigates interactions between three or more variables.
 Purpose: To understand the complex relationships and interactions in the data.
 Techniques:
 Multivariate plots (pair plots, parallel coordinates plots).
 Dimensionality reduction techniques (PCA, t-SNE).
 Cluster analysis.
 Heatmaps and correlation matrices.
4. Descriptive Statistics
 Definition: Summarizes the main features of a data set.
 Purpose: To provide a quick overview of the data.
 Techniques:
 Measures of central tendency (mean, median, mode).
 Measures of dispersion (range, variance, standard deviation).
 Frequency distributions.
5. Graphical Analysis
 Definition: Uses visual tools to explore data.
 Purpose: To identify patterns, trends, and data anomalies through visualization.
 Techniques:
 Charts (bar charts, histograms, pie charts).
 Plots (scatter plots, line plots, box plots).
 Advanced visualizations (heatmaps, violin plots, pair plots).
6. Dimensionality Reduction
 Definition: Reduces the number of variables under consideration.
 Purpose: To simplify models, reduce computation time, and mitigate the curse of
dimensionality.
 Techniques:
 Principal Component Analysis (PCA).
 t-Distributed Stochastic Neighbor Embedding (t-SNE).
 Linear Discriminant Analysis (LDA).
The Ultimate Ticket to Top Data Science Job Roles
Post Graduate Program In Data Science Explore Now

Exploratory Data Analysis Tools


Using the following tools for exploratory data analysis, data scientists can effectively gain
deeper insights and prepare data for advanced analytics and modeling.
1. Python Libraries
 Pandas: Provides data structures and functions needed to manipulate structured data
seamlessly.
 Use: Data cleaning, manipulation, and summary statistics.
 Supports large, multi-dimensional arrays and matrices and a collection of
mathematical functions.
 Use: Numerical computations and data manipulation.
 Matplotlib: A plotting library that produces static, animated, and interactive
visualizations.
 Use: Basic plots like line charts, scatter plots, and bar charts.
 Seaborn: Built on Matplotlib, it provides a high-level interface for drawing attractive
statistical graphics.
 Use: Advanced visualizations like heatmaps, violin plots, and pair plots.
 SciPy: Builds on NumPy and provides many higher-level scientific algorithms.
 Use: Statistical analysis and additional mathematical functions.
 Plotly: A graphing library that makes interactive, publication-quality graphs online.
 Use: Interactive and dynamic visualizations.
2. R Libraries
 ggplot2: A framework for creating graphics using the principles of the Grammar of
Graphics.
 Use: Complex and multi-layered visualizations.
 dplyr: A set of tools for data manipulation, offering consistent verbs to address
common data manipulation tasks.
 Use: Data wrangling and manipulation.
 tidyr: Provides functions to help you organize your data in a tidy way.
 Use: Data cleaning and tidying.
 shiny: An R package that makes building interactive web apps straight from R easy.
 Use: Interactive data analysis applications.
 plotly: Also available in R for creating interactive visualizations.
 Use: Interactive visualizations.
3. Integrated Development Environments (IDEs)
 Jupyter Notebook: An open-source web application that allows you to create and
share documents that contain live code, equations, visualizations, and narrative text.
 Use: Combining code execution, rich text, and visualizations.
 RStudio: An integrated development environment for R that offers tools for writing
and debugging code, building software, and analyzing data.
 Use: R development and analysis.
4. Data Visualization Tools
 Tableau: A top data visualization tool that facilitates the creation of diverse charts and
dashboards.
 Use: Interactive and shareable dashboards.
 Power BI: A Microsoft business analytics service offering interactive visualizations
and business intelligence features.
 Use: Interactive reports and dashboards.
5. Statistical Analysis Tools
 SPSS: A comprehensive statistics package from IBM.
 Use: Complex statistical data analysis.
 SAS: A software suite developed by SAS Institute for advanced analytics, business
intelligence, data management, and predictive analytics.
 Use: Statistical analysis and data management.
6. Data Cleaning Tools
 OpenRefine: A powerful tool for cleaning messy data, transforming formats, and
enhancing it with web services and external data.
 Use: Data cleaning and transformation.
 SQL Databases: Tools like MySQL, PostgreSQL, and SQLite are used to manage and
query relational databases.
 Use: Data extraction, transformation, and basic analysis.
Our Data Scientist Master's Program covers core topics such as R, Python, Machine
Learning, Tableau, Hadoop, and Spark. Get started on your journey today!
Linear Regression
Linear regression is a statistical method that is used in various machine learning models to predict
the value of unknown data using other related data values. Linear regression is used to study the relationship
between a dependent variable and an independent variable.
What is Linear Regression?
Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable you want to predict is called the dependent variable. The variable you are using to
predict the other variable's value is called the independent variable.
Linear regression is a very common formula used in various models that perform a predictive analysis.
In linear regression, we have two variables and they are considered as independent variable and
dependent variable. In Linear Regression we assumes a linear relationship between the variables,
which means that changes in the independent variables are associated with proportional changes in the
dependent variable.

Various linear regression that are commonly used are,


1. Simple Linear Regression: This is the simplest form, where we have one thing we’re trying to
predict and one thing we think might influence it. For example, We are perform a predictive
analysis where are trying to predict someone’s weight based on their height.
2. Multiple Linear Regression: Here, things get a bit more complex. We’re still predicting one thing,
but now we’re considering multiple factors that might influence it. For instance, we might predict a
person’s weight based on their height, age, and maybe even their diet habits.
3. Logistic Regression: This one comes into play when we’re dealing with binary outcomes, like
whether someone will click on an ad or not. We’re still looking at multiple factors that might play a
role.
4. Ordinal Regression: Sometimes, what we’re trying to predict isn’t exactly numerical, but it has an
order. Think of rating something from 1 to 5 stars. This kind of regression helps us predict such
ordinal outcomes.
5. Multinomial Regression: When our outcome has several categories but no inherent order, like
predicting someone’s favorite color among several options, we turn to multinomial regression.
6. Discriminant Analysis: Similar to multinomial regression, this helps us when we have multiple
categories for our outcome variable, but here, we’re specifically focused on classifying cases into
those categories based on the predictor variables.
Each of these methods has its own strengths and best-use scenarios.
Linear Regression Equation
Linear regression line equation is written in the form:
y = a + bx
where,
 x is Independent Variable, Plotted along X-axis
 y is Dependent Variable, Plotted along Y-axis
The slope of the regression line is “b”, and the intercept value of regression line is “a”(the value of y when
x = 0).
Linear Regression Formula
Formula used for linear regressions is, y = a + bx
Intercept value, a, and slope of the line, b, are evaluated using the formulas given below:

where,
 y is Dependent Variable that Lies along Y-axis
 a is Y-Intercept
 b is Slope of Regression Line
 x is Independent Variable that Lies along X-axis
Properties of Linear Regression
In the linear regression line if the regression parameters a0 and a1 are defined, the properties are given as
below:
 Linear regression line reduces the sum of squared differences between observed values and
predicted values.
 Linear regression line always passes through the mean of X and Y variable values.
 Linear regression constant (b0) is equal to the y-intercept of the linear regression.
 Linear regression coefficient (b0) is the slope of the regression line.
Linear Regression Line
Least square method is the most common method used to fit a regression line, in the X-Y graph. In
this process we determine the line of best fit by reducing the sum of the squares of the vertical deviations
from each data point to the line.
For any point that is fitted accurately, its perpendicular deviation is zero. Linear regression line is
shown in the image added below,

Regression Coefficient
Linear regression line, equation:
Y = B0 + B1X
where,
 B0 is a Constant
 B1 is Regression Coefficient
Here, B1 is the regression coefficient and its formula is,
B1 = b1 = Σ [ (xi – x)(yi – y) ] / Σ [(xi – x)2]
where,
 xi and yi are Observed Data Sets
 x and y are Mean Value
What is Linear Regression Used for?
Various uses of Linear Regression are,
 It is used in market research and study of customer survey results.
 It is used for studying performance of engine of automobiles.
 It is used in deciding the effective price of any goods.
 It is used in astronomy.
Error in Linear Regression Formula
Standard error about the regression line is defined as the measure of the average proportion that the
regression equation predicts. Standard error in this case is denoted by ‘SE‘. Higher the coefficient of the
determination involved, the lower the standard error and hence, a more accurate result is generated.
Solved Example Questions on Linear Regression
Question 1: Find the linear regression equation for the given data:

X Y

3 8

9 6

5 4

3 2

Solution:
Calculating intercept and slope value.

X y x2 xy

3 8 9 24

9 6 81 54

5 4 25 20

3 2 9 6

∑x = 20 ∑y = 20 ∑x2 = 124 ∑xy = 104

Using formula,
a = {20 (124) – 20 (104)} / {4 (124) – 400}
a = 400/96 = 4.17
b = {4 (104) – 20 (20)} / {4 (124) – 400}
b = 16/96 = 0.166
So, linear regression equation is, y=a+bx => y = 4.17 + 0.166x
Question 2: Find the linear regression equation for the given data:
X y

4 6

7 5

3 8

1 3

Solution:
Calculating intercept and slope value.

X y x2 xy

4 6 16 24

7 5 49 35

3 8 9 24

1 3 1 3

∑x = 15 ∑y = 22 ∑x2 = 75 ∑xy = 86

Using formula,
= (22 (75) – 15 (86)) / (4 (75) – 225)
= 360/75
= 4.8
= (4 (86) – 15 (22)) / (4 (75) – 225)
= 14/75
= 0.1867
So, the linear regression equation is, 4.8 + 0.1867x.
Question 3: Find the intercept of linear regression line if ∑x = 25, ∑y = 20, ∑x2 = 90, ∑xy = 150 and
n = 5.
Solution:
Using formula,
= (20 (90) – 25 (150)) / (5 (90) – 625)
= -1950/-175
= 11.14
Question 4: Find the intercept of linear regression line if ∑x = 30, ∑y = 27, ∑x2 = 110, ∑xy = 190 and
n = 4.
Solution:
Using formula,
= (27 (110) – 30 (190)) / (4 (110) – 900)
= -2730/-460
= 5.93
Question 5: Find slope of linear regression line if ∑x = 10, ∑y = 16, ∑x2 = 60, ∑xy = 120 and n = 4.
Solution:
Using formula,
= (4 (120) – 10 (16)) / (4 (60) – 100)
= 320/140
= 2.28
Question 6: Find slope of linear regression line if ∑x = 40, ∑y = 32, ∑x2 = 130, ∑xy = 210 and n = 4.
Solution:
Using formula,
= (4 (210) – 40 (32)) / (4 (130) – 1600)
= -440/-1080
= 0.407
Question 7: Find slope of linear regression line if ∑x = 50, ∑y = 44, ∑x2 = 150, ∑xy = 230 and n = 4.
Solution:
Using formula,
= (44 (150) – 50 (230)) / (4 (150) – 2500)
= -4900/-1900
= 2.57
= (4 (230) – 50 (44)) / (4 (150) – 2500)
= -1280/-1900
= 0.673
Non-linear regression
Nonlinear regression is a mathematical function that uses a generated line – typically a curve – to
fit an equation to some data. The sum of squares is used to determine the fitness of a regression model,
which is computed by calculating the difference between the mean and every point of data.
What is Nonlinear Regression?
Nonlinear regression is a mathematical model that fits an equation to certain data using a generated
line. As is the case with a linear regression that uses a straight-line equation (such as Ỵ= c + m x), nonlinear
regression shows association using a curve, making it nonlinear in the parameter.
A simple nonlinear regression model is expressed as follows:
Y = f (X, β) + ϵ
Where:
 X is a vector of P predictors
 β is a vector of k parameters
 F (-) is the known regression function
 ϵ is the error term
Alternatively, the model can also be written as follows:
Yi = h [xi(1) , xi(2), … , xi(m) ; Ѳ1, Ѳ2, …, Ѳp] + Ei
Where:
 Yi is the responsive variable
 h is the function
 x is the input
 Ѳ is the parameter to be estimated

Nonlinear regression refers to a broader category of regression models where the relationship between
the dependent variable and the independent variables is not assumed to be linear. If the underlying pattern
in the data exhibits a curve, whether it’s exponential growth, decay, logarithmic, or any other non-linear
form, fitting a nonlinear regression model can provide a more accurate representation of the relationship.
This is because in linear regression it is pre-assumed that the data is linear.
Many different regressions exist and can be used to fit whatever the dataset such as quadratic, cubic
regression, and so on to infinite degrees according to our requirement.
Assumptions in NonLinear Regression
These assumptions are similar to those in linear regression but may have nuanced interpretations due to the
nonlinearity of the model. Here are the key assumptions in nonlinear regression:
1. Functional Form: The chosen nonlinear model correctly represents the true relationship between
the dependent and independent variables.
2. Independence: Observations are assumed to be independent of each other.
3. Homoscedasticity: The variance of the residuals (the differences between observed and predicted
values) is constant across all levels of the independent variable.
4. Normality: Residuals are assumed to be normally distributed.
5. Multicollinearity: Independent variables are not perfectly correlated.
Types of Non-Linear Regression
There are two main types of Non-Linear regression in Machine Learning:
1. Parametric non-linear regression assumes that the relationship between the dependent and
independent variables can be modelled using a specific mathematical function. For example, the
relationship between the population of a country and time can be modeled using an exponential
function. Some common parametric non-linear regression models include: Polynomial regression,
Logistic regression, Exponential regression, Power regression etc.
2. Non-parametric non-linear regression does not assume that the relationship between the
dependent and independent variables can be modelled using a specific mathematical function.
Instead, it uses machine learning algorithms to learn the relationship from the data. Some common
non-parametric non-linear regression algorithms include: Kernel smoothing, Local polynomial
regression, Nearest neighbor regression etc.
Non-Linear Regression Algorithms
Nonlinear regression encompasses various types of models that capture relationships between
variables in a nonlinear manner. Here are some common types:
Polynomial Regression
Polynomial regression is a type of nonlinear regression that fits a polynomial function to the data.
The general form of a polynomial regression model is:

where,
 y : dependent variable
 X : independent variable

 : parameters of the model


 n : degree of the polynomial
Exponential Regression
Exponential regression is a type of nonlinear regression that fits an exponential function to the data.
The general form of an exponential regression model is:

where,
 y – dependent variable
 X – independent variable
 - parameters of the model
Logarithmic Regression
Logarithmic regression is a type of nonlinear regression that fits a logarithmic function to the data.
The general form of a logarithmic regression model is:

where,
 y – dependent variable
 X – independent variable
 - parameters of the model
Power Regression
Power regression is a type of nonlinear regression that fits a power function to the data. The general form
of a power regression model is:

where,
 y – dependent variable
 X – independent variable
 – parameters of the model
Generalized Additive Models (GAMs)
Generalized additive models (GAMs) are a type of nonlinear regression that combines multiple
linear models to model a complex relationship between variables. The general form of a GAM is:
𝑌 = 𝑓1(𝑥1) + 𝑓2(𝑥2) + ⋯ + 𝑓𝑛(𝑥𝑛)+∈
where,
 y – dependent variable
 x1, x2,….,xn – independent variable
 f1(x1), f2(x2), …, fn(xn) – smooth functions of the independent variables
 ∈ – error term
Gauss-Newton Algorithm:
The Gauss-Newton algorithm is an iterative optimization method designed for minimizing the
sum of squared differences between observed and predicted values in nonlinear least squares regression. It
iteratively updates parameter estimates by moving in the direction of the gradient of the objective function,
leveraging the Jacobian matrix and the residuals.
Gradient Descent algorithm
The Gradient Descent algorithm is a widely used iterative optimization technique for finding the
minimum of a function. In the context of nonlinear regression, it updates parameter estimates by iteratively
moving towards the direction of the steepest decrease in the objective function, with the learning rate
controlling the step size.
Levenberg-Marquardt algorithm
The Levenberg-Marquardt algorithm is a modification of the Gauss-Newton algorithm that
introduces a damping parameter to enhance robustness. It dynamically adjusts the step size during iterations
by combining the advantages of Gauss-Newton and gradient descent methods, providing a versatile
approach for solving nonlinear least squares problems.
Evaluating Non-Linear Regression Models
Evaluating the performance of a nonlinear regression model is crucial to ensure it accurately
represents the underlying relationship between the independent and dependent variables.
There are a number of different metrics that can be used to evaluate non-linear regression models, but some
common metrics are:
1. R-squared – R-squared (Coefficient of Determination) measures the proportion of variance in the
dependent variable that is explained by the independent variables in the model. It ranges from 0 to
1, where 0 indicates no explanation of variance and 1 indicates perfect explanation. A higher R-
squared value suggests a better model fit.
2. Adjusted R-squared – Adjusted R-squared is a modified version of R-squared that accounts for
the number of independent variables in the model. It penalizes models with more variables, making
it a more appropriate measure of goodness of fit when comparing models with different numbers of
independent variables. A higher adjusted R-squared value indicates a better model fit.
3. Root Mean Squared Error (RMSE) – Root Mean Squared Error (RMSE) is the square root of
MSE, providing a more intuitive measure of the average error in predictions. It represents the
average distance between the predicted and actual values of the dependent variable, scaled to the
same units as the dependent variable. A lower RMSE signifies a better model fit.
How does a Non-Linear Regression work?
Non-linear regression algorithms work by iteratively adjusting the parameters of a non-linear function
to minimize the error between the predicted values of the dependent variable and the actual values. The
specific function used depends on the nature of the relationship between the variables, and there are many
different types of non-linear functions that can be used.
 If we observe closely then we will realize that to evolve from linear regression to non-linear
regression. We are just supposed to add the higher-order terms of the dependent features in the
feature space. This is sometimes also known as feature engineering but not exactly.
 The addition of non-linear terms is what allows us to fit a curvilinear model to the data at hand.
Even though the non-linear regression is similar to the linear one but the different types of
challenges are faced by the Machine Learning practitioner while training such a model. And hence
several established methods, such as Levenberg-Marquardt and Gauss-Newton, are used to develop
nonlinear models.
Here we are implementing Non-Linear Regression using Python:
Step-1: Importing libraries
Importing all the necessary libraries:
 Python3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Step-2: Import Dataset


Importing and reading the dataset: Dataset Link
 Python3

# Read the CSV file


df = pd.read_csv('/content/gdp.csv')

# Display the first few rows of the dataframe


print(df.head())

Output:
Year Value
0 1960 5.918412e+10
1 1961 4.955705e+10
2 1962 4.668518e+10
3 1963 5.009730e+10
4 1964 5.906225e+10
Linear VS Non-Linear Regression

Feature Linear Regression Non Linear Regression

Assumes a linear relationship Allows for non-linear relationships


Relationship between
between the independent and between the independent and
variables
dependent variables dependent variables

Simpler model with fewer More complex model with more


Model complexity
parameters parameters

Highly interpretable due to the Less interpretable due to the non-


Interpretability
linear relationship linear relationship
Feature Linear Regression Non Linear Regression

Less susceptible to overfitting due


Overfitting More susceptible to overfitting due
to its ability to capture complex
susceptibility to its simplicity
relationships

Requires large datasets to accurately Can work with smaller datasets due
Flexibility
estimate the linear relationship to its flexibility

Suitable for predicting continuous Suitable for predicting continuous


Applications target variables when the target variables when the
relationship is linear relationship is non-linear

Predicting house prices based on Predicting customer churn based on


Examples
size and location behavioral patterns

Applications of Non-Linear Regression


Non-Linear Regression techniques.
1. The insurance industry makes use of it. Its application is seen, for instance, in the IBNR reserve
computation.
2. In the field of agricultural research, it is crucial. Considering that nonlinear models more accurately
represent numerous crops and soil dynamics than linear ones.
3. There are uses for nonlinear models in forestry research because the majority of biological processes
are inherently nonlinear. An example would be a straightforward power function that relates a tree’s
weight or volume to its height or diameter.
4. It is employed in the framing of the problem and the derivation of statistical solutions to the
calibration problem in research and development.
5. One example from the world of chemistry is the development of a wide-range colorless gas, HCFC-
22 formulation, using a nonlinear model.
Advantages & Disadvantages of Non-Linear Regression
Advantages of Non-Linear Regression
1. Non-linear regression can model relationships that are not linear in nature.
2. Non-linear regression can be used to make predictions about the dependent variable based on the
values of the independent variables.
3. Non-linear regression can be used to identify the factors that influence the dependent variable.
Disadvantages of Non-Linear Regression
1. Non-linear regression models can be more complex to implement than linear regression.
2. Non-linear regression models can be more sensitive to outliers than linear regression models.
3. Non-linear regression models can be more computationally expensive to train than linear regression
models.
Questions on Non-Linear Regression
1. What is non-linear regression?
Non-linear regression in Machine Learning is a statistical method used to model the relationship
between a dependent variable and one or more independent variables when that relationship is not linear.
This means that the relationship between the variables cannot be represented by a straight line.
2. Is nonlinear regression better than linear regression?
 Non-linear regression is more flexible than linear regression and can model a wider range of
relationships between variables.
 Non-linear regression can be more accurate than linear regression for nonlinear relationships.
 Non-linear regression is more complex and computationally expensive than linear regression.
 The choice between linear and non-linear regression depends on the specific problem and data.
3. How do you calculate nonlinear regression?
Calculating Non-Linear Regression using Python involves fitting a nonlinear model to the data to
capture the relationship between the dependent and independent variables.
Can be calculated using

Where,
 f – Regression function
 (X,β) – vector parameters
 – error term
4. When should I use non-linear regression?
Non-Linear Regression in Machine Learning when the relationship between the dependent and
independent variables is not linear. This can be determined by plotting the data and inspecting the
scatterplot.
Key Differences in Correlation and Covariance for Linear vs. Nonlinear Regression:

Covariance Covariance Correlation


Correlation (Nonlinear
Aspect (Linear (Nonlinear (Linear
Regression)
Regression) Regression) Regression)

Measures
Measures linear Measures linear co- Measures linear
strength and
co-movement movement but relationship; does not
Measure direction of a
between variables doesn't account for capture nonlinear
linear
XXX and YYY. nonlinearity. relationships well.
relationship.
Covariance Covariance Correlation
Correlation (Nonlinear
Aspect (Linear (Nonlinear (Linear
Regression)
Regression) Regression) Regression)

Provides limited Traditional correlation


Correlation close
Helps compute the insight because it (Pearson) might fail;
to +1/-1 indicates
Interpretation slope in linear does not reflect Spearman can capture
a strong linear
regression. curved monotonic nonlinear
relationship.
relationships. relationships.

Not suitable for


Appropriate for
Misleading for Accurate and nonlinear models unless
Sensitivity to linear models,
nonlinear models as useful for linear using specialized
Linearity directly linked to
it assumes linearity. models. correlations (e.g.,
the slope.
Spearman).

Pearson correlation won't


Correlation helps help with nonlinear
Impact on Directly influences No direct impact on
assess how well model fitting, but
Regression the slope in a linear nonlinear regression
XXX predicts Spearman or other
Coefficients regression model. coefficients.
YYY. nonlinear correlations
can.

What is logistic regression?


Logistic regression is a data analysis technique that uses mathematics to find the relationships
between two data factors. It then uses this relationship to predict the value of one of those factors based on
the other.
Logistic regression function:-
Logistic regression is a statistical model that uses the logistic function, or logit function, in
mathematics as the equation between x and y. The logit function maps y as a sigmoid function of x.

If you plot this logistic regression equation, you will get an S-curve as shown below.
The logit function returns only values between 0 and 1 for the dependent variable, irrespective of
the values of the independent variable. This is logistic regression estimates the value of the dependent
variable. Logistic regression methods also model equations between multiple independent variables and
one dependent variable.
Logistic regression analysis with multiple independent variables
In many cases, multiple explanatory variables affect the value of the dependent variable. To model
such input datasets, logistic regression formulas assume a linear relationship between the different
independent variables. You can modify the sigmoid function and compute the final output variable as
y = f(β0 + β1x1 + β2x2+… βnxn)
The symbol β represents the regression coefficient. The logit model can reverse calculate these
coefficient values when you give it a sufficiently large experimental dataset with known values of both
dependent and independent variables.
Log odds
The logit model can also determine the ratio of success to failure or log odds.
Mathematically, your odds in terms of probability are p/(1 - p), and your log odds are log (p/(1 - p)).
You can represent the logistic function as log odds as shown below:

For example, if you were playing poker with your friends and you won four matches out of 10, your
odds of winning are four sixths, or four out of six, which is the ratio of your success to failure. The
probability of winning, on the other hand, is four out of 10.
What are the types of logistic regression analysis?
There are three approaches to logistic regression analysis based on the outcomes of the dependent
variable.

 Binary logistic regression


Binary logistic regression works well for binary classification problems that have only two possible
outcomes. The dependent variable can have only two values, such as yes and no or 0 and 1.
Even though the logistic function calculates a range of values between 0 and 1, the binary regression
model rounds the answer to the closest values. Generally, answers below 0.5 are rounded to 0, and answers
above 0.5 are rounded to 1, so that the logistic function returns a binary outcome.
 Multinomial logistic regression
Multinomial regression can analyze problems that have several possible outcomes as long as the
number of outcomes is finite. For example, it can predict if house prices will increase by 25%, 50%, 75%,
or 100% based on population data, but it cannot predict the exact value of a house.
Multinomial logistic regression works by mapping outcome values to different values between 0
and 1. Since the logistic function can return a range of continuous data, like 0.1, 0.11, 0.12, and so on,
multinomial regression also groups the output to the closest possible values.

 Ordinal logistic regression


Ordinal logistic regression, or the ordered logit model, is a special type of multinomial regression
for problems in which numbers represent ranks rather than actual values. For example, you would use
ordinal regression to predict the answer to a survey question that asks customers to rank your service as
poor, fair, good, or excellent based on a numerical value, such as the number of items they purchase from
you over the year.
What are the applications of logistic regression?
Logistic regression has several real-world applications in many different industries.

 Manufacturing
Manufacturing companies use logistic regression analysis to estimate the probability of part failure
in machinery. They then plan maintenance schedules based on this estimate to minimize future failures.

 Healthcare
Medical researchers plan preventive care and treatment by predicting the likelihood of disease in
patients. They use logistic regression models to compare the impact of family history or genes on diseases.

 Finance
Financial companies have to analyze financial transactions for fraud and assess loan applications
and insurance applications for risk. These problems are suitable for a logistic regression model because
they have discrete outcomes, like high risk or low risk and fraudulent or not fraudulent.

 Marketing
Online advertising tools use the logistic regression model to predict if users will click on an
advertisement. As a result, marketers can analyze user responses to different words and images and create
high-performing advertisements with which customers will engage.

Difference between advantages and disadvantages of logistics regression:

Advantages Disadvantages

If the number of observations is lesser than


Logistic regression is easier to implement,
the number of features, Logistic Regression should
interpret, and very efficient to train.
not be used, otherwise, it may lead to overfitting.
Advantages Disadvantages

It makes no assumptions about


It constructs linear boundaries.
distributions of classes in feature space.

The major limitation of Logistic


It can easily extend to multiple
Regression is the assumption of linearity between
classes(multinomial regression) and a natural
the dependent variable and the independent
probabilistic view of class predictions.
variables.

It can only be used to predict discrete


It not only provides a measure of how
functions. Hence, the dependent variable of
appropriate a predictor(coefficient size)is, but also
Logistic Regression is bound to the discrete
its direction of association (positive or negative).
number set.

Non-linear problems can’t be solved with


It is very fast at classifying unknown logistic regression because it has a linear decision
records. surface. Linearly separable data is rarely found in
real-world scenarios.

Good accuracy for many simple data sets


Logistic Regression requires average or no
and it performs well when the dataset is linearly
multicollinearity between independent variables.
separable.

It is tough to obtain complex relationships


It can interpret model coefficients as using logistic regression. More powerful and
indicators of feature importance. compact algorithms such as Neural Networks can
easily outperform this algorithm.

Logistic regression is less inclined to over- In Linear Regression independent and


fitting but it can overfit in high dimensional dependent variables are related linearly. But
datasets. One may consider Regularization (L1 Logistic Regression needs that independent
and L2) techniques to avoid over-fitting in these variables are linearly related to the log odds
scenarios. (log(p/(1-p)).

Variable Transformation:-
What is Variable Transformation?
Variable transformation refers to the process of changing the scale or distribution of a variable in a
dataset to improve the accuracy, interpretability, or the performance scalability of statistical models, or
machine learning algorithms or data analysis techniques.
This technique is particularly important in the fields of statistics, data analysis, and data science,
where the assumptions of normality and homoscedasticity (constant variance) are often critical for the
validity of inferential statistics. By transforming variables, analysts can make their data more suitable for
various analytical methods, leading to more accurate and reliable results.

Types of Variable Transformation


1. Continuous variable
2. Categorical variable
1.Continuous Variables or Number variables:
Purposes:
 To change the scale of the variables
 To transform skewed data distribution to normal distribution
Standardization
x′i = (xi− ¯x) / s
when we have a few large numbers
Min-max scaling (Normalization)
x′I = (xi−xmax) / (xmax−xmin)
dependent on the min and max values, which makes it sensitive to outliers.
best to use when you have values in a fixed interval.
Square Root/Cube Root
 When variables have positive skewness or residuals have positive heteroskasticity.
 Frequency counts variable
 Data have many 0 or extremely small values.
Logarithmic
 Variables have positively skewed distribution

Formula In case

x′i=log(xi) cannot work zero because log(Θ) = -Inf

x′i=log(xi+1) variables with 0

x′i=log(xi+c)

x′I = (xi / (|xi|)) log|xi| variables with negative values

x′λi=log(xi+√x2i+λ) generalized log transformation

For the general case of log(xi+c)log(xi+c), choosing a constant is rather tricky.


Exponential
 Negatively skewed data
 Underlying logarithmic trend (e.g., survival, decay)
Power
 Variables have negatively skewed distribution
Inverse/Reciprocal
 Variables have platykurtic distribution
 Data are positively skewed
 Ratio data
Hyperbolic arcsine
 Variables with positively skewed distribution
Categorical Variables
Purposes
 To transform to continuous variable (for machine learning models) (e.g., encoding/ embedding in
text mining)
Approaches:
1. One Hot Encoding
 In this technique, it creates a new column/feature for each category in the Categorical Variable and
replaces with either 1 (presence of the feature) or 0 (absence of the feature). The number of
column/feature depends on the number of categories in the Categorical Variable. This method slows
down the learning process significantly if the number of the categories are very high.
 For Regression, we can use N-1 (drop first or last column of One Hot Coded new feature ),
 For classification, the recommendation is to use all N columns as most of the tree-based algorithm
builds a tree based on all available variables.
 Disadvantages:
 Tree algorithms cannot be applied to one-hot encoded data since it creates a sparse matrix.
 When the feature contains too many unique values, that many features are created which may result
in overfitting.
2. Label Encoding
 In this encoding, a unique value is assigned for different labels/categories.
 One major issue with sklearn.LabelEncoder is it assigns the values to the labels based on the
Alphabetical order of the lables.
 Ex : Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3
 Disadvantages:
 It mis-leads the information by assigning values based on Alphabetical order instead of actual label
order.
3. Ordinal Encoding
 Encode categorical features as an integer array.
 The input to this transformer should be an array-like of integers or strings, denoting the values taken
on by categorical (discrete) features. The features are converted to ordinal integers. This results in
a single column of integers (0 to n_categories - 1) per feature.
4. Frequency or Count Encoder
 In frequency encoding, each of the categories in the feature is replaced with the frequencies of
categories.
 Here frequency of the categories is related somewhat with the target variable, it helps the model to
understand and assign the weight in direct and inverse proportion, depending on the nature of the
data.
 Category refers to each of the unique values in a feature.
 Frequency(category) = Number of values in that category
 Size(data) = Size of the entire dataset.
Disadvantage:
 If two categories have the same frequency then it is hard to distinguish between them.
5. Binary Encoding
 It similar to onehot, but stores categories as binary bitstrings i.e., each binary bitstring creates one
feature column.
 Compared to One Hot Encoding, this will require fewer feature columns (for 100 categories, One
Hot Encoding will have 100 features while for Binary encoding, we will need just seven features).
 Feature -> ordinal encoding -> binary code -> digits of the binary code to separate columns.
6. Base-N encoder
 Base-N encoder encodes the categories into arrays of their base-N representation.
 A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent
to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.
7. Helmert Encoding
 Helmert coding is a third commonly used type of categorical encoding for regression along with
OHE and Sum Encoding.
 It compares each level of a categorical variable to the mean of the subsequent levels.
 The version in category_encoders is sometimes referred to as Reverse Helmert Coding.
 It is useful in certain situations where levels of the categorical variable are ordered, say, from lowest
to highest, or from smallest to largest.
8. Mean Encoding or Target Encoding
 It has become the most popular encoding type because of Kaggle competitions.
 It takes information about the target to encode categories, which makes it extremely powerful.
 In Target Encoding, labels are correlated directly with the target.i.e., for each category in the feature
label is decided with the mean value of the target variable on a training data.
Advantage :
 It does not affect the volume of the data and helps in faster learning.
Disadvantage :
 Target leakage: it uses information about the target. Because of the target leakage, model overfits
the training data which results in unreliable validation and lower test scores.
 To reduce the effect of target leakage.
 Increase regularization
 Add random noise to the representation of the category in train dataset (some sort of augmentation)
 Use Double Validation (using other validation)
9. Weight of Evidence Encoding
 This method was developed primarily to build a predictive model to evaluate the risk of loan default
in the credit and financial industry.
 It is a measure of the “strength” of a grouping for separating good and bad risk (default).
 Weight of evidence (WOE) is a measure of how much the evidence supports or undermines a
hypothesis.
 Distr Goods -> Distribution of Good Credit Outcomes
 Distr bads -> Distribution of Bad Credit Outcomes
 However, above formulas might lead to target leakage and overfit.
 To avoid that, regularization parameter a is induced and WoE is calculated in the following way:
10. Sum Encoder (Deviation Encoding or Effect Encoding)
 Compares the mean of the dependent variable (target) for a given level of a categorical column to
the overall mean of the target.
 Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression
(LR) types of models.
 However, the difference between them is the interpretation of LR coefficients: in OHE model the
intercept represents the mean for the baseline condition and coefficients represents simple effects
(the difference between one particular condition and the baseline), whereas in Sum Encoder model
the intercept represents the grand mean (across all conditions) and the coefficients can be interpreted
directly as the main effects.
11. Leave-one-out Encoder (LOO or LOOE)
 It is another example of target-based encoders.
 This encoder calculate mean target of category k for observation j if observation j is removed from
the dataset:
 While encoding the test dataset, a category is replaced with the mean target of the category k in the
train dataset:
Disadvantage :
 All other target-based encoders, the problems with LOO is target leakage.
12. CatBoost Encoder
 Catboost is a recently created target-based categorical encoder.
 It is intended to overcome target leakage problems inherent in LOO.
 To prevent overfitting, the process of target encoding for train dataset is repeated several times on
shuffled versions of the dataset and results are averaged.
13. James-Stein Encoding
 James-Stein Encoder is a target-based encoder.
 The idea behind James-Stein Encoder is simple. Estimation of the mean target for category k could
be calculated according to the following formula:
 One way to select B is to tune it like a hyperparameter via cross-validation, but Charles Stein came
up with another solution to the problem:
Disadvantage :
 It is defined only for normal distribution (which is not the case in real time).
 To avoid that, we can either convert binary targets with a log-odds ratio as it was done in WoE
Encoder (which is used by default because it is simple) or use beta distribution.
14. M-estimator Encoding
 M-Estimate Encoder is a simplified version of Target Encoder.
 It has only one hyperparameter — m, which represents the power of regularization.
 The higher value of m results into stronger shrinking.
 Recommended values for m is in the range of 1 to 100.
15. Hashing Encoding
Hashing converts categorical variables to a higher dimensional space of integers, where the distance
between two vectors of categorical variables in approximately maintained the transformed numerical
dimensional space.
With Hashing, the number of dimensions will be far less than the number of dimensions with encoding
like One Hot Encoding.
Advantage :
This method is advantageous when the cardinality of categorical is very high with
parameter n_components.
Disadvantage :
It is slow comparing to other encoder's.
16. Backward Difference Encoding
In backward difference coding, the mean of the dependent variable for a level is compared with the
mean of the dependent variable for the prior level.
This technique falls under the contrast coding system for categorical features. A feature of K
categories, or levels, usually enters a regression as a sequence of K-1 dummy variables.
17. Polynomial Encoding
Polynomial contrast coding for the encoding of categorical features.
18. MultiLabelBinarizer
MultiLabel Binarizer is used when any column has multiple labels.

Variable Transformation:
Variable transformation refers to the process of changing the scale, distribution, or format of a
variable in a dataset to make it more suitable for analysis. This technique is often used to improve the
performance of statistical models, enhance interpretability, and meet the assumptions of various analytical
methods.
Why Transform Variables?
1. Normalization/Standardization: Adjusting data to a common scale to make it easier to compare.
2. Handling Skewness: Transformations can help normalize distributions, making statistical analysis
more robust.
3. Improving Relationships: Some transformations can linearize relationships, making regression
analysis more effective.
4. Dealing with Outliers: Transformations can mitigate the influence of outliers on the analysis.
Common Types of Transformations
1. Log Transformation:
Log transformation is a mathematical technique used to convert data into a logarithmic scale.

Formula In case

x′i=log(xi) cannot work zero because log(Θ) = -Inf

x′i=log(xi+1) variables with 0

x′i=log(xi+c)

x′I = (xi / (|xi|)) log|xi| variables with negative values

x′λi=log(xi+√x2i+λ) generalized log transformation

o Usage: Reduces skewness, especially for right-skewed data.


o Example: If you have income data that varies widely, using the natural log of income can
make the distribution more normal.
2. Square Root Transformation:

The square root transformation is a data transformation technique used primarily to reduce
skewness and stabilize the variance in data. It's particularly useful when dealing with data that
have a right-skewed (positively skewed) distribution, where values are clustered on the lower end
but extend far to the right.
How to Perform a Square Root Transformation
To apply a square root transformation to a variable X, you calculate:
Y = √X

If X contains zero or negative values, you may need to adjust the data before applying the
transformation. A common adjustment is to add a small constant (e.g., 1) to each value to avoid
taking the square root of zero or a negative number:
Y = √X+k

where k is a small constant, usually 1.

o Usage: Useful for count data, particularly when the data is right-skewed.
o Example 1: Number of occurrences of an event (like website visits).
o Example 2: suppose you have data: [4,9,16,25].
Applying the square root transformation:
Y = [√4,√9,√16,√25]=[2,3,4,5]
3. Box-Cox Transformation:
The Box-Cox transformation is a family of power transformations designed to stabilize variance,
make the data more normally distributed, and improve the performance of statistical models. It's
particularly useful when data are skewed and do not meet the assumptions of linear regression or
ANOVA.
How the Box-Cox Transformation Works
The Box-Cox transformation applies a power transformation to a variable X using a parameter λ.
The transformation is defined as:
Y = {Xλ−1,ln(X), if λ=0, if λ=0
Where:
 X must be strictly positive; if not, a constant is often added to make all values positive.
 λ is a parameter estimated from the data to determine the best transformation, which makes the
transformed data as close to normal as possible.
o Usage: A family of power transformations that can stabilize variance and make the data more
normal.
o Example: It can be applied to various types of data depending on the optimal λ parameter.
4. Z-Score Standardization:
Z-score standardization, also known as standard score normalization or Z-score normalization, is a
statistical technique used to rescale data so that it has a mean of 0 and a standard deviation of 1. This
transformation is commonly used in data preprocessing for machine learning and statistical analysis,
especially when the scale of the features varies significantly.
How Z-Score Standardization Works
The Z-score of a data point indicates how many standard deviations it is from the mean of the
distribution. The formula for Z-score standardization is:
Z = (X−μ)/σ
Where:
 X = the original value of the data point
 μ = the mean of the data
 σ = the standard deviation of the data
Example: Useful in machine learning algorithms like k-means clustering.
5. Min-Max Scaling:
Min-Max Scaling, also known as Min-Max Normalization, is a feature scaling technique that
transforms data values into a fixed range, typically between 0 and 1. This method rescales the data by
adjusting the values proportionally within a specified range, making it easier to compare features on
different scales.
How Min-Max Scaling Works
Min-Max scaling transforms each data point according to the formula:
X′ = (X−Xmin) / (Xmax−Xmin)
Where:
 X = the original value of the data point
 Xmin = the minimum value in the dataset
 Xmax = the maximum value in the dataset
 X′ = the scaled value, typically between 0 and 1
o Usage: Rescales features to a fixed range, typically [0, 1].
o Example: Helpful in algorithms that require bounded input values.
Suppose you have a dataset: X=[10,20,30,40,50]
Find Xmin and Xmax:
o Xmin=10
o Xmax=50
2. Apply Min-Max Scaling to each value:
For X=10:
X′ = (10−10) / (50−10) = 0 / 40 = 0
For X=20:
X′ = (20−10) / (50−10) = 10 / 40 = 0.25
For X=30:
X′ = (30−10) / (50−10) = 20 / 40 = 0.5
For X=40:
X′ = (40−10) / (50−10) = 30 / 40 = 0.75
For X=50:
X′ = (50−10) / (50−10) = 40 / 40 =1
The scaled data: [0,0.25,0.5,0.75,1]
6. One-Hot Encoding:
o Usage: Converts categorical variables into binary format.
o Example: Transforming a "Color" variable with values "Red," "Green," and "Blue" into
three binary variables.
SPINNING of VARIABLES:
Spinning variables are also known as feature engineering or variable transformation, involves
creating new variables from existing ones to improve model performance.
The term "spinning of variables" could refer to the rotation or transformation of variables to
simplify relationships between them or to meet specific statistical assumptions.
Types of spinning of variables
1. Orthogonal Rotation (Varimax)
In Principal Component Analysis (PCA) or Factor Analysis, an orthogonal rotation like Varimax
maximizes the variance of squared loadings of each factor. This simplifies the interpretation but keeps the
factors uncorrelated.
The goal is to maximize:

where:
 λij represents the loading of the i-th variable on the j-th factor,
 p is the number of variables,
 k is the number of factors.
2. Box-Cox Transformation
The Box-Cox transformation is used to stabilize variance and make data more normal. The
transformation is defined as:
y(λ) = {(yλ−1) / λ, ln (y), if λ≠0, if λ=0
where:
 y is the variable to be transformed,
 λ is the transformation parameter.
3. Logarithmic Transformation
The logarithmic transformation is used to deal with skewed data, often applied to compress large
ranges. The formula for transforming a variable yyy is:
y′ = log(y)
This transformation is useful when the variable has a multiplicative relationship or wide ranges.
4. Fourier Transform
The Fourier Transform converts a signal or time series from the time domain to the frequency
domain. The transformation is:
X(f) = ∫−∞∞ x(t) power of (-i2 𝜋 ft ) dt
where:
 X(f) is the Fourier-transformed function,
 x(t) is the original signal as a function of time,
 f is frequency,
 e−i2πft is the complex exponential that "spins" the signal.
5. Wavelet Transform
The Wavelet Transform is used for time-frequency analysis. A continuous wavelet transform
(CWT) is represented as:
Wx (a,b) = 1 / √a ∫−∞ ∞ x(t) ψ∗ ((t−b) / a) dt
where:
 x(t) is the signal,
 ψ is the wavelet function,
 a is the scale parameter (frequency),
 b is the translation parameter (time),
 ψ* denotes the complex conjugate of the wavelet.
6. Polar to Cartesian Coordinate Transformation
To transform from polar coordinates (r,θ) to Cartesian coordinates (x,y) the formulas are:
X = r cos(θ)
Y = r sin(θ)
This transformation essentially "spins" the variables from a radial system to the usual Cartesian
plane.
7. Logit Transformation
In logistic regression, the logit transformation is used to model the probability of a binary outcome.
It is defined as:
logit(p) = log (p / (1−p))
where:
 p is the probability of success (or 1 in a binary outcome).
8. Principal Component Analysis (PCA)
The transformation matrix WWW in PCA is obtained by solving the eigenvalue problem:
W = eig(XTX)
where:
 X is the data matrix,
 W contains the eigenvectors (principal components).
The transformed data Z is then:
Z=XW
where Z contains the transformed (rotated) variables.
Population Stability Index (PSI):-
Population Stability Index (PSI) is a statistical measure used primarily in risk management, credit scoring,
and other business domains to assess the stability of a predictive model over time by comparing the
distribution of a characteristic or score between two different time periods or datasets. PSI helps monitor
whether a model's performance might degrade over time due to shifts in the population (e.g., customer base,
risk profiles) it was originally built for.
When analyzing data or model characteristics, PSI ensures that significant changes in these characteristics
over time are identified, allowing organizations to react accordingly, such as retraining models or updating
strategies.
How Population Stability Index Works
PSI compares the distributions of a characteristic or variable (such as a credit score, income, or any model
output) between two populations:
 Reference Population: Often the population used to build or validate the original model (could be
historical data).
 Comparison Population: A newer dataset (typically more recent data or from a different time period)
to be compared with the reference population.
Steps to Calculate Population Stability Index
1. Divide the Data into Bins:
o The characteristic you're analyzing (e.g., credit score, income) is divided into several equal-
sized intervals or bins. Usually, 10 bins (deciles) are used, but this can vary based on the
distribution.
2. Calculate the Percentage of Observations in Each Bin for Both Populations:
o For each bin, calculate the proportion of observations in both the reference and comparison
populations. This provides the distribution of the variable for reach population.
3. Compute the PSI for Each Bin:
o For each bin, the PSI is calculated using the formula:
𝑃
PSI = (𝑃𝑟𝑒𝑓 − 𝑃𝑐𝑜𝑚𝑝 ) × 𝐼𝑛 (𝑃 𝑟𝑒𝑓 )
𝑐𝑜𝑚𝑝

Where:
 Pref = Percentage of observations in the bin from the reference population.
 Pcomp = Percentage of observations in the bin from the comparison population.
 ln is the natural logarithm.
4. Sum the PSI Across All Bins:
o The total PSI is the sum of the PSI values across all bins.
Interpreting PSI Values
 PSI < 0.1: The population is considered stable, meaning there is no significant change in the
distribution between the two datasets.
 0.1 ≤ PSI < 0.25: The population shows moderate changes, which might indicate some shift that
could warrant further investigation.
 PSI ≥ 0.25: Significant change in the population distribution, suggesting the model may no longer
be valid for the current population and could require updates or retraining.
Importance of PSI in Characteristics Analysis
Population Stability Index directly ties into characteristics analysis because it evaluates how the
distribution of key features (characteristics) used in a model or analysis has changed over time. If certain
characteristics shift, the PSI flags this, signaling that the assumptions used to develop predictive models
might no longer hold true for the current population.
Example of PSI Calculation in Practice
Let’s consider an example where a credit scoring model was built based on historical data (reference
population), and now you want to see if the model still performs well with the new customer base
(comparison population). By calculating the PSI for the credit score across both populations, you can detect
if the customer risk profiles have shifted (e.g., a larger portion of customers with lower or higher credit
scores).
Steps:
1. Bin the Credit Scores: Divide credit scores into bins (e.g., 300-500, 500-600, 600-700, etc.).
2. Calculate Proportions: For each bin, calculate the percentage of customers in that range for both the
reference population (historical data) and the current population.
3. Apply PSI Formula: Use the PSI formula for each bin and sum the results to get the total PSI score.
4. Interpret: If the PSI is high, it indicates that the distribution of credit scores has shifted significantly,
suggesting that the model might not perform as expected.
Characteristics Analysis and PSI
Characteristics analysis involves understanding the specific features or attributes of data, and PSI helps
assess whether these features are changing over time. Key aspects of characteristics analysis that link to
PSI include:
1. Feature Distribution:
o PSI tracks shifts in the distribution of key features (e.g., age, income, credit score) that the
model relies on. If the distribution changes significantly, it may affect model performance.
2. Population Segmentation:
o In characteristics analysis, segmenting populations based on features (e.g., high-income vs.
low-income groups) is common. PSI can be used to monitor whether the distribution within
these segments has shifted, indicating potential model drift.
3. Model Monitoring:
o PSI provides a direct measure of how stable the key characteristics of the population remain
over time. This is crucial for models that rely on these characteristics, ensuring that they
continue to apply to the current population.
4. Risk Assessment:
o Changes in customer characteristics like spending patterns, creditworthiness, or behavior
can alter risk assessments. PSI helps in monitoring these shifts to recalibrate risk models if
necessary.
Example Use Cases of PSI in Characteristics Analysis
1. Credit Scoring:
o PSI can be used to monitor shifts in credit score distributions over time. If the score
distribution shifts significantly (e.g., more customers falling into lower-score bins), this
might indicate a change in the risk profile of the customer base.
2. Marketing and Customer Segmentation:
o A marketing team might monitor the PSI of customer demographics (e.g., age, location) to
see if the target audience is changing over time. For instance, if the PSI on age distribution
shows significant change, it might suggest the need to adjust marketing strategies.
3. Loan Default Prediction Models:
o Loan prediction models rely on several characteristics, such as income, debt-to-income ratio,
and employment status. A shift in these characteristics, as measured by PSI, could indicate
a need to recalibrate the model to better predict loan defaults.
Characteristics Analysis:
The characteristics analysis refers to examining specific attributes or traits of data points to uncover
insights, identify trends, or make predictions. This process involves identifying key features or variables in
the data that are most relevant to the questions or objectives at hand. The characteristics analysis works in
data analytics and reporting:
1. Identifying Key Features (Variables):
 Feature Selection: Choose the most important characteristics from the dataset that will provide
meaningful insights. These features could be categorical (e.g., gender, country) or numerical (e.g.,
age, income).
 Domain-Specific Characteristics: In different fields, such as healthcare, finance, or marketing, the
important features will vary. For instance:
o In healthcare, features might include age, diagnosis, or test results.
o In finance, it might be transaction amounts, credit scores, or account balances.
2. Data Profiling:
 Descriptive Statistics: Calculate key statistics (e.g., mean, median, standard deviation) for each
characteristic to summarize the data.
 Distribution Analysis: Analyze the distribution of different features. For example, are certain
characteristics skewed or normally distributed? Are there any outliers?
 Correlations: Identify relationships between different features. For instance, is there a correlation
between income level and spending habits?
3. Segmentation and Grouping:
 Cluster Analysis: Group data points based on similar characteristics. For instance, customer
segmentation can be done by grouping customers with similar buying behaviors.
 Cohort Analysis: Analyze characteristics across different groups or periods. For example, a cohort
of users who signed up during a particular month can be tracked over time to assess retention.
 Classification: Use classification algorithms to categorize data points based on their characteristics
(e.g., classifying customers as "high risk" or "low risk").
4. Anomaly Detection:
 Outlier Identification: Detect data points that have unusual characteristics compared to the rest of
the dataset. These could indicate potential fraud, errors, or unique opportunities.
 Pattern Recognition: Identify patterns in data that deviate from the norm. For example, in time series
analysis, a sudden spike in sales may signal an anomaly worth investigating.
5. Trend Analysis:
 Temporal Characteristics: Analyze how characteristics evolve over time. This could involve looking
at changes in customer behavior, sales figures, or website traffic over months or years.
 Seasonality and Cyclic Patterns: Identify repeating patterns or seasonal trends in characteristics,
which can guide future business decisions (e.g., predicting high sales during holiday periods).
6. Predictive Analytics:
 Model Building: Characteristics from historical data can be used to build predictive models. These
models use machine learning algorithms to predict future outcomes based on specific features. For
instance, predicting customer churn based on engagement characteristics.
 Feature Importance: When building predictive models, it’s crucial to assess the importance of each
characteristic to the prediction. This helps in refining models and improving their accuracy.
7. Reporting Insights:
 Data Visualization: Present the analyzed characteristics using charts, graphs, heatmaps, and other
visual tools. This allows stakeholders to quickly understand trends and insights.
 Summary Reports: Provide key takeaways from the characteristics analysis, focusing on actionable
insights. For instance, if a retail chain notices that certain customer demographics prefer a specific
product, this can lead to targeted marketing campaigns.
 Interactive Dashboards: Tools like Tableau, Power BI, or Google Data Studio allow users to interact
with the characteristics data dynamically, changing filters or focusing on specific features.
Examples of Characteristics Analysis in Different Fields:
 Marketing: Identifying customer demographics, purchasing behaviors, and preferences to tailor
campaigns and improve customer engagement.
 Healthcare: Analyzing patient characteristics (age, medical history, lab results) to predict outcomes,
recommend treatments, or detect diseases early.
 Finance: Assessing transaction characteristics, such as frequency, amount, and location, to detect
fraud or predict default risk.
 Human Resources: Analyzing employee characteristics (tenure, performance metrics, salary) to
predict turnover or identify leadership potential.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy