0% found this document useful (0 votes)
4 views

Lect 2

The document discusses techniques for handling missing values and outliers in data preprocessing. It describes various methods for dealing with missing values, such as deletion, mean/median imputation, and multiple imputation. It also discusses techniques specific to time series data, including last observation carried forward and linear interpolation. For outliers, it defines different types and provides methods like percentiles and interquartile range to identify and handle outliers.

Uploaded by

Rozanne de Zoysa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lect 2

The document discusses techniques for handling missing values and outliers in data preprocessing. It describes various methods for dealing with missing values, such as deletion, mean/median imputation, and multiple imputation. It also discusses techniques specific to time series data, including last observation carried forward and linear interpolation. For outliers, it defines different types and provides methods like percentiles and interquartile range to identify and handle outliers.

Uploaded by

Rozanne de Zoysa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Data Preprocessing

Data Cleansing and Evaluations

Subha Fernando,
Dr.Eng, M.Eng, B.Sc(Hons)
University of Moratuwa

subha.f@iit.ac.lk
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Handling Missing Values
Why data goes missing:

• The propensity for a data point to be missing is not related to


the missing data, but it is related to some of the observed data.
• Whether or not someone answered #13 on your survey has
nothing to do with the missing values, but it does have to do
with the values of some other variables.

• Missing Completely at Random (MCAR)


• The fact that a certain value is missing has nothing to do with
its hypothetical value and with the values of other variables.
Little MCAR Test with p value

• Missing not at Random( MNAR)


• Two possible reasons are that the missing value depends on
the hypothetical value.
e.g. People with high salaries generally do not
want to reveal their incomes in surveys) or,
• Missing value is dependent on some other variable’s value
(e.g. females generally don’t want to reveal
their ages! Here the missing value in age variable is
impacted by gender variable)

https://towardsdatascience.com/how-to-handle-missing-data-
8646b18db0d4
Handling Missing Values
• Deletion
• Listwise -Listwise deletion (complete-case analysis) removes all data for an observation that
has one or more missing values.
• Listwise deletion methods produce biased parameters and estimates. Because might be not due
to MCAR
data.dropna(inplace=True)
• Pairwise
• Pairwise deletion occurs when the statistical procedure uses cases that contain some missing
data. The procedure cannot include a particular variable when it has a missing value, but it can
still use the case when analyzing other variables with non-missing values.
#Pairwise Deletion
ncovMatrix <- cov(data, use="pairwise.complete.obs")
#Listwise Deletion
ncovMatrix <- cov(data, use="complete.obs")

• Dropping Variables
Drop variables if the data is missing for more than 60%
observations but only if that variable is insignificant.
del data.column_name
data.drop('column_name', axis=1, inplace=True)
Handling Missing Values
Time Series Specific Methods

• Last Observation Carried Forward (LOCF) & Next Observation Carried Backward
(NOCB)
• This is a statistical approach to the analysis of longitudinal repeated measures data.
• Longitudinal data track the same sample at different points in time.
• Both these methods can introduce bias in analysis when data has a visible trend.
• Linear Interpolation
• For a timeseries data with some trend and no seasonality

• Seasonal Adjustment + Linear Interpolation


• For a timeseries data with some trend and seasonality
Handling Missing Values
Time Series Specific Methods –

• Mean, Median and Mode (with no trends and no seasonality)


• Mean, median or mode is a very basic imputation method
• It takes no advantage of the time series characteristics or relationship between the variables.
• It is very fast but it reduces variance in the dataset.

from sklearn.preprocessing import SimpleImputer


values = data.values
missing_values=’NaN’, strategy=’mean’)
transformed_values = imputer.fit_transform(values)
# strategy can be changed to "median" and “most_frequent”
Handling Missing Values
Time Series Specific Methods – with trends and no seasonality

• Linear Regression (with trends and no seasonality)


• Identify variable with missing values using a correlation matrix.
• Other variables as independent variables in a regression equation. The variable with
missing data is used as the dependent variable.
• The regression equation is used to predict missing values for incomplete cases.
• In an iterative process, values for the missing variable are inserted and then all cases are
used to predict the dependent variable. These steps are repeated until the predicted
values are converged.
• Assume a linear relationship between the variables is exist and trying to fit it to the curve
• Interpolation & Linear Interpolation
• Interpolation is a mathematical method that adjusts a function to your data and uses this
function to extrapolate the missing data.
• The most simple type of interpolation is the linear interpolation, that makes a mean between
the values before the missing data and the value after.
Handling Missing Values
Time Series Specific Methods – with trends and no seasonality
• Interpolation

https://leportella.com/missing-data/
from fancyimpute import KNN
Handling Missing Values # Use 5 nearest rows which have a feature to fill in each row's
missing features
Categorical Data knnOutput = KNN(k=5).complete(data)
• Imputation
• Mode imputation is one method, but it introduces bias. Missing values can be treated as a
separate category.
• K neighbors are chosen based on some distance measure and their average is used as an
imputation estimate.
• The method requires the selection of the number of nearest neighbors, and a distance metric.
KNN can predict both discrete attributes (the most frequent value among the k nearest
neighbors) and continuous attributes (the mean among the k nearest neighbors).
• Continuous Data:
• Distance metrics for continuous data are Euclidean, Manhattan and Cosine
• Categorical Data:
• Hamming distance is used in this case.
• It takes all the categorical attributes and for each, count one if the value is not the same
between two points.
• The Hamming distance is then equal to the number of attributes for which the value was
different
Handling Missing Values
Continuous Data
• Multiple Imputation
• Imputation: Impute the missing entries of the incomplete data sets m times (m=3 in the figure).
Note that imputed values are drawn from a distribution.
• Simulating random draws doesn’t include uncertainty in model parameters.
• One approach is to use Markov Chain Monte Carlo (MCMC) simulation.
• This step results in m complete data sets. Analyze each of the m completed data sets.
• Pooling: Integrate the m analysis results into a result.
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Feature Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Handling Outliers
Outliers
• Outliers are unusual data points that differ significantly from the rest of the samples.
• Types of Outliers
• Global Outliers: A data point is considered as a global outlier if its values are far outside the
entirety of the dataset.
• Contextual (Conditional) Outliers: if an individual data instance is anomalous in a specific
context or condition, then it is termed as a contextual outlier.
• Collective Outliers: when a collection of data points is anomalous concerning the entire data
set, the values themselves are not anomalous.
Handling Outliers
Method to Find Outliers
Two Basic Methods: Percentile and Box Plot
• Percentile :
• Define a minimum percentile and maximum percentile.
• Usually, the minimum percentile is 5%, and the maximum percentile is 95%.
• All the data points outside the percentile range, are considered as outliers.

• Box Plot
• A box plot is a graphical display for describing the distribution of data. Box plots use the
median and the lower and upper quartiles.
lst = [np.random.randint(0,100) for i in range(0,100)]
## Adding a manual outlier
global_outlier = [300]
df = pd.DataFrame(lst+global_outlier,columns=['number'])
## Minimum Percentile Value df.boxplot(column=['number'])
min_val = df.quantile(0.05)
## Maximum Percentile Value
max_val = df.quantile(0.95)
## Finding All the Outliers
df = df[(df['number']<min_val[0])| (df['number']>max_val[0])]
df
Handling Outliers
Three Basic Methods:
• Remove all the outliers :
• Replace Outlier Values with a suitable value
• Replace them with min or max quantile value
• Using IQR
• IQR or interquartile range is a measurement of variability based on dividing the dataset into
different quantiles.
• Quantiles are divided into Q1, Q2, and Q3,
• Q1is the middle value of the first half of the dataset.
• Q2 is the median value, and
• Q3 is the middle value of the second half of the dataset.
• IQR = Q3-Q1
• We calculate the lower limit and upper limit and then simply discard all the values that are less
or above the limit and replace them with lower and upper limit accordingly.
• It will Also Work For Data That is Left skewed or Right Skewed

https://medium.com/towards-artificial-intelligence/handling-
outliers-in-machine-learning-f842d8f4c1dc
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Data Transformations
We prefer if data comes from Normal Distribution. variables in real datasets will follow more a
skewed distribution.
By applying transformations to these variables, we map skewed distribution to a normal distribution.

How to check our variables follow a normal distribution: Histogram and Q-Q Plot

In the Q-Q plots, if the variable follows a normal distribution, the


variable’s values should fall in a 45-degree line when plotted
against the theoretical quantiles.
Data Transformations
We prefer if data comes from Normal Distribution. variables in real datasets will follow more a
skewed distribution.
By applying transformations to these variables, we map skewed distribution to a normal distribution.

How to check our variables follow a normal distribution: Histogram and Q-Q Plot

In the Q-Q plots, if the variable follows a normal distribution, the


variable’s values should fall in a 45-degree line when plotted
against the theoretical quantiles.
Data Transformations
Method to transform variables
• Logarithmic transformation – only for positive numbers, when distribution is right skewed, ln(x)
transformer = FunctionTransformer(np.log, validate=True)

• Square root transformation - used for reducing right-skewed distributions √𝑥.


transformer = FunctionTransformer(np.sqrt, validate=True)

• Reciprocal transformation – not defined for zero values, reciprocal reverses the order among values
of the same sign, so large values become smaller. The negative reciprocal preserves the order
among values of the same sign. .
transformer = FunctionTransformer(np.receiporcal, validate=True)

• Exponential or power transformation : to reduce the left skewness, 𝑥 , 𝑥 … exp(𝑥)


transformer = FunctionTransformer(lambda x: x ** 3, validate=True)
Data Transformations
Method to transform variables
• Box-Cox transformation

• we examine all values of λ (-5,5). And we choose the optimal value (resulting in the best
approximation to a normal distribution) for our variable. Only work for positive numbers
transformer = PowerTransformer(method=‘box-cox', standardize=False)
• Yeo-Johnson transformation – Can apply also on negative numbers

transformer = Power Transformer(method='yeo-johnson', standardize=False)


Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
https://heartbeat.fritz.ai/hands-on-with-feature-engineering-
Data Coding techniques-encoding-categorical-variables-be4bc0715394

Categorical Variables:

• Traditional techniques
• One-hot Encoding
• Integer (Label) Encoding
• Monotonic Relationship
• Ordered label encoding

Advantages of one-hot encoding


• Does not assume the distribution of categories of the
categorical variable.
• Keeps all the information of the categorical variable.
• Suitable for linear models.

Limitations of one-hot encoding


• Expands the feature space.
data_with_k = pd.get_dummies(data.Sex)
data_with_k.head(10)
• Does not add extra information while encoding.
• Many dummy variables may be identical, and this can
introduce redundant information.
https://heartbeat.fritz.ai/hands-on-with-feature-engineering-
Data Coding techniques-encoding-categorical-variables-be4bc0715394

Categorical Variables:

• Traditional techniques
• One-hot Encoding
• Integer (Label) Encoding
• Monotonic Relationship
• Ordered label encoding
Advantages of integer encoding
• Straightforward to implement.
• Does not expand the feature space.
• Can work well enough with tree-based algorithms.
• Allows agile benchmarking of machine learning models.

Limitations of integer encoding


• Does not add extra information while encoding.
• Not suitable for linear models.
• Does not handle new categories in the test set automatically.
• Creates an order relationship between the categories.
https://heartbeat.fritz.ai/hands-on-with-feature-engineering-
Data Coding techniques-encoding-categorical-variables-be4bc0715394

Categorical Variables:

• Traditional techniques
• One-hot Encoding
• Integer (Label) Encoding
• Monotonic Relationship
• Ordered label encoding

Advantages of Ordered Label encoding


• Straightforward to implement.
• Does not expand the feature space.
• Can work well enough with tree-based algorithms.
• Allows agile benchmarking of machine learning models.

Limitations of Ordered Label encoding


• Does not add extra information while encoding.
• Not suitable for linear models.
• Does not handle new categories in the test set automatically.
• Creates an order relationship between the categories.
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Feature Scaling
• Feature scaling refers to the methods used to normalize the range of values of independent variables.
• Feature magnitude matters for several reasons:
• The scale of the variable directly influences the regression coefficient.
• Variables with a more significant magnitude dominate over the ones with a smaller magnitude
range.
• Gradient descent converges faster when features are on similar scales.
• Feature scaling helps decrease the time to find support vectors for SVMs.
• Euclidean distances are sensitive to feature magnitude.

• Scaling Methods
• Standardization
• Scale to maximum and minimum
• method is very sensitive to outliers.
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Variable Discretization

• Transforming a continuous variable into a discrete one.


• It essentially creates a set of contiguous intervals that span the variable’s value range.
• Binning is another name for discretization, where the bin is an alternative name
for the interval.
• Discretization Approach
• Supervised Approach
• Discretization with decision trees - Discretization with decision trees consists of using a
decision tree to identify the optimal bins.
• Unsupervised Approach
• Equal-width discretization
• K-means discretization
• Custom Discretization
Variable Discretization
• Unsupervised Approach
• Equal-width discretization
• It divides the range of possible values into N (manually) bins of the same width.
• The width of intervals is determined by

# import the libraries


import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

# load your data


data = pd.read_csv(data.csv')

# create the discretizer object with strategy uniform and 8 bins


discretizer = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy='uniform')
Variable Discretization
• Unsupervised Approach
• K-means discretization
• Apply k-means clustering to the continuous variable—then each cluster is considered as a bin.
# import the libraries
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer
# load your data
data = pd.read_csv(data.csv')
# create the discretizer object with strategy uniform and 8 bins
discretizer = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy=‘k-means')

• Custom Discretization
Discretization of variables like Age, can be divided [0–10] as kids, [10–25] as teenagers.
import pandas as pd

# bins interval
bins = [0, 10, 25, 65, 250]
# bins labels
labels = ['0-10', '10-25', '25-65', '>65']
# discretization with pandas
df['Age'] = pd.cut(df.Age, bins=bins, labels=labels, include_lowest=True)
df.head(10)
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
Class Imbalance
• Imbalance data is where the classification dataset class has a skewed proportion.
• Imbalance class creates a bias where the learning model tends to predict the majority class.
• Class Imbalance : 90% − 10% (Anomaly Detection) vs. 70% − 30% (Class Imbalance)
• Two techniques to overcome the class imbalance
• Undersampling: would decrease the proportion of majority class until the number is similar to the
minority class.
• Oversampling: would resample the minority class proportion following the majority class
proportion.

https://towardsdatascience.com/5-smote-techniques-for-
oversampling-your-imbalance-data-b8155bdbe2b5
https://github.com/analyticalmindsltd/smote_variants

Class Imbalance - Oversampling


• SMOTE : Synthetic Minority Oversampling Technique
• SMOTE – Continuous data
• SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data.
• SMOTE start by choosing random data from the minority class. And then k-nearest neighbors from the
data are set.
• Synthetic data would then made between the random data and the randomly selected k-nearest
neighbor.
• SMOTE works by selecting examples that are close in the feature space, drawing a line
between the examples in the feature space and drawing a new sample at a point along that
line.

• SMOTE-NC : Synthetic Minority Oversampling Technique (Nominal and Continuous Data)


• Denote which features are categorical, and SMOTE would resample the categorical data instead of creating synthetic
data.
Class Imbalance - Oversampling
• Borderline-SMOTE : Borderline-SMOTE only makes synthetic data along the decision boundary
between the two classes.
• Types of Borderline-SMOTE
• Boarderline-SMOTE1 and Borderline-SMOTE2.
• Borderline-SMOTE1 also oversampled the majority class where the majority data are causing
misclassification in the decision boundary.
• Borderline-SMOTE2 only oversampled the minority classes.

• Borderline-SMOTE SVM
• The borderline area is approximated by the support vectors after training SVMs classifier on
the original training set.
• Synthetic data will be randomly created along the lines joining each minority class support
vector with a few its nearest neighbors.

• Adaptive Synthetic Sampling (ADASYN)


• ADASYN creates synthetic data according to the data density.
• The synthetic data generation would inversely proportional to the density of the minority
class. i.e. More synthetic data are created in regions of the feature space where the density of
minority examples is low, and fewer or none where the density is high.
https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-
classification/#:~:text=The%20simplest%20undersampling%20technique%20involves,r

Class Imbalance eferred%20to%20as%20random%20undersampling.

Under sampling: would decrease the proportion of majority class until the number is similar to the
minority class.
• Random Under sampling:
• The technique involves randomly selecting examples from the majority class and deleting them from the
training dataset.
• Even it is simple and effective, since examples are removed without any concern of how useful or importance,
this might be in determining the decision boundary between the classes.
• Near Miss Under sampling:
• NearMiss-1 selects examples from the majority class that have the minimum average distance to the three
closest examples from the minority class.
• NearMiss-2 selects examples from the majority class that have the minimum average distance to the three
furthest examples from the minority class.
• NearMiss-3 involves selecting given number of majority class examples with minimum distance to each
minority class example.
• Tomek Links for Undersampling
• Edited Nearest Neighbors Rule for Undersampling
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Classification Measures in Supervised Learning
• When values are predicted (Regression)
• MSE (Mean Squared Error) = ∑ 𝑦 −𝑦 𝑤ℎ𝑒𝑟𝑒 𝑦 = 𝑓 𝑥

• RMSE (Root Mean Squared Error) = ∑ 𝑦 −𝑦

• MAE (Mean Absolute Error) = ∑ |𝑦 −𝑦 |

• When the classes are balanced: (Classification)


• Classification Error : =

• Accuracy = 1 − 𝐸𝑟𝑟𝑜𝑟 =
Classification Measures in Supervised Learning
𝑦 = 0 ∶ 𝑛𝑜𝑛 𝑠𝑝𝑎𝑚 𝑎𝑛𝑑 1: 𝑠𝑝𝑎𝑚 𝑦 ∶ 0 𝑛𝑜𝑛 𝑠𝑝𝑎𝑚 𝑎𝑛𝑑 1 𝑠𝑝𝑎𝑚

𝒚
Y Yes No
Yes TP FN P( | spam class)
No FP TN P( | non spam class)

We want green to maximize red to minimize for better


classification.

TP FN

98 2 100 Can we measure the


FP TN
accuracy only by using the
2 98 100 diagonal elements?
98%
Classification Measures in Supervised Learning
𝑦 = 0 ∶ 𝑛𝑜𝑛 𝑠𝑝𝑎𝑚 𝑎𝑛𝑑 1: 𝑠𝑝𝑎𝑚 𝑦 ∶ 0 𝑛𝑜𝑛 𝑠𝑝𝑎𝑚 𝑎𝑛𝑑 1 𝑠𝑝𝑎𝑚

TP FN 198 2 200 98
2 100
98 2 100 2 598 600
FP TN
66% 86%

As the class imbalance increases, the Green Diagonal doe


not reflect the model accuracy properly, So we need 698 2 700
measure that reflect the misclassification also.
98 2 100
88%
Classification Measures – F-Score
• When the classes are imbalanced:

• False Alarm Rate = False Positive Rate =


% of negative we misclassified as positive
• Miss Alarm Rate = False Negative Rate =
% of positives we misclassified as negatives
• Recall = True Positive Rate =
% of positives we classified correctly ( 1 – Miss Rate)
• Precision =
% of positives out of what we predicted was positive.
• F-measure =

Harmonic mean of recall and precision.


F-Score (Utility and Cost)
• Sometimes different type of errors cost differently. For example, earthquake predictions)
• False positive : Cost of preventive measures (evacuation, lost profit)
• False negative: Cost of recovery (reconstruction, liability)

• Detection Costs: weighted average of FP, FN rates:


• Cost = 𝐶 ∗ 𝐹𝑃 + 𝐶 ∗ 𝐹𝑁

• Sometime situation prefers to maintain high recall over precision or other way around.
• Measurement : AUC-ROC Curve (Area Under Curve – Receiver Operation
Characteristics)
• AUC-ROC curve is performance measurement for classification problem at various
threshold settings.
https://towardsdatascience.com/understanding-auc-roc-curve-
ROC Curve 68b2303cc9c5

• The threshold t, to get decision spam if 𝑓 𝑥 > 𝑡, non-spam 𝑖𝑓 (𝑓 𝑥 ≤ 𝑡). Let assume 𝑡 ∈ [0,1]

• So 𝑡 determines error rates


• 𝐹𝑃𝑅 = 𝑃 𝑓 𝑥 > 𝑡 𝑛𝑜𝑛 − 𝑠𝑝𝑎𝑚) and 𝑇𝑃𝑅 = 𝑃(𝑓 𝑥 > 𝑡|𝑠𝑝𝑎𝑚)
• Now you plot TPR vs. FPR as t varies from 0 𝑡𝑜 1

TPR = Recall = True Positive Rate = = P(f(x) >t |spam)

FPR = False Positive Rate = = P(f(x) > t| non-spam)


We want to reduce both 𝛼 and 𝛽. But theoretically we can’t
reduce both them together.
𝛼 = Type I error = FN/(FN+TP) and
𝛽 = Type II error = FP/ (FP+TN).

Therefore, for a fixed 𝛽, we reduce 𝛼 or increase 𝑇𝑃𝑅 = (1 − 𝛼)


ROC Curve
• AUC near to the 1 which means it has a good measure of separability.
• AUC near to the 0 which means it has the worst measure of separability.
• AUC is 0.5, it means the model has no class separation capacity.

Distribution of classes 1 1 and 0  0

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Red distribution curve is of the positive class (spam) and the green distribution curve is of
the negative class(no spam). It is perfectly able to distinguish between positive class and
negative class.
ROC Curve
• AUC near to the 0 which means it has the worst measure of separability.
• AUC is 0.5, it means the model has no class separation capacity.

Distribution of classes 0 -> 1 and 1 0

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Red distribution curve is of the positive class (spam) and the green distribution curve is of the
negative class(no spam). The model is predicting a negative class as a positive class and vice
versa.
ROC Curve
• AUC is 0.5, it means the model has no class separation capacity.

Distribution of classes (overlapped)

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Red distribution curve is of the positive class (spam) and the green distribution curve is of the
negative class(no spam). The model has no discrimination capacity to distinguish between
positive class and negative class.
ROC Curve
• AUC is 0.7, two distribution overlap, we introduce type 1 and type 2 error.
• Depending upon the threshold, we can minimize or maximize them. When AUC is 0.7, it
means there is a 70% chance that the model will be able to distinguish between positive
class and negative class.

Distribution of classes (somewhat overlapped)

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Red distribution curve is of the positive class (spam) and the green distribution curve is of the
negative class(no spam). The model has no discrimination capacity to distinguish between
positive class and negative class.
Value
• After fitting a linear regression model, you need to determine how well the model fits the data
• 𝑅 is a one of the key goodness-of-fit statistics for regression analysis.

• Statistically a regression model fits the data well if the differences between the observations and the
predicted values are small and unbiased. Unbiased means that the fitted values are not systematically too
high or too low anywhere in the observation space.

• Therefore, before assessing numeric measures of goodness-of-fit, like 𝑅 , we should evaluate the residual
plots.
• Residual plots can expose a biased model far more effectively than the numeric output by displaying
problematic patterns in the residuals.

• 𝑅 evaluates the scatter of the data points around the fitted regression line. It is also called the coefficient
of determination, or the coefficient of multiple determination for multiple regression.
• For the same data set, higher 𝑅 values represent smaller differences between the observed data and the
fitted values.
Value
• R-squared is always between 0 and 100%:
• 0% represents a model that does not explain any of the variation in the response variable around its mean.
• The mean of the dependent variable predicts the dependent variable as well as the regression model.
• 100% represents a model that explains all the variation in the response variable around its mean.
• Usually, the larger the 𝑅 , the better the regression model fits your observations.

The R-squared for the regression model on the left is 15%, and for the model on the right it is 85%. When a regression model
accounts for more of the variance, the data points are closer to the regression line.
Value
Evaluation Techniques

https://machinelearningmastery.com/framework-for-
imbalanced-classification-projects/
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance


• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
https://towardsdatascience.com/hypothesis-testing-in-machine-
Significance Test learning-using-python-a0dc89e169ce

One sampled T-test :- The One Sample t Test determines


whether the sample mean is statistically different from a
known or hypothesized population mean. The One
Sample t Test is a parametric test.

• For a given model A


• Repeat experiments n time under model A
• Calculate 𝑒 … 𝑒

• Calculate t- sample test 𝑡 = where s = with hypothesis 𝐻 : 𝜇 == 𝜃 vs.

𝐻 :𝜇 ≠ 𝜃
• Then t = ~ 𝑡(𝑛 − 1) under 𝛼% level significance
from scipy.stats import ttest_1samp
set, pval = ttest_1samp(ages, 𝜃)
print("p-value",pval)
• Or 𝜇 ± 𝜎 as the range
if pval <0.05:
• Can express as CI under 𝛼% s. t. 𝑃 𝑞 < 𝑡 < 𝑞 = 1−𝛼 print("we reject null hypothesis")
else:
print("we accept null hypothesis")
Significance Test
Two sampled T-test :-The Independent Samples t Test or
2-sample t-test compares the means of two independent
groups in order to determine whether there is statistical
evidence that the associated population means are
significantly different.

• Compare two models A, and B


• Repeat experiments n time under model A and model B.
• Calculate 𝑒 … 𝑒 and 𝑒 , … 𝑒
• Calculate independent t- sample test 𝑡 = where 𝑠𝑝 = with

hypothesis 𝐻 : 𝜇 == 𝜇 vs. 𝐻 : 𝜇 ≠ 𝜇
from scipy.stats import ttest_ind
• Then t = ~ 𝑡(𝑛 + 𝑛 − 2) under 𝛼% level significance
ttest,pval = ttest_ind(ErrorsA,ErrorsB)
print("p-value",pval)
if pval < 0.05:
• Show model A is better than model B? How? print("we reject null hypothesis")
else:
print("we accept null hypothesis")
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from statistics import mean, median, mode, stdev

cv1 = RepeatedStratifiedKFold(n_splits = 2, n_repeats = 2, random_state = 1)


scores1 = cross_val_score(model1, X, y, scoring = 'accuracy', cv = cv1, n_jobs = -1)
print('Logistic Regression Model Mean Accuracy: %.1f%% +/-
(%.3f)' % (mean(scores1*100), stdev(scores1)))
# evaluate model 2

cv2 = RepeatedStratifiedKFold(n_splits = 2, n_repeats = 2, random_state = 1)


scores2 = cross_val_score(model2, X, y, scoring = 'accuracy', cv = cv2, n_jobs = -1)
print('SVM Mean Accuracy: %.1f%% +/-(%.3f)' % (mean(scores2*100), stdev(scores2)))
from mlxtend.evaluate import paired_ttest_5x2cv
# check if difference between algorithms is real
t, p = paired_ttest_5x2cv(estimator1=model1,
estimator2=model2,
X=X,
y=y,
scoring='accuracy',
random_seed=1)
# summarize
print(f'The P-value is = {p:.3f}')
print(f'The t-statistics is = {t:.3f}')
# interpret the result
if p <= 0.05:
print('Since p<0.05, We can reject the null-
hypothesis that both models perform equally well on this dataset. We may conclude that the two algori
thms are significantly different.')
else:
print('Since p>0.05, we cannot reject the null hypothesis and may conclude that the performance o
f the two algorithms is not significantly different.')

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy