0% found this document useful (0 votes)

4 views

Lect 2

The document discusses techniques for handling missing values and outliers in data preprocessing. It describes various methods for dealing with missing values, such as deletion, mean/median imputation, and multiple imputation. It also discusses techniques specific to time series data, including last observation carried forward and linear interpolation. For outliers, it defines different types and provides methods like percentiles and interquartile range to identify and handle outliers.

Uploaded by

Rozanne de Zoysa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Lect 2

Uploaded by

Rozanne de Zoysa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Data Preprocessing

Data Cleansing and Evaluations

Subha Fernando,
Dr.Eng, M.Eng, B.Sc(Hons)
University of Moratuwa

subha.f@iit.ac.lk
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Handling Missing Values
Why data goes missing:

• The propensity for a data point to be missing is not related to

the missing data, but it is related to some of the observed data.
• Whether or not someone answered #13 on your survey has
nothing to do with the missing values, but it does have to do
with the values of some other variables.

• Missing Completely at Random (MCAR)

• The fact that a certain value is missing has nothing to do with
its hypothetical value and with the values of other variables.
Little MCAR Test with p value

• Missing not at Random( MNAR)

• Two possible reasons are that the missing value depends on
the hypothetical value.
e.g. People with high salaries generally do not
want to reveal their incomes in surveys) or,
• Missing value is dependent on some other variable’s value
(e.g. females generally don’t want to reveal
their ages! Here the missing value in age variable is
impacted by gender variable)

https://towardsdatascience.com/how-to-handle-missing-data-
8646b18db0d4
Handling Missing Values
• Deletion
• Listwise -Listwise deletion (complete-case analysis) removes all data for an observation that
has one or more missing values.
• Listwise deletion methods produce biased parameters and estimates. Because might be not due
to MCAR
data.dropna(inplace=True)
• Pairwise
• Pairwise deletion occurs when the statistical procedure uses cases that contain some missing
data. The procedure cannot include a particular variable when it has a missing value, but it can
still use the case when analyzing other variables with non-missing values.
#Pairwise Deletion
ncovMatrix <- cov(data, use="pairwise.complete.obs")
#Listwise Deletion
ncovMatrix <- cov(data, use="complete.obs")

• Dropping Variables
Drop variables if the data is missing for more than 60%
observations but only if that variable is insignificant.
del data.column_name
data.drop('column_name', axis=1, inplace=True)
Handling Missing Values
Time Series Specific Methods

• Last Observation Carried Forward (LOCF) & Next Observation Carried Backward
(NOCB)
• This is a statistical approach to the analysis of longitudinal repeated measures data.
• Longitudinal data track the same sample at different points in time.
• Both these methods can introduce bias in analysis when data has a visible trend.
• Linear Interpolation
• For a timeseries data with some trend and no seasonality

• Seasonal Adjustment + Linear Interpolation

• For a timeseries data with some trend and seasonality
Handling Missing Values
Time Series Specific Methods –

• Mean, Median and Mode (with no trends and no seasonality)

• Mean, median or mode is a very basic imputation method
• It takes no advantage of the time series characteristics or relationship between the variables.
• It is very fast but it reduces variance in the dataset.

from sklearn.preprocessing import SimpleImputer

values = data.values
missing_values=’NaN’, strategy=’mean’)
transformed_values = imputer.fit_transform(values)
# strategy can be changed to "median" and “most_frequent”
Handling Missing Values
Time Series Specific Methods – with trends and no seasonality

• Linear Regression (with trends and no seasonality)

• Identify variable with missing values using a correlation matrix.
• Other variables as independent variables in a regression equation. The variable with
missing data is used as the dependent variable.
• The regression equation is used to predict missing values for incomplete cases.
• In an iterative process, values for the missing variable are inserted and then all cases are
used to predict the dependent variable. These steps are repeated until the predicted
values are converged.
• Assume a linear relationship between the variables is exist and trying to fit it to the curve
• Interpolation & Linear Interpolation
• Interpolation is a mathematical method that adjusts a function to your data and uses this
function to extrapolate the missing data.
• The most simple type of interpolation is the linear interpolation, that makes a mean between
the values before the missing data and the value after.
Handling Missing Values
Time Series Specific Methods – with trends and no seasonality
• Interpolation

https://leportella.com/missing-data/
from fancyimpute import KNN
Handling Missing Values # Use 5 nearest rows which have a feature to fill in each row's
missing features
Categorical Data knnOutput = KNN(k=5).complete(data)
• Imputation
• Mode imputation is one method, but it introduces bias. Missing values can be treated as a
separate category.
• K neighbors are chosen based on some distance measure and their average is used as an
imputation estimate.
• The method requires the selection of the number of nearest neighbors, and a distance metric.
KNN can predict both discrete attributes (the most frequent value among the k nearest
neighbors) and continuous attributes (the mean among the k nearest neighbors).
• Continuous Data:
• Distance metrics for continuous data are Euclidean, Manhattan and Cosine
• Categorical Data:
• Hamming distance is used in this case.
• It takes all the categorical attributes and for each, count one if the value is not the same
between two points.
• The Hamming distance is then equal to the number of attributes for which the value was
different
Handling Missing Values
Continuous Data
• Multiple Imputation
• Imputation: Impute the missing entries of the incomplete data sets m times (m=3 in the figure).
Note that imputed values are drawn from a distribution.
• Simulating random draws doesn’t include uncertainty in model parameters.
• One approach is to use Markov Chain Monte Carlo (MCMC) simulation.
• This step results in m complete data sets. Analyze each of the m completed data sets.
• Pooling: Integrate the m analysis results into a result.
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Feature Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Handling Outliers
Outliers
• Outliers are unusual data points that differ significantly from the rest of the samples.
• Types of Outliers
• Global Outliers: A data point is considered as a global outlier if its values are far outside the
entirety of the dataset.
• Contextual (Conditional) Outliers: if an individual data instance is anomalous in a specific
context or condition, then it is termed as a contextual outlier.
• Collective Outliers: when a collection of data points is anomalous concerning the entire data
set, the values themselves are not anomalous.
Handling Outliers
Method to Find Outliers
Two Basic Methods: Percentile and Box Plot
• Percentile :
• Define a minimum percentile and maximum percentile.
• Usually, the minimum percentile is 5%, and the maximum percentile is 95%.
• All the data points outside the percentile range, are considered as outliers.

• Box Plot
• A box plot is a graphical display for describing the distribution of data. Box plots use the
median and the lower and upper quartiles.
lst = [np.random.randint(0,100) for i in range(0,100)]
## Adding a manual outlier
global_outlier = [300]
df = pd.DataFrame(lst+global_outlier,columns=['number'])
## Minimum Percentile Value df.boxplot(column=['number'])
min_val = df.quantile(0.05)
## Maximum Percentile Value
max_val = df.quantile(0.95)
## Finding All the Outliers
df = df[(df['number']<min_val[0])| (df['number']>max_val[0])]
df
Handling Outliers
Three Basic Methods:
• Remove all the outliers :
• Replace Outlier Values with a suitable value
• Replace them with min or max quantile value
• Using IQR
• IQR or interquartile range is a measurement of variability based on dividing the dataset into
different quantiles.
• Quantiles are divided into Q1, Q2, and Q3,
• Q1is the middle value of the first half of the dataset.
• Q2 is the median value, and
• Q3 is the middle value of the second half of the dataset.
• IQR = Q3-Q1
• We calculate the lower limit and upper limit and then simply discard all the values that are less
or above the limit and replace them with lower and upper limit accordingly.
• It will Also Work For Data That is Left skewed or Right Skewed

https://medium.com/towards-artificial-intelligence/handling-
outliers-in-machine-learning-f842d8f4c1dc
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Data Transformations
We prefer if data comes from Normal Distribution. variables in real datasets will follow more a
skewed distribution.
By applying transformations to these variables, we map skewed distribution to a normal distribution.

How to check our variables follow a normal distribution: Histogram and Q-Q Plot

In the Q-Q plots, if the variable follows a normal distribution, the

variable’s values should fall in a 45-degree line when plotted
against the theoretical quantiles.
Data Transformations
We prefer if data comes from Normal Distribution. variables in real datasets will follow more a
skewed distribution.
By applying transformations to these variables, we map skewed distribution to a normal distribution.

How to check our variables follow a normal distribution: Histogram and Q-Q Plot

In the Q-Q plots, if the variable follows a normal distribution, the

variable’s values should fall in a 45-degree line when plotted
against the theoretical quantiles.
Data Transformations
Method to transform variables
• Logarithmic transformation – only for positive numbers, when distribution is right skewed, ln(x)
transformer = FunctionTransformer(np.log, validate=True)

• Square root transformation - used for reducing right-skewed distributions √𝑥.

transformer = FunctionTransformer(np.sqrt, validate=True)

• Reciprocal transformation – not defined for zero values, reciprocal reverses the order among values
of the same sign, so large values become smaller. The negative reciprocal preserves the order
among values of the same sign. .
transformer = FunctionTransformer(np.receiporcal, validate=True)

• Exponential or power transformation : to reduce the left skewness, 𝑥 , 𝑥 … exp(𝑥)

transformer = FunctionTransformer(lambda x: x ** 3, validate=True)
Data Transformations
Method to transform variables
• Box-Cox transformation

• we examine all values of λ (-5,5). And we choose the optimal value (resulting in the best
approximation to a normal distribution) for our variable. Only work for positive numbers
transformer = PowerTransformer(method=‘box-cox', standardize=False)
• Yeo-Johnson transformation – Can apply also on negative numbers

transformer = Power Transformer(method='yeo-johnson', standardize=False)

Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
https://heartbeat.fritz.ai/hands-on-with-feature-engineering-
Data Coding techniques-encoding-categorical-variables-be4bc0715394

Categorical Variables:

• Traditional techniques
• One-hot Encoding
• Integer (Label) Encoding
• Monotonic Relationship
• Ordered label encoding

Advantages of one-hot encoding

• Does not assume the distribution of categories of the
categorical variable.
• Keeps all the information of the categorical variable.
• Suitable for linear models.

Limitations of one-hot encoding

• Expands the feature space.
data_with_k = pd.get_dummies(data.Sex)
data_with_k.head(10)
• Does not add extra information while encoding.
• Many dummy variables may be identical, and this can
introduce redundant information.
https://heartbeat.fritz.ai/hands-on-with-feature-engineering-
Data Coding techniques-encoding-categorical-variables-be4bc0715394

Categorical Variables:

• Traditional techniques
• One-hot Encoding
• Integer (Label) Encoding
• Monotonic Relationship
• Ordered label encoding
Advantages of integer encoding
• Straightforward to implement.
• Does not expand the feature space.
• Can work well enough with tree-based algorithms.
• Allows agile benchmarking of machine learning models.

Limitations of integer encoding

• Does not add extra information while encoding.
• Not suitable for linear models.
• Does not handle new categories in the test set automatically.
• Creates an order relationship between the categories.
https://heartbeat.fritz.ai/hands-on-with-feature-engineering-
Data Coding techniques-encoding-categorical-variables-be4bc0715394

Categorical Variables:

• Traditional techniques
• One-hot Encoding
• Integer (Label) Encoding
• Monotonic Relationship
• Ordered label encoding

Advantages of Ordered Label encoding

• Straightforward to implement.
• Does not expand the feature space.
• Can work well enough with tree-based algorithms.
• Allows agile benchmarking of machine learning models.

Limitations of Ordered Label encoding

• Does not add extra information while encoding.
• Not suitable for linear models.
• Does not handle new categories in the test set automatically.
• Creates an order relationship between the categories.
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Feature Scaling
• Feature scaling refers to the methods used to normalize the range of values of independent variables.
• Feature magnitude matters for several reasons:
• The scale of the variable directly influences the regression coefficient.
• Variables with a more significant magnitude dominate over the ones with a smaller magnitude
range.
• Gradient descent converges faster when features are on similar scales.
• Feature scaling helps decrease the time to find support vectors for SVMs.
• Euclidean distances are sensitive to feature magnitude.

• Scaling Methods
• Standardization
• Scale to maximum and minimum
• method is very sensitive to outliers.
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Variable Discretization

• Transforming a continuous variable into a discrete one.

• It essentially creates a set of contiguous intervals that span the variable’s value range.
• Binning is another name for discretization, where the bin is an alternative name
for the interval.
• Discretization Approach
• Supervised Approach
• Discretization with decision trees - Discretization with decision trees consists of using a
decision tree to identify the optimal bins.
• Unsupervised Approach
• Equal-width discretization
• K-means discretization
• Custom Discretization
Variable Discretization
• Unsupervised Approach
• Equal-width discretization
• It divides the range of possible values into N (manually) bins of the same width.
• The width of intervals is determined by

# import the libraries

import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

# load your data

data = pd.read_csv(data.csv')

# create the discretizer object with strategy uniform and 8 bins

discretizer = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy='uniform')
Variable Discretization
• Unsupervised Approach
• K-means discretization
• Apply k-means clustering to the continuous variable—then each cluster is considered as a bin.
# import the libraries
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer
# load your data
data = pd.read_csv(data.csv')
# create the discretizer object with strategy uniform and 8 bins
discretizer = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy=‘k-means')

• Custom Discretization
Discretization of variables like Age, can be divided [0–10] as kids, [10–25] as teenagers.
import pandas as pd

# bins interval
bins = [0, 10, 25, 65, 250]
# bins labels
labels = ['0-10', '10-25', '25-65', '>65']
# discretization with pandas
df['Age'] = pd.cut(df.Age, bins=bins, labels=labels, include_lowest=True)
df.head(10)
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
Class Imbalance
• Imbalance data is where the classification dataset class has a skewed proportion.
• Imbalance class creates a bias where the learning model tends to predict the majority class.
• Class Imbalance : 90% − 10% (Anomaly Detection) vs. 70% − 30% (Class Imbalance)
• Two techniques to overcome the class imbalance
• Undersampling: would decrease the proportion of majority class until the number is similar to the
minority class.
• Oversampling: would resample the minority class proportion following the majority class
proportion.

https://towardsdatascience.com/5-smote-techniques-for-
oversampling-your-imbalance-data-b8155bdbe2b5
https://github.com/analyticalmindsltd/smote_variants

Class Imbalance - Oversampling

• SMOTE : Synthetic Minority Oversampling Technique
• SMOTE – Continuous data
• SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data.
• SMOTE start by choosing random data from the minority class. And then k-nearest neighbors from the
data are set.
• Synthetic data would then made between the random data and the randomly selected k-nearest
neighbor.
• SMOTE works by selecting examples that are close in the feature space, drawing a line
between the examples in the feature space and drawing a new sample at a point along that
line.

• SMOTE-NC : Synthetic Minority Oversampling Technique (Nominal and Continuous Data)

• Denote which features are categorical, and SMOTE would resample the categorical data instead of creating synthetic
data.
Class Imbalance - Oversampling
• Borderline-SMOTE : Borderline-SMOTE only makes synthetic data along the decision boundary
between the two classes.
• Types of Borderline-SMOTE
• Boarderline-SMOTE1 and Borderline-SMOTE2.
• Borderline-SMOTE1 also oversampled the majority class where the majority data are causing
misclassification in the decision boundary.
• Borderline-SMOTE2 only oversampled the minority classes.

• Borderline-SMOTE SVM
• The borderline area is approximated by the support vectors after training SVMs classifier on
the original training set.
• Synthetic data will be randomly created along the lines joining each minority class support
vector with a few its nearest neighbors.

• Adaptive Synthetic Sampling (ADASYN)

• ADASYN creates synthetic data according to the data density.
• The synthetic data generation would inversely proportional to the density of the minority
class. i.e. More synthetic data are created in regions of the feature space where the density of
minority examples is low, and fewer or none where the density is high.
https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-
classification/#:~:text=The%20simplest%20undersampling%20technique%20involves,r

Class Imbalance eferred%20to%20as%20random%20undersampling.

Under sampling: would decrease the proportion of majority class until the number is similar to the
minority class.
• Random Under sampling:
• The technique involves randomly selecting examples from the majority class and deleting them from the
training dataset.
• Even it is simple and effective, since examples are removed without any concern of how useful or importance,
this might be in determining the decision boundary between the classes.
• Near Miss Under sampling:
• NearMiss-1 selects examples from the majority class that have the minimum average distance to the three
closest examples from the minority class.
• NearMiss-2 selects examples from the majority class that have the minimum average distance to the three
furthest examples from the minority class.
• NearMiss-3 involves selecting given number of majority class examples with minimum distance to each
minority class example.
• Tomek Links for Undersampling
• Edited Nearest Neighbors Rule for Undersampling
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
Classification Measures in Supervised Learning
• When values are predicted (Regression)
• MSE (Mean Squared Error) = ∑ 𝑦 −𝑦 𝑤ℎ𝑒𝑟𝑒 𝑦 = 𝑓 𝑥

• RMSE (Root Mean Squared Error) = ∑ 𝑦 −𝑦

• MAE (Mean Absolute Error) = ∑ |𝑦 −𝑦 |

• When the classes are balanced: (Classification)

• Classification Error : =

• Accuracy = 1 − 𝐸𝑟𝑟𝑜𝑟 =
Classification Measures in Supervised Learning
𝑦 = 0 ∶ 𝑛𝑜𝑛 𝑠𝑝𝑎𝑚 𝑎𝑛𝑑 1: 𝑠𝑝𝑎𝑚 𝑦 ∶ 0 𝑛𝑜𝑛 𝑠𝑝𝑎𝑚 𝑎𝑛𝑑 1 𝑠𝑝𝑎𝑚

𝒚
Y Yes No
Yes TP FN P( | spam class)
No FP TN P( | non spam class)

We want green to maximize red to minimize for better

classification.

TP FN

98 2 100 Can we measure the

FP TN
accuracy only by using the
2 98 100 diagonal elements?
98%
Classification Measures in Supervised Learning
𝑦 = 0 ∶ 𝑛𝑜𝑛 𝑠𝑝𝑎𝑚 𝑎𝑛𝑑 1: 𝑠𝑝𝑎𝑚 𝑦 ∶ 0 𝑛𝑜𝑛 𝑠𝑝𝑎𝑚 𝑎𝑛𝑑 1 𝑠𝑝𝑎𝑚

TP FN 198 2 200 98
2 100
98 2 100 2 598 600
FP TN
66% 86%

As the class imbalance increases, the Green Diagonal doe

not reflect the model accuracy properly, So we need 698 2 700
measure that reflect the misclassification also.
98 2 100
88%
Classification Measures – F-Score
• When the classes are imbalanced:

• False Alarm Rate = False Positive Rate =

% of negative we misclassified as positive
• Miss Alarm Rate = False Negative Rate =
% of positives we misclassified as negatives
• Recall = True Positive Rate =
% of positives we classified correctly ( 1 – Miss Rate)
• Precision =
% of positives out of what we predicted was positive.
• F-measure =

Harmonic mean of recall and precision.

F-Score (Utility and Cost)
• Sometimes different type of errors cost differently. For example, earthquake predictions)
• False positive : Cost of preventive measures (evacuation, lost profit)
• False negative: Cost of recovery (reconstruction, liability)

• Detection Costs: weighted average of FP, FN rates:

• Cost = 𝐶 ∗ 𝐹𝑃 + 𝐶 ∗ 𝐹𝑁

• Sometime situation prefers to maintain high recall over precision or other way around.
• Measurement : AUC-ROC Curve (Area Under Curve – Receiver Operation
Characteristics)
• AUC-ROC curve is performance measurement for classification problem at various
threshold settings.
https://towardsdatascience.com/understanding-auc-roc-curve-
ROC Curve 68b2303cc9c5

• The threshold t, to get decision spam if 𝑓 𝑥 > 𝑡, non-spam 𝑖𝑓 (𝑓 𝑥 ≤ 𝑡). Let assume 𝑡 ∈ [0,1]

• So 𝑡 determines error rates

• 𝐹𝑃𝑅 = 𝑃 𝑓 𝑥 > 𝑡 𝑛𝑜𝑛 − 𝑠𝑝𝑎𝑚) and 𝑇𝑃𝑅 = 𝑃(𝑓 𝑥 > 𝑡|𝑠𝑝𝑎𝑚)
• Now you plot TPR vs. FPR as t varies from 0 𝑡𝑜 1

TPR = Recall = True Positive Rate = = P(f(x) >t |spam)

FPR = False Positive Rate = = P(f(x) > t| non-spam)

We want to reduce both 𝛼 and 𝛽. But theoretically we can’t
reduce both them together.
𝛼 = Type I error = FN/(FN+TP) and
𝛽 = Type II error = FP/ (FP+TN).

Therefore, for a fixed 𝛽, we reduce 𝛼 or increase 𝑇𝑃𝑅 = (1 − 𝛼)

ROC Curve
• AUC near to the 1 which means it has a good measure of separability.
• AUC near to the 0 which means it has the worst measure of separability.
• AUC is 0.5, it means the model has no class separation capacity.

Distribution of classes 1 1 and 0  0

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Red distribution curve is of the positive class (spam) and the green distribution curve is of
the negative class(no spam). It is perfectly able to distinguish between positive class and
negative class.
ROC Curve
• AUC near to the 0 which means it has the worst measure of separability.
• AUC is 0.5, it means the model has no class separation capacity.

Distribution of classes 0 -> 1 and 1 0

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Red distribution curve is of the positive class (spam) and the green distribution curve is of the
negative class(no spam). The model is predicting a negative class as a positive class and vice
versa.
ROC Curve
• AUC is 0.5, it means the model has no class separation capacity.

Distribution of classes (overlapped)

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Red distribution curve is of the positive class (spam) and the green distribution curve is of the
negative class(no spam). The model has no discrimination capacity to distinguish between
positive class and negative class.
ROC Curve
• AUC is 0.7, two distribution overlap, we introduce type 1 and type 2 error.
• Depending upon the threshold, we can minimize or maximize them. When AUC is 0.7, it
means there is a 70% chance that the model will be able to distinguish between positive
class and negative class.

Distribution of classes (somewhat overlapped)

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Red distribution curve is of the positive class (spam) and the green distribution curve is of the
negative class(no spam). The model has no discrimination capacity to distinguish between
positive class and negative class.
Value
• After fitting a linear regression model, you need to determine how well the model fits the data
• 𝑅 is a one of the key goodness-of-fit statistics for regression analysis.

• Statistically a regression model fits the data well if the differences between the observations and the
predicted values are small and unbiased. Unbiased means that the fitted values are not systematically too
high or too low anywhere in the observation space.

• Therefore, before assessing numeric measures of goodness-of-fit, like 𝑅 , we should evaluate the residual
plots.
• Residual plots can expose a biased model far more effectively than the numeric output by displaying
problematic patterns in the residuals.

• 𝑅 evaluates the scatter of the data points around the fitted regression line. It is also called the coefficient
of determination, or the coefficient of multiple determination for multiple regression.
• For the same data set, higher 𝑅 values represent smaller differences between the observed data and the
fitted values.
Value
• R-squared is always between 0 and 100%:
• 0% represents a model that does not explain any of the variation in the response variable around its mean.
• The mean of the dependent variable predicts the dependent variable as well as the regression model.
• 100% represents a model that explains all the variation in the response variable around its mean.
• Usually, the larger the 𝑅 , the better the regression model fits your observations.

The R-squared for the regression model on the left is 15%, and for the model on the right it is 85%. When a regression model
accounts for more of the variance, the data points are closer to the regression line.
Value
Evaluation Techniques

https://machinelearningmastery.com/framework-for-
imbalanced-classification-projects/
Outline
• Data Cleansing
• Dealing with Missing Values
• Handling Outliers

• Data Transformations
• Data Coding
• Feature Scaling
• Feature Discretization

• Data Generation and Class Imbalance

• SMOTE
• DBSMOTE (Density Based Synthetic Oversampling)

• Accuracy Measures
• Significancy Test
https://towardsdatascience.com/hypothesis-testing-in-machine-
Significance Test learning-using-python-a0dc89e169ce

One sampled T-test :- The One Sample t Test determines

whether the sample mean is statistically different from a
known or hypothesized population mean. The One
Sample t Test is a parametric test.

• For a given model A

• Repeat experiments n time under model A
• Calculate 𝑒 … 𝑒
∑
• Calculate t- sample test 𝑡 = where s = with hypothesis 𝐻 : 𝜇 == 𝜃 vs.

𝐻 :𝜇 ≠ 𝜃
• Then t = ~ 𝑡(𝑛 − 1) under 𝛼% level significance
from scipy.stats import ttest_1samp
set, pval = ttest_1samp(ages, 𝜃)
print("p-value",pval)
• Or 𝜇 ± 𝜎 as the range
if pval <0.05:
• Can express as CI under 𝛼% s. t. 𝑃 𝑞 < 𝑡 < 𝑞 = 1−𝛼 print("we reject null hypothesis")
else:
print("we accept null hypothesis")
Significance Test
Two sampled T-test :-The Independent Samples t Test or
2-sample t-test compares the means of two independent
groups in order to determine whether there is statistical
evidence that the associated population means are
significantly different.

• Compare two models A, and B

• Repeat experiments n time under model A and model B.
• Calculate 𝑒 … 𝑒 and 𝑒 , … 𝑒
• Calculate independent t- sample test 𝑡 = where 𝑠𝑝 = with

hypothesis 𝐻 : 𝜇 == 𝜇 vs. 𝐻 : 𝜇 ≠ 𝜇
from scipy.stats import ttest_ind
• Then t = ~ 𝑡(𝑛 + 𝑛 − 2) under 𝛼% level significance
ttest,pval = ttest_ind(ErrorsA,ErrorsB)
print("p-value",pval)
if pval < 0.05:
• Show model A is better than model B? How? print("we reject null hypothesis")
else:
print("we accept null hypothesis")
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from statistics import mean, median, mode, stdev

cv1 = RepeatedStratifiedKFold(n_splits = 2, n_repeats = 2, random_state = 1)

scores1 = cross_val_score(model1, X, y, scoring = 'accuracy', cv = cv1, n_jobs = -1)
print('Logistic Regression Model Mean Accuracy: %.1f%% +/-
(%.3f)' % (mean(scores1*100), stdev(scores1)))
# evaluate model 2

cv2 = RepeatedStratifiedKFold(n_splits = 2, n_repeats = 2, random_state = 1)

scores2 = cross_val_score(model2, X, y, scoring = 'accuracy', cv = cv2, n_jobs = -1)
print('SVM Mean Accuracy: %.1f%% +/-(%.3f)' % (mean(scores2*100), stdev(scores2)))
from mlxtend.evaluate import paired_ttest_5x2cv
# check if difference between algorithms is real
t, p = paired_ttest_5x2cv(estimator1=model1,
estimator2=model2,
X=X,
y=y,
scoring='accuracy',
random_seed=1)
# summarize
print(f'The P-value is = {p:.3f}')
print(f'The t-statistics is = {t:.3f}')
# interpret the result
if p <= 0.05:
print('Since p<0.05, We can reject the null-
hypothesis that both models perform equally well on this dataset. We may conclude that the two algori
thms are significantly different.')
else:
print('Since p>0.05, we cannot reject the null hypothesis and may conclude that the performance o
f the two algorithms is not significantly different.')

PUM24 Updates
No ratings yet
PUM24 Updates
115 pages
3egm042500 2145
100% (1)
3egm042500 2145
24 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Unit 1
No ratings yet
Unit 1
21 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Analytics Lab Manual_250402_095326
No ratings yet
Data Analytics Lab Manual_250402_095326
58 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
Missing Data
No ratings yet
Missing Data
14 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Chapter 1. Data Preparation (2)
No ratings yet
Chapter 1. Data Preparation (2)
74 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Machine: Learning
No ratings yet
Machine: Learning
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
chapter_3
No ratings yet
chapter_3
58 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
Unit2
No ratings yet
Unit2
76 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Outliners
No ratings yet
Outliners
15 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
No ratings yet
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
15 pages
Class5 DataPreprocessing DataCleaning 23aug2021
No ratings yet
Class5 DataPreprocessing DataCleaning 23aug2021
14 pages
Unit 1
No ratings yet
Unit 1
26 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Research File 3
No ratings yet
Research File 3
10 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
02_23ECE216_EDA_Pre Processing
No ratings yet
02_23ECE216_EDA_Pre Processing
16 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
data science slides
No ratings yet
data science slides
57 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Equifax Complaint
No ratings yet
Equifax Complaint
28 pages
Vacon NXL Lift ALFIFF28 Application Manual UD00973
No ratings yet
Vacon NXL Lift ALFIFF28 Application Manual UD00973
44 pages
Database Management System: Tutorial 1
No ratings yet
Database Management System: Tutorial 1
2 pages
Pps Imp Questions PDF
No ratings yet
Pps Imp Questions PDF
6 pages
Ba 137m - Half Life
No ratings yet
Ba 137m - Half Life
3 pages
Oracle Programming Using PL/SQL: Level 2 WWW - Micros.umsl - Edu
No ratings yet
Oracle Programming Using PL/SQL: Level 2 WWW - Micros.umsl - Edu
2 pages
Link State Shortest Path Algo
No ratings yet
Link State Shortest Path Algo
3 pages
CISSP For Professionals
No ratings yet
CISSP For Professionals
2 pages
Vsan 703 Planning Deployment Guide
No ratings yet
Vsan 703 Planning Deployment Guide
87 pages
796 Ict P1 Mock Questions 2024
No ratings yet
796 Ict P1 Mock Questions 2024
4 pages
Curriculum Vitae Of: Md. Omar Faruk
No ratings yet
Curriculum Vitae Of: Md. Omar Faruk
2 pages
UM0470 - STM8 SWIM Communication Protocol and Debug Module
No ratings yet
UM0470 - STM8 SWIM Communication Protocol and Debug Module
37 pages
CATIA V5 Design Fundamentals: Jaecheol Koh
100% (1)
CATIA V5 Design Fundamentals: Jaecheol Koh
45 pages
Bilal Khan Resume
No ratings yet
Bilal Khan Resume
1 page
Notice of Claimed Infringement - Case ID 7aa861f8de708bf88b8b
No ratings yet
Notice of Claimed Infringement - Case ID 7aa861f8de708bf88b8b
3 pages
IELTS Practice Test 01 Reading Ac PDF
100% (1)
IELTS Practice Test 01 Reading Ac PDF
16 pages
Assessment 2
No ratings yet
Assessment 2
5 pages
Lions Share en PDF
No ratings yet
Lions Share en PDF
16 pages
Assignment Basic of Algebra Question
No ratings yet
Assignment Basic of Algebra Question
5 pages
Identifying the Source of Water on Plant Using the Leaf Wetness Sensor and via Deep Learning-Based Ensemble Method
No ratings yet
Identifying the Source of Water on Plant Using the Leaf Wetness Sensor and via Deep Learning-Based Ensemble Method
9 pages
Introducing Belle Bonne Sage
No ratings yet
Introducing Belle Bonne Sage
4 pages
Micro - 5 Instruction (2) Logical + Shift 16-10-2023
No ratings yet
Micro - 5 Instruction (2) Logical + Shift 16-10-2023
10 pages
5 Testing Advanced I/O Devices
No ratings yet
5 Testing Advanced I/O Devices
10 pages
Allelectricalinterviewquestions4u Blogspot in
No ratings yet
Allelectricalinterviewquestions4u Blogspot in
5 pages
SDN-Enabled Integrated Space-Air-Ground Networks: Towards A Convergence
No ratings yet
SDN-Enabled Integrated Space-Air-Ground Networks: Towards A Convergence
38 pages
Leica Rugby 810, 820, 830, 840 Brochure
No ratings yet
Leica Rugby 810, 820, 830, 840 Brochure
12 pages
USR-N540-User-Manual - V1 1 0 01
No ratings yet
USR-N540-User-Manual - V1 1 0 01
73 pages
DSS Products Comparison Between DSS Express and DSS Pro V8.0.2 20210816
No ratings yet
DSS Products Comparison Between DSS Express and DSS Pro V8.0.2 20210816
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lect 2

Uploaded by

Lect 2

Uploaded by

Data Preprocessing

Data Cleansing and Evaluations

• Data Generation and Class Imbalance

• The propensity for a data point to be missing is not related to

• Missing Completely at Random (MCAR)

• Missing not at Random( MNAR)

• Seasonal Adjustment + Linear Interpolation

• Mean, Median and Mode (with no trends and no seasonality)

from sklearn.preprocessing import SimpleImputer

• Linear Regression (with trends and no seasonality)

• Data Generation and Class Imbalance

• Data Generation and Class Imbalance

In the Q-Q plots, if the variable follows a normal distribution, the

In the Q-Q plots, if the variable follows a normal distribution, the

• Square root transformation - used for reducing right-skewed distributions √𝑥.

• Exponential or power transformation : to reduce the left skewness, 𝑥 , 𝑥 … exp(𝑥)

transformer = Power Transformer(method='yeo-johnson', standardize=False)

• Data Generation and Class Imbalance

Advantages of one-hot encoding

Limitations of one-hot encoding

Limitations of integer encoding

Advantages of Ordered Label encoding

Limitations of Ordered Label encoding

• Data Generation and Class Imbalance

• Data Generation and Class Imbalance

• Transforming a continuous variable into a discrete one.

# import the libraries

# load your data

# create the discretizer object with strategy uniform and 8 bins

• Data Generation and Class Imbalance

Class Imbalance - Oversampling

• SMOTE-NC : Synthetic Minority Oversampling Technique (Nominal and Continuous Data)

• Adaptive Synthetic Sampling (ADASYN)

Class Imbalance eferred%20to%20as%20random%20undersampling.

• Data Generation and Class Imbalance

• RMSE (Root Mean Squared Error) = ∑ 𝑦 −𝑦

• MAE (Mean Absolute Error) = ∑ |𝑦 −𝑦 |

• When the classes are balanced: (Classification)

We want green to maximize red to minimize for better

98 2 100 Can we measure the

As the class imbalance increases, the Green Diagonal doe

• False Alarm Rate = False Positive Rate =

Harmonic mean of recall and precision.

• Detection Costs: weighted average of FP, FN rates:

• So 𝑡 determines error rates

TPR = Recall = True Positive Rate = = P(f(x) >t |spam)

FPR = False Positive Rate = = P(f(x) > t| non-spam)

Therefore, for a fixed 𝛽, we reduce 𝛼 or increase 𝑇𝑃𝑅 = (1 − 𝛼)

Distribution of classes 1 1 and 0  0

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Distribution of classes 0 -> 1 and 1 0

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Distribution of classes (overlapped)

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

Distribution of classes (somewhat overlapped)

TPR = = P(f(x) >t |spam)

FPR = = = (P(f(x) > t |spam)

• Data Generation and Class Imbalance

One sampled T-test :- The One Sample t Test determines

• For a given model A

• Compare two models A, and B

cv1 = RepeatedStratifiedKFold(n_splits = 2, n_repeats = 2, random_state = 1)

cv2 = RepeatedStratifiedKFold(n_splits = 2, n_repeats = 2, random_state = 1)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.