Feature Selection 16891042299

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

What is Feature Selection?


Feature selection is a process in machine learning and data analysis where a subset of relevant features or variables is selected from a larger set of
available features. The goal of feature selection is to identify the most informative and relevant features that contribute the most to the predictive
performance of a model while reducing complexity, improving accuracy, and mitigating the risk of overfitting.

The following are the primary reasons for performing feature selection:

Simplifying the model: By reducing the number of features, the complexity of the model can be reduced, leading to simpler and more interpretable
models.

Improving prediction performance: Irrelevant or redundant features can introduce noise and lead to overfitting. Feature selection helps to focus on
the most informative features, leading to improved generalization and prediction accuracy.

Reducing computational complexity: By selecting a smaller subset of features, the computational resources required for model training and
prediction can be reduced, resulting in faster processing times.

Enhancing interpretability: Feature selection can help identify the most influential features, enabling better understanding and interpretation of the
relationships between the features and the target variable.

What problems does Feature Selection solve?

Feature selection can help solve various problems in machine learning and data analysis.

1. Curse of Dimensionality: When dealing with high-dimensional data, where the number of features is large compared to the number of observations,
feature selection can alleviate the curse of dimensionality. By selecting a subset of relevant features, the data representation becomes more focused,
improving the efficiency and performance of learning algorithms.

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 1/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

2. Overfitting: Including irrelevant or redundant features in a model can lead to overfitting, where the model becomes overly complex and performs
poorly on unseen data. Feature selection helps to identify and remove such features, reducing the risk of overfitting and improving the generalization
ability of the model.

3. Computational Efficiency: In situations where computational resources are limited, such as working with large datasets, feature selection can
significantly reduce the computational complexity. By selecting a smaller subset of features, the training and inference times of the model can be
reduced, enabling faster processing.

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 2/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

4. Interpretability: Feature selection can improve the interpretability of models by identifying the most relevant features. This is particularly important in
fields where model transparency and explainability are crucial, such as healthcare or finance. By selecting a subset of interpretable features, it
becomes easier to understand the factors influencing the model's predictions.

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 3/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

Explanation :

Types Of Feature Selection Methods


1.Filter Method

2.Wrapper Method

3.Embedded Method

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 4/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

1. Filter Based Feature Selection


Filter-based feature selection techniques are methods that use statistical measures to score each feature independently, and then select a subset of
features based on these scores. These methods are called "filter" methods because they essentially filter out the features that do not meet some
criterion.

Here are a few commonly used filter-based feature selection methods:

1. Correlation-based Feature Selection: This method measures the correlation between each feature and the target variable. Features with higher
correlation values are considered more relevant and are selected for inclusion. Common correlation measures include Pearson correlation coefficient
for continuous variables and point biserial correlation for categorical variables.

2. Information Gain: Information gain is a measure from information theory that quantifies the amount of information that a feature provides about the
target variable. Features with higher information gain are considered more informative and are selected. This method is commonly used for feature
selection in decision tree-based algorithms.

3. Chi-Square Test: The chi-square test is used for feature selection when dealing with categorical target variables. It measures the independence
between each feature and the target variable by comparing the observed and expected frequencies in a contingency table. Features with significant
chi-square test statistics are selected.

4. ANOVA: Analysis of Variance (ANOVA) is used when the target variable is continuous and the features are categorical. It compares the means of
each feature across different levels of the target variable and determines if there are significant differences. Features with significant differences are
selected.

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 5/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

Pratical

In [1]: import pandas as pd


import numpy as np

In [2]: df = pd.read_csv("D:\\datascience\\Nitish sir\\data Preprocessing\\train.csv").drop(columns='subject')


df.head()

# Data is Already Scaled , it has 562 Columns

Out[2]:
tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- fBodyBodyGyroJerkMag- fBodyB
...
mean()-X mean()-Y mean()-Z std()-X std()-Y std()-Z mad()-X mad()-Y mad()-Z max()-X skewness()

0 0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112 -0.983185 -0.923527 -0.934724 ... -0.298676

1 0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807 -0.974914 -0.957686 -0.943068 ... -0.595051

2 0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520 -0.963668 -0.977469 -0.938692 ... -0.390748

3 0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099 -0.982750 -0.989302 -0.938692 ... -0.117290

4 0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321 -0.979672 -0.990441 -0.942469 ... -0.351471

5 rows × 562 columns

In [3]: # Display the shape of the DataFrame



df.shape # 7352 - rows , 562 - Columns

Out[3]: (7352, 562)

In [4]: # Get the value counts of the 'Activity' column in the DataFrame

df['Activity'].value_counts()

Out[4]: LAYING 1407


STANDING 1374
SITTING 1286
WALKING 1226
WALKING_UPSTAIRS 1073
WALKING_DOWNSTAIRS 986
Name: Activity, dtype: int64

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 6/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

Duplicated Feature

In [5]: from sklearn.preprocessing import LabelEncoder


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Separate features and target
X = df.drop('Activity', axis=1)
y = df['Activity']

# Encode target labels
le = LabelEncoder()
y = le.fit_transform(y)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]: # Print the shape of X_train



print(X_train.shape)

# Print the shape of X_test
print(X_test.shape)

(5881, 561)
(1471, 561)

Applying Logistic Regression

In [7]: # Initialize and train logistic regression model


log_reg = LogisticRegression(max_iter=1000) # Increase max_iter if it doesn't converge
log_reg.fit(X_train, y_train)


# Make predictions on the test set
y_pred = log_reg.predict(X_test)


# Calculate and print accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy:", accuracy)

Test accuracy: 0.9809653297076818

C:\Users\user\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converg


e (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://scikit-learn.org/stable/modules/preprocessing.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (https://scikit-learn.org/stable/modules/l
inear_model.html#logistic-regression)
n_iter_i = _check_optimize_result(

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 7/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

1. Removing Duplicate Columns

In [9]: def get_duplicate_columns(df):


"""
Returns a dictionary of duplicate columns in a DataFrame.

Args:
df (pandas.DataFrame): The input DataFrame.

Returns:
dict: A dictionary where the keys are duplicate columns and the values are lists of duplicate columns.
"""
duplicate_columns = {}
seen_columns = {}

for column in df.columns:
current_column = df[column]

# Convert column data to bytes
try:
current_column_hash = current_column.values.tobytes()
except AttributeError:
current_column_hash = current_column.to_string().encode()

if current_column_hash in seen_columns:
if seen_columns[current_column_hash] in duplicate_columns:
duplicate_columns[seen_columns[current_column_hash]].append(column)
else:
duplicate_columns[seen_columns[current_column_hash]] = [column]
else:
seen_columns[current_column_hash] = column

return duplicate_columns

In [10]: # Get duplicate columns in the dataset



duplicate_columns = get_duplicate_columns(X_train)

In [11]: duplicate_columns

# Duplicated in key pair values

Out[11]: {'tBodyAccMag-mean()': ['tBodyAccMag-sma()',


'tGravityAccMag-mean()',
'tGravityAccMag-sma()'],
'tBodyAccMag-std()': ['tGravityAccMag-std()'],
'tBodyAccMag-mad()': ['tGravityAccMag-mad()'],
'tBodyAccMag-max()': ['tGravityAccMag-max()'],
'tBodyAccMag-min()': ['tGravityAccMag-min()'],
'tBodyAccMag-energy()': ['tGravityAccMag-energy()'],
'tBodyAccMag-iqr()': ['tGravityAccMag-iqr()'],
'tBodyAccMag-entropy()': ['tGravityAccMag-entropy()'],
'tBodyAccMag-arCoeff()1': ['tGravityAccMag-arCoeff()1'],
'tBodyAccMag-arCoeff()2': ['tGravityAccMag-arCoeff()2'],
'tBodyAccMag-arCoeff()3': ['tGravityAccMag-arCoeff()3'],
'tBodyAccMag-arCoeff()4': ['tGravityAccMag-arCoeff()4'],
'tBodyAccJerkMag-mean()': ['tBodyAccJerkMag-sma()'],
'tBodyGyroMag-mean()': ['tBodyGyroMag-sma()'],
'tBodyGyroJerkMag-mean()': ['tBodyGyroJerkMag-sma()'],
'fBodyAccMag-mean()': ['fBodyAccMag-sma()'],
'fBodyBodyAccJerkMag-mean()': ['fBodyBodyAccJerkMag-sma()'],
'fBodyBodyGyroMag-mean()': ['fBodyBodyGyroMag-sma()'],
'fBodyBodyGyroJerkMag-mean()': ['fBodyBodyGyroJerkMag-sma()']}

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 8/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [12]: X_train[['tBodyAccMag-mean()','tBodyAccMag-sma()','tGravityAccMag-mean()','tGravityAccMag-sma()']]

Out[12]:
tBodyAccMag-mean() tBodyAccMag-sma() tGravityAccMag-mean() tGravityAccMag-sma()

57 -0.997420 -0.997420 -0.997420 -0.997420

4154 -0.133797 -0.133797 -0.133797 -0.133797

6945 0.130528 0.130528 0.130528 0.130528

527 -0.955222 -0.955222 -0.955222 -0.955222

4196 -0.992499 -0.992499 -0.992499 -0.992499

... ... ... ... ...

5191 -0.372758 -0.372758 -0.372758 -0.372758

5226 -0.256033 -0.256033 -0.256033 -0.256033

5390 -0.434138 -0.434138 -0.434138 -0.434138

860 -0.983969 -0.983969 -0.983969 -0.983969

7270 -0.200163 -0.200163 -0.200163 -0.200163

5881 rows × 4 columns

In [13]: # Remove duplicate columns from X_train and X_test



for one_list in duplicate_columns.values():
X_train.drop(columns=one_list,inplace=True)
X_test.drop(columns=one_list,inplace=True)

In [14]: print(X_train.shape)

(5881, 540)

In [15]: print(X_test.shape)

(1471, 540)

2. Variance Threshold

Variance thresholding is a simple filter-based feature selection method that selects features based on their variance. The idea behind this method is that
features with low variance carry less information and may be less informative for predictive modeling tasks. Therefore, by setting a threshold for the
variance, features with variance below that threshold are removed from consideration.

Constant Features: Constant features are those that have the same value for all data points in the dataset. Since these features do not change or
vary across the dataset, they carry no useful information for modeling or analysis. Examples of constant features include an ID column, a feature that
was mistakenly duplicated, or a feature that was created with incorrect data.

Removing constant features is typically straightforward and can be done by calculating the variance of each feature. If the variance is zero, it indicates that
the feature is constant and can be safely eliminated from the dataset.

Quasi-constant Features: Quasi-constant features, also known as near-constant or low-variance features, are those that have very low variability
with values that are almost constant. Although these features exhibit some degree of variation, their low variance suggests that they do not provide
much discriminatory or informative power.

Determining the threshold to classify a feature as quasi-constant depends on the specific dataset and problem. It is common to set a threshold based on
empirical analysis or domain knowledge.

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 9/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [16]: # Code


from sklearn.feature_selection import VarianceThreshold


# Create a VarianceThreshold object with a threshold of 0.05
sel = VarianceThreshold(threshold=0.05)

In [17]: # Fit the model to the training data



sel.fit(X_train)

Out[17]: ▾ VarianceThreshold
VarianceThreshold(threshold=0.05)

In [18]: # Get the support of the current selection




# sel.get_support()

""" get_support = It will return an array of boolean values. True for the features whose importance is greater than
the mean importance and False for the rest."""

Out[18]: ' get_support = It will return an array of boolean values. True for the features whose importance is greater than \nthe mean
importance and False for the rest.'

In [19]: # Calculates the sum of the selected support.



sum(sel.get_support()) # 349 are greater than Threshold

Out[19]: 349

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 10/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [20]: # Retrieve the selected columns from X_train



columns = X_train.columns[sel.get_support()]

In [21]: columns

# columns are greater than Threshold

Out[21]: Index(['tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z',


'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z',
'tBodyAcc-max()-X', 'tBodyAcc-max()-Y', 'tBodyAcc-max()-Z',
'tBodyAcc-min()-X',
...
'fBodyBodyGyroJerkMag-meanFreq()', 'fBodyBodyGyroJerkMag-skewness()',
'fBodyBodyGyroJerkMag-kurtosis()', 'angle(tBodyAccMean,gravity)',
'angle(tBodyAccJerkMean),gravityMean)',
'angle(tBodyGyroMean,gravityMean)',
'angle(tBodyGyroJerkMean,gravityMean)', 'angle(X,gravityMean)',
'angle(Y,gravityMean)', 'angle(Z,gravityMean)'],
dtype='object', length=349)

In [22]: ​
# Transform the features of the training set using sel ( Keep OnLy 349 columns and delete every column , where below threshold

X_train = sel.transform(X_train)

# Transform the features of the testing set using sel

X_test = sel.transform(X_test)


# Create DataFrames for the transformed training and test sets with the specified column names
#because they conveted in to numpy arrays

X_train = pd.DataFrame(X_train, columns=columns)
X_test = pd.DataFrame(X_test, columns=columns)

In [23]: print(X_train.shape)

print(X_test.shape)

# Here We got 349 Columns , Firstly data consists of 540 columns

(5881, 349)
(1471, 349)

In [24]: # Print the first few rows of the X_train dataframe



X_train.head()

Out[24]:
tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- fBodyBodyGyroJerkMag- fBodyB
...
std()-X std()-Y std()-Z mad()-X mad()-Y mad()-Z max()-X max()-Y max()-Z min()-X meanFreq()

0 -0.994425 -0.994873 -0.994886 -0.994939 -0.993994 -0.995450 -0.938974 -0.577031 -0.813863 0.846922 ... 0.394506

1 -0.326331 0.069663 -0.224321 -0.343326 0.039623 -0.256327 -0.310961 0.085617 -0.411806 0.271334 ... 0.052089

2 -0.026220 -0.032163 0.393109 -0.118256 -0.030279 0.432861 0.370607 -0.072309 0.200747 0.118277 ... -0.038923

3 -0.981092 -0.901124 -0.960423 -0.984417 -0.901405 -0.965788 -0.922291 -0.524676 -0.807362 0.825370 ... -0.145084

4 -0.997380 -0.983893 -0.984482 -0.997331 -0.985196 -0.983768 -0.942062 -0.564033 -0.810993 0.853330 ... 0.096524

5 rows × 349 columns

Points to Consider

1. Ignores Target Variable: Variance Threshold is a univariate method, meaning it evaluates each feature independently and doesn't consider the
relationship between each feature and the target variable. This means it may keep irrelevant features that have a high variance but no relationship
with the target, or discard potentially useful features that have a low variance but a strong relationship with the target.

2. Ignores Feature Interactions: Variance Threshold doesn't account for interactions between features. A feature with a low variance may become very
informative when combined with another feature.

3. Sensitive to Data Scaling: Variance Threshold is sensitive to the scale of the data. If features are not on the same scale, the variance will naturally

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 11/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook
b hi h f f t ith l l Th f it i i t tt t d di th f t b f l i V i Th h ld
4. Arbitrary Threshold Value: It's up to the user to define what constitutes a "low" variance. The threshold is not always easy to define and the optimal
value can vary between datasets.

3. Correlation

Correlation-based Feature Selection: This method measures the correlation between each feature and the target variable. Features with higher correlation
values are considered more relevant and are selected for inclusion. Common correlation measures include Pearson correlation coefficient for continuous
variables and point biserial correlation for categorical variables.

Explanation :

In [26]: # code

import seaborn as sns

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 12/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [27]: # Create a heatmap of the correlation matrix of X_train



sns.heatmap(X_train.corr())

Out[27]: <AxesSubplot:>

In [29]: X_train.corr()

Out[29]:
tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc-
std()-X std()-Y std()-Z mad()-X mad()-Y mad()-Z max()-X max()-Y max()-Z min()-X

tBodyAcc-std()-X 1.000000 0.927247 0.850268 0.998631 0.920936 0.845200 0.981284 0.893743 0.843918 -0.966714

tBodyAcc-std()-Y 0.927247 1.000000 0.895065 0.922627 0.997384 0.894128 0.917831 0.953852 0.882782 -0.937472

tBodyAcc-std()-Z 0.850268 0.895065 1.000000 0.842986 0.890973 0.997414 0.852711 0.864716 0.936311 -0.861033

tBodyAcc-mad()-X 0.998631 0.922627 0.842986 1.000000 0.916201 0.838010 0.973704 0.888702 0.838024 -0.962447

tBodyAcc-mad()-Y 0.920936 0.997384 0.890973 0.916201 1.000000 0.890707 0.911283 0.950131 0.877793 -0.932521

... ... ... ... ... ... ... ... ... ... ...

angle(tBodyGyroMean,gravityMean) 0.023914 -0.002241 -0.010535 0.024098 -0.005865 -0.014838 0.029230 -0.000207 -0.023622 -0.008700

angle(tBodyGyroJerkMean,gravityMean) -0.035176 -0.028881 -0.016002 -0.035629 -0.026679 -0.016949 -0.038935 -0.013144 -0.011510 0.030630

angle(X,gravityMean) -0.374114 -0.383095 -0.344114 -0.370629 -0.379578 -0.346350 -0.386159 -0.373556 -0.345776 0.365571

angle(Y,gravityMean) 0.472605 0.524945 0.475241 0.467965 0.526803 0.476498 0.482312 0.489971 0.462052 -0.471464

angle(Z,gravityMean) 0.393209 0.432180 0.480824 0.389139 0.430548 0.477627 0.404088 0.424181 0.416249 -0.392253

349 rows × 349 columns

In [30]: corr_matrix = X_train.corr()

In [31]: # Get the column names of the DataFrame


columns = corr_matrix.columns

# Create an empty list to keep track of columns to drop
columns_to_drop = []

# Loop over the columns
for i in range(len(columns)):
for j in range(i + 1, len(columns)):

# Access the cell of the DataFrame


if corr_matrix.loc[columns[i], columns[j]] > 0.95: # if corelation is greater than 0.95 drop them, and add remaining
columns_to_drop.append(columns[j])

print(len(columns_to_drop))

1508

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 13/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [33]: set(columns_to_drop)

"""The line set(columns_to_drop) creates a set from the columns_to_drop list.
A set is an unordered collection of unique elements.
In this case, it creates a set of column names that need to be dropped based on the correlation condition.

"""

Out[33]: {'fBodyAcc-bandsEnergy()-1,16',
'fBodyAcc-bandsEnergy()-1,16.1',
'fBodyAcc-bandsEnergy()-1,16.2',
'fBodyAcc-bandsEnergy()-1,24',
'fBodyAcc-bandsEnergy()-1,24.1',
'fBodyAcc-bandsEnergy()-1,24.2',
'fBodyAcc-bandsEnergy()-1,8',
'fBodyAcc-bandsEnergy()-17,32',
'fBodyAcc-energy()-X',
'fBodyAcc-energy()-Y',
'fBodyAcc-entropy()-X',
'fBodyAcc-entropy()-Y',
'fBodyAcc-entropy()-Z',
'fBodyAcc-iqr()-X',
'fBodyAcc-iqr()-Y',
'fBodyAcc-iqr()-Z',
'fBodyAcc-kurtosis()-X',
'fBodyAcc-kurtosis()-Y',
'fBodyAcc-kurtosis()-Z',
'fB d A d() X'
In [34]: columns_to_drop = set(columns_to_drop)

In [35]: # The code simply calculates the length of the list 'columns_to_drop'.


len(columns_to_drop)

# By using Correlation , we have dropped from 349 columns to 197 columns , we got total 152 columns remaining

Out[35]: 197

In [36]: # Remove specified columns from X_train and X_test



X_train.drop(columns = columns_to_drop, axis = 1, inplace=True)

X_test.drop(columns = columns_to_drop, axis = 1, inplace=True)

In [39]: # Print the shape of the training and testing data



print(X_train.shape)
print(X_test.shape)

(5881, 152)
(1471, 152)

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 14/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [41]: X_train

Out[41]:
tBodyAcc- tBodyAcc- tBodyAcc-
tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- fBodyBodyGyroMag- fBodyBo
entropy()- entropy()- entropy()- ...
std()-X std()-Y std()-Z max()-Z min()-X min()-Y min()-Z skewness()
X Y Z

0 -0.994425 -0.994873 -0.994886 -0.813863 0.846922 0.691468 0.846423 -0.611174 -0.768785 -0.663066 ... 0.043182

1 -0.326331 0.069663 -0.224321 -0.411806 0.271334 0.039452 0.269204 0.403663 0.180054 0.176069 ... -0.225796

2 -0.026220 -0.032163 0.393109 0.200747 0.118277 0.072295 0.245986 0.318557 0.135103 0.087680 ... -0.346388

3 -0.981092 -0.901124 -0.960423 -0.807362 0.825370 0.642789 0.815368 -0.376515 -0.171730 -0.496816 ... 0.197311

4 -0.997380 -0.983893 -0.984482 -0.810993 0.853330 0.687431 0.844895 -0.652548 -0.678458 -0.486837 ... 0.310607

... ... ... ... ... ... ... ... ... ... ... ... ...

5876 -0.555352 -0.104055 -0.438064 -0.373579 0.454289 0.153427 0.475699 0.455070 0.288106 0.191017 ... -0.490616

5877 -0.290043 -0.212102 -0.469731 -0.123285 0.123907 0.202561 0.577615 0.345742 0.159941 0.085557 ... -0.846175

5878 -0.627198 -0.216566 -0.424764 -0.441355 0.530629 0.077330 0.469639 0.436434 0.322259 0.086872 ... -0.609038

5879 -0.994825 -0.985314 -0.965857 -0.805684 0.849776 0.688222 0.828575 -0.542681 -0.667647 -0.431412 ... -0.327943

5880 -0.368655 -0.142631 -0.151250 -0.108235 0.252673 0.303901 0.326319 0.407481 0.281442 0.105444 ... 0.126431

5881 rows × 152 columns

Disadvantages of Correlation

1. Linearity Assumption: Correlation measures the linear relationship between two variables. It does not capture non-linear relationships well. If a
relationship is nonlinear, the correlation coefficient can be misleading.

2. Doesn't Capture Complex Relationships: Correlation only measures the relationship between two variables at a time. It may not capture complex
relationships involving more than two variables.

3. Threshold Determination: Just like variance threshold, defining what level of correlation is considered "high" can be subjective and may vary
depending on the specific problem or dataset.

4. Sensitive to Outliers: Correlation is sensitive to outliers. A few extreme values can significantly skew the correlation coefficient.

4. ANOVA

ANOVA can be used as a technique to assess the relevance of categorical features with respect to a continuous target variable. By comparing the means
of the target variable across different levels or groups of the categorical feature, ANOVA helps determine if there are significant differences in the means,
indicating the potential predictive power of the feature

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 15/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

Explanation :

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 16/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 17/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [42]: # code

from sklearn.feature_selection import f_classif # Anova

from sklearn.feature_selection import SelectKBest # Select ' K' Best Features

# Perform feature selection

sel = SelectKBest(f_classif, k=100).fit(X_train, y_train)

# display selected feature names
X_train.columns[sel.get_support()]


# These are 100 best columns

Out[42]: Index(['tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z',


'tBodyAcc-max()-Z', 'tBodyAcc-min()-X', 'tBodyAcc-min()-Y',
'tBodyAcc-min()-Z', 'tBodyAcc-entropy()-X', 'tBodyAcc-entropy()-Y',
'tBodyAcc-entropy()-Z', 'tBodyAcc-arCoeff()-X,1',
'tBodyAcc-arCoeff()-X,2', 'tBodyAcc-arCoeff()-X,3',
'tBodyAcc-arCoeff()-Y,1', 'tBodyAcc-arCoeff()-Z,1',
'tBodyAcc-correlation()-X,Y', 'tBodyAcc-correlation()-Y,Z',
'tGravityAcc-mean()-X', 'tGravityAcc-mean()-Y', 'tGravityAcc-mean()-Z',
'tGravityAcc-sma()', 'tGravityAcc-energy()-Y', 'tGravityAcc-energy()-Z',
'tGravityAcc-entropy()-X', 'tGravityAcc-entropy()-Y',
'tGravityAcc-arCoeff()-Y,1', 'tGravityAcc-arCoeff()-Y,2',
'tGravityAcc-arCoeff()-Z,1', 'tGravityAcc-arCoeff()-Z,2',
'tGravityAcc-correlation()-Y,Z', 'tBodyAccJerk-std()-Z',
'tBodyAccJerk-min()-X', 'tBodyAccJerk-min()-Y', 'tBodyAccJerk-min()-Z',
'tBodyAccJerk-entropy()-X', 'tBodyAccJerk-arCoeff()-X,3',
'tBodyAccJerk-correlation()-X,Y', 'tBodyGyro-std()-X',
'tBodyGyro-std()-Y', 'tBodyGyro-std()-Z', 'tBodyGyro-max()-X',
'tBodyGyro-max()-Z', 'tBodyGyro-min()-X', 'tBodyGyro-min()-Z',
'tBodyGyro-entropy()-X', 'tBodyGyro-entropy()-Y',
'tBodyGyro-entropy()-Z', 'tBodyGyro-arCoeff()-X,1',
'tBodyGyro-arCoeff()-X,2', 'tBodyGyro-arCoeff()-Z,1',
'tBodyGyro-arCoeff()-Z,2', 'tBodyGyro-arCoeff()-Z,3',
'tBodyGyro-correlation()-Y,Z', 'tBodyGyroJerk-std()-Y',
'tBodyGyroJerk-min()-X', 'tBodyGyroJerk-min()-Y',
'tBodyGyroJerk-min()-Z', 'tBodyGyroJerk-arCoeff()-Z,2',
'tBodyGyroJerk-arCoeff()-Z,3', 'tBodyAccMag-arCoeff()1',
'tBodyAccMag-arCoeff()2', 'tBodyAccMag-arCoeff()3',
'tBodyAccJerkMag-min()', 'tBodyAccJerkMag-arCoeff()1',
'tBodyAccJerkMag-arCoeff()2', 'tBodyGyroMag-min()',
'tBodyGyroMag-energy()', 'tBodyGyroMag-entropy()',
'tBodyGyroJerkMag-min()', 'fBodyAcc-min()-X', 'fBodyAcc-energy()-Z',
'fBodyAcc-meanFreq()-X', 'fBodyAcc-meanFreq()-Y',
'fBodyAcc-meanFreq()-Z', 'fBodyAcc-skewness()-X',
'fBodyAcc-bandsEnergy()-17,24', 'fBodyAcc-bandsEnergy()-1,8.1',
'fBodyAcc-bandsEnergy()-9,16.1', 'fBodyAcc-bandsEnergy()-17,32.1',
'fBodyAcc-bandsEnergy()-1,8.2', 'fBodyAccJerk-maxInds-X',
'fBodyAccJerk-maxInds-Z', 'fBodyAccJerk-meanFreq()-X',
'fBodyAccJerk-meanFreq()-Y', 'fBodyAccJerk-meanFreq()-Z',
'fBodyAccJerk-skewness()-X', 'fBodyAccJerk-bandsEnergy()-1,8.1',
'fBodyGyro-maxInds-Y', 'fBodyGyro-maxInds-Z', 'fBodyGyro-meanFreq()-X',
'fBodyGyro-meanFreq()-Z', 'fBodyGyro-bandsEnergy()-1,8.1',
'fBodyAccMag-maxInds', 'fBodyBodyAccJerkMag-min()',
'fBodyBodyAccJerkMag-meanFreq()', 'fBodyBodyAccJerkMag-skewness()',
'fBodyBodyGyroMag-meanFreq()', 'angle(X,gravityMean)',
'angle(Y,gravityMean)', 'angle(Z,gravityMean)'],
dtype='object')

In [43]: ​
# stored in columns

columns = X_train.columns[sel.get_support()]

In [44]: ​
# Transform X_train and X_test using sel
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

# Convert the transformed arrays into DataFrames with column names
X_train = pd.DataFrame(X_train, columns=columns)
X_test = pd.DataFrame(X_test, columns=columns)

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 18/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [45]: print(X_train.shape)
print(X_test.shape)

# from 152 column , we have got 100 Columns using Anova Method

(5881, 100)
(1471, 100)

In [46]: X_train.head()

Out[46]:
tBodyAcc- tBodyAcc- tBodyAcc- fBodyGyro-
tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- tBodyAcc- fBodyGyro-
entropy()- entropy()- entropy()- ... meanFreq()-
std()-X std()-Y std()-Z max()-Z min()-X min()-Y min()-Z bandsEnergy()-1,8.1
X Y Z Z

0 -0.994425 -0.994873 -0.994886 -0.813863 0.846922 0.691468 0.846423 -0.611174 -0.768785 -0.663066 ... -0.158283 -0.999962

1 -0.326331 0.069663 -0.224321 -0.411806 0.271334 0.039452 0.269204 0.403663 0.180054 0.176069 ... -0.315738 -0.898022

2 -0.026220 -0.032163 0.393109 0.200747 0.118277 0.072295 0.245986 0.318557 0.135103 0.087680 ... -0.071045 0.192158

3 -0.981092 -0.901124 -0.960423 -0.807362 0.825370 0.642789 0.815368 -0.376515 -0.171730 -0.496816 ... -0.629423 -0.996263

4 -0.997380 -0.983893 -0.984482 -0.810993 0.853330 0.687431 0.844895 -0.652548 -0.678458 -0.486837 ... 0.261846 -0.999927

5 rows × 100 columns

Disadvantages of Anova:

1. Assumption of Normality: ANOVA assumes that the data for each group follow a normal distribution. This assumption may not hold true for all
datasets, especially those with skewed distributions.

2. Assumption of Homogeneity of Variance: ANOVA assumes that the variances of the different groups are equal. This is the assumption of
homogeneity of variance (also known as homoscedasticity). If this assumption is violated, it may lead to incorrect results.

3. Independence of Observations: ANOVA assumes that the observations are independent of each other. This might not be the case in datasets where
observations are related (e.g., time series data, nested data).

4. Effect of Outliers: ANOVA is sensitive to outliers. A single outlier can significantly affect the F-statistic leading to a potentially erroneous conclusion.

5. Doesn't Account for Interactions: Just like other univariate feature selection methods, ANOVA does not consider interactions between features.

Moment of Truth

In [51]: # Initialize and train logistic regression model


log_reg = LogisticRegression(max_iter=1000) # Increase max_iter if it doesn't converge
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Calculate and print accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy:", accuracy)

Test accuracy: 0.9700883752549286

We achieved the same accuracy even after reducing the number of columns from 562 to 100

5.Chi-Square

Chi-square is a statistical test commonly used in feature selection to determine the independence between categorical variables. It assesses the
relationship between two variables by comparing the observed frequencies in a contingency table with the frequencies that would be expected if the
variables were independent.

In the context of feature selection, the chi-square test can be used to evaluate the significance of the association between each feature and the target
variable. The basic idea is to calculate the chi-square statistic for each feature and the target, and then assess the statistical significance of the
relationship.

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 19/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

Explaantion :

In [54]: titanic = pd.read_csv("D:\\datascience\\Nitish sir\\Inferential Statistics\\titanic_train.csv")[['Pclass','Sex','SibSp','Parch


titanic.head()

Out[54]:
Pclass Sex SibSp Parch Embarked Survived

0 3 male 1 0 S 0

1 1 female 1 0 C 1

2 3 female 0 0 S 1

3 1 female 1 0 S 1

4 3 male 0 0 S 0

In [55]: ct = pd.crosstab(titanic['Survived'],titanic['Sex'],margins=True)
ct

Out[55]:
Sex female male All

Survived

0 81 468 549

1 233 109 342

All 314 577 891

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 20/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [56]: from scipy.stats import chi2_contingency



# Run the chi-squared test for independence

chi2_contingency(ct) # On SEX data

# Here are Expected Values with 'P value'

# Smaller the area , Greater the impact in data

Out[56]: (263.05057407065567,
1.0036732821369117e-55,
4,
array([[193.47474747, 355.52525253, 549. ],
[120.52525253, 221.47474747, 342. ],
[314. , 577. , 891. ]]))

In [57]: # HERE apply for desired total data



score = []

for feature in titanic.columns[:-1]:

# create contingency table


ct = pd.crosstab(titanic['Survived'], titanic[feature])

# chi_test
p_value = chi2_contingency(ct)[1]
score.append(p_value)

In [58]: pd.Series(score, index=titanic.columns[:-1]).sort_values(ascending=True).plot(kind='bar')




# Where ever 'P value' is small , they are important feature

# Here SEX , Pclass , Sibsp are Important Features

Out[58]: <AxesSubplot:>

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 21/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook

In [59]: # Code for Sk.Learn



from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2
import matplotlib.pyplot as plt

# assuming titanic is your DataFrame and 'Survived' is the target column

# Encode categorical variables
le = LabelEncoder()
titanic_encoded = titanic.apply(le.fit_transform)

X = titanic_encoded.drop('Survived', axis=1)
y = titanic_encoded['Survived']

# Calculate chi-squared stats
chi_scores = chi2(X, y)

# chi_scores[1] are the p-values of each feature.
p_values = pd.Series(chi_scores[1], index = X.columns)
p_values.sort_values(inplace = True)

# Plotting the p-values
p_values.plot.bar()

plt.title('Chi-square test - P-values')
plt.xlabel('Feature')
plt.ylabel('P-value')

plt.show()

Disadvantages of Chi-Square
1. Categorical Data Only: The chi-square test can only be used with categorical variables. It is not suitable for continuous variables unless they have
been discretized into categories, which can lead to loss of information.

2. Independence of Observations: The chi-square test assumes that the observations are independent of each other. This might not be the case in
datasets where observations are related (e.g., time series data, nested data).

3. Sufficient Sample Size: Chi-square test requires a sufficiently large sample size. The results may not be reliable if the sample size is too small or if
the frequency count in any category is too low (typically less than 5).

4. No Variable Interactions: Chi-square test, like other univariate feature selection methods, does not consider interactions between features. It might
miss out on identifying important features that are significant in combination with other features.

Advantages and Disadvantages of Filter Methods

Advantages

1. Simplicity: Filter methods are generally straightforward and easy to understand. They involve calculating a statistic that measures the relevance of
each feature, and selecting the top features based on this statistic.

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 22/23


6/23/23, 7:38 PM Feature Selection ( Prudhvi Vardhan Notes) - Jupyter Notebook
2. Speed: These methods are usually computationally efficient. Because they evaluate each feature independently, they can be much faster than
wrapper methods or embedded methods, which need to train a model to evaluate feature importance.

3. Scalability: Filter methods can handle a large number of features effectively because they don't involve any learning methods. This makes them
suitable for high-dimensional datasets.

4. Pre-processing Step: They can serve as a pre-processing step for other feature selection methods. For instance, you could use a filter method to
remove irrelevant features before applying a more computationally expensive method such as a wrapper method

Disadvantages

1. Lack of Feature Interaction: Filter methods treat each feature individually and hence do not consider the interactions between features. They might
miss out on identifying important features that don't appear significant individually but are significant in combination with other features.

2. Model Agnostic: Filter methods are agnostic to the machine learning model that will be used for the prediction. This means that the selected features
might not necessarily contribute to the accuracy of the specific model you want to use.

3. Statistical Measures Limitation: The statistical measures used in these methods have their own limitations. For example, correlation is a measure of
linear relationship and might not capture non-linear relationships effectively. Similarly, variance-based methods might keep features with high variance
but low predictive power.

4. Threshold Determination: For some methods, determining the threshold to select features can be a bit subjective. For example, what constitutes
"low" variance or "high" correlation might differ depending on the context or the specific dataset.

In [ ]: ​

localhost:8888/notebooks/ Feature Selection ( Prudhvi Vardhan Notes).ipynb# 23/23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy