Outliers, Hypothesis and Natural Language Processing
Outliers, Hypothesis and Natural Language Processing
Outliers, Hypothesis and Natural Language Processing
[27]: iris.columns
1
from sklearn.model_selection import train_test_split
[30]: le = LabelEncoder()
y_encoded = le.fit_transform(y)
iris[target_column] = y_encoded
[31]: sns.heatmap(iris.corr(method='pearson').drop(
[], axis=1).drop([], axis=0),
annot = True);
plt.show()
2
#Treating Outliers
var = iris['sepal_width']
[34]: var
[34]: 0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
…
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: float64
[36]: 3.0
3
Data with Outliers Replaced by Median:
0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
…
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: float64
4
if(len(outliers) == 0):
print("No outliers.")
[]
No outliers.
Hypothesis
[39]: import numpy as np
from scipy.stats import kstest, norm
else:
print(f"The data follows a normal distribution (p-value = {p_value})")
else:
5
print(f"The sample follows a normal distribution (p-value = {p_value})")
else:
print(f"The sample follows a normal distribution (p-value = {p_value})")
[ ]: #Sample documents
doc1 = 'Game of Thrones is an amazing tv series!, Game of Thrones is the best␣
↪tv series! and Game of Thrones is so great'
['game', 'of', 'thrones', 'is', 'an', 'amazing', 'tv', 'series', 'game', 'of',
'thrones', 'is', 'the', 'best', 'tv', 'series', 'and', 'game', 'of', 'thrones',
'is', 'so', 'great']
[ ]: import nltk
from nltk.corpus import stopwords
6
nltk.download('stopwords')
[ ]: stop_words = set(stopwords.words('english'))
filtered_words = [word for word in w_doc1 if word.lower() not in stop_words]
game thrones amazing tv series game thrones best tv series game thrones great
X = vectorizer.fit_transform(doc1)
# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()
# Convert the Bag of Words representation to a dense matrix and print it
print(X.toarray())
print("Feature names (words):", feature_names)
[[1 1 1 1 3 1 3 3 2 1 1 3 2]]
Feature names (words): ['amazing' 'an' 'and' 'best' 'game' 'great' 'is' 'of'
'series' 'so' 'the'
'thrones' 'tv']