Train

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Implement Find-S Algorithm

Load Libraries
In [1]:
import pandas as pd

Load Dataset
In [2]:
df = pd.read_csv('./tennis.csv')

Explore Dataset
In [3]:
df.head()
Out[3]:

tem
outlook humidity windy play
p

0 sunny hot high False no

1 sunny hot high True no

2 overcast hot high False yes

3 rainy mild high False yes

4 rainy cool normal False yes

In [4]:
df.shape
Out[4]:
(14, 5)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 outlook 14 non-null object
1 temp 14 non-null object
2 humidity 14 non-null object
3 windy 14 non-null bool
4 play 14 non-null object
dtypes: bool(1), object(4)
memory usage: 590.0+ bytes
In [16]:
for i in df.columns:
print(f'{i} : {df[i].unique()}')
outlook : ['sunny' 'overcast' 'rainy']
temp : ['hot' 'mild' 'cool']
humidity : ['high' 'normal']
windy : [False True]
play : ['no' 'yes']

Split Dataset Into Attributes And Target


In [6]:
result=df['play'].values
In [7]:
attributes=df.drop('play',axis=1).values

Initialization Of Specific Hypothesis


In [8]:
H=['0']*attributes.shape[1]
In [9]:
print(f'Initial Hypothesis is : {H}')
Initial Hypothesis is : ['0', '0', '0', '0']

Implement The Logic Of Find-S Algorithm


In [10]:
for i in range(attributes.shape[0]):
if result[i]=='yes':
for j in range(attributes.shape[1]):
if H[j]=='0':
H[j]=attributes[i][j]
elif H[j]!=attributes[i][j]:
H[j]='?'
print(f'For Step-{i} : {H}')

For Step-0 : ['0', '0', '0', '0']


For Step-1 : ['0', '0', '0', '0']
For Step-2 : ['overcast', 'hot', 'high', False]
For Step-3 : ['?', '?', 'high', False]
For Step-4 : ['?', '?', '?', False]
For Step-5 : ['?', '?', '?', False]
For Step-6 : ['?', '?', '?', '?']
For Step-7 : ['?', '?', '?', '?']
For Step-8 : ['?', '?', '?', '?']
For Step-9 : ['?', '?', '?', '?']
For Step-10 : ['?', '?', '?', '?']
For Step-11 : ['?', '?', '?', '?']
For Step-12 : ['?', '?', '?', '?']
For Step-13 : ['?', '?', '?', '?']

Final General Hypothesis


In [11]:
print(f'Final Hypothesis is : {H}')
Final Hypothesis is : ['?', '?', '?', '?']

Q2)

You are given a dataset Housing.csv, which contains information about various features of houses and
their corresponding prices. The goal is to predict the house prices based on the available features
using linear regression.

Code :

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset

df = pd.read_csv('Housing.csv')
# Check for missing values

print(df.isnull().sum())

# Fill missing values for numerical columns with mean and categorical columns with mode

df.fillna(df.mean(), inplace=True)

df['Column_name'] = df['Column_name'].fillna(df['Column_name'].mode()[0])

# One-Hot Encoding for categorical columns

df_encoded = pd.get_dummies(df, drop_first=True)

# Define features (X) and target (y)

X = df_encoded.drop(columns=['Price'])

y = df_encoded['Price']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model

model = LinearRegression()

model.fit(X_train, y_train)

# Make predictions on the test set

y_pred = model.predict(X_test)

# Evaluate the model's performance


mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

# Output the evaluation metrics

print("Mean Squared Error (MSE):", mse)

print("R-squared (R²):", r2)

# Plot the actual vs predicted values

plt.scatter(y_test, y_pred)

plt.xlabel("Actual Prices")

plt.ylabel("Predicted Prices")

plt.title("Actual vs Predicted House Prices")

plt.show()

# Residuals plot

residuals = y_test - y_pred

sns.scatterplot(x=y_pred, y=residuals)

plt.axhline(y=0, color='r', linestyle='--')

plt.xlabel("Predicted Prices")

plt.ylabel("Residuals")

plt.title("Residuals vs Predicted Prices")

plt.show()
Q: Design a task where you acquire two distinct types of datasets: one comprising numerical data and
the other categorical data. Subsequently, you will perform Linear Regression on the dataset containing
numerical values, and Logistic Regression on the dataset containing categorical values.

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, LogisticRegression

from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix

from sklearn.preprocessing import OneHotEncoder

# Dataset 1: Numerical Data (House Prices Prediction)

# Generate a random numerical dataset

data_numerical = {

'Size': [1200, 1500, 1800, 2000, 1600, 1100, 2500, 2200, 2300, 1400],

'Bedrooms': [3, 4, 3, 5, 4, 2, 5, 4, 4, 3],

'Age': [5, 10, 15, 7, 3, 12, 8, 14, 5, 20],

'Price': [400000, 500000, 450000, 350000, 475000, 320000, 600000, 550000, 580000, 330000]

df_numerical = pd.DataFrame(data_numerical)

# Features and target variable for numerical dataset


X_numerical = df_numerical.drop(columns=['Price'])

y_numerical = df_numerical['Price']

# Split the data into training and testing sets for linear regression

X_train_num, X_test_num, y_train_num, y_test_num = train_test_split(X_numerical, y_numerical,


test_size=0.2, random_state=42)

# Linear Regression Model

linear_model = LinearRegression()

linear_model.fit(X_train_num, y_train_num)

# Predict on test data

y_pred_num = linear_model.predict(X_test_num)

# Evaluate Linear Regression Model

mse_num = mean_squared_error(y_test_num, y_pred_num)

r2_num = r2_score(y_test_num, y_pred_num)

# Dataset 2: Categorical Data (Customer Purchase Prediction)

# Generate a random categorical dataset

data_categorical = {

'Age_Group': ['18-25', '26-35', '36-45', '46-60', '18-25', '26-35', '36-45', '46-60', '18-25', '26-35'],

'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male'],

'Product_Category': ['A', 'B', 'A', 'B', 'B', 'A', 'B', 'A', 'B', 'A'],

'Purchased': [1, 0, 1, 0, 1, 1, 0, 1, 0, 1] # 1 = Purchased, 0 = Not Purchased

}
df_categorical = pd.DataFrame(data_categorical)

# One-Hot Encoding for categorical features

df_categorical_encoded = pd.get_dummies(df_categorical, drop_first=True)

# Features and target variable for categorical dataset

X_categorical = df_categorical_encoded.drop(columns=['Purchased'])

y_categorical = df_categorical_encoded['Purchased']

# Split the data into training and testing sets for logistic regression

X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(X_categorical, y_categorical,


test_size=0.2, random_state=42)

# Logistic Regression Model

logistic_model = LogisticRegression()

logistic_model.fit(X_train_cat, y_train_cat)

# Predict on test data

y_pred_cat = logistic_model.predict(X_test_cat)

# Evaluate Logistic Regression Model

accuracy_cat = accuracy_score(y_test_cat, y_pred_cat)

conf_matrix_cat = confusion_matrix(y_test_cat, y_pred_cat)

# Output the evaluation metrics


print("Linear Regression - Numerical Dataset Evaluation:")

print(f"Mean Squared Error: {mse_num}")

print(f"R-squared: {r2_num}")

print("\nLogistic Regression - Categorical Dataset Evaluation:")

print(f"Accuracy: {accuracy_cat}")

print(f"Confusion Matrix:\n{conf_matrix_cat}")

question 4

You decide to use the algorithm for classification.The dataset is split into 80%
training data and 20% test data.
1. Load the dataset and preprocess it (handle missing values, normalize the
data if necessary).
2. Implement the KNN algorithm to classify.
3. Evaluate the model's performance by calculating the and displaying the .

# Importing necessary libraries

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split


from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Load and preprocess the dataset

# Load the dataset (replace 'your_dataset.csv' with the actual path to your dataset)

# For demonstration, let's create a synthetic dataset

data = {

'Feature1': [1, 2, 3, 4, 5, 6, np.nan, 8, 9, 10],

'Feature2': [5, 4, 3, 2, 1, np.nan, 3, 2, 1, 5],

'Feature3': [np.nan, 1, 2, 3, 4, 5, 6, 7, 8, 9],

'Target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1] # Binary classification (0 or 1)

df = pd.DataFrame(data)

# Show the dataset

print("Original Dataset:")

print(df)

# Handling missing values

imputer = SimpleImputer(strategy='mean') # Fill missing values with the mean of the column

df_imputed = pd.DataFrame(imputer.fit_transform(df.drop(columns=['Target'])))
df_imputed['Target'] = df['Target'] # Add target column back

# Normalize/standardize the data (important for KNN)

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df_imputed.drop(columns=['Target']))

y = df_imputed['Target']

# Step 2: Split the dataset into training and testing sets (80% training, 20% testing)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Step 3: Implement the KNN algorithm

# Initialize the KNN classifier (let's use k=3 for this example)

knn = KNeighborsClassifier(n_neighbors=3)

# Train the KNN model on the training data

knn.fit(X_train, y_train)

# Make predictions on the test data

y_pred = knn.predict(X_test)

# Step 4: Evaluate the model's performance

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)


print(f"\nAccuracy: {accuracy:.2f}")

# Classification report (precision, recall, f1-score)

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

# Confusion matrix

print("\nConfusion Matrix:")

print(confusion_matrix(y_test, y_pred))

Question 2

Your task is to take an unclean dataset and drop the unnecessary columns from it. Then, check
the remaining columns to see if there are any NaN (missing) values, and if there are, fill those
values. After that, apply One-hot encoding to the categorical values. Finally, calculate the mean
of the columns you are using.

import pandas as pd

import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Step 1: Load the dataset

# For this demonstration, we'll create a synthetic dataset with some NaN values and categorical data.

data = {

'Feature1': [1, 2, np.nan, 4, 5],

'Feature2': [5, np.nan, 3, 2, 1],

'Feature3': ['A', 'B', 'A', 'B', 'A'],

'Feature4': [10, 20, 30, 40, 50],

'UnnecessaryFeature': ['X', 'Y', 'Z', 'X', 'Y'] # This will be dropped

# Create DataFrame

df = pd.DataFrame(data)

# Show the original dataset

print("Original Dataset:")

print(df)

# Step 2: Drop unnecessary columns (in this case, 'UnnecessaryFeature')

df_cleaned = df.drop(columns=['UnnecessaryFeature'])

# Step 3: Check for missing (NaN) values

print("\nMissing Values Before Filling:")

print(df_cleaned.isna().sum())

# Fill missing values with the mean for numerical columns

df_cleaned = df_cleaned.fillna(df_cleaned.mean())

# Check again for missing values after filling


print("\nMissing Values After Filling:")

print(df_cleaned.isna().sum())

# Step 4: Apply One-hot encoding to categorical variables

# For this example, 'Feature3' is categorical and will be encoded

df_encoded = pd.get_dummies(df_cleaned, columns=['Feature3'], drop_first=True)

# Show the DataFrame after encoding

print("\nDataset After One-hot Encoding:")

print(df_encoded)

# Step 5: Calculate the mean of the numerical columns

mean_values = df_encoded.mean()

# Show the means of the columns

print("\nMean of Columns:")

print(mean_values)

output :

Original Dataset:

Feature1 Feature2 Feature3 Feature4 UnnecessaryFeature

0 1.0 5.0 A 10 X

1 2.0 NaN B 20 Y

2 NaN 3.0 A 30 Z

3 4.0 2.0 B 40 X

4 5.0 1.0 A 50 Y

Missing Values Before Filling:

Feature1 1
Feature2 1

Feature3 0

Feature4 0

UnnecessaryFeature 0

dtype: int64

Missing Values After Filling:

Feature1 0

Feature2 0

Feature3 0

Feature4 0

dtype: int64

Dataset After One-hot Encoding:

Feature1 Feature2 Feature4 Feature3_B

0 1.0 5.0 10 0

1 2.0 2.0 20 1

2 3.0 3.0 30 0

3 4.0 2.0 40 1

4 5.0 1.0 50 0

Mean of Columns:

Feature1 3.000000

Feature2 2.750000

Feature4 30.000000

Feature3_B 0.400000

dtype: float64
OR

import pandas as pd

# Load the dataset

file_path = '/mnt/data/laptopData.csv'

data = pd.read_csv(file_path)

# Step 1: Drop unnecessary columns

data_cleaned = data.drop(columns=['Unnamed: 0'])

# Step 2: Check for missing values

# Fill missing values: Numerical columns will be filled with their mean; categorical with the mode.

for column in data_cleaned.columns:

if data_cleaned[column].dtype == 'object': # Categorical data

data_cleaned[column].fillna(data_cleaned[column].mode()[0], inplace=True)

else: # Numerical data

data_cleaned[column].fillna(data_cleaned[column].mean(), inplace=True)

# Verify no missing values remain

assert data_cleaned.isnull().sum().sum() == 0, "There are still missing values!"

# Step 3: Apply One-Hot Encoding to categorical columns

categorical_columns = data_cleaned.select_dtypes(include=['object']).columns

data_encoded = pd.get_dummies(data_cleaned, columns=categorical_columns, drop_first=True)

# Step 4: Calculate the mean of all numerical columns

column_means = data_encoded.mean()
# Display the results

print("Column Means:")

print(column_means)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy