0% found this document useful (0 votes)
14 views

Data Science Lab Manual

The document provides steps and code snippets for various machine learning techniques like regression, classification, clustering and dimensionality reduction. It includes practical examples on Excel functions, pivot tables, VLOOKUP, conditional formatting, reading data from files, preprocessing tasks, feature scaling, dummy variables, hypothesis testing using t-test and chi-square test, ANOVA, different types of regression, logistic regression and decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Science Lab Manual

The document provides steps and code snippets for various machine learning techniques like regression, classification, clustering and dimensionality reduction. It includes practical examples on Excel functions, pivot tables, VLOOKUP, conditional formatting, reading data from files, preprocessing tasks, feature scaling, dummy variables, hypothesis testing using t-test and chi-square test, ANOVA, different types of regression, logistic regression and decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

PRACTICAL 1

Introduction to Excel
A. Perform conditional formatting on a dataset using various criteria.

Steps
Step 1: Go to conditional formatting > Greater Than

Step 2: Enter the greater than filter value for example 2000.
Step 3: Go to Data Bars > Solid Fill in conditional formatting.

B. Create a pivot table to analyse and summarize data.


Steps
Step 1: select the entire table and go to Insert tab PivotChart > Pivotchart
Step 2: Select “New worksheet” in the create pivot chart window.
Step 3: Select and drag attributes in the below boxes.

C. Use VLOOKUP function to retrieve information from a different worksheet or table.


Steps:
Step 1: click on an empty cell and type the following command.
=VLOOKUP(B3, B3:D3,1, TRUE)
D. Perform what-if analysis using Goal Seek to determine input values for desired
output.
Steps-
Step 1: In the Data tab go to the what if analysis>Goal seek.

Step 2: Fill the information in the window accordingly and click ok


PRACTICAL 2

Data Frames and Basic Data Pre-processing


A. Read data from CSV and JSON files into a data frame.
(1)
# Read data from a csv file
import pandas as pd
df = pd.read_csv('Student_Marks.csv')
print("Our dataset ")
print(df)

(2)
# Reading data from a JSON file
import pandas as pd
data = pd.read_json('dataset.json')
print(data)

B. Perform basic data pre-processing tasks such as handling missing values and outliers.
Code:
(1)
# Replacing NA values using fillna()
import pandas as pd
df = pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after filling NA values with 0 : ")
df2=df.fillna(value=0)
print(df2)

(2)
# Dropping NA values using dropna()
import pandas as pd
df = pd.read_csv('titanic.csv')
print(df)
df.head(10)

print("Dataset after dropping NA values: ")


df.dropna(inplace = True)
print(df)
C. Manipulate and transform data using functions like filtering, sorting, and grouping
Code:
import pandas as pd

# Load iris dataset


iris = pd.read_csv('Iris.csv')

# Filtering data based on a condition


setosa = iris[iris['Species'] == 'setosa']
print("Setosa samples:")
print(setosa.head())

# Sorting data
sorted_iris = iris.sort_values(by='SepalLengthCm', ascending=False)
print("\nSorted iris dataset:")
print(sorted_iris.head())

# Grouping data
grouped_species = iris.groupby('Species').mean()
print("\nMean measurements for each species:")
print(grouped_species)
PRACTICAL 3

Feature Scaling and Dummification

A. Apply feature-scaling techniques like standardization and normalization to numerical


features.

Code:

# Standardization and normalization


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2], skiprows=1)
df.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original DataFrame:")
print(df)
scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df[['Alcohol','Malic Acid']])
df[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax Scaling")
print(df)
scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df[['Alcohol','Malic Acid']])
df[['Alcohol','Malic Acid']]=scaled_standardvalue
print("\n Dataframe after Standard Scaling")
print(df)
B. Perform feature Dummification to convert categorical variables into numerical
representations.

Code:

import pandas as pd
iris=pd.read_csv("Iris.csv")
print(iris)
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
print(iris)
Practical 4
Hypothesis Testing
Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test)
# t-test
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Generate two samples for demonstration purposes


np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)

# Perform a two-sample t-test


t_statistic, p_value = stats.ttest_ind(sample1, sample2)

# Set the significance level


alpha = 0.05

print("Results of Two-Sample t-test:")


print(f'T-statistic: {t_statistic}')
print(f'P-value: {p_value}')
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")

# Plot the distributions


plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()

# Highlight the critical region if null hypothesis is rejected


if p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')
plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',
backgroundcolor='white')

# Show the plot


plt.show()

# Draw Conclusions
if p_value < alpha:
if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean of Sample 1 is significantly higher than that of Sample
2.")
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean of Sample 2 is significantly higher than that of Sample
1.")
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to claim a significant difference
between the means.")

Output:

#chi-test
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sb
import warnings
from scipy import stats
warnings.filterwarnings('ignore')
df=sb.load_dataset('mpg')
print(df)
print(df['horsepower'].describe())
print(df['model_year'].describe())
bins=[0,75,150,240]
df['horsepower_new']=pd.cut(df['horsepower'],bins=bins,labels=['l','m','h'])
c=df['horsepower_new']
print(c)
ybins=[69,72,74,84]
label=['t1','t2','t3']
df['modelyear_new']=pd.cut(df['model_year'],bins=ybins,labels=label)
newyear=df['modelyear_new']
print(newyear)
df_chi=pd.crosstab(df['horsepower_new'],df['modelyear_new'])
print(df_chi)
print(stats.chi2_contingency(df_chi)
Output:
Conclusion: There is sufficient evidence to reject the null hypothesis, indicating that
there is a significant association between 'horsepower_new' and 'modelyear_new'
categories.
Practical 5
ANOVA (Analysis of Variance)
Perform one-way ANOVA to compare means across multiple groups.
Conduct post-hoc tests to identify significant differences between group means.

import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

group1 = [23, 25, 29, 34, 30]


group2 = [19, 20, 22, 24, 25]
group3 = [15, 18, 20, 21, 17]
group4 = [28, 24, 26, 30, 29]

all_data = group1 + group2 + group3 + group4


group_labels = ['Group1'] * len(group1) + ['Group2'] * len(group2) + ['Group3'] *
len(group3) + ['Group4'] * len(group4)

f_statistics, p_value = stats.f_oneway(group1, group2, group3, group4)


print("one-way ANOVA:")
print("F-statistics:", f_statistics)
print("p-value", p_value)

tukey_results = pairwise_tukeyhsd(all_data, group_labels)


print("\nTukey-Kramer post-hoc test:")
print(tukey_results)

Output:-
Practical 6
Regression and its Types.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
print(housing_df)

housing_df['PRICE'] = housing.target

X = housing_df[['AveRooms']]
y = housing_df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

mse = mean_squared_error(y_test, model.predict(X_test))


r2 = r2_score(y_test, model.predict(X_test))
print("Mean Squared Error:", mse)
print("R-squared:", r2)
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

#########################################

#Multiple Liner Regression

X = housing_df.drop('PRICE',axis=1)
y = housing_df['PRICE']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

model = LinearRegression()

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

print("Mean Squared Error:",mse)


print("R-squared:",r2)
print("Intercept:",model.intercept_)
print("Coefficient:",model.coef_)
Output:
Practical 7

Logistic Regression and Decision Tree

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
classification_report
# Load the Iris dataset and create a binary classification problem
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] +
['target'])
binary_df = iris_df[iris_df['target'] != 2]
X = binary_df.drop('target', axis=1)
y = binary_df['target']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a logistic regression model and evaluate its performance
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred_logistic = logistic_model.predict(X_test)

print("Logistic Regression Metrics")


print("Accuracy: ", accuracy_score(y_test, y_pred_logistic))
print("Precision:", precision_score(y_test, y_pred_logistic))
print("Recall: ", recall_score(y_test, y_pred_logistic))
print("\nClassification Report")
print(classification_report(y_test, y_pred_logistic))
# Train a decision tree model and evaluate its performance
decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree = decision_tree_model.predict(X_test)
print("\nDecision Tree Metrics")
print("Accuracy: ", accuracy_score(y_test, y_pred_tree))
print("Precision:", precision_score(y_test, y_pred_tree))
print("Recall: ", recall_score(y_test, y_pred_tree))
print("\nClassification Report")
print(classification_report(y_test, y_pred_tree))
Output:-
Practical 8
K-Means clustering
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = pd.read_csv("C:\\Users\Reape\Downloads\wholesale\wholesale.csv")
data.head()

categorical_features = ['Channel', 'Region']


continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']
data[continuous_features].describe()

for col in categorical_features:


dummies = pd.get_dummies(data[col], prefix = col)
data = pd.concat([data, dummies], axis = 1)
data.drop(col, axis = 1, inplace = True)
data.head()

mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)

sum_of_squared_distances = []
K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)

plt.plot(K, sum_of_squared_distances, 'bx-')


plt.xlabel('k')
plt.ylabel('sum_of_squared_distances')
plt.title('elbow Mehtod for optimal k')
plt.show()

Output:
Practical 9

Principal Component Analysis (PCA)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] +
['target'])
X = iris_df.drop('target', axis=1)
y = iris_df['target']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio = pca.explained_variance_ratio_

plt.figure(figsize=(8, 6))
plt.plot(np.cumsum(explained_variance_ratio), marker='o', linestyle='--')
plt.title('Explained Variance Ratio')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.grid(True)
plt.show()

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1
print(f"Number of principal components to explain 95% variance: {n_components}")

pca = PCA(n_components=n_components)
X_reduced = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', s=50, alpha=0.5)
plt.title('Data in Reduced-dimensional Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Target')
plt.show()
Output:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy