0% found this document useful (0 votes)
3 views6 pages

Python Project 2 Colab

The document outlines a data analysis project on the Wine Quality dataset, detailing the importation of necessary libraries, data preprocessing, and exploratory data analysis. Key findings include the most frequent wine quality, correlations between various features and wine quality, and the training of Decision Tree and Random Forest models to predict wine quality with accuracy scores of 0.675 and 0.725, respectively. The Random Forest model outperformed the Decision Tree model in terms of accuracy.

Uploaded by

Gaurav Rajula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

Python Project 2 Colab

The document outlines a data analysis project on the Wine Quality dataset, detailing the importation of necessary libraries, data preprocessing, and exploratory data analysis. Key findings include the most frequent wine quality, correlations between various features and wine quality, and the training of Decision Tree and Random Forest models to predict wine quality with accuracy scores of 0.675 and 0.725, respectively. The Random Forest model outperformed the Decision Tree model in terms of accuracy.

Uploaded by

Gaurav Rajula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1/22/25, 8:14 PM Finlatics project 2 .

ipynb - Colab

FINLATICS Project 2

In this dataset we are analysing Wine Quality dataset.

# importing necessery libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# importing the data set

df = pd.read_csv('/content/wine_data.csv')

# data preprocessing

df.head()

free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates alcohol q
acidity acidity acid sugar
dioxide dioxide

0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4

1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8

2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8

3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8

4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4

Next steps: Generate code with df toggle_off View recommended plots New interactive sheet

# Checking info about the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 1/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
# Checking for missing values and duplicates
print(df.isnull().sum())

print("checking duplicate rows")

print(df.duplicated().sum())

print("describing data")

print(df.describe())

fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
checking duplicate rows
240
describing data
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000

chlorides free sulfur dioxide total sulfur dioxide density \


count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690

pH sulphates alcohol quality


count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 5.636023
std 0.154386 0.169507 1.065668 0.807569
min 2.740000 0.330000 8.400000 3.000000
25% 3.210000 0.550000 9.500000 5.000000
50% 3.310000 0.620000 10.200000 6.000000
75% 3.400000 0.730000 11.100000 6.000000
max 4.010000 2.000000 14.900000 8.000000

1. What is the most frequently occurring wine quality? What is the highest number in and the lowest number in
the quantity column?

# Most frequently occurring wine quality


most_frequent_quality = df['quality'].mode()[0]
quality_count = df['quality'].value_counts()

# Highest and lowest values in the 'quality' column


highest_quality = df['quality'].max()
lowest_quality = df['quality'].min()

https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 2/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab

print("Most frequent wine quality : ",most_frequent_quality)


print("Frequency of each wine quality : ", quality_count)
print("Highest wine quality : " ,highest_quality)
print("Lowest wine quality : " ,lowest_quality)

Most frequent wine quality : 5


Frequency of each wine quality : quality
5 681
6 638
7 199
4 53
8 18
3 10
Name: count, dtype: int64
Highest wine quality : 8
Lowest wine quality : 3

2. How is fixed acidity correlated to the quality of the wine? How does the alcohol content affect the quality?
How is the free Sulphur dioxide content correlated to the quality of the wine?

# finding correlations between given features

corr_fixed_acidity = df['fixed acidity'].corr(df['quality'])


corr_alcohol = df['alcohol'].corr(df['quality'])
corr_free_sulfur_dioxide = df['free sulfur dioxide'].corr(df['quality'])

print("correlation between fixed acidity and quality of wine : ",corr_fixed_acidity)

print("correlation between alcohol and quality of wine : ",corr_alcohol)

print("corelation between free sulfur dioxide and quality of wine : ",corr_free_sulfur_dioxide)

correlation between fixed acidity and quality of wine : 0.12405164911322428


correlation between alcohol and quality of wine : 0.4761663239995365
corelation between free sulfur dioxide and quality of wine : -0.0506560572442763

# visualizing the given correlations


import seaborn as sns

plt.figure(figsize=(8,8))

# Fixed acidity vs Quality


plt.subplot(1, 3, 1)
sns.scatterplot(x='fixed acidity', y='quality', data=df, alpha=0.5)
plt.title('Fixed Acidity vs Quality')
plt.xlabel('Fixed Acidity')
plt.ylabel('Quality')

# Alcohol vs Quality
plt.subplot(1, 3, 2)
sns.scatterplot(x='alcohol', y='quality', data=df, alpha=0.5, color='orange')
plt.title('Alcohol vs Quality')
plt.xlabel('Alcohol')
plt.ylabel('Quality')

# Free Sulfur Dioxide vs Quality


plt.subplot(1, 3, 3)
sns.scatterplot(x='free sulfur dioxide', y='quality', data=df, alpha=0.5, color='green')
plt.title('Free Sulfur Dioxide vs Quality')

https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 3/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
plt.xlabel('Free Sulfur Dioxide')
plt.ylabel('Quality')

plt.tight_layout()
plt.show()

3. What is the average residual sugar for the best quality wine and the lowest quality wine in the dataset?

# average residual sugar for the best quality wine and the lowest quality wine

residual_sugar_best_quality = df[df['quality'] == df['quality'].max()]['residual sugar'].mean()


residual_sugar_lowest_quality = df[df['quality'] == df['quality'].min()]['residual sugar'].mean()

print("Average residual sugar for the best quality wine : ",residual_sugar_best_quality)


print("Average residual sugar for the lowest quality wine : ",residual_sugar_lowest_quality)

Average residual sugar for the best quality wine : 2.5777777777777775


Average residual sugar for the lowest quality wine : 2.6350000000000002

https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 4/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab

4. Does volatile acidity has an effect over the quality of the wine samples in the dataset?

# correlation of volatile acidity and wine quality

corr_volatile_acidity = df['volatile acidity'].corr(df['quality'])

print("correlation between volatile acidity and wine quality : ",corr_volatile_acidity)

# Scatter plot to visualize the relationship

plt.figure(figsize=(8, 5))
sns.scatterplot(x='volatile acidity', y='quality', data=df, alpha=0.5, color='green')
plt.title('Volatile Acidity vs Wine Quality')
plt.xlabel('Volatile Acidity')
plt.ylabel('Wine Quality')
plt.show()

correlation between volatile acidity and wine quality : -0.390557780264007

5. Train a Decision Tree model and Random Forest Model separately to predict the Quality of the given samples
of wine. Compare the Accuracy scores for both models.

# for this we need to train two models and compare the accuracy score of both models
# for this we need to import needed models and split the data into training and testing

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Splitting data into features (X) and target (y)


X = df.drop(columns=['quality'])
y = df['quality']

# Train-test split (80% train, 20% test)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 5/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab

# Decision Tree Model

dt_model = DecisionTreeClassifier(random_state=3)
# fitting and training model
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
# accuracy score of decision tree model
dt_accuracy = accuracy_score(y_test, y_pred_dt)

print("accuracy score of decision tree model for the wine data : ",dt_accuracy)

# Random Forest Model

rf_model = RandomForestClassifier(random_state=3)
# fitting and training model
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
# accuracy score of random forest model
rf_accuracy = accuracy_score(y_test, y_pred_rf)

print("accuracy score of random forest model for the wine data : ",rf_accuracy)

# comparing accuracy score of both models

print("for the given wine data")


print("accuracy score of decision tree model : ",dt_accuracy)
print("accuracy score of random forest model : ",rf_accuracy)

if dt_accuracy > rf_accuracy:


print("Decision Tree model performs better.")
elif dt_accuracy < rf_accuracy:
print("Random Forest model performs better.")
else:
print("Both models have the same accuracy.")

accuracy score of decision tree model for the wine data : 0.675
accuracy score of random forest model for the wine data : 0.725
for the given wine data
accuracy score of decision tree model : 0.675
accuracy score of random forest model : 0.725
Random Forest model performs better.

Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a reCAPTCHA challenge.

https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 6/6

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy