Python Project 2 Colab
Python Project 2 Colab
ipynb - Colab
FINLATICS Project 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/content/wine_data.csv')
# data preprocessing
df.head()
free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates alcohol q
acidity acidity acid sugar
dioxide dioxide
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
Next steps: Generate code with df toggle_off View recommended plots New interactive sheet
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 1/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
# Checking for missing values and duplicates
print(df.isnull().sum())
print(df.duplicated().sum())
print("describing data")
print(df.describe())
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
checking duplicate rows
240
describing data
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
1. What is the most frequently occurring wine quality? What is the highest number in and the lowest number in
the quantity column?
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 2/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
2. How is fixed acidity correlated to the quality of the wine? How does the alcohol content affect the quality?
How is the free Sulphur dioxide content correlated to the quality of the wine?
plt.figure(figsize=(8,8))
# Alcohol vs Quality
plt.subplot(1, 3, 2)
sns.scatterplot(x='alcohol', y='quality', data=df, alpha=0.5, color='orange')
plt.title('Alcohol vs Quality')
plt.xlabel('Alcohol')
plt.ylabel('Quality')
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 3/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
plt.xlabel('Free Sulfur Dioxide')
plt.ylabel('Quality')
plt.tight_layout()
plt.show()
3. What is the average residual sugar for the best quality wine and the lowest quality wine in the dataset?
# average residual sugar for the best quality wine and the lowest quality wine
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 4/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
4. Does volatile acidity has an effect over the quality of the wine samples in the dataset?
plt.figure(figsize=(8, 5))
sns.scatterplot(x='volatile acidity', y='quality', data=df, alpha=0.5, color='green')
plt.title('Volatile Acidity vs Wine Quality')
plt.xlabel('Volatile Acidity')
plt.ylabel('Wine Quality')
plt.show()
5. Train a Decision Tree model and Random Forest Model separately to predict the Quality of the given samples
of wine. Compare the Accuracy scores for both models.
# for this we need to train two models and compare the accuracy score of both models
# for this we need to import needed models and split the data into training and testing
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 5/6
1/22/25, 8:14 PM Finlatics project 2 .ipynb - Colab
dt_model = DecisionTreeClassifier(random_state=3)
# fitting and training model
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
# accuracy score of decision tree model
dt_accuracy = accuracy_score(y_test, y_pred_dt)
print("accuracy score of decision tree model for the wine data : ",dt_accuracy)
rf_model = RandomForestClassifier(random_state=3)
# fitting and training model
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
# accuracy score of random forest model
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print("accuracy score of random forest model for the wine data : ",rf_accuracy)
accuracy score of decision tree model for the wine data : 0.675
accuracy score of random forest model for the wine data : 0.725
for the given wine data
accuracy score of decision tree model : 0.675
accuracy score of random forest model : 0.725
Random Forest model performs better.
Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a reCAPTCHA challenge.
https://colab.research.google.com/drive/1LFtPIQYXP4WcNZ6Aik3xLZMMsvVO1HvQ#scrollTo=mG7oet9mwoPi&printMode=true 6/6