Report

1
Contents
Problem 01...................................................................................................2
Problem 02...................................................................................................6
Problem 03.................................................................................................14
Problem 4...................................................................................................20
2
Problem 01
1. import pandas as pd
2. import matplotlib.pyplot as plt
3.
4. # Define the column names based on the dataset description
5. columns = [
6. "mpg", "cylinders", "displacement", "horsepower",
7. "weight", "acceleration", "model_year", "origin", "car_name"
8. ]
9.
10. # Load the dataset
11. url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/auto-mpg/auto-mpg.data"
12. df = pd.read_csv(
13. url,
14. delim_whitespace=True,
15. names=columns,
16. na_values='?'
17. )
18.
19. # Display the structure of the dataset
20. print("Dataset structure:")
21. print(df.info())
22. print("\nFirst few rows of the dataset:")
23. print(df.head())
Code explanation:
 The pd.read_csv() function loaded the dataset directly from the
URL, using whitespace as a delimiter.
 Missing values were identified using the na_values parameter.
 The info() method summarized the dataset structure, including
data types and non-null counts for each column.
 The head() function displayed the first few rows to understand
the dataset content.
3
print(df.describe())
# Calculate the median of MPG and Weight

mpg_median = df['mpg'].median()
weight_median = df['weight'].median()
4
print(f"Median of MPG: {mpg_median}")

print(f"Median of Weight: {weight_median}")
Code explanation:
 The median() function was applied to the mpg and weight
columns to compute their respective medians.
 Median MPG and weight values provide insights into the typical
fuel efficiency and size of vehicles in the dataset, unaffected by
extreme values.
# Create a scatter plot with MPG on the y-axis and Weight on the x-
axis
plt.figure(figsize=(8, 6))
plt.scatter(df['weight'], df['mpg'], alpha=0.7, color='blue')
plt.title("Relationship between Car Weight and Fuel Efficiency (MPG)",
fontsize=14)
plt.xlabel("Car Weight", fontsize=12)
plt.ylabel("Miles per Gallon (MPG)", fontsize=12)
plt.grid(alpha=0.3)
plt.show()
5
Explanation of scatter plot:

The scatter plot clearly shows a negative correlation between car
weight and fuel efficiency (MPG). As the weight of the car increases,
the MPG generally decreases. This suggests that heavier cars tend to
have lower fuel efficiency. The trend is quite noticeable: cars with
higher weights, often above 3000 pounds, tend to have MPG values
between 10 and 20, while lighter cars typically exhibit higher MPG
values, often greater than 30. This inverse relationship highlights the
fact that heavier vehicles require more energy to move, leading to
higher fuel consumption and, therefore, lower MPG. Conversely, lighter
vehicles consume less fuel, which is why they tend to be more fuel-
efficient.
6
# Calculate the correlation between MPG and Displacement, as well as

between MPG and Weight
mpg_displacement_corr = df['mpg'].corr(df['displacement'])
mpg_weight_corr = df['mpg'].corr(df['weight'])
print(f"Correlation between MPG and Displacement:

{mpg_displacement_corr}")
print(f"Correlation between MPG and Weight: {mpg_weight_corr}")
Explanation:
The correlation results indicate a strong negative relationship between
both MPG and Displacement (-0.804) and MPG and Weight (-0.832).
These values suggest that as engine displacement (size) and car
weight increase, fuel efficiency (MPG) decreases significantly. The high
negative correlations indicate that larger engines tend to consume
more fuel, reducing fuel efficiency, and heavier cars require more
energy to move, further lowering their MPG. This aligns with
automotive principles, where heavier and more powerful vehicles are
generally less fuel-efficient due to higher fuel consumption demands.
Problem 02
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset

7
file_path = 'user_reviews.csv' # Make sure to specify the correct path if

needed
DF = pd.read_csv(file_path)
# Display the first 5 rows of the DataFrame
print("First 5 rows of the DataFrame:")
print(DF.head())
# List all the columns
print("\nList of columns:")
print(DF.columns)
# Description of each column
for column in DF.columns:
print(f"\n{column}: {DF[column].dtype}")
Code explanation:
 Used pd.read_csv() to load the dataset into a DataFrame called DF.

 Displayed the first five rows using DF.head().
 Used DF.columns to list all column names and iterated through them
with descriptions.
8
# Create a new column 'word_count' that stores the number of words in each
review (text column)
DF['word_count'] = DF['text'].apply(lambda x: len(x.split()))
# Display the first 5 rows to check the new column
print(DF[['text', 'word_count']].head())
Code explanation:
 Used the apply() method with a lambda function to split the text in the
text column into words using .split(), and counted the resulting words
with len().
9
 Added the word counts to a new column, word_count.
# Create a new column 'length_category' based on the word count
conditions = [
(DF['word_count'] > 200),
(DF['word_count'] >= 50) & (DF['word_count'] <= 200),
(DF['word_count'] < 50)
# Define the corresponding category labels
categories = ['long', 'medium', 'short']
# Apply the conditions to create the 'length_category' column
DF['length_category'] = pd.cut(DF['word_count'], bins=[0, 50, 200,

float('inf')], labels=categories)
print(DF[['text', 'word_count', 'length_category']].head())
Code explanation:
 Used a for loop to iterate over the DataFrame rows with iterrows().
10
 Checked conditions for word count (>200, 50–200, <50) and appended
the corresponding categories (long, medium, short) to a list.
 Added this list as a new column.
# Using a for loop to classify length_category
length_categories = []
for index, row in DF.iterrows():
if row['word_count'] > 200:
length_categories.append('long')
elif 50 <= row['word_count'] <= 200:
length_categories.append('medium')
else:
length_categories.append('short')
# Assign the length_categories to the 'length_category' column
DF['length_category_for_loop'] = length_categories
print(DF[['text', 'word_count', 'length_category_for_loop']].head())

11
# Using a while loop to classify length_category
length_categories = []
index = 0
while index < len(DF):
if DF.loc[index, 'word_count'] > 200:
length_categories.append('long')
elif 50 <= DF.loc[index, 'word_count'] <= 200:
length_categories.append('medium')
else:
length_categories.append('short')
index += 1
# Assign the length_categories to the 'length_category' column
DF['length_category_while_loop'] = length_categories
12
print(DF[['text', 'word_count', 'length_category_while_loop']].head())
# Using vectorization to classify length_category
DF['length_category_vectorized'] = pd.cut(DF['word_count'], bins=[0, 50,

200, float('inf')], labels=['short', 'medium', 'long'])
print(DF[['text', 'word_count', 'length_category_vectorized']].head())

13
# Plotting a pie chart showing the proportions of each length category
length_category_counts = DF['length_category_vectorized'].value_counts()
# Plot the pie chart
length_category_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90,

colors=['#66b3ff', '#99ff99', '#ffcc99'])
plt.title('Proportion of Each Length Category')
plt.ylabel('') # Hide y-label for aesthetic reasons
plt.show()
14
Problem 03
import pandas as pd
import matplotlib.pyplot as plt
# URL for the dataset
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-
majors/recent-grads.csv"
15
# Load the dataset into a DataFrame
df = pd.read_csv(url)
# Display the structure of the dataset
print("Dataset structure:")
print(df.info())
# Display the first few rows of the dataset
print("\nFirst few rows of the dataset:")
print(df.head())
16
17
# Calculate the correlation between Median salary and Unemployment rate
correlation = df["Median"].corr(df["Unemployment_rate"])
print(f"Correlation between Median salary and Unemployment rate:

{correlation}")
# Create a scatter plot to visualize the relationship
plt.scatter(df["Median"], df["Unemployment_rate"], color='blue', alpha=0.5)
plt.title('Relationship between Median Salary and Unemployment Rate')
plt.xlabel('Median Salary ($)')

18
plt.ylabel('Unemployment Rate')
plt.grid(True)
plt.show()
# Filter the dataset for majors with a Median salary above $60,000 and
Unemployment_rate below 5%
filtered_data = df[(df["Median"] > 60000) & (df["Unemployment_rate"] <

0.05)]
# Display the results
print("Majors with Median Salary above $60,000 and Unemployment Rate

below 5%:")
print(filtered_data[['Major', 'Median', 'Unemployment_rate']])

19
# Group by Major_category and calculate the average Median salary for each
category
grouped_data = df.groupby("Major_category")["Median"].mean()
# Display the results
print("Average Median Salary for each Major Category:")
print(grouped_data)
# Create a bar chart to visualize the average Median salaries by

Major_category
grouped_data.plot(kind='bar', figsize=(12, 6), color='skyblue')
# Add titles and labels to the plot
plt.title('Average Median Salary by Major Category')

20
plt.xlabel('Major Category')
plt.ylabel('Average Median Salary ($)')
plt.xticks(rotation=90)
plt.show()
Problem 4
import sqlite3
# Create a connection to SQLite database (it will create a new file if it

doesn't exist)
conn = sqlite3.connect('company.db')
21
cursor = conn.cursor()
# Create branches table
cursor.execute('''
CREATE TABLE IF NOT EXISTS branches (
branch_id INTEGER PRIMARY KEY,
address TEXT UNIQUE NOT NULL,
zip_code TEXT NOT NULL,
city TEXT NOT NULL,
region TEXT NOT NULL,
country TEXT NOT NULL
);
''')
# Create teams table
cursor.execute('''
CREATE TABLE IF NOT EXISTS teams (
team_id INTEGER PRIMARY KEY,
team_name TEXT UNIQUE NOT NULL,
branch_id INTEGER,
FOREIGN KEY (branch_id) REFERENCES branches(branch_id)
);
''')
# Commit changes and close the connection
conn.commit()
conn.close()
22
print("Tables 'branches' and 'teams' created successfully!")
# Reconnect to the database
# Insert data into branches table
cursor.executemany('''
INSERT INTO branches (branch_id, address, zip_code, city, region, country)
VALUES (?, ?, ?, ?, ?, ?)
''', [
(101, '50 Green Street', 'SW1 2AA', 'London', 'London', 'UK'),
(102, '12 Hill Avenue', 'M1 5GH', 'Manchester', 'Greater Manchester', 'UK'),
(103, '34 River Road', 'L2 3DR', 'Liverpool', 'Merseyside', 'UK'),
(104, '19 Lake View', 'B4 2DD', 'Birmingham', 'West Midlands', 'UK')
])
# Insert data into teams table
cursor.executemany('''
INSERT INTO teams (team_id, team_name, branch_id)
VALUES (?, ?, ?)
''', [
(1010, 'Engineering', 101),

23
(1020, 'HR', 102),
(1030, 'Sales', 103),
(1040, 'Finance', 104),
(1050, 'Support', None) # Support team has no branch assigned
])
# Commit changes and close the connection
conn.commit()
conn.close()
print("Data inserted into 'branches' and 'teams' tables successfully!")
# Reconnect to the database
# Query to display all teams along with their branch locations
cursor.execute('''
SELECT teams.team_name, branches.address, branches.city,

branches.region, branches.country
FROM teams
LEFT JOIN branches ON teams.branch_id = branches.branch_id
''')
# Fetch and display results

24
teams_with_locations = cursor.fetchall()
print("Teams and their Branch Locations:")
for team in teams_with_locations:
print(team)
# Query to display all branches without any teams
cursor.execute('''
SELECT branch_id, address, city, region, country
FROM branches
WHERE branch_id NOT IN (SELECT DISTINCT branch_id FROM teams)
''')
# Fetch and display results
branches_without_teams = cursor.fetchall()
print("\nBranches without any teams:")
for branch in branches_without_teams:
print(branch)
# Close the connection
conn.close()

Report

Uploaded by

Document Informationclick to expand document informationIt application report

Document Informationclick to expand document information

Copyright:

Available Formats

Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report

Uploaded by

Copyright:

Available Formats

1

# Calculate the median of MPG and Weight

print(f"Median of MPG: {mpg_median}")

Explanation of scatter plot:

# Calculate the correlation between MPG and Displacement, as well as

print(f"Correlation between MPG and Displacement:

import matplotlib.pyplot as plt

# Load the dataset

file_path = 'user_reviews.csv' # Make sure to specify the correct path if

# Display the first 5 rows of the DataFrame

print("First 5 rows of the DataFrame:")

# List all the columns

# Description of each column

for column in DF.columns:

 Used pd.read_csv() to load the dataset into a DataFrame called DF.

DF['word_count'] = DF['text'].apply(lambda x: len(x.split()))

# Display the first 5 rows to check the new column

 Added the word counts to a new column, word_count.

# Create a new column 'length_category' based on the word count

(DF['word_count'] > 200),

(DF['word_count'] >= 50) & (DF['word_count'] <= 200),

(DF['word_count'] < 50)

# Define the corresponding category labels

categories = ['long', 'medium', 'short']

# Apply the conditions to create the 'length_category' column

DF['length_category'] = pd.cut(DF['word_count'], bins=[0, 50, 200,

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count', 'length_category']].head())

# Using a for loop to classify length_category

for index, row in DF.iterrows():

if row['word_count'] > 200:

elif 50 <= row['word_count'] <= 200:

# Assign the length_categories to the 'length_category' column

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count', 'length_category_for_loop']].head())

# Using a while loop to classify length_category

while index < len(DF):

if DF.loc[index, 'word_count'] > 200:

elif 50 <= DF.loc[index, 'word_count'] <= 200:

# Assign the length_categories to the 'length_category' column

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count', 'length_category_while_loop']].head())

# Using vectorization to classify length_category

DF['length_category_vectorized'] = pd.cut(DF['word_count'], bins=[0, 50,

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count', 'length_category_vectorized']].head())

# Plotting a pie chart showing the proportions of each length category

# Plot the pie chart

length_category_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90,

plt.title('Proportion of Each Length Category')

plt.ylabel('') # Hide y-label for aesthetic reasons

import matplotlib.pyplot as plt

# URL for the dataset

# Load the dataset into a DataFrame

# Display the structure of the dataset

# Display the first few rows of the dataset

print("\nFirst few rows of the dataset:")

# Calculate the correlation between Median salary and Unemployment rate