Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

1

Contents
Problem 01...................................................................................................2
Problem 02...................................................................................................6
Problem 03.................................................................................................14
Problem 4...................................................................................................20
2

Problem 01
1. import pandas as pd
2. import matplotlib.pyplot as plt
3.
4. # Define the column names based on the dataset description
5. columns = [
6. "mpg", "cylinders", "displacement", "horsepower",
7. "weight", "acceleration", "model_year", "origin", "car_name"
8. ]
9.
10. # Load the dataset
11. url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/auto-mpg/auto-mpg.data"
12. df = pd.read_csv(
13. url,
14. delim_whitespace=True,
15. names=columns,
16. na_values='?'
17. )
18.
19. # Display the structure of the dataset
20. print("Dataset structure:")
21. print(df.info())
22. print("\nFirst few rows of the dataset:")
23. print(df.head())

Code explanation:
 The pd.read_csv() function loaded the dataset directly from the
URL, using whitespace as a delimiter.
 Missing values were identified using the na_values parameter.
 The info() method summarized the dataset structure, including
data types and non-null counts for each column.
 The head() function displayed the first few rows to understand
the dataset content.
3

print(df.describe())

# Calculate the median of MPG and Weight


mpg_median = df['mpg'].median()
weight_median = df['weight'].median()
4

print(f"Median of MPG: {mpg_median}")


print(f"Median of Weight: {weight_median}")

Code explanation:
 The median() function was applied to the mpg and weight
columns to compute their respective medians.
 Median MPG and weight values provide insights into the typical
fuel efficiency and size of vehicles in the dataset, unaffected by
extreme values.

# Create a scatter plot with MPG on the y-axis and Weight on the x-
axis
plt.figure(figsize=(8, 6))
plt.scatter(df['weight'], df['mpg'], alpha=0.7, color='blue')
plt.title("Relationship between Car Weight and Fuel Efficiency (MPG)",
fontsize=14)
plt.xlabel("Car Weight", fontsize=12)
plt.ylabel("Miles per Gallon (MPG)", fontsize=12)
plt.grid(alpha=0.3)
plt.show()
5

Explanation of scatter plot:


The scatter plot clearly shows a negative correlation between car
weight and fuel efficiency (MPG). As the weight of the car increases,
the MPG generally decreases. This suggests that heavier cars tend to
have lower fuel efficiency. The trend is quite noticeable: cars with
higher weights, often above 3000 pounds, tend to have MPG values
between 10 and 20, while lighter cars typically exhibit higher MPG
values, often greater than 30. This inverse relationship highlights the
fact that heavier vehicles require more energy to move, leading to
higher fuel consumption and, therefore, lower MPG. Conversely, lighter
vehicles consume less fuel, which is why they tend to be more fuel-
efficient.
6

# Calculate the correlation between MPG and Displacement, as well as


between MPG and Weight
mpg_displacement_corr = df['mpg'].corr(df['displacement'])
mpg_weight_corr = df['mpg'].corr(df['weight'])

print(f"Correlation between MPG and Displacement:


{mpg_displacement_corr}")
print(f"Correlation between MPG and Weight: {mpg_weight_corr}")

Explanation:
The correlation results indicate a strong negative relationship between
both MPG and Displacement (-0.804) and MPG and Weight (-0.832).
These values suggest that as engine displacement (size) and car
weight increase, fuel efficiency (MPG) decreases significantly. The high
negative correlations indicate that larger engines tend to consume
more fuel, reducing fuel efficiency, and heavier cars require more
energy to move, further lowering their MPG. This aligns with
automotive principles, where heavier and more powerful vehicles are
generally less fuel-efficient due to higher fuel consumption demands.

Problem 02
import pandas as pd

import matplotlib.pyplot as plt

# Load the dataset


7

file_path = 'user_reviews.csv' # Make sure to specify the correct path if


needed

DF = pd.read_csv(file_path)

# Display the first 5 rows of the DataFrame

print("First 5 rows of the DataFrame:")

print(DF.head())

# List all the columns

print("\nList of columns:")

print(DF.columns)

# Description of each column

for column in DF.columns:

print(f"\n{column}: {DF[column].dtype}")

Code explanation:

 Used pd.read_csv() to load the dataset into a DataFrame called DF.


 Displayed the first five rows using DF.head().
 Used DF.columns to list all column names and iterated through them
with descriptions.
8

# Create a new column 'word_count' that stores the number of words in each
review (text column)

DF['word_count'] = DF['text'].apply(lambda x: len(x.split()))

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count']].head())

Code explanation:

 Used the apply() method with a lambda function to split the text in the
text column into words using .split(), and counted the resulting words
with len().
9

 Added the word counts to a new column, word_count.

# Create a new column 'length_category' based on the word count

conditions = [

(DF['word_count'] > 200),

(DF['word_count'] >= 50) & (DF['word_count'] <= 200),

(DF['word_count'] < 50)

# Define the corresponding category labels

categories = ['long', 'medium', 'short']

# Apply the conditions to create the 'length_category' column

DF['length_category'] = pd.cut(DF['word_count'], bins=[0, 50, 200,


float('inf')], labels=categories)

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count', 'length_category']].head())

Code explanation:

 Used a for loop to iterate over the DataFrame rows with iterrows().
10

 Checked conditions for word count (>200, 50–200, <50) and appended
the corresponding categories (long, medium, short) to a list.
 Added this list as a new column.

# Using a for loop to classify length_category

length_categories = []

for index, row in DF.iterrows():

if row['word_count'] > 200:

length_categories.append('long')

elif 50 <= row['word_count'] <= 200:

length_categories.append('medium')

else:

length_categories.append('short')

# Assign the length_categories to the 'length_category' column

DF['length_category_for_loop'] = length_categories

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count', 'length_category_for_loop']].head())


11

# Using a while loop to classify length_category

length_categories = []

index = 0

while index < len(DF):

if DF.loc[index, 'word_count'] > 200:

length_categories.append('long')

elif 50 <= DF.loc[index, 'word_count'] <= 200:

length_categories.append('medium')

else:

length_categories.append('short')

index += 1

# Assign the length_categories to the 'length_category' column

DF['length_category_while_loop'] = length_categories
12

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count', 'length_category_while_loop']].head())

# Using vectorization to classify length_category

DF['length_category_vectorized'] = pd.cut(DF['word_count'], bins=[0, 50,


200, float('inf')], labels=['short', 'medium', 'long'])

# Display the first 5 rows to check the new column

print(DF[['text', 'word_count', 'length_category_vectorized']].head())


13

# Plotting a pie chart showing the proportions of each length category

length_category_counts = DF['length_category_vectorized'].value_counts()

# Plot the pie chart

plt.figure(figsize=(7, 7))

length_category_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90,


colors=['#66b3ff', '#99ff99', '#ffcc99'])

plt.title('Proportion of Each Length Category')

plt.ylabel('') # Hide y-label for aesthetic reasons

plt.show()
14

Problem 03
import pandas as pd

import matplotlib.pyplot as plt

# URL for the dataset

url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-
majors/recent-grads.csv"
15

# Load the dataset into a DataFrame

df = pd.read_csv(url)

# Display the structure of the dataset

print("Dataset structure:")

print(df.info())

# Display the first few rows of the dataset

print("\nFirst few rows of the dataset:")

print(df.head())
16
17

# Calculate the correlation between Median salary and Unemployment rate

correlation = df["Median"].corr(df["Unemployment_rate"])

print(f"Correlation between Median salary and Unemployment rate:


{correlation}")

# Create a scatter plot to visualize the relationship

plt.figure(figsize=(10, 6))

plt.scatter(df["Median"], df["Unemployment_rate"], color='blue', alpha=0.5)

plt.title('Relationship between Median Salary and Unemployment Rate')

plt.xlabel('Median Salary ($)')


18

plt.ylabel('Unemployment Rate')

plt.grid(True)

plt.show()

# Filter the dataset for majors with a Median salary above $60,000 and
Unemployment_rate below 5%

filtered_data = df[(df["Median"] > 60000) & (df["Unemployment_rate"] <


0.05)]

# Display the results

print("Majors with Median Salary above $60,000 and Unemployment Rate


below 5%:")

print(filtered_data[['Major', 'Median', 'Unemployment_rate']])


19

# Group by Major_category and calculate the average Median salary for each
category

grouped_data = df.groupby("Major_category")["Median"].mean()

# Display the results

print("Average Median Salary for each Major Category:")

print(grouped_data)

# Create a bar chart to visualize the average Median salaries by


Major_category

grouped_data.plot(kind='bar', figsize=(12, 6), color='skyblue')

# Add titles and labels to the plot

plt.title('Average Median Salary by Major Category')


20

plt.xlabel('Major Category')

plt.ylabel('Average Median Salary ($)')

plt.xticks(rotation=90)

plt.show()

Problem 4
import sqlite3

# Create a connection to SQLite database (it will create a new file if it


doesn't exist)

conn = sqlite3.connect('company.db')
21

cursor = conn.cursor()

# Create branches table

cursor.execute('''

CREATE TABLE IF NOT EXISTS branches (

branch_id INTEGER PRIMARY KEY,

address TEXT UNIQUE NOT NULL,

zip_code TEXT NOT NULL,

city TEXT NOT NULL,

region TEXT NOT NULL,

country TEXT NOT NULL

);

''')

# Create teams table

cursor.execute('''

CREATE TABLE IF NOT EXISTS teams (

team_id INTEGER PRIMARY KEY,

team_name TEXT UNIQUE NOT NULL,

branch_id INTEGER,

FOREIGN KEY (branch_id) REFERENCES branches(branch_id)

);

''')

# Commit changes and close the connection

conn.commit()

conn.close()
22

print("Tables 'branches' and 'teams' created successfully!")

# Reconnect to the database

conn = sqlite3.connect('company.db')

cursor = conn.cursor()

# Insert data into branches table

cursor.executemany('''

INSERT INTO branches (branch_id, address, zip_code, city, region, country)

VALUES (?, ?, ?, ?, ?, ?)

''', [

(101, '50 Green Street', 'SW1 2AA', 'London', 'London', 'UK'),

(102, '12 Hill Avenue', 'M1 5GH', 'Manchester', 'Greater Manchester', 'UK'),

(103, '34 River Road', 'L2 3DR', 'Liverpool', 'Merseyside', 'UK'),

(104, '19 Lake View', 'B4 2DD', 'Birmingham', 'West Midlands', 'UK')

])

# Insert data into teams table

cursor.executemany('''

INSERT INTO teams (team_id, team_name, branch_id)

VALUES (?, ?, ?)

''', [

(1010, 'Engineering', 101),


23

(1020, 'HR', 102),

(1030, 'Sales', 103),

(1040, 'Finance', 104),

(1050, 'Support', None) # Support team has no branch assigned

])

# Commit changes and close the connection

conn.commit()

conn.close()

print("Data inserted into 'branches' and 'teams' tables successfully!")

# Reconnect to the database

conn = sqlite3.connect('company.db')

cursor = conn.cursor()

# Query to display all teams along with their branch locations

cursor.execute('''

SELECT teams.team_name, branches.address, branches.city,


branches.region, branches.country

FROM teams

LEFT JOIN branches ON teams.branch_id = branches.branch_id

''')

# Fetch and display results


24

teams_with_locations = cursor.fetchall()

print("Teams and their Branch Locations:")

for team in teams_with_locations:

print(team)

# Query to display all branches without any teams

cursor.execute('''

SELECT branch_id, address, city, region, country

FROM branches

WHERE branch_id NOT IN (SELECT DISTINCT branch_id FROM teams)

''')

# Fetch and display results

branches_without_teams = cursor.fetchall()

print("\nBranches without any teams:")

for branch in branches_without_teams:

print(branch)

# Close the connection

conn.close()

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy