0% found this document useful (0 votes)

9 views41 pages

Mds1111 Merged Numbered (1)

Uploaded by

ponnekantilikhitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views41 pages

Mds1111 Merged Numbered (1)

Uploaded by

ponnekantilikhitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

WEEK-1,2

1} Create NumPy arrays from Python Data Structures, Intrinsic NumPy

objects and Random Functions.
A:-NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it
freely. NumPy stands for Numerical Python.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that make
working with ndarray very easy.
Installation of NumPy:
Install it using this command:
C:\Users\Your Name>pip install numpy
Creating NumPy Arrays from Python Data Structures
You can create arrays from Python lists, tuples, or even other NumPy arrays.
Code:-
import numpy as np

# From a list
list_data = [1, 2, 3, 4, 5]
array_from_list = np.array(list_data)
print("Array from list:", array_from_list)

# From a tuple
tuple_data = (6, 7, 8, 9, 10)
array_from_tuple = np.array(tuple_data)
print("Array from tuple:", array_from_tuple)

# Multi-dimensional list
multi_list_data = [[1, 2, 3], [4, 5, 6]]
multi_array = np.array(multi_list_data)
print("Multi-dimensional array:", multi_array)

OUTPUT:-

Creating Arrays with Intrinsic NumPy Objects

NumPy provides built-in functions to create arrays with particular properties, such as arrays
filled with zeros, ones, or a range of numbers.
Code:-
import numpy as np

1
# Array of zeros
zeros_array = np.zeros((3, 3))
print("Array of zeros:\n", zeros_array)

# Array of ones
ones_array = np.ones((2, 4))
print("Array of ones:\n", ones_array)

# Array with a range of values

range_array = np.arange(0, 10, 2) # start, stop, step
print("Array with range:", range_array)

# Linearly spaced array

linspace_array = np.linspace(0, 1, 5) # start, stop, number of samples
print("Linearly spaced array:", linspace_array)

# Identity matrix
identity_matrix = np.eye(3)
print("Identity matrix:\n", identity_matrix)

OUTPUT:-

Creating Arrays with Random Functions

NumPy’s random module allows you to generate arrays of random numbers, which can be
useful for simulations and testing.
Code:-
import numpy as np

# Array of random floats between 0 and 1

random_array = np.random.rand(3, 3) # shape (3, 3)
print("Random array of floats:\n", random_array)

# Array of random integers

random_integers = np.random.randint(1, 10, size=(3, 3)) # values between 1 and 10
print("Random array of integers:\n", random_integers)

2
# Normal distribution (mean=0, std=1)
normal_dist_array = np.random.randn(3, 3)
print("Array from normal distribution:\n", normal_dist_array)

# Random sample from a given range

random_choice = np.random.choice([10, 20, 30, 40], size=(2, 2))
print("Random choice array:\n", random_choice)

OUTPUT:-

2}Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and

Splitting.
A:- NumPy Array Indexing
Array indexing is the same as accessing an array element.
You can access an array element by referring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.
Code:-
import numpy as np

# 1D array indexing
array_1d = np.array([10, 20, 30, 40, 50])
print("Element at index 2:", array_1d[2])

# 2D array indexing
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Element at row 1, column 2:", array_2d[1, 2])

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])

3
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])

OUTPUT:-

Slicing
Slicing allows you to access a subset of an array using the syntax start:stop:step.
If we don't pass start its considered 0
If we don't pass end its considered length of array in that dimension
If we don't pass step its considered 1.
Code:-
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
print(arr[:4])
print(arr[-3:-1])
print(arr[1:5:2])
print(arr[::2])
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])
print(arr[1, 1:4])

OUTPUT:-

NumPy Array Reshaping

Reshaping means changing the shape of an array.
The shape of an array is the number of elements in each dimension.
By reshaping we can add or remove dimensions or change number of elements in each
dimension.
Code:-
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(4, 3)
print(newarr)
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)

4
print(newarr)
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
print(arr.reshape(2, 4).base)
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
newarr = arr.reshape(2, 2, -1)
print(newarr)
arr = np.array([[1, 2, 3], [4, 5, 6]])
newarr = arr.reshape(-1)
print(newarr)
OUTPUT:-

NumPy Joining Array

Joining means putting contents of two or more arrays in a single array.
In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.
We pass a sequence of arrays that we want to join to the concatenate() function, along with the
axis. If axis is not explicitly passed, it is taken as 0.
Code:-
import numpy as np

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])

5
arr = np.concatenate((arr1, arr2), axis=1)
print(arr)
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.stack((arr1, arr2), axis=1)
print(arr)
arr = np.hstack((arr1, arr2))
print(arr)
arr = np.vstack((arr1, arr2))
print(arr)
arr = np.dstack((arr1, arr2))
print(arr)
OUTPUT:-

NumPy Splitting Array

Splitting is reverse operation of Joining.
Joining merges multiple arrays into one and Splitting breaks one array into multiple.
We use array_split() for splitting arrays, we pass it the array we want to split and the number
of splits.
Code:-
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 4)
print(newarr)
arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
newarr = np.array_split(arr, 3)
print(newarr)
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
newarr = np.array_split(arr, 3, axis=1)
print(newarr)

6
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
newarr = np.hsplit(arr, 3)
print(newarr)
OUTPUT:-

3} Computation on NumPy arrays using Universal Functions and

Mathematical methods.
A:- NumPy provides a range of mathematical operations and universal functions (ufuncs) that
allow efficient computation across arrays. Here's how you can use these functions and
methods:
1. Universal Functions (UFuncs)
Universal functions (ufuncs) in NumPy are vectorized functions that apply element-wise
operations on arrays, making computations fast and concise.
Code:-
import numpy as np

array = np.array([1, 2, 3, 4])

# Addition
print("Array + 2:", array + 2)
# Multiplication
print("Array * 2:", array * 2)
# Exponentiation
print("Array ** 2:", array ** 2)
# Sine of each element
angles = np.array([0, np.pi / 2, np.pi])
print("Sine:", np.sin(angles))

7
# Cosine of each element
print("Cosine:", np.cos(angles))
# Tangent of each element
print("Tangent:", np.tan(angles))
# Exponential (e^x) of each element
print("Exponential:", np.exp(array))
# Natural logarithm (log base e)
print("Natural Logarithm:",np.log(array))
# Logarithm base 10
print("Logarithm base 10:",np.log10(array))
OUTPUT:-

2. Aggregation and Mathematical Methods

NumPy also provides functions to compute aggregations, such as sum, mean, median, min, and
max.
Code:-
import numpy as np
array = np.array([1, 2, 3, 4])

# Sum of elements
print("Sum:", np.sum(array))
# Mean of elements
print("Mean:", np.mean(array))
# Median of elements
print("Median:", np.median(array))
# Standard deviation of elements
print("Standard Deviation:", np.std(array))
# Minimum and Maximum
print("Min:", np.min(array))
print("Max:", np.max(array))

8
OUTPUT:-

3. Matrix Operations
For 2D arrays (matrices), NumPy provides matrix-specific operations such as dot products,
matrix multiplication, and transposing.
Code:-
import numpy as np
array = np.array([1, 2, 3, 4])
# Matrix multiplication
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])
product = np.dot(matrix_a, matrix_b)
print("Matrix multiplication:\n", product)
# Element-wise multiplication
elementwise_product = matrix_a * matrix_b
print("Element-wise multiplication:\n", elementwise_product)
# Transpose of a matrix
transpose = matrix_a.T
print("Transpose:\n", transpose)
OUTPUT:-

4. Conditional Operations
You can also apply conditional functions on arrays to filter data or perform conditional
computations.
Code:-
import numpy as np
array = np.array([1, 2, 3, 4])
# Conditional selection

9
array = np.array([1, 2, 3, 4, 5])
print("Elements greater than 2:", array[array > 2])
result = np.where(array % 2 == 0, "Even", "Odd")
print("Even or Odd:", result)
OUTPUT:-

5. Broadcasting
NumPy's broadcasting allows operations on arrays of different shapes, which is efficient for
element-wise operations.
Code:-
import numpy as np
array = np.array([1, 2, 3, 4])
# Broadcasting with scalar
array = np.array([[1, 2, 3], [4, 5, 6]])
print("Add 10 to each element:\n", array + 10)
# Broadcasting with another array
array_b = np.array([10, 20, 30])
print("Element-wise addition with broadcasting:\n", array + array_b)
OUTPUT:-

4}Import a CSV file and perform various Statistical and Comparison

operations on rows/columns.
A:-
Step 1: Import the CSV File
Let's assume we have a CSV file named "data.csv" with numerical data:
# Sample data.csv file:
A,B,C
10,20,30
40,50,60
70,80,90
100,110,120
We can use NumPy's genfromtxt function to import the file.
Code:-
import numpy as np
# Import the CSV file (skip the header row and specify the delimiter)
data = np.genfromtxt('data.csv', delimiter=',', skip_header=1)print("Data from CSV:\n", data)

10
OUTPUT:-

Step 2: Perform Statistical Operations on Rows and Columns

NumPy provides a range of statistical functions that can be applied along specific axes (axis=0
for columns, axis=1 for rows).
Mean
Code:-
# Mean of each column
column_mean = np.mean(data, axis=0)
print("Mean of each column:", column_mean)
# Mean of each row row_mean = np.mean(data, axis=1)
print("Mean of each row:", row_mean)
OUTPUT:-

Sum
Code:-
# Sum of each column
column_sum = np.sum(data, axis=0)
print("Sum of each column:", column_sum)
# Sum of each row row_sum = np.sum(data, axis=1)
print("Sum of each row:", row_sum)
OUTPUT:-

Standard Deviation
Code:-
# Standard deviation of each column
column_std = np.std(data, axis=0)
print("Standard deviation of each column:", column_std)
# Standard deviation of each row
row_std = np.std(data, axis=1)
print("Standard deviation of each row:", row_std)
OUTPUT:-

Min and Max

Code:-
# Minimum of each column

11
column_min = np.min(data, axis=0)
print("Min of each column:", column_min)

# Maximum of each row

row_max = np.max(data, axis=1)
print("Max of each row:", row_max)
OUTPUT:-

Step 3: Perform Comparison Operations

You can apply conditions to filter or compare specific rows and columns.
1. Filter Rows Based on Condition
For instance, let’s filter rows where the values in column A are greater than 50.
Code:-
filtered_rows = data[data[:, 0] > 50]
print("Rows where column A > 50:\n", filtered_rows)
OUTPUT:-

2. Comparison Across Rows and Columns

Check if each element in column B is greater than the corresponding element in column A.
Code:-
comparison = data[:, 1] > data[:, 0]
print("Is each element in column B greater than column A?", comparison)
OUTPUT:-

3. Applying Conditions to Elements

Let’s replace all values less than 50 with zero.
Code:-
data[data < 50] = 0
print("Data with values less than 50 replaced by 0:\n", data)
OUTPUT:-

5} Load an image file and do crop and flip operation using NumPy
Indexing.
A:-
Code:-

12
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

# Load the image

image = Image.open("/content/lotus1.jpg") # Ensure this path is correct

# Convert the image to a NumPy array

image_arr = np.array(image)

# Check the dimensions of the image

height, width, channels = image_arr.shape
print(f"Image dimensions: {height}x{width}, Channels: {channels}")

# Define cropping coordinates (adjusted to fit within 894x726)

crop_start_row = 100 # Adjust as needed
crop_end_row = min(700, height) # Ensure the end row does not exceed height
crop_start_col = 100 # Adjust as needed
crop_end_col = min(400, width) # Ensure the end column does not exceed width

# Crop the image

cropped_image_arr = image_arr[crop_start_row:crop_end_row, crop_start_col:crop_end_col]

# Convert the NumPy array back to an image

cropped_image = Image.fromarray(cropped_image_arr)

# Display the cropped image inline using matplotlib

plt.imshow(cropped_image)
plt.axis('off') # Hide axes
plt.show()
OUTPUT:-

13
WEEK-3,4,5
1)Create Pandas Series and Data Frame from various inputs.

A:- Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and
was created by Wes McKinney in 2008.
Pandas Series
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Code:-
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
print(myvar[0])
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
OUTPUT:-

14
Pandas DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is like a column, a DataFrame is the whole table.
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
Code:-
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
print(df.loc[0])
print(df.loc[[0, 1]])
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
print(df.loc["day2"])
OUTPUT:-

15
2. Import any CSV file to Pandas Data Frame and perform the following?
A:- Let’s assume you have a CSV file named "data.csv".
Code:-
import pandas as pd
# Import CSV file
df = pd.read_csv("/content/data.csv")
(a) Visualize the first and last 10 records
Code:-
# Display first 10 records
print("First 10 records:\n", df.head(10))

# Display last 10 records

print("Last 10 records:\n", df.tail(10))
OUTPUT:-

(b) Get the shape, index and column details.

Code:-
# Shape of DataFrame
print("Shape:", df.shape)

# Index of DataFrame

16
print("Index:", df.index)

# Column names
print("Columns:", df.columns)
OUTPUT:-

(c) Select/Delete the records (rows)/columns based on conditions.

1) Selecting Rows Based on Condition:
Select rows where a column value meets a condition.
Code:-
# Select rows where CO2 > 100
selected_rows = df[df["CO2"] > 100]
selected_rows
OUTPUT:-

2. Deleting Columns
Let's say we want to delete the Weight column.
Code:-
# Drop the "Weight" column
df_without_weight = df.drop(columns=["Weight"])
df_without_weight.head()

17
OUTPUT:-

3. Deleting Rows Based on Condition

For example, deleting rows where Volume is less than 1200.
Code:-
# Drop rows where Volume < 1200
df_filtered_volume = df[df["Volume"] >= 1200]
df_filtered_volume.head()
OUTPUT:-

(d) Perform ranking and sorting operations.

*Ranking: Let’s add a column to rank the cars based on their CO2 emissions.
Code:-
# Rank the cars based on CO2 emissions (lower CO2 has a higher rank)
df["CO2_Rank"] = df["CO2"].rank(method="min", ascending=True)
df[["Car", "Model", "CO2", "CO2_Rank"]].head()
OUTPUT:-

* Sorting: Let’s sort the dataset by Volume in descending order.

Code:-
# Sort by Volume in descending order
df_sorted_by_volume = df.sort_values(by="Volume", ascending=False)
df_sorted_by_volume.head()

18
OUTPUT:-

(e) Do required statistical operations on the given columns.

Let’s perform some statistical operations on the Volume and CO2 columns.
Code:-
# Mean, sum, standard deviation, min, and max for "Volume" and "CO2"
volume_stats = {
"Mean Volume": df["Volume"].mean(),
"Sum Volume": df["Volume"].sum(),
"Std Dev Volume": df["Volume"].std(),
"Min Volume": df["Volume"].min(),
"Max Volume": df["Volume"].max(),
}

co2_stats = {
"Mean CO2": df["CO2"].mean(),
"Sum CO2": df["CO2"].sum(),
"Std Dev CO2": df["CO2"].std(),
"Min CO2": df["CO2"].min(),
"Max CO2": df["CO2"].max(),
}
volume_stats, co2_stats
OUTPUT:-

(f) Find the count and uniqueness of the given categorical values.
Assuming Car is a categorical column, we can find the count and uniqueness of car brands.
Code:-
# Count of unique car brands
unique_car_counts = df["Car"].value_counts()
unique_car_counts

19
# Number of unique car brands
unique_car_count = df["Car"].nunique()
unique_car_count
OUTPUT:-

(g) Rename single/multiple columns.

Let’s rename CO2 to CO2_Emissions and Volume to Engine_Volume.
Code:-
# Rename columns
df_renamed = df.rename(columns={"CO2": "CO2_Emissions", "Volume":
"Engine_Volume"})
df_renamed.head()
OUTPUT:-

20
WEEK 6, 7, 8
1. Develop a model on residual analysis of simple linear regression.
Residual analysis is an important step in evaluating the fit of a regression model. It involves examining
the residuals, which are the differences between the observed values and the predicted values from the
regression model. Here’s a guide on how to perform residual analysis using a default dataset, such as the
famous "Iris" dataset, with a focus on a simple linear regression model.

Source code:
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['species'] = iris.target
# Fit the model (predicting sepal length from sepal width)
X = data['sepal width (cm)']
Y = data['sepal length (cm)']
X = sm.add_constant(X) # Adds an intercept term
model = sm.OLS(Y, X).fit()
# Display the model summary
print(model.summary())
# Calculate predicted values and residuals
data['predicted'] = model.predict(X)
data['residuals'] = data['sepal length (cm)'] - data['predicted']
# Residual plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=data['predicted'], y=data['residuals'])

21
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()
# Check for normality: Histogram of residuals
plt.figure(figsize=(10, 6))
sns.histplot(data['residuals'], kde=True)
plt.title('Residuals Distribution')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
# Q-Q plot for normality of residuals
sm.qqplot(data['residuals'], line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()
# Calculate leverage
leverage = model.get_influence().hat_matrix_diag
# Leverage vs Residuals plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=leverage, y=data['residuals'])
plt.axhline(0, color='red', linestyle='--')
plt.title('Leverage vs Residuals')
plt.xlabel('Leverage')
plt.ylabel('Residuals')
plt.show()

22
Output:

23
2. Residual plots of linear regression
Residual Plots: Definition
Residual Plots are graphical representations used to diagnose the fit of a regression model by
displaying the residuals (the differences between observed and predicted values) against
predicted values or another variable. They help assess whether the assumptions of regression
analysis (linearity, homoscedasticity, independence, and normality of residuals) are met.
Key Components of Residual Plots
Residuals: The vertical distances between the observed data points and the regression line,
calculated as:
Horizontal Line at Zero: Indicates where the residuals are equal to zero, which is the ideal
scenario for a perfect model fit.
Axes:
X-axis: Often shows predicted values or one of the independent variables.
Y-axis: Displays residuals.
Purpose of Residual Plots
Identify Patterns: To detect non-linearity, if residuals show a systematic pattern rather than
random scatter.
Check Homoscedasticity: To assess if residuals have constant variance. A funnel shape indicates
heteroscedasticity.
Detect Outliers and Influential Points: Outliers can significantly affect the regression model.
Source code:
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
# Load the California housing dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
data['median_house_value'] = california_housing.target

24
# Select features and target variable
X = data['MedInc'] # Corrected to 'MedInc'
Y = data['median_house_value']
# Add a constant term for the intercept
X = sm.add_constant(X)
# Fit the linear regression model
model = sm.OLS(Y, X).fit()
# Display the model summary
print(model.summary())
# Calculate predicted values and residuals
data['predicted'] = model.predict(X)
data['residuals'] = data['median_house_value'] - data['predicted']
# Residual plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=data['predicted'], y=data['residuals'])
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()
# Check for normality: Histogram of residuals
plt.figure(figsize=(10, 6))
sns.histplot(data['residuals'], kde=True)
plt.title('Residuals Distribution')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
# Q-Q plot for normality of residuals

25
sm.qqplot(data['residuals'], line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()
# Calculate leverage
leverage = model.get_influence().hat_matrix_diag
# Leverage vs Residuals plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=leverage, y=data['residuals'])
plt.axhline(0, color='red', linestyle='--')
plt.title('Leverage vs Residuals')
plt.xlabel('Leverage')
plt.ylabel('Residuals')
plt.show()
Output:

26
3. Normal probability plots.
A normal probability plot (also known as a Q-Q plot) is a graphical tool used to assess whether a
set of data follows a normal distribution. In the context of residual analysis in regression, it's
particularly useful for checking the normality of residuals.
Source code:
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
# Load the California housing dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
data['median_house_value'] = california_housing.target
# Select features and target variable
X = data['MedInc'] # Median income
Y = data['median_house_value']
# Add a constant term for the intercept
X = sm.add_constant(X)
# Fit the linear regression model
model = sm.OLS(Y, X).fit()
# Calculate residuals

27
data['residuals'] = Y - model.predict(X)
# Q-Q plot for normality of residuals
plt.figure(figsize=(10, 6))
sm.qqplot(data['residuals'], line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()
Output:

4. Empirical model of linear regression analysis.

An empirical model in linear regression is a statistical approach that estimates the relationship
between a dependent variable and one or more independent variables based on observed data. It
uses a linear equation to describe how changes in the independent variables affect the dependent
variable.
Source code:
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
# Load the California housing dataset
california_housing = fetch_california_housing()

28
data = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
data['median_house_value'] = california_housing.target
# Select features and target variable
X = data[['MedInc', 'HouseAge', 'AveRooms']] # Multiple independent variables
Y = data['median_house_value']
# Add a constant term for the intercept
X = sm.add_constant(X)
# Calculate predicted values and residuals
data['predicted'] = model.predict(X)
data['residuals'] = Y - data['predicted']
# Residual plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=data['predicted'], y=data['residuals'])
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()
# Q-Q plot for normality of residuals
plt.figure(figsize=(10, 6))
sm.qqplot(data['residuals'], line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()

29
Output:

30
WEEK 9, 10, 11
1. Import any CSV file to Pandas Data Frame and perform the following:
(a) Handle missing data by detecting and dropping/ filling missing values.
Source code:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
# Load the California housing dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
data['median_house_value'] = california_housing.target
# Simulate missing data for demonstration purposes
data.loc[0:10, 'MedInc'] = np.nan # Introduce NaN values
print("Initial DataFrame:")
print(data.head())
Output:

# Check for missing values

print("Missing values before handling:")
print(data.isnull().sum())

31
# Fill missing values with the median
data['MedInc'].fillna(data['MedInc'].median(), inplace=True)
print("\nMissing values after handling:")
print(data.isnull().sum())
Output:

(b) Transform data using apply () and map() method.

Source code:
# Transform data using apply()
# Double the values in 'HouseAge'
data['Double_HouseAge'] = data['HouseAge'].apply(lambda x: x * 2)
# Using map() to categorize 'HouseAge'
# Create a mapping function for house age categories
age_mapping = {
(0, 10): '0-10',
(11, 20): '11-20',
(21, 30): '21-30',
(31, 40): '31-40',
(41, 50): '41-50',

32
(51, np.inf): '51+'
}
def categorize_age(age):
for age_range, label in age_mapping.items():
if age_range[0] <= age <= age_range[1]:
return label
return 'Unknown'
# Use map() to apply the categorization
data['HouseAge_Category'] = data['HouseAge'].map(categorize_age)
# Result

print(data[['HouseAge', 'Double_HouseAge', 'HouseAge_Category']].head())

Output:

(c) Detect and filter outliers.

Source code:
# Detect outliers using IQR method for 'median_house_value'
Q1 = data['median_house_value'].quantile(0.25)
Q3 = data['median_house_value'].quantile(0.75)
IQR = Q3 - Q1
# Filter out the outliers
data_filtered = data[(data['median_house_value'] >= (Q1 - 1.5 * IQR)) &
(data['median_house_value'] <= (Q3 + 1.5 * IQR))]
print(f"\nOriginal dataset shape: {data.shape}")
print(f"Filtered dataset shape: {data_filtered.shape}")

33
Output:
Original dataset shape: (20640, 10)
Filtered dataset shape: (19569, 10)
Source code:
import matplotlib.pyplot as plt
import seaborn as sns
# Box plot to visualize the median house value
plt.figure(figsize=(10, 6))
sns.boxplot(x=data['median_house_value'])
plt.title('Box plot of Median House Value')
plt.show()
Output:

(d) Perform Vectorized String operations on Pandas Series

Source code:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
# Load the California housing dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
data['median_house_value'] = california_housing.target

34
# Simulate missing data for demonstration purposes
data.loc[0:10, 'MedInc'] = np.nan # Introduce NaN values
# Create a sample string column with a shorter name 's'
descriptions = [
'Affordable housing in California',
'Luxury homes with ocean view',
'Cozy cottage near the mountains',
'Modern apartments in urban area',
'Spacious villas in suburbs',
'Renovated houses with large gardens',
'New constructions with smart features',
'Historic homes with character',
'Townhouses close to amenities',
'Single-family homes with yards',
'Condos with stunning views',
'Eco-friendly houses'
]
# Repeat the descriptions to match the length of the DataFrame
data['s'] = (descriptions * (len(data) // len(descriptions) + 1))[:len(data)]
print("Initial DataFrame:")
print(data.head())
# Vectorized String Operations
data['s_lower'] = data['s'].str.lower() # Convert to lowercase
data['has_affordable'] = data['s'].str.contains('affordable', case=False) # Check for substring
data['s_length'] = data['s'].str.len() # Get length of each description
data['s_replaced'] = data['s'].str.replace('California', 'CA', regex=False) # Replace substring
data['count_e'] = data['s'].str.count('e') # Count occurrences of 'e'

35
# Display the modified DataFrame
print("\nModified DataFrame with String Operations:")
print(data[['s', 's_lower', 'has_affordable', 's_length', 's_replaced', 'count_e']].head())
Output:

2. Implement regularized Linear regression.

Source code:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns

36
# Load the California housing dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
data['median_house_value'] = california_housing.target
# Simulate missing data for demonstration purposes
data.loc[0:10, 'MedInc'] = np.nan # Introduce NaN values
# Fill missing values (simple strategy)
data['MedInc'].fillna(data['MedInc'].mean(), inplace=True)
# Features and target variable
X = data.drop('median_house_value', axis=1)
y = data['median_house_value']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Ridge Regression Model
ridge_model = Ridge(alpha=1.0) # Regularization strength
ridge_model.fit(X_train_scaled, y_train)
# Predict and evaluate Ridge model
ridge_predictions = ridge_model.predict(X_test_scaled)
ridge_mse = mean_squared_error(y_test, ridge_predictions)
print(f'Ridge Regression Mean Squared Error: {ridge_mse:.2f}')
# Train Lasso Regression Model
lasso_model = Lasso(alpha=1.0) # Regularization strength
lasso_model.fit(X_train_scaled, y_train)

37
# Predict and evaluate Lasso model
lasso_predictions = lasso_model.predict(X_test_scaled)
lasso_mse = mean_squared_error(y_test, lasso_predictions)
print(f'Lasso Regression Mean Squared Error: {lasso_mse:.2f}')
# Setting the style for the plots
sns.set(style="whitegrid")
# 1. Scatter Plot of Predictions vs. Actual Values
plt.figure(figsize=(12, 6))
# Ridge Regression Predictions
plt.subplot(1, 2, 1)
plt.scatter(y_test, ridge_predictions, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Ridge Regression: Predictions vs Actual')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
# Lasso Regression Predictions
plt.subplot(1, 2, 2)
plt.scatter(y_test, lasso_predictions, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Lasso Regression: Predictions vs Actual')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.tight_layout()
plt.show()
# 2. Bar Plot of Coefficients
coefficients = pd.DataFrame({
'Feature': X.columns,
'Ridge Coefficients': ridge_model.coef_,

38
'Lasso Coefficients': lasso_model.coef_
})
# Melt the DataFrame for better plotting
coefficients_melted = coefficients.melt(id_vars='Feature', var_name='Model',
value_name='Coefficient')
# Bar Plot
plt.figure(figsize=(12, 6))
sns.barplot(x='Feature', y='Coefficient', hue='Model', data=coefficients_melted)
plt.title('Ridge and Lasso Regression Coefficients')
plt.xticks(rotation=45)
plt.axhline(0, color='grey', linestyle='--')
plt.show()
Output:

39
(e) Develop a model on logistic regression on any data set for prediction.
Source code:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
# Load the California housing dataset
california_housing = fetch_california_housing()
data = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
data['median_house_value'] = california_housing.target
# Create a binary target variable (1 if house value > threshold, else 0)
threshold = 2.5 # You can adjust this threshold based on your needs
data['high_value'] = (data['median_house_value'] > threshold).astype(int)
# Features and target variable
X = data.drop(['median_house_value', 'high_value'], axis=1)
y = data['high_value']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the Logistic Regression Model
logistic_model = LogisticRegression(max_iter=200)
logistic_model.fit(X_train, y_train)
# Predict on the test set
y_pred = logistic_model.predict(X_test)
# Evaluate the model

40
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=['Low Value', 'High Value'],
yticklabels=['Low Value', 'High Value'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
Output:

M3-Introduction to Numpy and Pandas
No ratings yet
M3-Introduction to Numpy and Pandas
55 pages
Module Numpy
No ratings yet
Module Numpy
67 pages
Unit 3_Numpy_VP
No ratings yet
Unit 3_Numpy_VP
53 pages
Basic of Numphy
No ratings yet
Basic of Numphy
14 pages
python-notes-BCC-302 (Unit - 05)
No ratings yet
python-notes-BCC-302 (Unit - 05)
25 pages
Numpy
No ratings yet
Numpy
64 pages
Numpy Merged (1)
No ratings yet
Numpy Merged (1)
93 pages
UNIT IV FDS
No ratings yet
UNIT IV FDS
142 pages
numpy_ppt
No ratings yet
numpy_ppt
73 pages
Numpy
No ratings yet
Numpy
24 pages
APznzaaqszKXWidB7ZcUyElwKtMW9baPO5uwgBspe7mup3-RAjUbFs9a5J0SWJx5baBOtL8oMAExrcfE-xNmC3fbtEqgqkuUDV3hM3RFDNeuJc8K5DkloC95lixWjd8hSK4WWqCMirKOpcOSGSRNGGugDyjrAf-wzcSS5bC_l3kfkAro7lqM_CfNu8jP_XQRy6CFb
No ratings yet
APznzaaqszKXWidB7ZcUyElwKtMW9baPO5uwgBspe7mup3-RAjUbFs9a5J0SWJx5baBOtL8oMAExrcfE-xNmC3fbtEqgqkuUDV3hM3RFDNeuJc8K5DkloC95lixWjd8hSK4WWqCMirKOpcOSGSRNGGugDyjrAf-wzcSS5bC_l3kfkAro7lqM_CfNu8jP_XQRy6CFb
51 pages
Unit 1
No ratings yet
Unit 1
170 pages
Unit 4 Numpy
No ratings yet
Unit 4 Numpy
14 pages
Numpy Tutorial
No ratings yet
Numpy Tutorial
19 pages
Numpy (Numerical Python)
No ratings yet
Numpy (Numerical Python)
80 pages
DOC-20231126-WA0006.
No ratings yet
DOC-20231126-WA0006.
14 pages
NumpyToday's Session
No ratings yet
NumpyToday's Session
8 pages
DATASCIENCE LAB
No ratings yet
DATASCIENCE LAB
42 pages
Introduction to NumPy
No ratings yet
Introduction to NumPy
15 pages
Numpy Basics
No ratings yet
Numpy Basics
66 pages
10 Numpy
No ratings yet
10 Numpy
39 pages
Numpy
No ratings yet
Numpy
14 pages
Ids 6 Experiments
No ratings yet
Ids 6 Experiments
27 pages
Numpy
No ratings yet
Numpy
20 pages
Swarang Raut EDVA Experiment 1 Numpy Pandas
No ratings yet
Swarang Raut EDVA Experiment 1 Numpy Pandas
58 pages
Data Science Handwritten Notes - 3
No ratings yet
Data Science Handwritten Notes - 3
26 pages
Lets Begin With Numpy
No ratings yet
Lets Begin With Numpy
16 pages
NUMPY
No ratings yet
NUMPY
8 pages
NumPy class 11th
No ratings yet
NumPy class 11th
10 pages
NumPy Basics
No ratings yet
NumPy Basics
23 pages
n Umpy Pandas Tutorial
No ratings yet
n Umpy Pandas Tutorial
65 pages
Num Py
No ratings yet
Num Py
13 pages
NumPy_Array_Operations_and_Functions
No ratings yet
NumPy_Array_Operations_and_Functions
14 pages
NumPy
No ratings yet
NumPy
18 pages
Jovia Report
No ratings yet
Jovia Report
18 pages
DSE UNIT 3
No ratings yet
DSE UNIT 3
12 pages
OceanofPDF.com AI and Blockchain in Smart Grids - Amit Kumar Tyagi
No ratings yet
OceanofPDF.com AI and Blockchain in Smart Grids - Amit Kumar Tyagi
681 pages
Lecture+Notes Python+for+DS PDF
No ratings yet
Lecture+Notes Python+for+DS PDF
48 pages
15.NUMPY
No ratings yet
15.NUMPY
32 pages
Topic - 2 - The Basics of NumPy Arrays 1
100% (1)
Topic - 2 - The Basics of NumPy Arrays 1
10 pages
Day 3.Numpy_Complete_Guide
No ratings yet
Day 3.Numpy_Complete_Guide
17 pages
CAP776 Numpy
No ratings yet
CAP776 Numpy
71 pages
Numpy Guide
No ratings yet
Numpy Guide
1 page
Arrays
No ratings yet
Arrays
28 pages
Numpy
No ratings yet
Numpy
27 pages
Lab 1 - Introduction
No ratings yet
Lab 1 - Introduction
14 pages
NumPy Notes
No ratings yet
NumPy Notes
13 pages
NUMPYA03
No ratings yet
NUMPYA03
36 pages
Kuliah #7 Alprog - Numpy, Pandas, Matplotlib
No ratings yet
Kuliah #7 Alprog - Numpy, Pandas, Matplotlib
48 pages
Numpy Cheat Sheet
No ratings yet
Numpy Cheat Sheet
13 pages
Num Py
No ratings yet
Num Py
31 pages
11_NumPy
No ratings yet
11_NumPy
14 pages
p
No ratings yet
p
27 pages
NUMPY, PANDAS
No ratings yet
NUMPY, PANDAS
19 pages
Num Py
No ratings yet
Num Py
15 pages
Exp 12345
No ratings yet
Exp 12345
15 pages
9500 MPR. Indoor MSS-8 - MSS-4 + Outdoor ODU300 - MPT-HC - MPT - HC V2 - MPT-MC. User Manual. Rel
100% (1)
9500 MPR. Indoor MSS-8 - MSS-4 + Outdoor ODU300 - MPT-HC - MPT - HC V2 - MPT-MC. User Manual. Rel
200 pages
Ot Lab 6
No ratings yet
Ot Lab 6
13 pages
Numpy
No ratings yet
Numpy
9 pages
numpy
No ratings yet
numpy
7 pages
Lab 1
No ratings yet
Lab 1
6 pages
Lab Manual: Electrical Circuit Analysis II CPE222
No ratings yet
Lab Manual: Electrical Circuit Analysis II CPE222
148 pages
Surviving Microsoft Active Directory Failures With Rubrik
No ratings yet
Surviving Microsoft Active Directory Failures With Rubrik
30 pages
Week-1 Module -1 Introduction to Technical Communication
No ratings yet
Week-1 Module -1 Introduction to Technical Communication
18 pages
The Future of Artificial Intelligence and Its Impact on Society
No ratings yet
The Future of Artificial Intelligence and Its Impact on Society
4 pages
Leach - Don't Feed the Phish - How to Catch a Phish Student Handout
No ratings yet
Leach - Don't Feed the Phish - How to Catch a Phish Student Handout
4 pages
65 Mohit Tiwari IT 2 STLD File
No ratings yet
65 Mohit Tiwari IT 2 STLD File
78 pages
Preview-9781782177357 A24761001
No ratings yet
Preview-9781782177357 A24761001
23 pages
R12 AP Generic Data Fix (GDF) Patch For Payment Adjustment Events Accounted With Incorrect Amount Causing Trial Balance Issues (Doc ID 2632103.1)
No ratings yet
R12 AP Generic Data Fix (GDF) Patch For Payment Adjustment Events Accounted With Incorrect Amount Causing Trial Balance Issues (Doc ID 2632103.1)
8 pages
Front of House Cover Letter Example
100% (1)
Front of House Cover Letter Example
8 pages
What Is SAP ?
100% (1)
What Is SAP ?
79 pages
CCCS214: Object-Oriented Programming II Lab 5: Layout Manager
No ratings yet
CCCS214: Object-Oriented Programming II Lab 5: Layout Manager
8 pages
SE Unit1 20IT3302
No ratings yet
SE Unit1 20IT3302
20 pages
CTET 2015 Question Paper 1 and Answer Key
100% (1)
CTET 2015 Question Paper 1 and Answer Key
48 pages
Is Deep Learning Is A Game Changer For Marketing Analytics
No ratings yet
Is Deep Learning Is A Game Changer For Marketing Analytics
23 pages
Editing in MS Word
No ratings yet
Editing in MS Word
3 pages
Artificial Intelligence (AI)
No ratings yet
Artificial Intelligence (AI)
34 pages
Stellarium
100% (1)
Stellarium
6 pages
User Manual of Mobile App NSDL
No ratings yet
User Manual of Mobile App NSDL
13 pages
2.2 Tuples:: 2.2.1 Create and Display Tuples in Python
No ratings yet
2.2 Tuples:: 2.2.1 Create and Display Tuples in Python
4 pages
CV Syntiche 2-5
No ratings yet
CV Syntiche 2-5
4 pages
Technical Interview Questions Technical Interview Questions
No ratings yet
Technical Interview Questions Technical Interview Questions
13 pages
Grid Dynamics Hackathon Rules and Guidelines
No ratings yet
Grid Dynamics Hackathon Rules and Guidelines
3 pages
Naval - November - 2023
No ratings yet
Naval - November - 2023
3 pages
3c) Question Paper (Mid Exam)
No ratings yet
3c) Question Paper (Mid Exam)
2 pages
College of Information Technology and Computing - USTP - CDO
No ratings yet
College of Information Technology and Computing - USTP - CDO
5 pages
Idoc - Pub Freebitcoin Script Roll 10000
No ratings yet
Idoc - Pub Freebitcoin Script Roll 10000
2 pages
CIE - I-Important Questions
No ratings yet
CIE - I-Important Questions
1 page
9 Insan Sase Service Review Class: Aptitude No. 2
No ratings yet
9 Insan Sase Service Review Class: Aptitude No. 2
3 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Mds1111 Merged Numbered (1)

Uploaded by

Mds1111 Merged Numbered (1)

Uploaded by

WEEK-1,2

1} Create NumPy arrays from Python Data Structures, Intrinsic NumPy

Creating Arrays with Intrinsic NumPy Objects

# Array with a range of values

# Linearly spaced array

Creating Arrays with Random Functions

# Array of random floats between 0 and 1

# Array of random integers

# Random sample from a given range

2}Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and

NumPy Array Reshaping

NumPy Joining Array

arr1 = np.array([1, 2, 3])

NumPy Splitting Array

arr = np.array([1, 2, 3, 4, 5, 6])

3} Computation on NumPy arrays using Universal Functions and

array = np.array([1, 2, 3, 4])

2. Aggregation and Mathematical Methods

4}Import a CSV file and perform various Statistical and Comparison

Step 2: Perform Statistical Operations on Rows and Columns

Min and Max

# Maximum of each row

Step 3: Perform Comparison Operations

2. Comparison Across Rows and Columns

3. Applying Conditions to Elements

# Load the image

# Convert the image to a NumPy array

# Check the dimensions of the image

# Define cropping coordinates (adjusted to fit within 894x726)

# Crop the image

# Convert the NumPy array back to an image

# Display the cropped image inline using matplotlib

# Display last 10 records

(b) Get the shape, index and column details.

(c) Select/Delete the records (rows)/columns based on conditions.

3. Deleting Rows Based on Condition

(d) Perform ranking and sorting operations.

* Sorting: Let’s sort the dataset by Volume in descending order.

(e) Do required statistical operations on the given columns.

(g) Rename single/multiple columns.

4. Empirical model of linear regression analysis.

# Check for missing values

(b) Transform data using apply () and map() method.

print(data[['HouseAge', 'Double_HouseAge', 'HouseAge_Category']].head())

(c) Detect and filter outliers.

(d) Perform Vectorized String operations on Pandas Series

2. Implement regularized Linear regression.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.