Data Science Lab Manual
Data Science Lab Manual
MASTER RECORD
B.Tech. Artificial Intelligence and Data Science
II year/IV semester
R2021
Prepared by,
Dr. M. Kaliappan HOD/AD
Dr. S. Vimal ASP/AD
TamilNadu
0
VISION AND MISSION
Vision of the Institute
To evolve as an Institute of international repute in offering high-quality technical
education, Research and extension programmes in order to create knowledgeable,
professionally competent and skilled Engineers and Technologists capable of
working in multi-disciplinary environment to cater to the societal needs.
Mission of Institute
To accomplish its unique vision, the Institute has a far-reaching mission that aims:
• To offer higher education in Engineering and Technology with highest level of quality,
Professionalism and ethical standards
• To equip the students with up-to-date knowledge in cutting-edge technologies, wisdom,
creativity and passion for innovation, and life-long learning skills
• To constantly motivate and involve the students and faculty members in the education process
for continuously improving their performance to achieve excellence.
Vision of Department
To impart international quality education, promote collaborative research and
graduate industry-ready engineers in the domain of Artificial Intelligence and Data
Science to servethe society.
1
Program Educational Outcomes (PEO’s)
After successful completion of the degree, the students will be able to
PEO 1. Apply Artificial Intelligence and Data Science techniques with industrial
standards and pioneering research to solve social and environment-related
problems for making a sustainable ecosystems.
Program Outcomes(PO’s)
Engineering Graduates will be able to:
o Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
o Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
o Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
o Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
o Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
o The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
o Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
o Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
o Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
o Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
2
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
o Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to ones own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
o Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
3
INSTRUCTIONS TO THE STUDENTS
Students should wear Uniforms and Coats neatly during the lab session
Students should maintain silence during lab hours; roaming around the lab
during lab session is not permitted
Programs should be written in the manual and well prepared for the current
exercise before coming to the session
Before Every Session, Last Session lab exercise & record should be
completed and get it verified by the faculty
In the Record Note, Flow Chart and Outputs should be written on the left
side, while Aim, Algorithm & Result should be written on the right side.
Performance 25 Marks
Viva 10 Marks
Record 15 Marks
Total 50 Marks
4
PREFACE
The current year’s manual (2022 -2023) differs from the previous
year’s manual in numerous ways. Course objectives and outcomes and
mapping the lab exercises to the outcomes are included in this manual.
All the lab exercises are revised and updated in many places. New
exercises are addedbesides university syllabus.
Viva questions related to the exercise is included at the end of all the
lab exercises along with besides university syllabus exercises.
5
CONTENTS
Contents Page
no
Syllabus 7
CO - Mapping 8
Ex Experiment
no
1 Working with Numpy and 9
Pandas arrays
3 Frequency, Distributions, 34
variability and Variance
4 Normal Curves,correlation 37
and Scatter plots and
correlation coefficient
5 Regression 42
6 Z - test 51
7 T – Test 54
8 Anova 58
6
SYLLABUS
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY LTPC
0 0 4 2
OBJECTIVES:
To develop data analytic code in python To be able to use python libraries for handling data
To develop analytical applications using python To perform data visualization using plots
OUTCOMES: Upon successful completion of this course, students will be able to:
CO1. Write python programs to handle data using Numpy and Pandas
CO2. Perform descriptive analytics
CO3. Perform data exploration using Matplotlib
CO4. Perform inferential data analytics
CO5. Build models of predictive analytics
7
CO Mapping
Exno Experiment CO
6 Z- Test CO2
7 T- Test CO2
8 Anova CO2
12 Regression-Least CO5
squares
8
Excerise1
Working with Numpy and Pandas dataframe
1 A): Simple pandas programs
1. Write a Pandas program to get the powers of an array values element-wise.
Program:
import pandas as pd
data=pd.DataFrame({'x':[78,90,89,78],'y':[67,67,87,79],'z':[56,89,57,90]})
print(data)
OutP
x y z
0 78 67 56
1 90 67 89
2 89 87 57
3 78 79 90
2. Write a Pandas program to create and display a DataFrame from a specified dictionary
data which has the index labels
import pandas as pd
details = {
'Name' : ['Ankit', 'Aishwarya', 'Shaurya', 'Shivangi'],
'Age' : [23, 21, 22, 21],
'University' : ['BHU', 'JNU', 'DU', 'BHU'],
}
df = pd.DataFrame(details)
df
Name Age University
0 Ankit 23 BHU
1 Aishwarya 21 JNU
2 Shaurya 22 DU
3 Shivangi 21 BHU
3.Write a Pandas program to display a summary of the basic information about a specified
DataFrame and its data. Sample Python dictionary data and list labels:
import pandas as pd
import numpy as np
9
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Summary of the basic information about this DataFrame and its data:")
print(df.info())
Summary of the basic information about this DataFrame and its data:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype
df = pd.DataFrame(exam_data , index=labels)
print("First three rows of the data frame:")
print(df.iloc[:3])
First three rows of the data frame:
name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
5.Write a Pandas program to select the 'name' and 'score' columns from the following DataFrame
10
import pandas as pd
import numpy as np
df = pd.DataFrame(exam_data , index=labels)
print("Select specific columns:")
print(df[['name', 'score']])
Select specific columns:
name score
a Anastasia 12.5
b Dima 9.0
c Katherine 16.5
d James NaN
e Emily 9.0
f Michael 20.0
g Matthew 14.5
h Laura NaN
i Kevin 8.0
j Jonas 19.0
6. Write a Pandas program to select the specified columns and rows from a given data
frame.
import pandas as pd
import numpy as np
df = pd.DataFrame(exam_data , index=labels)
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no
11
f 20.0 yes
g 14.5 yes
7. Write a Pandas program to select the rows where the number of attempts in the
examination is greater than 2
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Number of attempts in the examination is less than 2 and score greater than 15 :")
print(df[(df['attempts'] <2) & (df['score'] >15)])
Number of attempts in the examination is less than 2 and score greater than 15 :
name score attempts qualify
j Jonas 19.0 1 yes
8. Write a Pandas program to count the number of rows and columns of a DataFrame
import pandas as pd
dict= {'Name' : ['Martha', 'Tim', 'Rob', 'Georgia'],
'Marks' : [87, 91, 97, 95]}
df = pd.DataFrame(dict)
display(df)
rows = df.shape[0]
cols = df.shape[1]
print("Rows: "+str(rows))
print("Columns: "+str(cols))
Name Marks
0 Martha 87
1 Tim 91
2 Rob 97
3 Georgia 95
Rows: 4
Columns: 2
9. Write a Pandas program to select the rows where the score is missing, i.e. is NaN.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
12
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Rows where score is missing:")
print(df[df['score'].isnull()])
Rows where score is missing:
name score attempts qualify
d James NaN 3 no
h Laura NaN 1 no
10.Write a Pandas program to select the rows the score is between 15 and 20 (inclusive)
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Rows where score between 15 and 20 (inclusive):")
print(df[df['score'].between(15, 20)])
Rows where score between 15 and 20 (inclusive):
name score attempts qualify
c Katherine 16.5 2 yes
f Michael 20.0 3 yes
j Jonas 19.0 1 yes
11.Write a Pandas program to select the rows where number of attempts in the examination is
less than 2 and score greater than 15.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Number of attempts in the examination is less than 2 and score greater than 15 :")
print(df[(df['attempts'] <2) & (df['score'] >15)])
13
Number of attempts in the examination is less than 2 and score greater than 15 :
name score attempts qualify
j Jonas 19.0 1 yes
12.Write a Pandas program to change the score in row 'd' to 11.5.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("\nOriginal data frame:")
print(df)
print("\nChange the score in row 'd' to 11.5:")
df.loc['d', 'score'] =11.5
print(df)
14
i Kevin 8.0 2 no
j Jonas 19.0 1 yes
13.Write a Pandas program to calculate the sum of the examination attempts by the students
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("\nSum of the examination attempts by the students:")
print(df['attempts'].sum())
df = pd.DataFrame(exam_data , index=labels)
print("\nMean score for each different student in data frame:")
print(df['score'].mean())
15
print(df)
16. Write a Pandas program to sort the DataFrame first by 'name' in descending order, then
by 'score' in ascending order.
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year',
'Population', 'Continent'])
df
Country Year Population Continent
0 Afghanistan 1952 8425333.0 Asia
1 Australia 1957 9712569.0 Oceania
2 Brazil 1962 76039390.0 Americas 3
China 1957 637408000.0 Asia
4 France 1957 44310863.0 Europe
5 India 1952 372000000.0 Asia
6 United States 1957 171984000.0 Americas
17.Write a Pandas program to replace the 'qualify' column contains the values 'yes' and 'no' with
True and False.
import numpy as np
import pandas as pd
16
r2 = exam_data.replace('no', 'false')
r1
name score attempts qualify
0 Anastasia 12.5 1 true
1 Dima 9.0 3 no
2 Katherine 16.5 2 true
3 James NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 true
6 Matthew 14.5 1 true
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 true
18.Write a Pandas program to change the name 'James' to 'Suresh' in name column of the
DataFrame.
import pandas as pd
17
df.pop('attempts')
print(df)
Original rows:
name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes
import pandas as pd
import numpy as np
exam_data = [{'name':'Anastasia', 'score':12.5}, {'name':'Dima','score':9},
{'name':'Katherine','score':16.5}]
df = pd.DataFrame(exam_data)
for index, row in df.iterrows():
print(row['name'], row['score'])
Anastasia 12.5
Dima 9.0
Katherine 16.5
22. Write a Pandas program to get list from DataFrame column headers
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
18
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print(list(df.columns.values))
['name', 'score', 'attempts', 'qualify']
23. Write a Pandas program to select rows from a given DataFrame based on values in some
columns.
import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
print('Rows for colum1 value == 4')
print(df.loc[df['col1'] ==4])
Rows for colum1 value == 4
col1 col2 col3
1 4 5 8
3 4 7 0
19
print('After add one row:')
df2 = {'col1': 10, 'col2': 11, 'col3': 12}
df = df.append(df2, ignore_index=True)
print(df)
After add one row:
col1 col2 col3
0 1 4 7
1 4 5 8
2 3 6 9
3 4 7 0
4 5 8 1
5 10 11 12
26. Write a Pandas program to write a DataFrame to CSV file using tab separator.
import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
print('Data from new_file.csv file:')
df.to_csv('new_file.csv', sep='\t', index=False)
new_df = pd.read_csv('new_file.csv')
print(new_df)
Data from new_file.csv file:
col1\tcol2\tcol3
0 1\t4\t7
1 4\t5\t8
2 3\t6\t9
3 4\t7\t0
4 5\t8\t1
27. Write a Pandas program to count city wise number of people from a given of data set (city,
name of the person)
import pandas as pd
df1 = pd.DataFrame({'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
'Matthew', 'Laura', 'Kevin', 'Jonas'],
'city': ['California', 'Los Angeles', 'California', 'California', 'California', 'Los Angeles', 'Los
Angeles', 'Georgia', 'Georgia', 'Los Angeles']})
g1 = df1.groupby(["city"]).size().reset_index(name='Name of the person')
print(g1)
city Name of the person
0 California 4
1 Georgia 2
2 Los Angeles 4
20
28. Write a Pandas program to delete DataFrame row(s) based on given column value.
import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
df = df[df.col2 !=5]
print("New DataFrame")
print(df)
New DataFrame
col1 col2 col3
0 1 4 7
2 3 6 9
3 4 7 0
4 5 8 1
29. Write a Pandas program to widen output display to see more columns.
import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("Original DataFrame")
print(df)
Original DataFrame
col1 col2 col3
0 1 4 7
1 4 5 8
2 3 6 9
3 4 7 0
4 5 8 1
30. Write a Pandas program to replace all the NaN values with Zero's in a column of a dataframe.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
df = pd.DataFrame(exam_data)
df = df.fillna(0)
21
print("\nNew DataFrame replacing all NaN with 0:")
print(df)
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
col_names = ['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alc
o', 'active', 'cardio']
df = pd.read_csv('/content/drive/MyDrive/dataset/cardio_train.csv',delimiter=';')
df.head(5)
df.columns
df.info()
22
df.isnull().sum()
new_df = df[['ap_hi','cholesterol','gluc','smoke']]
new_df.tail(5)
23
max(new_df['gluc'])
3
max(new_df['cholesterol'])
3
#lets ckeck the range
print(new_df[new_df['ap_hi'].between(1000,1620)])
1 c) Game Dataset
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/dataset/metacritic_games.csv')
df.head(4)
24
df.columns
df.info()
df.isnull().sum()
new_df.columns
25
new_df.groupby(['name']).mean()
new_data = df[['meta_score','user_score']]
new_data.head(5)
PlayStation= new_data.loc[new_data['user_score']==6.8]
PlayStation.tail(4)
26
Exercise 2
Basic plots using Matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\Users\21ad016\Desktop\Datasets\forestfires.csv')
df.head(5)
Output:
new_df = df[['X','Y','month']]
new_df.head(5)
Output:
data = new_df.groupby(['month']).value_counts()
data
Output:
data.sort_values(ascending = True)
27
Bar plots
#Bar plot
plt.bar(df['month'],df['temp'],color='y')
plt.xlabel('Month')
plt.ylabel('Temperature')
plt.title("Bar plot")
Output:
plt.bar(df['month'],df['wind'],color = 'r')
plt.xlabel('Month')
plt.ylabel('Wind')
plt.title("Bar plot")
Output:
plt.bar(df['month'],df['rain'])
plt.xlabel('Month')
plt.ylabel('Rain')
plt.title("Bar plot")
28
Output:
plt.bar(df['month'],df['RH'],color='g')
plt.xlabel('Month')
plt.ylabel('Temperature')
plt.title("Bar plot")
Output:
Observation : It is observed that the factors like wind,temp,RH will contribute to the forest fires.
Rain doesnot contribute much
Most number of forest fires have occured in the months of March,July which is nearly summer
season where the tempearature will be high
Forest fire occcur in the month of August due o wind and Relative humidity
29
Scatter plot
#Scatter Plot
plt.scatter(df['month'],df['area'])
plt.xlabel('Month')
plt.ylabel('Area in hectares')
plt.title('Scatter plot')
Output:
plt.scatter(df['month'],df['day'],color='#800000')
plt.xlabel('Month')
plt.ylabel('Day')
plt.title('Scatter Plot')
Output:
plt.scatter(df['month'],df['DC'])
plt.xlabel('')
Output:
30
Area plot
months = {"jan":1, "feb":2, "mar":3, "apr":4, "may":5, "jun":6, "jul":7, "aug":8, "sep":9,
"oct":10, "nov":11, "dec":12}
df["month_num"] = df["month"].map(months)
df = df.sort_values(by=["month_num", "day"])
plt.fill_between(range(len(df)), df["area"], color="orange")
plt.xlabel("Time")
plt.ylabel("Area")
plt.title("Distribution of Area over Time")
plt.show()
# group by month and get the mean FFMC value for each month
31
ffmc_month_mean = ffmc_month.groupby('month')['FFMC'].mean()
# create the area plot
fig, ax = plt.subplots()
ax.fill_between(ffmc_month_mean.index, ffmc_month_mean.values, alpha=0.5)
ax.set(title='Average FFMC by Month', xlabel='Month', ylabel='FFMC')
plt.show()
32
Histogram
plt.hist(df['X'])
plt.hist(df['Y'])
33
Exercise 3
Frequency distributions, average and variance
Frequency distribution,variance and variability
View(diabetes_data_upload)
Mean
#mean of the age
x <- diabetes_data_upload$Age
cat("Mean age of person having diabetes: ",mean(x));
Output:
Observation:
People with average of 48 years are affected by diabetes
library(dplyr)
grouped = diabetes_data_upload %>% group_by(Gender) %>%
34
summarise(Age = mean(Age))
grouped
Output:
Observation:
The average age of women getting affected by diabetes is 47 and 48 for men.
library(dplyr)
df_grouped = diabetes_data_upload %>% group_by(sudden.weight.loss) %>%
summarise(Age = mean(Age))
df_grouped
Output:
Variability
1. Range
range_age<- range(diabetes_data_upload$Age)
cat("Range with age grouped: ",range_age);
Output:
Observation:
It shows that people start affected by diabetes at the age of 16 itself
2. Interquartile Range
age <- diabetes_data_upload$Age
q50 <- quantile(age,0.5)
35
q75 <- quantile(age,0.75)
q100 <- quantile(age,1)
3. Variance
age <- diabetes_data_upload$Age
cat("The variance of age: ",var(age))
Output:
4. Standard Deviation
age <- diabetes_data_upload$Age
cat("The standard deviation of age: ",sd(age))
Output:
36
Exercise 4
Normal Curves, Correlation and scatter plots and correlation
coefficient
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('/content/drive/MyDrive/datasets/accident.csv')
df.isnull().sum()
37
from bokeh.plotting import figure,output_file,show
output_file('person.html')
graph = figure(title="Scatter plot")
x = df['PERSONS']
y = df['VE_TOTAL']
graph.scatter(x,y)
show(graph)
38
from bokeh.io import show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.transform import linear_cmap
df.columns
Normal Curve:
import numpy as np
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure,output_file
output_file('normal.html')
accidents = df['CITY']
hist,edges = np.histogram(accidents,density = True)
mean = np.mean(accidents)
std_dev = np.std(accidents)
show(p)
39
import numpy as np
from bokeh.io import show
from bokeh.models import LinearColorMapper, ColorBar
from bokeh.palettes import Viridis256
from bokeh.plotting import figure
from bokeh.transform import transform
from bokeh.plotting import figure,output_file
output_file("heatmap_persons.html")
40
color_bar = ColorBar(color_mapper=mapper, location=(0, 0))
p.add_layout(color_bar, 'right')
41
Exercise 5
Regression
5a. Regression Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
df = pd.read_csv('/content/drive/MyDrive/datasets/winequality-red.csv')
df.head(3)
df.isnull().sum()
df.columns
42
sns.residplot(x="fixed acidity",y="quality",data=df)
sns.heatmap(df.corr(),annot=True,cmap='YlGnBu',linecolor='r',linewidths=0.5)
43
fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='citric acid',y='quality',data=df,color='r',ax=ax0)
sns.residplot(x='citric acid',y='quality',data=df,color='r',ax=ax1)
plt.show()
44
fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='sulphates',y='quality',data=df,color='g',ax=ax0)
sns.residplot(x='sulphates',y='quality',data=df,color='g',ax=ax1)
plt.show()
fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='alcohol',y='quality',data=df,color='g',ax=ax0)
sns.residplot(x='alcohol',y='quality',data=df,color='g',ax=ax1)
plt.show()
45
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
X= sm.add_constant(X)
regression = sm.OLS(y,X).fit()
print(regression.summary())
46
from sklearn.metrics import mean_squared_error, mean_absolute_error
# calculate the mean squared error and root mean squared error
mse = mean_squared_error(y, y_pred)
rmse = mean_squared_error(y, y_pred, squared=False)
library(caret)
data(winequality)
head(winequality)
summary(winequality)
47
winequality$quality_binary<- ifelse(winequality$quality> 6,1,0)
ggplot(winequality,aes(x=quality,fill=quality_binary))+geom_bar()
set.seed(123)
48
train_index<- sample(seq_len(nrow(wine)),size = 0.7 * nrow(winequality))
train_data<- winequality[train_index, ]
test_data<- winequality[-train_index, ]
49
library(pROC)
roc <- roc(test_data$quality_binary,pred)
plot(roc)
auc(roc)
50
Exercise 6
Z – Test
6a. Z – test for one sample
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
df = pd.read_csv('/content/drive/MyDrive/datasets/student_clustering.csv')
df.head(3)
df.isnull().sum()
import numpy as np
import statsmodels.stats.api as sms
df['iq'].mean()
101.995
H0 : mean<=101 H1 : mean != 101
test_stat,p_value = sms.ztest(df['iq'],value=101)
print("Test Statistic: ",test_stat)
print("p-value: ",p_value)
print("\n ")
print("************Conclusion*************")
alpha = 0.05
if (p_value < alpha):
print("\nReject the null hypothesis\n The IQ value of the sample is not equal to the population
mean\nThe new students IQ is below 101")
else:
print("\nFail to reject the null hypothesis\n The IQ value of the sample is less than or equal to th
e population mean\nThe new students also have the IQ same as the population")
51
6b . Z – test for two independent samples
#b) Two sample Z-test
df1 = pd.read_csv('/content/drive/MyDrive/datasets/weight-height.csv')
df1.head(3)
df2 = df1.drop('Weight',axis=1)
male_df = df2[df2['Gender'] == 'Male']
female_df = df2[df2['Gender'] == 'Female']
from statsmodels.stats.weightstats import ztest
sample1 = male_df['Height']
sample2 = female_df['Height']
mean1,mean2 = np.mean(sample1),np.mean(sample2)
std1,std2 = np.std(sample1), np.std(sample2)
z_score,p_value = ztest(x1=sample1,x2=sample2,value=0)
52
alpha = 0.05
if p_value<alpha:
print("Reject the null hypothethis\nThe difference between the population means is not zero")
else:
print("Fail to reject the null hypothethis\nThe difference between the population is zero")
cgpa_1_sem = np.array([7.8,7.7,6.5,9.8,7.0,7.9,8.0,8.5,7.4,8.8])
cgpa_2_sem = np.array([8.4,8.0,6.0,7.9,8.9,7.8,8.0,7.5,6.6,8.0])
diff = np.array(cgpa_2_sem) - np.array(cgpa_1_sem)
mean_diff = np.mean(diff)
std_diff = np.std(diff,ddof=1)
n = len(diff)
se_diff = std_diff/np.sqrt(n)
null_hypothesis = 0
z_score = mean_diff/se_diff
p_value = 2 * norm.cdf(-np.abs(z_score))
print("Mean difference:", mean_diff)
print("Standard deviation of difference:", std_diff)
print("Standard error of the mean difference:", se_diff)
print("Z-score:", z_score)
print("P-value:", p_value)
print("\n*********************************")
print("************Conclusion*************")
53
alpha = 0.05
if p_value<alpha:
print("Reject the null hypothesis\nThis indicates that the difference between the populations is
not zer")
else:
print("Failed to reject null hypothethis\nThis indicates the difference between the mean of popu
lations is zero")
54
Exercise 7
T – test
7a. t – test for one sample
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
df = pd.read_csv('/content/drive/MyDrive/datasets/winequality-red.csv')
df.head(3)
df['fixed acidity'].mean()
8.31963727329581
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis\nThe mean of the population has no special thing happenning"
)
else:
print("Failed to reject the null hypothesis\nThere is something special happening in the populati
on")
55
7b t-test for two independent samples
import pandas as pd
import numpy as np
import scipy.stats as stats
data = pd.read_csv('/content/drive/MyDrive/datasets/weight-height.csv')
data.head(3)
male_weight = data[data['Gender']=='Male']
female_weight = data[data['Gender']=='Female']
sample1 = male_weight['Weight']
sample2 = female_weight['Weight']
t_stat , p_value = stats.ttest_ind(sample1,sample2)
print('t-statistic: ',t_stat)
print("p_value: ",p_value)
alpha = 0.05
print("\n*****************************")
print("\n*********Conclusion**********")
if p_value < alpha:
print('\nReject the null hypothethis\n There is no difference between the means of the populatio
n')
else:
print("\nFailed to reject null hypothethis\n There is a diifference between the means of the popu
lations")
print("\n**************************")
56
7c t-test for paired samples
marks_1_sem = [8.9,8.8,7.9,6.7,8.7,7.8,9.,6.7,8.9,8.0]
marks_2_sem = [9.0,7.8,8.0,6.7,9.0,6.8,8.9,7.4,8.0,6.7]
import scipy.stats as stats
t_stat,p_value = stats.ttest_rel(marks_1_sem,marks_2_sem)
print("t-statistic: ",t_stat)
print("p_value: ",p_value)
alpha = 0.05
print("\n*****************************")
print("\n*********Conclusion**********")
if p_value < alpha:
print('\nReject the null hypothethis\n There is no difference between the means of the populatio
n')
else:
print("\nFailed to reject null hypothethis\n There is a diifference between the means of the popu
lations")
print("\n**************************")
57
Exercise 8
Anova
8a. One Way Anova Test
View(DATA_RELEASE)
sum(is.na(DATA_RELEASE))
58
library(ggplot2)
ggplot(DATA_RELEASE, aes(x=dnn_prediction, y=dnn_confidence)) +
geom_boxplot() +
labs(x="DNN Prediction", y="DNN Confidence") +
ggtitle("Boxplot of DNN Confidence by DNN Prediction")
Conclusion:
There is a significant difference between the means of the dnn_confidence scores across the
levels of the dnn_prediction factor (F(5, 9594) = 216, p < 0.001). The null hypothesis that there
is no difference between the means is rejected
59
# perform pairwise ANOVA for tested_occupation factor
pw1<-pairwise.t.test(DATA_RELEASE$dnn_confidence,
DATA_RELEASE$tested_occupation, p.adjust.method = "bonferroni")
print(pw1)
60
print(pw3)
library(ggpubr)
ggboxplot(DATA_RELEASE, x = "tested_occupation", y = "dnn_confidence",
color = "tested_occupation", palette = "jco") +
stat_compare_means(comparisons = list(c("Accountant", "Data Analyst"),
c("Accountant", "Software Developer"),
c("Data Analyst", "Marketing Manager"),
c("Marketing Manager", "Software Developer")),
method = "t.test", label = "p.signif") +
labs(title = "Comparison plot for dnn_confidence by tested_occupation",
x = "Tested occupation", y = "DNN confidence")
61
Conclusion:
The conclusion is that both tested_occupation and dnn_prediction have a significant impact on
the dnn_confidence, and the interaction between them also has a significant effect. Additionally,
there are significant pairwise differences in dnn_confidence between different levels of
tested_occupation and dnn_prediction.
62
8c Two wayanova test:
model1 <- aov(dnn_confidence ~ tested_occupation
* dnn_prediction, data=DATA_RELEASE)
summary(model1)
library(ggplot2)
ggplot(DATA_RELEASE, aes(x = tested_occupation, y = dnn_confidence, color =
dnn_prediction)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
labs(title = "Interaction plot for dnn_confidence by tested_occupation and dnn_prediction",
x = "Tested occupation", y = "DNN confidence")
63
Conclusion:
The two-way ANOVA, we can see that both the tested_occupation factor and the dnn_prediction
factor have a significant effect on the dnn_confidence variable, as indicated by the extremely
small p-values (< 2e-16) and the significant F values. The interaction term between the
tested_occupation and dnn_prediction factors is also significant, indicating that the effect of one
factor on the dnn_confidence variable is dependent on the level of the other factor.
64
Exercise 9
Building and validating linear models
View(diamond)
65
Linear Regression:
#The built in Linear Regression model lm() is used
#to train the data
model <- lm(price ~ carat+depth+table,data=diamond)
summary(model)
66
#MAke the predictions using model
new_data<- data.frame(carat = c(0.5, 0.75, 1.0), depth = c(61.0, 62.0, 63.0), table = c(58, 60,
62))
new_data$predicted_price<- predict(model,new_data)
new_data$predicted_price
summary(model)$r.squared
Polynomial Regression:
#Polynomial Regression
model1 <- lm(price ~poly(carat,2),data=diamond)
summary(model1)
new_data$prediction_poly<- predict(model1,new_data)
new_data$prediction_poly
67
summary(model1)$r.squared
68
color = c("D", "E", "F"),
clarity = c("VS1", "VS2", "SI1"),
depth = c(61.5, 62.0, 63.5),
table = c(55, 59, 63),
x = c(4.29, 4.96, 5.67),
y = c(4.32, 4.97, 5.70),
z = c(2.65, 3.08, 3.62))
# Predict the prices for the new data using the Bayesian linear regression model
new_data$predicted_bayes<- predict(model.bayes, newdata = new_data)
library(caret)
# Predict the prices for the test data using the Bayesian linear regression model
test_data$predicted_price<- predict(model.bayes, newdata = test_data)
69
library(ggplot2)
# Get the predictions for each of the three models
preds1 <- predict(model, newdata = diamond)
preds2 <- predict(model1, newdata = diamond)
preds3 <- predict(model.bayes, newdata = diamond)
# Create a data frame with the actual values and the predicted values for each model
df <- data.frame(actual = diamond$price,
model = preds1,
model1 = preds2,
model.bayes = preds3)
# Create a scatter plot of the actual values versus the predicted values for each model
ggplot(df_long, aes(x = actual, y = predicted, color = model)) +
geom_point() +
scale_color_manual(values = c("red", "green", "blue")) +
labs(x = "Actual Price", y = "Predicted Price", color = "Model")
70
9b) Linear Regression to predict top 10 players
View(Chess)
71
# Perform a linear regression of ELO on Age
model <- lm(ELO ~ Age, data = Chess)
summary(model)
library(ggplot2)
72
library(dplyr)
# Predict ELO rating for each player using the linear regression model
players <- Chess %>%
mutate(Predicted_ELO = predict(lm(ELO ~ Age, data = Chess)))
73
74
Exercise 10
Building and Validating Logistic models
10a. Logistic Regression on Titanic DataSet
View(titanic)
library(dplyr)
library(caret)
library(ggplot2)
titanic$Gender_Survived<- ifelse(titanic$Sex=='male',1,0)
ggplot(titanic,aes(x=Survived,fill=Gender_Survived))+geom_bar()
75
train_index<-sample(seq_len(nrow(titanic)),size = 0.7*nrow(titanic_updated))
train_data<- titanic_updated[train_index,]
test_data<- titanic_updated[-train_index,]
pred<- predict(model,new_data=test_data,type="response")
pred<-head(pred,nrow(test_data))
predictions<- data.frame(actual=test_data$Survived,predicted = ifelse(pred>=0.5,1,0))
predictions
library(pROC)
roc <- roc(test_data$Survived,pred)
plot(roc)
76
auc(roc)
library(corrplot)
corr<- cor(Breast_cancer_data)
corrplot(corr,method='circle')
77
library(dplyr)
library(ggplot2)
library(caret)
set.seed(1234)
train_index<- sample(seq_len(nrow(Breast_cancer_data)),size=0.7*nrow(Breast_cancer_data))
train_data<- Breast_cancer_data[train_index,]
test_data<- Breast_cancer_data[-train_index,]
head(train_data)
head(test_data)
78
model <- glm(diagnosis ~., data=train_data,family=binomial)
summary(model)
library(pROC)
roc <- roc(test_data$diagnosis,pred)
plot(roc)
auc(roc)
79
10c) Finding Odd ratio
df<- read.csv('C:/Users/21ad016/Desktop/Datasets/framingham.csv')
sum(is.na(df))
645
df<- na.omit(df)
head(df)
set.seed(123)
trainIndex<- sample(1:nrow(df), 0.8*nrow(df))
trainData<- df[trainIndex, ]
testData<- df[-trainIndex, ]
80
summary(model)
exp(coef(model))
df = pd.read_csv(r'C:\Users\21ad016\Desktop\Datasets\archive_iris\Iris.csv')
df.head(3)
81
df.isna().sum()
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
lr_model = LogisticRegression()
# Fit the model to the training data
lr_model.fit(X_train, y_train)
LogisticRegression()
y_pred = lr_model.predict(X_test)
print("Accuracy:", accuracy)
82
Exercise 11
Time series analysis
11 Time series Analysis using R
df<- read.csv('C:/Users/21ad016/Desktop/Datasets/test.csv')
head(df)
df$Date<- as.Date(df$Date)
sales_ts<- ts(df$number_sold, start = min(df$Date), frequency = 7)
plot(sales_ts, main = "Weekly Sales Data", xlab = "Date", ylab = "Number Sold")
sales_decomp<- decompose(sales_ts)
plot(sales_decomp)
83
library(forecast)
sales_model<- auto.arima(sales_ts)
sales_forecast<- forecast(sales_model, h = 7)
plot(sales_forecast, main = "Sales Forecast", xlab = "Date", ylab = "Number Sold")
84
Exercise 12
Regression – Least Squares
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
download("C:\Users\21ad016\Desktop\data")
download("C:\Users\21ad016\Desktop\data1")
import numpy as np
import random
import thinkstats2
import thinkplot
download("C:\Users\21ad016\Desktop\nsfg.py")
download("C:\Users\21ad016\Desktop\first.py")
download("C:\Users\21ad016\Desktop\2002FemPreg.dct")
download(
C:\Users\21ad016\Desktop\2002FemPreg.dat.gz"
)
import first
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ages = live.agepreg
weights = live.totalwgt_lb
live.head(4)
85
return inter, slope
inter, slope = LeastSquares(ages, weights)
inter, slope
inter + slope * 25
slope * 10
fit_xs = np.sort(xs)
fit_ys = inter + slope * fit_xs
return fit_xs, fit_ys
fit_xs, fit_ys = FitLine(ages, inter, slope)
thinkplot.Scatter(ages, weights, color='red', alpha=0.1, s=10)
thinkplot.Plot(fit_xs, fit_ys, color='green', linewidth=3)
thinkplot.Plot(fit_xs, fit_ys, color='black', linewidth=2)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Birth weight (lbs)',
axis=[10, 45, 0, 15],
legend=False)
live.info()
86
import seaborn as sns
sns.distplot(live['agepreg'])
sns.distplot(live['totalwgt_lb'])
87
bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)
88
def SampleRows(df, nrows, replace=False):
indices = np.random.choice(df.index, nrows, replace=replace)
sample = df.loc[indices]
return sample
def ResampleRows(df):
return SampleRows(df, len(df), replace=True)
def SamplingDistributions(live, iters=101):
t = []
for _ in range(iters):
sample = ResampleRows(live)
ages = sample.agepreg
weights = sample.totalwgt_lb
estimates = LeastSquares(ages, weights)
t.append(estimates)
89
def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options):
fys_seq = []
for inter, slope in zip(inters, slopes):
fxs, fys = FitLine(xs, inter, slope)
fys_seq.append(fys)
p = (100 - percent) / 2
percents = p, 100 - p
low, high = thinkstats2.PercentileRows(fys_seq, percents)
thinkplot.FillBetween(fxs, low, high, **options)
PlotConfidenceIntervals(age_means, inters, slopes, percent=90,
color='gray', alpha=0.3, label='90% CI')
PlotConfidenceIntervals(age_means, inters, slopes, percent=50,
color='gray', alpha=0.5, label='50% CI')
90
def ResampleRowsWeighted(df, column='finalwgt'):
weights = df[column]
cdf = thinkstats2.Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample
iters = 100
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)
estimates = [thinkstats2.ResampleRows(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)
91
92