0% found this document useful (0 votes)
17 views

Data Science Lab Manual

The document outlines the Data Science and Analytics Laboratory course for B.Tech. students in Artificial Intelligence and Data Science at Ramco Institute of Technology. It includes the vision and mission of the institute and department, program educational outcomes, and specific lab exercises focused on Python and data analytics. Additionally, it provides instructions for students, a syllabus, and a preface highlighting updates from the previous manual.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Data Science Lab Manual

The document outlines the Data Science and Analytics Laboratory course for B.Tech. students in Artificial Intelligence and Data Science at Ramco Institute of Technology. It includes the vision and mission of the institute and department, program educational outcomes, and specific lab exercises focused on Python and data analytics. Additionally, it provides instructions for students, a syllabus, and a preface highlighting updates from the previous manual.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

AD3341 DATA SCIENCE AND ANALYTICS LABORATORY

MASTER RECORD
B.Tech. Artificial Intelligence and Data Science
II year/IV semester
R2021
Prepared by,
Dr. M. Kaliappan HOD/AD
Dr. S. Vimal ASP/AD

Department of Artificial Intelligence and Data Science


RAMCO INSTITUTE OF TECHNOLOGY
Rajapalayam – 626117

TamilNadu

Prepared by, Approved by,

Dr. M. Kaliappan HOD/AD Dr. M. Kaliappan HOD/AD

Dr. S. Vimal ASP/AD

0
VISION AND MISSION
Vision of the Institute
To evolve as an Institute of international repute in offering high-quality technical
education, Research and extension programmes in order to create knowledgeable,
professionally competent and skilled Engineers and Technologists capable of
working in multi-disciplinary environment to cater to the societal needs.

Mission of Institute
To accomplish its unique vision, the Institute has a far-reaching mission that aims:
• To offer higher education in Engineering and Technology with highest level of quality,
Professionalism and ethical standards
• To equip the students with up-to-date knowledge in cutting-edge technologies, wisdom,
creativity and passion for innovation, and life-long learning skills
• To constantly motivate and involve the students and faculty members in the education process
for continuously improving their performance to achieve excellence.

Vision of Department
To impart international quality education, promote collaborative research and
graduate industry-ready engineers in the domain of Artificial Intelligence and Data
Science to servethe society.

Mission of the Department

 Excel in Teaching-Learning process and collaborative Research by the use


of modern infrastructure and innovative components.

 Establish an Artificial Intelligence and Data Science based centre of


excellence to prepare professional technocrats for solving interdisciplinary
industry problems in various applications

 Motivate students to emerge as entrepreneurs with leadership qualities in a


societal centric program to fulfil Industry and community needs with ethical
standards.

1
Program Educational Outcomes (PEO’s)
After successful completion of the degree, the students will be able to

PEO 1. Apply Artificial Intelligence and Data Science techniques with industrial
standards and pioneering research to solve social and environment-related
problems for making a sustainable ecosystems.

PEO 2. Excel with professional skills, fundamental knowledge, and advanced


futuristic technologies to become Data Scientists, Data Analyst Managers, Data
Science leaders AI Research Scientists, or Entrepreneurs.

Program Outcomes(PO’s)
Engineering Graduates will be able to:
o Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
o Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
o Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
o Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
o Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
o The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
o Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
o Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
o Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
o Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write

2
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
o Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to ones own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
o Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

Program Specific Outcomes(PSO’s)


After successful completion of the degree, the students will be able to:
PSO 1: To apply analytic technologies to arrive at actionable foresight, Insight, hindsight from
data for solving business and engineering problems
PSO 2: To create, and apply the techniques of AI and Data Science to forecast future events in
the domain of Healthcare, Education, and Agriculture, Manufacturing, Automation, Robotics,
Transport, etc
PSO 3: To enrich the critical thinking skills in emerging technologies such as Hybrid Mobile
application development, cloud technology stack, and cyber-physical systems with mathematical
aid to foresee the research findings and provide the solutions

3
INSTRUCTIONS TO THE STUDENTS

 Students should wear Uniforms and Coats neatly during the lab session

 Students should maintain silence during lab hours; roaming around the lab
during lab session is not permitted

 Programs should be written in the manual and well prepared for the current
exercise before coming to the session

 Experiments should be completed within the Specified Lab Session

 Before Every Session, Last Session lab exercise & record should be
completed and get it verified by the faculty

 In the Record Note, Flow Chart and Outputs should be written on the left
side, while Aim, Algorithm & Result should be written on the right side.

 Programs (Printed) should be placed on the right side

 Marks for each lab exercise is awarded as follows :

Performance 25 Marks

Viva 10 Marks

Record 15 Marks

Total 50 Marks

4
PREFACE
The current year’s manual (2022 -2023) differs from the previous
year’s manual in numerous ways. Course objectives and outcomes and
mapping the lab exercises to the outcomes are included in this manual.

All the lab exercises are revised and updated in many places. New
exercises are addedbesides university syllabus.

Viva questions related to the exercise is included at the end of all the
lab exercises along with besides university syllabus exercises.

A number of people helped me with this revision. I would like to thank


the Head of the Department and all friendly faculty members for
providing valuable suggestions in preparing this students’ centric
manual. I would also like to the Lab technicians and other people who
helped me in many ways.

5
CONTENTS
Contents Page
no
Syllabus 7

CO - Mapping 8

Ex Experiment
no
1 Working with Numpy and 9
Pandas arrays

2 Basic Plots using Matplotlib 27

3 Frequency, Distributions, 34
variability and Variance
4 Normal Curves,correlation 37
and Scatter plots and
correlation coefficient
5 Regression 42

6 Z - test 51

7 T – Test 54

8 Anova 58

9 Building and Validating linear 64


models
10 Building and Validating 75
logistic models
11 Time series analysis 83

12 Regression – least squares 85

6
SYLLABUS
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY LTPC

0 0 4 2
OBJECTIVES:
To develop data analytic code in python To be able to use python libraries for handling data
To develop analytical applications using python To perform data visualization using plots

LIST OF EXPERIMENTS Tools:


Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh
Working with Numpy arrays
1. Working with Pandas data frames
2. Basic plots using Matplotlib
3. Frequency distributions, Averages, Variability
4. Normal curves, Correlation and scatter plots, Correlation coefficient
5. Regression
6. Z-test
7. T-test
8. ANOVA
9. Building and validating linear models
10. Building and validating logistic models
11. Time series analysis
PRACTICALS 60 PERIODS

OUTCOMES: Upon successful completion of this course, students will be able to:
CO1. Write python programs to handle data using Numpy and Pandas
CO2. Perform descriptive analytics
CO3. Perform data exploration using Matplotlib
CO4. Perform inferential data analytics
CO5. Build models of predictive analytics

7
CO Mapping
Exno Experiment CO

1 Working with CO1


Numpy and Pandas
arrays
2 Basic plots using CO3
Matplotlib
3 Frequency CO4
distributions,variance
and Variablity
4 Normal Curves, CO4
Correlation,and
Scatter plots
5 Regression CO5

6 Z- Test CO2

7 T- Test CO2

8 Anova CO2

9 Building and CO5


validating Linear
models
10 Building and CO5
validating Logistic
models
11 Time series analysis CO2

12 Regression-Least CO5
squares

8
Excerise1
Working with Numpy and Pandas dataframe
1 A): Simple pandas programs
1. Write a Pandas program to get the powers of an array values element-wise.
Program:
import pandas as pd
data=pd.DataFrame({'x':[78,90,89,78],'y':[67,67,87,79],'z':[56,89,57,90]})
print(data)
OutP
x y z
0 78 67 56
1 90 67 89
2 89 87 57
3 78 79 90
2. Write a Pandas program to create and display a DataFrame from a specified dictionary
data which has the index labels
import pandas as pd

details = {
'Name' : ['Ankit', 'Aishwarya', 'Shaurya', 'Shivangi'],
'Age' : [23, 21, 22, 21],
'University' : ['BHU', 'JNU', 'DU', 'BHU'],
}

df = pd.DataFrame(details)

df
Name Age University
0 Ankit 23 BHU
1 Aishwarya 21 JNU
2 Shaurya 22 DU
3 Shivangi 21 BHU
3.Write a Pandas program to display a summary of the basic information about a specified
DataFrame and its data. Sample Python dictionary data and list labels:
import pandas as pd
import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

9
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("Summary of the basic information about this DataFrame and its data:")
print(df.info())
Summary of the basic information about this DataFrame and its data:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype

0 name 10 non-null object


1 score 8 non-null float64
2 attempts 10 non-null int64
3 qualify 10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes
None
4.Write a Pandas program to get the first 3 rows of a given DataFrame. Go to the editor Sample
Python dictionary data and list labels:
import pandas as pd
import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',


'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("First three rows of the data frame:")
print(df.iloc[:3])
First three rows of the data frame:
name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
5.Write a Pandas program to select the 'name' and 'score' columns from the following DataFrame

10
import pandas as pd
import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',


'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("Select specific columns:")
print(df[['name', 'score']])
Select specific columns:
name score
a Anastasia 12.5
b Dima 9.0
c Katherine 16.5
d James NaN
e Emily 9.0
f Michael 20.0
g Matthew 14.5
h Laura NaN
i Kevin 8.0
j Jonas 19.0
6. Write a Pandas program to select the specified columns and rows from a given data
frame.
import pandas as pd
import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',


'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no

11
f 20.0 yes
g 14.5 yes
7. Write a Pandas program to select the rows where the number of attempts in the
examination is greater than 2
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Number of attempts in the examination is less than 2 and score greater than 15 :")
print(df[(df['attempts'] <2) & (df['score'] >15)])
Number of attempts in the examination is less than 2 and score greater than 15 :
name score attempts qualify
j Jonas 19.0 1 yes
8. Write a Pandas program to count the number of rows and columns of a DataFrame
import pandas as pd
dict= {'Name' : ['Martha', 'Tim', 'Rob', 'Georgia'],
'Marks' : [87, 91, 97, 95]}
df = pd.DataFrame(dict)
display(df)
rows = df.shape[0]
cols = df.shape[1]
print("Rows: "+str(rows))
print("Columns: "+str(cols))
Name Marks
0 Martha 87
1 Tim 91
2 Rob 97
3 Georgia 95
Rows: 4
Columns: 2
9. Write a Pandas program to select the rows where the score is missing, i.e. is NaN.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

12
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("Rows where score is missing:")
print(df[df['score'].isnull()])
Rows where score is missing:
name score attempts qualify
d James NaN 3 no
h Laura NaN 1 no
10.Write a Pandas program to select the rows the score is between 15 and 20 (inclusive)
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("Rows where score between 15 and 20 (inclusive):")
print(df[df['score'].between(15, 20)])
Rows where score between 15 and 20 (inclusive):
name score attempts qualify
c Katherine 16.5 2 yes
f Michael 20.0 3 yes
j Jonas 19.0 1 yes
11.Write a Pandas program to select the rows where number of attempts in the examination is
less than 2 and score greater than 15.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Number of attempts in the examination is less than 2 and score greater than 15 :")
print(df[(df['attempts'] <2) & (df['score'] >15)])

13
Number of attempts in the examination is less than 2 and score greater than 15 :
name score attempts qualify
j Jonas 19.0 1 yes
12.Write a Pandas program to change the score in row 'd' to 11.5.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("\nOriginal data frame:")
print(df)
print("\nChange the score in row 'd' to 11.5:")
df.loc['d', 'score'] =11.5
print(df)

Original data frame:


name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

Change the score in row 'd' to 11.5:


name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James 11.5 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no

14
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

13.Write a Pandas program to calculate the sum of the examination attempts by the students
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("\nSum of the examination attempts by the students:")
print(df['attempts'].sum())

Sum of the examination attempts by the students:


19
14. Write a Pandas program to calculate the mean score for each different student in
DataFrame
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("\nMean score for each different student in data frame:")
print(df['score'].mean())

Mean score for each different student in data frame:


13.5625
15. Write a Pandas program to append a new row 'k' to data frame with given values for each
column
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})

15
print(df)

Name Age Location


0 Nik 31 Toronto
1 Kate 30 London
2 Evan 40 Kingston
3 Kyra 33 Hamilton

16. Write a Pandas program to sort the DataFrame first by 'name' in descending order, then
by 'score' in ascending order.
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year',
'Population', 'Continent'])

df
Country Year Population Continent
0 Afghanistan 1952 8425333.0 Asia
1 Australia 1957 9712569.0 Oceania
2 Brazil 1962 76039390.0 Americas 3
China 1957 637408000.0 Asia
4 France 1957 44310863.0 Europe
5 India 1952 372000000.0 Asia
6 United States 1957 171984000.0 Americas
17.Write a Pandas program to replace the 'qualify' column contains the values 'yes' and 'no' with
True and False.
import numpy as np
import pandas as pd

exam_data =pd.DataFrame( {'name': ['Anastasia', 'Dima', 'Katherine',


'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no',
'yes']})
exam_data.set_index([['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'],'name'])
r1 = exam_data.replace('yes', 'true')

16
r2 = exam_data.replace('no', 'false')
r1
name score attempts qualify
0 Anastasia 12.5 1 true
1 Dima 9.0 3 no
2 Katherine 16.5 2 true
3 James NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 true
6 Matthew 14.5 1 true
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 true

18.Write a Pandas program to change the name 'James' to 'Suresh' in name column of the
DataFrame.
import pandas as pd

# Create the series


series = pd.Series(['suresh', 'for', 'geeks',
'pandas', 'series']
print("james:")
series
james:
0 suresh
1 for
2 geeks
3 pandas
4 series
dtype: object
19. Write a Pandas program to delete the 'attempts' column from the DataFrame
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Original rows:")
print(df)
print("\nDelete the 'attempts' column from the data frame:")

17
df.pop('attempts')
print(df)
Original rows:
name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

Delete the 'attempts' column from the data frame:


name score qualify
a Anastasia 12.5 yes
b Dima 9.0 no
c Katherine 16.5 yes
d James NaN no
e Emily 9.0 no
f Michael 20.0 yes
g Matthew 14.5 yes
h Laura NaN no
i Kevin 8.0 no
j Jonas 19.0 yes
21. Write a Pandas program to iterate over rows in a DataFrame.

import pandas as pd
import numpy as np
exam_data = [{'name':'Anastasia', 'score':12.5}, {'name':'Dima','score':9},
{'name':'Katherine','score':16.5}]
df = pd.DataFrame(exam_data)
for index, row in df.iterrows():
print(row['name'], row['score'])
Anastasia 12.5
Dima 9.0
Katherine 16.5
22. Write a Pandas program to get list from DataFrame column headers

import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

18
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print(list(df.columns.values))
['name', 'score', 'attempts', 'qualify']
23. Write a Pandas program to select rows from a given DataFrame based on values in some
columns.
import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
print('Rows for colum1 value == 4')
print(df.loc[df['col1'] ==4])
Rows for colum1 value == 4
col1 col2 col3
1 4 5 8
3 4 7 0

24. Write a Pandas program to change the order of a DataFrame columns


import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
print('After altering col1 and col3')
df = df[['col3', 'col2', 'col1']]
print(df)
After altering col1 and col3
col3 col2 col1
0 7 4 1
1 8 5 4
2 9 6 3
3 0 7 4
4 1 8 5

25. Write a Pandas program to add one row in an existing DataFrame.


import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)

19
print('After add one row:')
df2 = {'col1': 10, 'col2': 11, 'col3': 12}
df = df.append(df2, ignore_index=True)
print(df)
After add one row:
col1 col2 col3
0 1 4 7
1 4 5 8
2 3 6 9
3 4 7 0
4 5 8 1
5 10 11 12

26. Write a Pandas program to write a DataFrame to CSV file using tab separator.

import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
print('Data from new_file.csv file:')
df.to_csv('new_file.csv', sep='\t', index=False)
new_df = pd.read_csv('new_file.csv')
print(new_df)
Data from new_file.csv file:
col1\tcol2\tcol3
0 1\t4\t7
1 4\t5\t8
2 3\t6\t9
3 4\t7\t0
4 5\t8\t1
27. Write a Pandas program to count city wise number of people from a given of data set (city,
name of the person)
import pandas as pd
df1 = pd.DataFrame({'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
'Matthew', 'Laura', 'Kevin', 'Jonas'],
'city': ['California', 'Los Angeles', 'California', 'California', 'California', 'Los Angeles', 'Los
Angeles', 'Georgia', 'Georgia', 'Los Angeles']})
g1 = df1.groupby(["city"]).size().reset_index(name='Name of the person')
print(g1)
city Name of the person
0 California 4
1 Georgia 2
2 Los Angeles 4

20
28. Write a Pandas program to delete DataFrame row(s) based on given column value.
import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
df = df[df.col2 !=5]
print("New DataFrame")
print(df)
New DataFrame
col1 col2 col3
0 1 4 7
2 3 6 9
3 4 7 0
4 5 8 1
29. Write a Pandas program to widen output display to see more columns.

import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("Original DataFrame")
print(df)
Original DataFrame
col1 col2 col3
0 1 4 7
1 4 5 8
2 3 6 9
3 4 7 0
4 5 8 1

30. Write a Pandas program to replace all the NaN values with Zero's in a column of a dataframe.

import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
df = pd.DataFrame(exam_data)
df = df.fillna(0)

21
print("\nNew DataFrame replacing all NaN with 0:")
print(df)

New DataFrame replacing all NaN with 0:


name score attempts qualify
0 Anastasia 12.5 1 yes
1 Dima 9.0 3 no
2 Katherine 16.5 2 yes
3 James 0.0 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura 0.0 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 yes

1b) Cardino dataset:


Program:

import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
col_names = ['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alc
o', 'active', 'cardio']
df = pd.read_csv('/content/drive/MyDrive/dataset/cardio_train.csv',delimiter=';')
df.head(5)

df.columns

df.info()

22
df.isnull().sum()

new_df = df[['ap_hi','cholesterol','gluc','smoke']]
new_df.tail(5)

23
max(new_df['gluc'])
3
max(new_df['cholesterol'])
3
#lets ckeck the range
print(new_df[new_df['ap_hi'].between(1000,1620)])

1 c) Game Dataset

import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/dataset/metacritic_games.csv')
df.head(4)

24
df.columns

df.info()

df.isnull().sum()

new_df = df[['name', 'platform', 'release_date', 'summary','meta_score']]


new_df.tail(5)

new_df.columns

25
new_df.groupby(['name']).mean()

new_data = df[['meta_score','user_score']]
new_data.head(5)

PlayStation= new_data.loc[new_data['user_score']==6.8]
PlayStation.tail(4)

26
Exercise 2
Basic plots using Matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\Users\21ad016\Desktop\Datasets\forestfires.csv')
df.head(5)
Output:

new_df = df[['X','Y','month']]
new_df.head(5)
Output:

data = new_df.groupby(['month']).value_counts()
data
Output:

data.sort_values(ascending = True)

27
Bar plots
#Bar plot

plt.bar(df['month'],df['temp'],color='y')
plt.xlabel('Month')
plt.ylabel('Temperature')
plt.title("Bar plot")
Output:

plt.bar(df['month'],df['wind'],color = 'r')
plt.xlabel('Month')
plt.ylabel('Wind')
plt.title("Bar plot")
Output:

plt.bar(df['month'],df['rain'])
plt.xlabel('Month')
plt.ylabel('Rain')
plt.title("Bar plot")

28
Output:

plt.bar(df['month'],df['RH'],color='g')
plt.xlabel('Month')
plt.ylabel('Temperature')
plt.title("Bar plot")
Output:

Observation : It is observed that the factors like wind,temp,RH will contribute to the forest fires.
Rain doesnot contribute much
Most number of forest fires have occured in the months of March,July which is nearly summer
season where the tempearature will be high
Forest fire occcur in the month of August due o wind and Relative humidity

29
Scatter plot
#Scatter Plot
plt.scatter(df['month'],df['area'])
plt.xlabel('Month')
plt.ylabel('Area in hectares')
plt.title('Scatter plot')
Output:

plt.scatter(df['month'],df['day'],color='#800000')
plt.xlabel('Month')
plt.ylabel('Day')
plt.title('Scatter Plot')
Output:

plt.scatter(df['month'],df['DC'])
plt.xlabel('')
Output:

30
Area plot
months = {"jan":1, "feb":2, "mar":3, "apr":4, "may":5, "jun":6, "jul":7, "aug":8, "sep":9,
"oct":10, "nov":11, "dec":12}
df["month_num"] = df["month"].map(months)
df = df.sort_values(by=["month_num", "day"])
plt.fill_between(range(len(df)), df["area"], color="orange")
plt.xlabel("Time")
plt.ylabel("Area")
plt.title("Distribution of Area over Time")
plt.show()

ffmc_month = df[['FFMC', 'month']]

# group by month and get the mean FFMC value for each month

31
ffmc_month_mean = ffmc_month.groupby('month')['FFMC'].mean()
# create the area plot
fig, ax = plt.subplots()
ax.fill_between(ffmc_month_mean.index, ffmc_month_mean.values, alpha=0.5)
ax.set(title='Average FFMC by Month', xlabel='Month', ylabel='FFMC')
plt.show()

ffmc_month = df[['DC', 'month']]


# group by month and get the mean FFMC value for each month
ffmc_month_mean = ffmc_month.groupby('month')['DC'].mean()
# create the area plot
fig, ax = plt.subplots()
ax.fill_between(ffmc_month_mean.index, ffmc_month_mean.values,
alpha=0.5,color='#FFC0CB')
ax.set(title='Average DC by Month', xlabel='Month', ylabel='DC')
plt.show()

32
Histogram
plt.hist(df['X'])

plt.hist(df['Y'])

33
Exercise 3
Frequency distributions, average and variance
Frequency distribution,variance and variability
View(diabetes_data_upload)

#checking for null values


sum(is.na(diabetes_data_upload))
Output:

#There are no null values in the dataset

Mean
#mean of the age
x <- diabetes_data_upload$Age
cat("Mean age of person having diabetes: ",mean(x));
Output:

Observation:
People with average of 48 years are affected by diabetes

library(dplyr)
grouped = diabetes_data_upload %>% group_by(Gender) %>%

34
summarise(Age = mean(Age))
grouped
Output:

Observation:
The average age of women getting affected by diabetes is 47 and 48 for men.

library(dplyr)
df_grouped = diabetes_data_upload %>% group_by(sudden.weight.loss) %>%
summarise(Age = mean(Age))
df_grouped
Output:

Variability
1. Range
range_age<- range(diabetes_data_upload$Age)
cat("Range with age grouped: ",range_age);
Output:

Observation:
It shows that people start affected by diabetes at the age of 16 itself

2. Interquartile Range
age <- diabetes_data_upload$Age
q50 <- quantile(age,0.5)

35
q75 <- quantile(age,0.75)
q100 <- quantile(age,1)

cat("50% quartile: ",q50,"\n")


cat("75% quartile: ",q75,"\n")
cat("100%quartile: ",q100,"\n")
Output:

3. Variance
age <- diabetes_data_upload$Age
cat("The variance of age: ",var(age))
Output:

4. Standard Deviation
age <- diabetes_data_upload$Age
cat("The standard deviation of age: ",sd(age))
Output:

36
Exercise 4
Normal Curves, Correlation and scatter plots and correlation
coefficient
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive


drive.mount('/content/drive')
Mounted at /content/drive

df = pd.read_csv('/content/drive/MyDrive/datasets/accident.csv')
df.isnull().sum()

from bokeh.plotting import figure,output_file,show


output_file('scatter_plot.html')
graph = figure(title = "Scatter plot using BOkeh")
x = df['PERSONS']
y = df['COUNTY']
graph.scatter(x,y)
show(graph)

37
from bokeh.plotting import figure,output_file,show
output_file('person.html')
graph = figure(title="Scatter plot")
x = df['PERSONS']
y = df['VE_TOTAL']
graph.scatter(x,y)
show(graph)

38
from bokeh.io import show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.transform import linear_cmap
df.columns

Normal Curve:
import numpy as np
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure,output_file

output_file('normal.html')
accidents = df['CITY']
hist,edges = np.histogram(accidents,density = True)

mean = np.mean(accidents)
std_dev = np.std(accidents)

x = np.linspace(mean - 3*std_dev, mean + 3*std_dev, 1000)


y = 1/(np.sqrt(2*np.pi)*std_dev)*np.exp(-(x-mean)**2/(2*std_dev**2))

source = ColumnDataSource({'hist': hist, 'left': edges[:-1], 'right': edges[1:]})


p = figure(title='Accident Histogram with Normal Distribution', tools='')
p.quad(source=source, bottom=0, top='hist', left='left', right='right', fill_color='navy', line_color=
'white', alpha=0.5)

p.line(x, y, line_color='red', line_width=2)

show(p)

39
import numpy as np
from bokeh.io import show
from bokeh.models import LinearColorMapper, ColorBar
from bokeh.palettes import Viridis256
from bokeh.plotting import figure
from bokeh.transform import transform
from bokeh.plotting import figure,output_file

output_file("heatmap_persons.html")

# Create some sample data


x = df['VE']

# Define the color mapper


mapper = LinearColorMapper(palette=Viridis256, low=data.min(), high=data.max())

# Create the figure and plot the rectangles


p = figure(x_range=(0, 10), y_range=(0, 10))
p.rect(x='x', y='y', width=1, height=1, source=data, line_color=None, fill_color=transform('value'
, mapper))

# Add the color bar

40
color_bar = ColorBar(color_mapper=mapper, location=(0, 0))
p.add_layout(color_bar, 'right')

# Show the plot


show(p)

41
Exercise 5
Regression
5a. Regression Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
df = pd.read_csv('/content/drive/MyDrive/datasets/winequality-red.csv')
df.head(3)

df.isnull().sum()

df.columns

import seaborn as sns


sns.regplot(x="fixed acidity",y="quality",data = df)

42
sns.residplot(x="fixed acidity",y="quality",data=df)

sns.heatmap(df.corr(),annot=True,cmap='YlGnBu',linecolor='r',linewidths=0.5)

43
fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='citric acid',y='quality',data=df,color='r',ax=ax0)
sns.residplot(x='citric acid',y='quality',data=df,color='r',ax=ax1)
plt.show()

44
fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='sulphates',y='quality',data=df,color='g',ax=ax0)
sns.residplot(x='sulphates',y='quality',data=df,color='g',ax=ax1)
plt.show()

fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='alcohol',y='quality',data=df,color='g',ax=ax0)
sns.residplot(x='alcohol',y='quality',data=df,color='g',ax=ax1)
plt.show()

45
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
X= sm.add_constant(X)
regression = sm.OLS(y,X).fit()
print(regression.summary())

46
from sklearn.metrics import mean_squared_error, mean_absolute_error

# get the predicted values


y_pred = regression.predict(X)

# calculate the mean squared error and root mean squared error
mse = mean_squared_error(y, y_pred)
rmse = mean_squared_error(y, y_pred, squared=False)

print("Mean Squared Error (MSE):", mse)


print("Root Mean Squared Error (RMSE):", rmse)

Mean Squared Error (MSE): 0.41676716722140794


Root Mean Squared Error (RMSE): 0.6455750670692045

5b. Logistic Regression using R


library(dplyr)
library(ggplot2)

library(caret)
data(winequality)
head(winequality)

summary(winequality)

47
winequality$quality_binary<- ifelse(winequality$quality> 6,1,0)
ggplot(winequality,aes(x=quality,fill=quality_binary))+geom_bar()

set.seed(123)

48
train_index<- sample(seq_len(nrow(wine)),size = 0.7 * nrow(winequality))
train_data<- winequality[train_index, ]
test_data<- winequality[-train_index, ]

model <- glm(quality_binary ~ .,data=train_data,family=binomial)


summary(model)

pred<- predict(model, new_data = test_data, type = "response")


pred<- head(pred, nrow(test_data))
# create a new dataframe to store predicted values
predictions <- data.frame(actual = test_data$quality,
predicted = ifelse(pred>= 0.5, 1, 0))
predictions

49
library(pROC)
roc <- roc(test_data$quality_binary,pred)
plot(roc)
auc(roc)

50
Exercise 6
Z – Test
6a. Z – test for one sample
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
df = pd.read_csv('/content/drive/MyDrive/datasets/student_clustering.csv')
df.head(3)

df.isnull().sum()

import numpy as np
import statsmodels.stats.api as sms
df['iq'].mean()
101.995
H0 : mean<=101 H1 : mean != 101
test_stat,p_value = sms.ztest(df['iq'],value=101)
print("Test Statistic: ",test_stat)
print("p-value: ",p_value)

print("\n ")
print("************Conclusion*************")
alpha = 0.05
if (p_value < alpha):
print("\nReject the null hypothesis\n The IQ value of the sample is not equal to the population
mean\nThe new students IQ is below 101")
else:
print("\nFail to reject the null hypothesis\n The IQ value of the sample is less than or equal to th
e population mean\nThe new students also have the IQ same as the population")

51
6b . Z – test for two independent samples
#b) Two sample Z-test

df1 = pd.read_csv('/content/drive/MyDrive/datasets/weight-height.csv')
df1.head(3)

df2 = df1.drop('Weight',axis=1)
male_df = df2[df2['Gender'] == 'Male']
female_df = df2[df2['Gender'] == 'Female']
from statsmodels.stats.weightstats import ztest

sample1 = male_df['Height']
sample2 = female_df['Height']
mean1,mean2 = np.mean(sample1),np.mean(sample2)
std1,std2 = np.std(sample1), np.std(sample2)

z_score,p_value = ztest(x1=sample1,x2=sample2,value=0)

print("The mean of Sample1: Height of Male: ",mean1)


print("The mean of Sample2: Height of Female: ",mean2)
print("The Z score obtained: ",z_score)
print("The P-value: ",p_value)

52
alpha = 0.05
if p_value<alpha:
print("Reject the null hypothethis\nThe difference between the population means is not zero")
else:
print("Fail to reject the null hypothethis\nThe difference between the population is zero")

6c. Z – test on paired sample


#c) Paired Z-test
import numpy as np
from scipy.stats import norm

cgpa_1_sem = np.array([7.8,7.7,6.5,9.8,7.0,7.9,8.0,8.5,7.4,8.8])
cgpa_2_sem = np.array([8.4,8.0,6.0,7.9,8.9,7.8,8.0,7.5,6.6,8.0])
diff = np.array(cgpa_2_sem) - np.array(cgpa_1_sem)
mean_diff = np.mean(diff)
std_diff = np.std(diff,ddof=1)
n = len(diff)
se_diff = std_diff/np.sqrt(n)
null_hypothesis = 0
z_score = mean_diff/se_diff
p_value = 2 * norm.cdf(-np.abs(z_score))
print("Mean difference:", mean_diff)
print("Standard deviation of difference:", std_diff)
print("Standard error of the mean difference:", se_diff)
print("Z-score:", z_score)
print("P-value:", p_value)

print("\n*********************************")
print("************Conclusion*************")

53
alpha = 0.05
if p_value<alpha:
print("Reject the null hypothesis\nThis indicates that the difference between the populations is
not zer")
else:
print("Failed to reject null hypothethis\nThis indicates the difference between the mean of popu
lations is zero")

54
Exercise 7
T – test
7a. t – test for one sample
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

df = pd.read_csv('/content/drive/MyDrive/datasets/winequality-red.csv')
df.head(3)

df['fixed acidity'].mean()
8.31963727329581

from scipy.stats import ttest_1samp


t_statistic,p_value = ttest_1samp(df['fixed acidity'],8.3)
print("T-Statistic: {:.2f}, P-Value: {:.2f}".format(t_statistic,p_value))

alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis\nThe mean of the population has no special thing happenning"
)
else:
print("Failed to reject the null hypothesis\nThere is something special happening in the populati
on")

55
7b t-test for two independent samples
import pandas as pd
import numpy as np
import scipy.stats as stats
data = pd.read_csv('/content/drive/MyDrive/datasets/weight-height.csv')
data.head(3)

male_weight = data[data['Gender']=='Male']
female_weight = data[data['Gender']=='Female']
sample1 = male_weight['Weight']
sample2 = female_weight['Weight']
t_stat , p_value = stats.ttest_ind(sample1,sample2)
print('t-statistic: ',t_stat)
print("p_value: ",p_value)

alpha = 0.05
print("\n*****************************")
print("\n*********Conclusion**********")
if p_value < alpha:
print('\nReject the null hypothethis\n There is no difference between the means of the populatio
n')
else:
print("\nFailed to reject null hypothethis\n There is a diifference between the means of the popu
lations")
print("\n**************************")

56
7c t-test for paired samples
marks_1_sem = [8.9,8.8,7.9,6.7,8.7,7.8,9.,6.7,8.9,8.0]
marks_2_sem = [9.0,7.8,8.0,6.7,9.0,6.8,8.9,7.4,8.0,6.7]
import scipy.stats as stats
t_stat,p_value = stats.ttest_rel(marks_1_sem,marks_2_sem)
print("t-statistic: ",t_stat)
print("p_value: ",p_value)

alpha = 0.05
print("\n*****************************")
print("\n*********Conclusion**********")
if p_value < alpha:
print('\nReject the null hypothethis\n There is no difference between the means of the populatio
n')
else:
print("\nFailed to reject null hypothethis\n There is a diifference between the means of the popu
lations")
print("\n**************************")

57
Exercise 8
Anova
8a. One Way Anova Test
View(DATA_RELEASE)

sum(is.na(DATA_RELEASE))

model <- aov(dnn_confidence ~ dnn_prediction,


data=DATA_RELEASE)
summary(model)

58
library(ggplot2)
ggplot(DATA_RELEASE, aes(x=dnn_prediction, y=dnn_confidence)) +
geom_boxplot() +
labs(x="DNN Prediction", y="DNN Confidence") +
ggtitle("Boxplot of DNN Confidence by DNN Prediction")

Conclusion:
There is a significant difference between the means of the dnn_confidence scores across the
levels of the dnn_prediction factor (F(5, 9594) = 216, p < 0.001). The null hypothesis that there
is no difference between the means is rejected

8b Pair wise Anova test:


#Pair wise anova
model2 <- aov(dnn_confidence ~ tested_occupation + dnn_prediction +
tested_occupation:dnn_prediction, data = DATA_RELEASE)
summary(model2)

59
# perform pairwise ANOVA for tested_occupation factor
pw1<-pairwise.t.test(DATA_RELEASE$dnn_confidence,
DATA_RELEASE$tested_occupation, p.adjust.method = "bonferroni")
print(pw1)

# perform pairwise ANOVA for dnn_prediction factor


pw2<-pairwise.t.test(DATA_RELEASE$dnn_confidence, DATA_RELEASE$dnn_prediction,
p.adjust.method = "bonferroni")
print(pw2)

# perform pairwise ANOVA for interaction of tested_occupation and dnn_prediction factors


pw3<-pairwise.t.test(DATA_RELEASE$dnn_confidence,
interaction(DATA_RELEASE$tested_occupation, DATA_RELEASE$dnn_prediction),
p.adjust.method = "bonferroni")

60
print(pw3)

library(ggpubr)
ggboxplot(DATA_RELEASE, x = "tested_occupation", y = "dnn_confidence",
color = "tested_occupation", palette = "jco") +
stat_compare_means(comparisons = list(c("Accountant", "Data Analyst"),
c("Accountant", "Software Developer"),
c("Data Analyst", "Marketing Manager"),
c("Marketing Manager", "Software Developer")),
method = "t.test", label = "p.signif") +
labs(title = "Comparison plot for dnn_confidence by tested_occupation",
x = "Tested occupation", y = "DNN confidence")

61
Conclusion:
The conclusion is that both tested_occupation and dnn_prediction have a significant impact on
the dnn_confidence, and the interaction between them also has a significant effect. Additionally,
there are significant pairwise differences in dnn_confidence between different levels of
tested_occupation and dnn_prediction.

62
8c Two wayanova test:
model1 <- aov(dnn_confidence ~ tested_occupation
* dnn_prediction, data=DATA_RELEASE)
summary(model1)

library(ggplot2)
ggplot(DATA_RELEASE, aes(x = tested_occupation, y = dnn_confidence, color =
dnn_prediction)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
labs(title = "Interaction plot for dnn_confidence by tested_occupation and dnn_prediction",
x = "Tested occupation", y = "DNN confidence")

63
Conclusion:
The two-way ANOVA, we can see that both the tested_occupation factor and the dnn_prediction
factor have a significant effect on the dnn_confidence variable, as indicated by the extremely
small p-values (< 2e-16) and the significant F values. The interaction term between the
tested_occupation and dnn_prediction factors is also significant, indicating that the effect of one
factor on the dnn_confidence variable is dependent on the level of the other factor.

64
Exercise 9
Building and validating linear models
View(diamond)

#Check for null values


sum(is.na(diamond))

#Display the head of the data


head(diamond)

#The information of the data can be obtained by


summary(diamond)

65
Linear Regression:
#The built in Linear Regression model lm() is used
#to train the data
model <- lm(price ~ carat+depth+table,data=diamond)
summary(model)

66
#MAke the predictions using model
new_data<- data.frame(carat = c(0.5, 0.75, 1.0), depth = c(61.0, 62.0, 63.0), table = c(58, 60,
62))
new_data$predicted_price<- predict(model,new_data)

new_data$predicted_price

summary(model)$r.squared

Polynomial Regression:
#Polynomial Regression
model1 <- lm(price ~poly(carat,2),data=diamond)
summary(model1)

new_data$prediction_poly<- predict(model1,new_data)
new_data$prediction_poly

67
summary(model1)$r.squared

Bayesian Linear Regression:


#Bayesian Linear Regression
library(arm)

# Fit a Bayesian linear regression model


model.bayes<- bayesglm(price ~ carat + cut + color + clarity + depth + table + x + y + z, data =
diamonds)

# Print the model summary


summary(model.bayes)

# Create new data to predict on


new_data<- data.frame(carat = c(0.3, 0.5, 0.7),
cut = c("Ideal", "Premium", "Good"),

68
color = c("D", "E", "F"),
clarity = c("VS1", "VS2", "SI1"),
depth = c(61.5, 62.0, 63.5),
table = c(55, 59, 63),
x = c(4.29, 4.96, 5.67),
y = c(4.32, 4.97, 5.70),
z = c(2.65, 3.08, 3.62))

# Predict the prices for the new data using the Bayesian linear regression model
new_data$predicted_bayes<- predict(model.bayes, newdata = new_data)

# Print the predicted prices


print(new_data$predicted_bayes)

library(caret)

# Split the data into training and testing sets


set.seed(123)
train_index<- createDataPartition(diamond$price, p = 0.7, list = FALSE)
train_data<- diamond[train_index, ]
test_data<- diamond[-train_index, ]

# Fit the Bayesian linear regression model on the training data


model.bayes<- bayesglm(price ~ carat + cut + color + clarity + depth + table + x + y + z, data =
train_data)

# Predict the prices for the test data using the Bayesian linear regression model
test_data$predicted_price<- predict(model.bayes, newdata = test_data)

# Calculate the MSE, RMSE, MAE, and R-squared


mse<- mean((test_data$price - test_data$predicted_price)^2)
rmse<- sqrt(mse)
mae<- mean(abs(test_data$price - test_data$predicted_price))
rsq<- cor(test_data$price, test_data$predicted_price)^2

# Print the metrics


print(paste0("MSE: ", mse))
print(paste0("RMSE: ", rmse))
print(paste0("MAE: ", mae))
print(paste0("R-squared: ", rsq))

69
library(ggplot2)
# Get the predictions for each of the three models
preds1 <- predict(model, newdata = diamond)
preds2 <- predict(model1, newdata = diamond)
preds3 <- predict(model.bayes, newdata = diamond)

# Create a data frame with the actual values and the predicted values for each model
df <- data.frame(actual = diamond$price,
model = preds1,
model1 = preds2,
model.bayes = preds3)

# Convert the data frame to a long format


df_long<- tidyr::gather(df, model, predicted, -actual)

# Create a scatter plot of the actual values versus the predicted values for each model
ggplot(df_long, aes(x = actual, y = predicted, color = model)) +
geom_point() +
scale_color_manual(values = c("red", "green", "blue")) +
labs(x = "Actual Price", y = "Predicted Price", color = "Model")

70
9b) Linear Regression to predict top 10 players
View(Chess)

#Check for null values


sum(is.na(Chess))

71
# Perform a linear regression of ELO on Age
model <- lm(ELO ~ Age, data = Chess)
summary(model)

library(ggplot2)

# Create a scatter plot of ELO vs Age with regression line


ggplot(Chess, aes(x = Age, y = ELO)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Age", y = "ELO", title = "Chess Player Rankings in Jan 2021")

72
library(dplyr)

# Load Chess dataset


data(Chess)

# Predict ELO rating for each player using the linear regression model
players <- Chess %>%
mutate(Predicted_ELO = predict(lm(ELO ~ Age, data = Chess)))

# Rank players by predicted ELO rating and select top 10


top_players<- players %>%
arrange(desc(Predicted_ELO)) %>%
slice_head(n = 10)

# Print top 10 players


top_players

73
74
Exercise 10
Building and Validating Logistic models
10a. Logistic Regression on Titanic DataSet
View(titanic)

titanic <- na.omit(titanic)


titanic_updated<- select(titanic,-PassengerId,-Sex,-Name,-Ticket,-Cabin,-Embarked)
head(titanic_updated)

library(dplyr)
library(caret)
library(ggplot2)
titanic$Gender_Survived<- ifelse(titanic$Sex=='male',1,0)
ggplot(titanic,aes(x=Survived,fill=Gender_Survived))+geom_bar()

75
train_index<-sample(seq_len(nrow(titanic)),size = 0.7*nrow(titanic_updated))
train_data<- titanic_updated[train_index,]
test_data<- titanic_updated[-train_index,]

model <- glm(Survived ~.,data=train_data,family=binomial)


summary(model)

pred<- predict(model,new_data=test_data,type="response")
pred<-head(pred,nrow(test_data))
predictions<- data.frame(actual=test_data$Survived,predicted = ifelse(pred>=0.5,1,0))
predictions

library(pROC)
roc <- roc(test_data$Survived,pred)
plot(roc)

76
auc(roc)

10b. Logistic Regression on Breast Cancer Dataset


head(Breast_cancer_data)

#check for null values


sum(is.na(Breast_cancer_data))

library(corrplot)
corr<- cor(Breast_cancer_data)
corrplot(corr,method='circle')

77
library(dplyr)
library(ggplot2)
library(caret)
set.seed(1234)
train_index<- sample(seq_len(nrow(Breast_cancer_data)),size=0.7*nrow(Breast_cancer_data))
train_data<- Breast_cancer_data[train_index,]
test_data<- Breast_cancer_data[-train_index,]
head(train_data)

head(test_data)

78
model <- glm(diagnosis ~., data=train_data,family=binomial)
summary(model)

library(pROC)
roc <- roc(test_data$diagnosis,pred)
plot(roc)
auc(roc)

79
10c) Finding Odd ratio
df<- read.csv('C:/Users/21ad016/Desktop/Datasets/framingham.csv')
sum(is.na(df))
645
df<- na.omit(df)
head(df)

correlation <- cor(df)


corrplot(correlation)

set.seed(123)
trainIndex<- sample(1:nrow(df), 0.8*nrow(df))
trainData<- df[trainIndex, ]
testData<- df[-trainIndex, ]

model <- glm(male ~.,data=trainData,family=binomial)

80
summary(model)

exp(coef(model))

10d) Logistic Regression using Sckit-Learn


import pandas as pd
import numpy as np

df = pd.read_csv(r'C:\Users\21ad016\Desktop\Datasets\archive_iris\Iris.csv')
df.head(3)

81
df.isna().sum()

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df.drop('Species', axis=1), df['Species'],


test_size=0.3)

lr_model = LogisticRegression()
# Fit the model to the training data
lr_model.fit(X_train, y_train)

LogisticRegression()

y_pred = lr_model.predict(X_test)

# Evaluate the accuracy of the model


accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

82
Exercise 11
Time series analysis
11 Time series Analysis using R
df<- read.csv('C:/Users/21ad016/Desktop/Datasets/test.csv')
head(df)

df$Date<- as.Date(df$Date)
sales_ts<- ts(df$number_sold, start = min(df$Date), frequency = 7)
plot(sales_ts, main = "Weekly Sales Data", xlab = "Date", ylab = "Number Sold")

sales_decomp<- decompose(sales_ts)
plot(sales_decomp)

83
library(forecast)
sales_model<- auto.arima(sales_ts)
sales_forecast<- forecast(sales_model, h = 7)
plot(sales_forecast, main = "Sales Forecast", xlab = "Date", ylab = "Number Sold")

84
Exercise 12
Regression – Least Squares
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve

local, _ = urlretrieve(url, filename)


print("Downloaded " + local)

download("C:\Users\21ad016\Desktop\data")
download("C:\Users\21ad016\Desktop\data1")
import numpy as np

import random

import thinkstats2
import thinkplot
download("C:\Users\21ad016\Desktop\nsfg.py")
download("C:\Users\21ad016\Desktop\first.py")
download("C:\Users\21ad016\Desktop\2002FemPreg.dct")
download(
C:\Users\21ad016\Desktop\2002FemPreg.dat.gz"
)
import first
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ages = live.agepreg
weights = live.totalwgt_lb
live.head(4)

from thinkstats2 import Mean, MeanVar, Var, Std, Cov

def LeastSquares(xs, ys):


meanx, varx = MeanVar(xs)
meany = Mean(ys)

slope = Cov(xs, ys, meanx, meany) / varx


inter = meany - slope * meanx

85
return inter, slope
inter, slope = LeastSquares(ages, weights)
inter, slope

inter + slope * 25

slope * 10

def FitLine(xs, inter, slope):

fit_xs = np.sort(xs)
fit_ys = inter + slope * fit_xs
return fit_xs, fit_ys
fit_xs, fit_ys = FitLine(ages, inter, slope)
thinkplot.Scatter(ages, weights, color='red', alpha=0.1, s=10)
thinkplot.Plot(fit_xs, fit_ys, color='green', linewidth=3)
thinkplot.Plot(fit_xs, fit_ys, color='black', linewidth=2)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Birth weight (lbs)',
axis=[10, 45, 0, 15],
legend=False)

live.info()

86
import seaborn as sns
sns.distplot(live['agepreg'])

sns.distplot(live['totalwgt_lb'])

def Residuals(xs, ys, inter, slope):


xs = np.asarray(xs)
ys = np.asarray(ys)
res = ys - (inter + slope * xs)
return res
live['residual'] = Residuals(ages, weights, inter, slope)

87
bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)

age_means = [group.agepreg.mean() for _, group in groups][1:-1]


age_means

cdfs = [thinkstats2.Cdf(group.residual) for _, group in groups][1:-1]


def PlotPercentiles(age_means, cdfs):
thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(age_means, weight_percentiles, label=label)
PlotPercentiles(age_means, cdfs)

thinkplot.Config(xlabel="Mother's age (years)",


ylabel='Residual (lbs)',
xlim=[10, 45])

88
def SampleRows(df, nrows, replace=False):
indices = np.random.choice(df.index, nrows, replace=replace)
sample = df.loc[indices]
return sample

def ResampleRows(df):
return SampleRows(df, len(df), replace=True)
def SamplingDistributions(live, iters=101):
t = []
for _ in range(iters):
sample = ResampleRows(live)
ages = sample.agepreg
weights = sample.totalwgt_lb
estimates = LeastSquares(ages, weights)
t.append(estimates)

inters, slopes = zip(*t)


return inters, slopes
inters, slopes = SamplingDistributions(live, iters=1001)
def Summarize(estimates, actual=None):
mean = Mean(estimates)
stderr = Std(estimates, mu=actual)
cdf = thinkstats2.Cdf(estimates)
ci = cdf.ConfidenceInterval(90)
print('mean, SE, CI', mean, stderr, ci)
Summarize(inters)
Summarize(slopes)

for slope, inter in zip(slopes, inters):


fxs, fys = FitLine(age_means, inter, slope)
thinkplot.Plot(fxs, fys, color='gray', alpha=0.01)

thinkplot.Config(xlabel="Mother's age (years)",


ylabel='Residual (lbs)',
xlim=[10, 45])

89
def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options):
fys_seq = []
for inter, slope in zip(inters, slopes):
fxs, fys = FitLine(xs, inter, slope)
fys_seq.append(fys)

p = (100 - percent) / 2
percents = p, 100 - p
low, high = thinkstats2.PercentileRows(fys_seq, percents)
thinkplot.FillBetween(fxs, low, high, **options)
PlotConfidenceIntervals(age_means, inters, slopes, percent=90,
color='gray', alpha=0.3, label='90% CI')
PlotConfidenceIntervals(age_means, inters, slopes, percent=50,
color='gray', alpha=0.5, label='50% CI')

thinkplot.Config(xlabel="Mother's age (years)",


ylabel='Residual (lbs)',
xlim=[10, 45])

90
def ResampleRowsWeighted(df, column='finalwgt'):
weights = df[column]
cdf = thinkstats2.Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample
iters = 100
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)

estimates = [thinkstats2.ResampleRows(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)

91
92

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy