0% found this document useful (0 votes)

17 views

Data Science Lab Manual

The document outlines the Data Science and Analytics Laboratory course for B.Tech. students in Artificial Intelligence and Data Science at Ramco Institute of Technology. It includes the vision and mission of the institute and department, program educational outcomes, and specific lab exercises focused on Python and data analytics. Additionally, it provides instructions for students, a syllabus, and a preface highlighting updates from the previous manual.

Uploaded by

SUJITHA M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Data Science Lab Manual

Uploaded by

SUJITHA M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

AD3341 DATA SCIENCE AND ANALYTICS LABORATORY

MASTER RECORD
B.Tech. Artificial Intelligence and Data Science
II year/IV semester
R2021
Prepared by,
Dr. M. Kaliappan HOD/AD
Dr. S. Vimal ASP/AD

Department of Artificial Intelligence and Data Science

RAMCO INSTITUTE OF TECHNOLOGY
Rajapalayam – 626117

TamilNadu

Prepared by, Approved by,

Dr. M. Kaliappan HOD/AD Dr. M. Kaliappan HOD/AD

Dr. S. Vimal ASP/AD

0
VISION AND MISSION
Vision of the Institute
To evolve as an Institute of international repute in offering high-quality technical
education, Research and extension programmes in order to create knowledgeable,
professionally competent and skilled Engineers and Technologists capable of
working in multi-disciplinary environment to cater to the societal needs.

Mission of Institute
To accomplish its unique vision, the Institute has a far-reaching mission that aims:
• To offer higher education in Engineering and Technology with highest level of quality,
Professionalism and ethical standards
• To equip the students with up-to-date knowledge in cutting-edge technologies, wisdom,
creativity and passion for innovation, and life-long learning skills
• To constantly motivate and involve the students and faculty members in the education process
for continuously improving their performance to achieve excellence.

Vision of Department
To impart international quality education, promote collaborative research and
graduate industry-ready engineers in the domain of Artificial Intelligence and Data
Science to servethe society.

Mission of the Department

 Excel in Teaching-Learning process and collaborative Research by the use

of modern infrastructure and innovative components.

 Establish an Artificial Intelligence and Data Science based centre of

excellence to prepare professional technocrats for solving interdisciplinary
industry problems in various applications

 Motivate students to emerge as entrepreneurs with leadership qualities in a

societal centric program to fulfil Industry and community needs with ethical
standards.

1
Program Educational Outcomes (PEO’s)
After successful completion of the degree, the students will be able to

PEO 1. Apply Artificial Intelligence and Data Science techniques with industrial
standards and pioneering research to solve social and environment-related
problems for making a sustainable ecosystems.

PEO 2. Excel with professional skills, fundamental knowledge, and advanced

futuristic technologies to become Data Scientists, Data Analyst Managers, Data
Science leaders AI Research Scientists, or Entrepreneurs.

Program Outcomes(PO’s)
Engineering Graduates will be able to:
o Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
o Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
o Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
o Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
o Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
o The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
o Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
o Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
o Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
o Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write

2
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
o Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to ones own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
o Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

Program Specific Outcomes(PSO’s)

After successful completion of the degree, the students will be able to:
PSO 1: To apply analytic technologies to arrive at actionable foresight, Insight, hindsight from
data for solving business and engineering problems
PSO 2: To create, and apply the techniques of AI and Data Science to forecast future events in
the domain of Healthcare, Education, and Agriculture, Manufacturing, Automation, Robotics,
Transport, etc
PSO 3: To enrich the critical thinking skills in emerging technologies such as Hybrid Mobile
application development, cloud technology stack, and cyber-physical systems with mathematical
aid to foresee the research findings and provide the solutions

3
INSTRUCTIONS TO THE STUDENTS

 Students should wear Uniforms and Coats neatly during the lab session

 Students should maintain silence during lab hours; roaming around the lab
during lab session is not permitted

 Programs should be written in the manual and well prepared for the current
exercise before coming to the session

 Experiments should be completed within the Specified Lab Session

 Before Every Session, Last Session lab exercise & record should be
completed and get it verified by the faculty

 In the Record Note, Flow Chart and Outputs should be written on the left
side, while Aim, Algorithm & Result should be written on the right side.

 Programs (Printed) should be placed on the right side

 Marks for each lab exercise is awarded as follows :

Performance 25 Marks

Viva 10 Marks

Record 15 Marks

Total 50 Marks

4
PREFACE
The current year’s manual (2022 -2023) differs from the previous
year’s manual in numerous ways. Course objectives and outcomes and
mapping the lab exercises to the outcomes are included in this manual.

All the lab exercises are revised and updated in many places. New
exercises are addedbesides university syllabus.

Viva questions related to the exercise is included at the end of all the
lab exercises along with besides university syllabus exercises.

A number of people helped me with this revision. I would like to thank

the Head of the Department and all friendly faculty members for
providing valuable suggestions in preparing this students’ centric
manual. I would also like to the Lab technicians and other people who
helped me in many ways.

5
CONTENTS
Contents Page
no
Syllabus 7

CO - Mapping 8

Ex Experiment
no
1 Working with Numpy and 9
Pandas arrays

2 Basic Plots using Matplotlib 27

3 Frequency, Distributions, 34
variability and Variance
4 Normal Curves,correlation 37
and Scatter plots and
correlation coefficient
5 Regression 42

6 Z - test 51

7 T – Test 54

8 Anova 58

9 Building and Validating linear 64

models
10 Building and Validating 75
logistic models
11 Time series analysis 83

12 Regression – least squares 85

6
SYLLABUS
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY LTPC

0 0 4 2
OBJECTIVES:
To develop data analytic code in python To be able to use python libraries for handling data
To develop analytical applications using python To perform data visualization using plots

LIST OF EXPERIMENTS Tools:

Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh
Working with Numpy arrays
1. Working with Pandas data frames
2. Basic plots using Matplotlib
3. Frequency distributions, Averages, Variability
4. Normal curves, Correlation and scatter plots, Correlation coefficient
5. Regression
6. Z-test
7. T-test
8. ANOVA
9. Building and validating linear models
10. Building and validating logistic models
11. Time series analysis
PRACTICALS 60 PERIODS

OUTCOMES: Upon successful completion of this course, students will be able to:
CO1. Write python programs to handle data using Numpy and Pandas
CO2. Perform descriptive analytics
CO3. Perform data exploration using Matplotlib
CO4. Perform inferential data analytics
CO5. Build models of predictive analytics

7
CO Mapping
Exno Experiment CO

1 Working with CO1

Numpy and Pandas
arrays
2 Basic plots using CO3
Matplotlib
3 Frequency CO4
distributions,variance
and Variablity
4 Normal Curves, CO4
Correlation,and
Scatter plots
5 Regression CO5

6 Z- Test CO2

7 T- Test CO2

8 Anova CO2

9 Building and CO5

validating Linear
models
10 Building and CO5
validating Logistic
models
11 Time series analysis CO2

12 Regression-Least CO5
squares

8
Excerise1
Working with Numpy and Pandas dataframe
1 A): Simple pandas programs
1. Write a Pandas program to get the powers of an array values element-wise.
Program:
import pandas as pd
data=pd.DataFrame({'x':[78,90,89,78],'y':[67,67,87,79],'z':[56,89,57,90]})
print(data)
OutP
x y z
0 78 67 56
1 90 67 89
2 89 87 57
3 78 79 90
2. Write a Pandas program to create and display a DataFrame from a specified dictionary
data which has the index labels
import pandas as pd

details = {
'Name' : ['Ankit', 'Aishwarya', 'Shaurya', 'Shivangi'],
'Age' : [23, 21, 22, 21],
'University' : ['BHU', 'JNU', 'DU', 'BHU'],
}

df = pd.DataFrame(details)

df
Name Age University
0 Ankit 23 BHU
1 Aishwarya 21 JNU
2 Shaurya 22 DU
3 Shivangi 21 BHU
3.Write a Pandas program to display a summary of the basic information about a specified
DataFrame and its data. Sample Python dictionary data and list labels:
import pandas as pd
import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

9
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("Summary of the basic information about this DataFrame and its data:")
print(df.info())
Summary of the basic information about this DataFrame and its data:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype

0 name 10 non-null object

1 score 8 non-null float64
2 attempts 10 non-null int64
3 qualify 10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes
None
4.Write a Pandas program to get the first 3 rows of a given DataFrame. Go to the editor Sample
Python dictionary data and list labels:
import pandas as pd
import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("First three rows of the data frame:")
print(df.iloc[:3])
First three rows of the data frame:
name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
5.Write a Pandas program to select the 'name' and 'score' columns from the following DataFrame

10
import pandas as pd
import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

df = pd.DataFrame(exam_data , index=labels)
print("Select specific columns:")
print(df[['name', 'score']])
Select specific columns:
name score
a Anastasia 12.5
b Dima 9.0
c Katherine 16.5
d James NaN
e Emily 9.0
f Michael 20.0
g Matthew 14.5
h Laura NaN
i Kevin 8.0
j Jonas 19.0
6. Write a Pandas program to select the specified columns and rows from a given data
frame.
import pandas as pd
import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

df = pd.DataFrame(exam_data , index=labels)
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no

11
f 20.0 yes
g 14.5 yes
7. Write a Pandas program to select the rows where the number of attempts in the
examination is greater than 2
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Number of attempts in the examination is less than 2 and score greater than 15 :")
print(df[(df['attempts'] <2) & (df['score'] >15)])
Number of attempts in the examination is less than 2 and score greater than 15 :
name score attempts qualify
j Jonas 19.0 1 yes
8. Write a Pandas program to count the number of rows and columns of a DataFrame
import pandas as pd
dict= {'Name' : ['Martha', 'Tim', 'Rob', 'Georgia'],
'Marks' : [87, 91, 97, 95]}
df = pd.DataFrame(dict)
display(df)
rows = df.shape[0]
cols = df.shape[1]
print("Rows: "+str(rows))
print("Columns: "+str(cols))
Name Marks
0 Martha 87
1 Tim 91
2 Rob 97
3 Georgia 95
Rows: 4
Columns: 2
9. Write a Pandas program to select the rows where the score is missing, i.e. is NaN.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

12
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("Rows where score is missing:")
print(df[df['score'].isnull()])
Rows where score is missing:
name score attempts qualify
d James NaN 3 no
h Laura NaN 1 no
10.Write a Pandas program to select the rows the score is between 15 and 20 (inclusive)
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("Rows where score between 15 and 20 (inclusive):")
print(df[df['score'].between(15, 20)])
Rows where score between 15 and 20 (inclusive):
name score attempts qualify
c Katherine 16.5 2 yes
f Michael 20.0 3 yes
j Jonas 19.0 1 yes
11.Write a Pandas program to select the rows where number of attempts in the examination is
less than 2 and score greater than 15.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Number of attempts in the examination is less than 2 and score greater than 15 :")
print(df[(df['attempts'] <2) & (df['score'] >15)])

13
Number of attempts in the examination is less than 2 and score greater than 15 :
name score attempts qualify
j Jonas 19.0 1 yes
12.Write a Pandas program to change the score in row 'd' to 11.5.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("\nOriginal data frame:")
print(df)
print("\nChange the score in row 'd' to 11.5:")
df.loc['d', 'score'] =11.5
print(df)

Original data frame:

name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

Change the score in row 'd' to 11.5:

name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James 11.5 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no

14
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

13.Write a Pandas program to calculate the sum of the examination attempts by the students
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("\nSum of the examination attempts by the students:")
print(df['attempts'].sum())

Sum of the examination attempts by the students:

19
14. Write a Pandas program to calculate the mean score for each different student in
DataFrame
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)
print("\nMean score for each different student in data frame:")
print(df['score'].mean())

Mean score for each different student in data frame:

13.5625
15. Write a Pandas program to append a new row 'k' to data frame with given values for each
column
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})

15
print(df)

Name Age Location

0 Nik 31 Toronto
1 Kate 30 London
2 Evan 40 Kingston
3 Kyra 33 Hamilton

16. Write a Pandas program to sort the DataFrame first by 'name' in descending order, then
by 'score' in ascending order.
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year',
'Population', 'Continent'])

df
Country Year Population Continent
0 Afghanistan 1952 8425333.0 Asia
1 Australia 1957 9712569.0 Oceania
2 Brazil 1962 76039390.0 Americas 3
China 1957 637408000.0 Asia
4 France 1957 44310863.0 Europe
5 India 1952 372000000.0 Asia
6 United States 1957 171984000.0 Americas
17.Write a Pandas program to replace the 'qualify' column contains the values 'yes' and 'no' with
True and False.
import numpy as np
import pandas as pd

exam_data =pd.DataFrame( {'name': ['Anastasia', 'Dima', 'Katherine',

'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no',
'yes']})
exam_data.set_index([['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'],'name'])
r1 = exam_data.replace('yes', 'true')

16
r2 = exam_data.replace('no', 'false')
r1
name score attempts qualify
0 Anastasia 12.5 1 true
1 Dima 9.0 3 no
2 Katherine 16.5 2 true
3 James NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 true
6 Matthew 14.5 1 true
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 true

18.Write a Pandas program to change the name 'James' to 'Suresh' in name column of the
DataFrame.
import pandas as pd

# Create the series

series = pd.Series(['suresh', 'for', 'geeks',
'pandas', 'series']
print("james:")
series
james:
0 suresh
1 for
2 geeks
3 pandas
4 series
dtype: object
19. Write a Pandas program to delete the 'attempts' column from the DataFrame
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Original rows:")
print(df)
print("\nDelete the 'attempts' column from the data frame:")

17
df.pop('attempts')
print(df)
Original rows:
name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

Delete the 'attempts' column from the data frame:

name score qualify
a Anastasia 12.5 yes
b Dima 9.0 no
c Katherine 16.5 yes
d James NaN no
e Emily 9.0 no
f Michael 20.0 yes
g Matthew 14.5 yes
h Laura NaN no
i Kevin 8.0 no
j Jonas 19.0 yes
21. Write a Pandas program to iterate over rows in a DataFrame.

import pandas as pd
import numpy as np
exam_data = [{'name':'Anastasia', 'score':12.5}, {'name':'Dima','score':9},
{'name':'Katherine','score':16.5}]
df = pd.DataFrame(exam_data)
for index, row in df.iterrows():
print(row['name'], row['score'])
Anastasia 12.5
Dima 9.0
Katherine 16.5
22. Write a Pandas program to get list from DataFrame column headers

import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

18
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print(list(df.columns.values))
['name', 'score', 'attempts', 'qualify']
23. Write a Pandas program to select rows from a given DataFrame based on values in some
columns.
import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
print('Rows for colum1 value == 4')
print(df.loc[df['col1'] ==4])
Rows for colum1 value == 4
col1 col2 col3
1 4 5 8
3 4 7 0

24. Write a Pandas program to change the order of a DataFrame columns

import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
print('After altering col1 and col3')
df = df[['col3', 'col2', 'col1']]
print(df)
After altering col1 and col3
col3 col2 col1
0 7 4 1
1 8 5 4
2 9 6 3
3 0 7 4
4 1 8 5

25. Write a Pandas program to add one row in an existing DataFrame.

import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)

19
print('After add one row:')
df2 = {'col1': 10, 'col2': 11, 'col3': 12}
df = df.append(df2, ignore_index=True)
print(df)
After add one row:
col1 col2 col3
0 1 4 7
1 4 5 8
2 3 6 9
3 4 7 0
4 5 8 1
5 10 11 12

26. Write a Pandas program to write a DataFrame to CSV file using tab separator.

import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
print('Data from new_file.csv file:')
df.to_csv('new_file.csv', sep='\t', index=False)
new_df = pd.read_csv('new_file.csv')
print(new_df)
Data from new_file.csv file:
col1\tcol2\tcol3
0 1\t4\t7
1 4\t5\t8
2 3\t6\t9
3 4\t7\t0
4 5\t8\t1
27. Write a Pandas program to count city wise number of people from a given of data set (city,
name of the person)
import pandas as pd
df1 = pd.DataFrame({'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
'Matthew', 'Laura', 'Kevin', 'Jonas'],
'city': ['California', 'Los Angeles', 'California', 'California', 'California', 'Los Angeles', 'Los
Angeles', 'Georgia', 'Georgia', 'Los Angeles']})
g1 = df1.groupby(["city"]).size().reset_index(name='Name of the person')
print(g1)
city Name of the person
0 California 4
1 Georgia 2
2 Los Angeles 4

20
28. Write a Pandas program to delete DataFrame row(s) based on given column value.
import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
df = df[df.col2 !=5]
print("New DataFrame")
print(df)
New DataFrame
col1 col2 col3
0 1 4 7
2 3 6 9
3 4 7 0
4 5 8 1
29. Write a Pandas program to widen output display to see more columns.

import pandas as pd
import numpy as np
d = {'col1': [1, 4, 3, 4, 5], 'col2': [4, 5, 6, 7, 8], 'col3': [7, 8, 9, 0, 1]}
df = pd.DataFrame(data=d)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("Original DataFrame")
print(df)
Original DataFrame
col1 col2 col3
0 1 4 7
1 4 5 8
2 3 6 9
3 4 7 0
4 5 8 1

30. Write a Pandas program to replace all the NaN values with Zero's in a column of a dataframe.

import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
df = pd.DataFrame(exam_data)
df = df.fillna(0)

21
print("\nNew DataFrame replacing all NaN with 0:")
print(df)

New DataFrame replacing all NaN with 0:

name score attempts qualify
0 Anastasia 12.5 1 yes
1 Dima 9.0 3 no
2 Katherine 16.5 2 yes
3 James 0.0 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura 0.0 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 yes

1b) Cardino dataset:

Program:

import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
col_names = ['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alc
o', 'active', 'cardio']
df = pd.read_csv('/content/drive/MyDrive/dataset/cardio_train.csv',delimiter=';')
df.head(5)

df.columns

df.info()

22
df.isnull().sum()

new_df = df[['ap_hi','cholesterol','gluc','smoke']]
new_df.tail(5)

23
max(new_df['gluc'])
3
max(new_df['cholesterol'])
3
#lets ckeck the range
print(new_df[new_df['ap_hi'].between(1000,1620)])

1 c) Game Dataset

import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/dataset/metacritic_games.csv')
df.head(4)

24
df.columns

df.info()

df.isnull().sum()

new_df = df[['name', 'platform', 'release_date', 'summary','meta_score']]

new_df.tail(5)

new_df.columns

25
new_df.groupby(['name']).mean()

new_data = df[['meta_score','user_score']]
new_data.head(5)

PlayStation= new_data.loc[new_data['user_score']==6.8]
PlayStation.tail(4)

26
Exercise 2
Basic plots using Matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\Users\21ad016\Desktop\Datasets\forestfires.csv')
df.head(5)
Output:

new_df = df[['X','Y','month']]
new_df.head(5)
Output:

data = new_df.groupby(['month']).value_counts()
data
Output:

data.sort_values(ascending = True)

27
Bar plots
#Bar plot

plt.bar(df['month'],df['temp'],color='y')
plt.xlabel('Month')
plt.ylabel('Temperature')
plt.title("Bar plot")
Output:

plt.bar(df['month'],df['wind'],color = 'r')
plt.xlabel('Month')
plt.ylabel('Wind')
plt.title("Bar plot")
Output:

plt.bar(df['month'],df['rain'])
plt.xlabel('Month')
plt.ylabel('Rain')
plt.title("Bar plot")

28
Output:

plt.bar(df['month'],df['RH'],color='g')
plt.xlabel('Month')
plt.ylabel('Temperature')
plt.title("Bar plot")
Output:

Observation : It is observed that the factors like wind,temp,RH will contribute to the forest fires.
Rain doesnot contribute much
Most number of forest fires have occured in the months of March,July which is nearly summer
season where the tempearature will be high
Forest fire occcur in the month of August due o wind and Relative humidity

29
Scatter plot
#Scatter Plot
plt.scatter(df['month'],df['area'])
plt.xlabel('Month')
plt.ylabel('Area in hectares')
plt.title('Scatter plot')
Output:

plt.scatter(df['month'],df['day'],color='#800000')
plt.xlabel('Month')
plt.ylabel('Day')
plt.title('Scatter Plot')
Output:

plt.scatter(df['month'],df['DC'])
plt.xlabel('')
Output:

30
Area plot
months = {"jan":1, "feb":2, "mar":3, "apr":4, "may":5, "jun":6, "jul":7, "aug":8, "sep":9,
"oct":10, "nov":11, "dec":12}
df["month_num"] = df["month"].map(months)
df = df.sort_values(by=["month_num", "day"])
plt.fill_between(range(len(df)), df["area"], color="orange")
plt.xlabel("Time")
plt.ylabel("Area")
plt.title("Distribution of Area over Time")
plt.show()

ffmc_month = df[['FFMC', 'month']]

# group by month and get the mean FFMC value for each month

31
ffmc_month_mean = ffmc_month.groupby('month')['FFMC'].mean()
# create the area plot
fig, ax = plt.subplots()
ax.fill_between(ffmc_month_mean.index, ffmc_month_mean.values, alpha=0.5)
ax.set(title='Average FFMC by Month', xlabel='Month', ylabel='FFMC')
plt.show()

ffmc_month = df[['DC', 'month']]

# group by month and get the mean FFMC value for each month
ffmc_month_mean = ffmc_month.groupby('month')['DC'].mean()
# create the area plot
fig, ax = plt.subplots()
ax.fill_between(ffmc_month_mean.index, ffmc_month_mean.values,
alpha=0.5,color='#FFC0CB')
ax.set(title='Average DC by Month', xlabel='Month', ylabel='DC')
plt.show()

32
Histogram
plt.hist(df['X'])

plt.hist(df['Y'])

33
Exercise 3
Frequency distributions, average and variance
Frequency distribution,variance and variability
View(diabetes_data_upload)

#checking for null values

sum(is.na(diabetes_data_upload))
Output:

#There are no null values in the dataset

Mean
#mean of the age
x <- diabetes_data_upload$Age
cat("Mean age of person having diabetes: ",mean(x));
Output:

Observation:
People with average of 48 years are affected by diabetes

library(dplyr)
grouped = diabetes_data_upload %>% group_by(Gender) %>%

34
summarise(Age = mean(Age))
grouped
Output:

Observation:
The average age of women getting affected by diabetes is 47 and 48 for men.

library(dplyr)
df_grouped = diabetes_data_upload %>% group_by(sudden.weight.loss) %>%
summarise(Age = mean(Age))
df_grouped
Output:

Variability
1. Range
range_age<- range(diabetes_data_upload$Age)
cat("Range with age grouped: ",range_age);
Output:

Observation:
It shows that people start affected by diabetes at the age of 16 itself

2. Interquartile Range
age <- diabetes_data_upload$Age
q50 <- quantile(age,0.5)

35
q75 <- quantile(age,0.75)
q100 <- quantile(age,1)

cat("50% quartile: ",q50,"\n")

cat("75% quartile: ",q75,"\n")
cat("100%quartile: ",q100,"\n")
Output:

3. Variance
age <- diabetes_data_upload$Age
cat("The variance of age: ",var(age))
Output:

4. Standard Deviation
age <- diabetes_data_upload$Age
cat("The standard deviation of age: ",sd(age))
Output:

36
Exercise 4
Normal Curves, Correlation and scatter plots and correlation
coefficient
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive

drive.mount('/content/drive')
Mounted at /content/drive

df = pd.read_csv('/content/drive/MyDrive/datasets/accident.csv')
df.isnull().sum()

from bokeh.plotting import figure,output_file,show

output_file('scatter_plot.html')
graph = figure(title = "Scatter plot using BOkeh")
x = df['PERSONS']
y = df['COUNTY']
graph.scatter(x,y)
show(graph)

37
from bokeh.plotting import figure,output_file,show
output_file('person.html')
graph = figure(title="Scatter plot")
x = df['PERSONS']
y = df['VE_TOTAL']
graph.scatter(x,y)
show(graph)

38
from bokeh.io import show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.transform import linear_cmap
df.columns

Normal Curve:
import numpy as np
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure,output_file

output_file('normal.html')
accidents = df['CITY']
hist,edges = np.histogram(accidents,density = True)

mean = np.mean(accidents)
std_dev = np.std(accidents)

x = np.linspace(mean - 3std_dev, mean + 3std_dev, 1000)

y = 1/(np.sqrt(2*np.pi)*std_dev)*np.exp(-(x-mean)**2/(2*std_dev**2))

source = ColumnDataSource({'hist': hist, 'left': edges[:-1], 'right': edges[1:]})

p = figure(title='Accident Histogram with Normal Distribution', tools='')
p.quad(source=source, bottom=0, top='hist', left='left', right='right', fill_color='navy', line_color=
'white', alpha=0.5)

p.line(x, y, line_color='red', line_width=2)

show(p)

39
import numpy as np
from bokeh.io import show
from bokeh.models import LinearColorMapper, ColorBar
from bokeh.palettes import Viridis256
from bokeh.plotting import figure
from bokeh.transform import transform
from bokeh.plotting import figure,output_file

output_file("heatmap_persons.html")

# Create some sample data

x = df['VE']

# Define the color mapper

mapper = LinearColorMapper(palette=Viridis256, low=data.min(), high=data.max())

# Create the figure and plot the rectangles

p = figure(x_range=(0, 10), y_range=(0, 10))
p.rect(x='x', y='y', width=1, height=1, source=data, line_color=None, fill_color=transform('value'
, mapper))

# Add the color bar

40
color_bar = ColorBar(color_mapper=mapper, location=(0, 0))
p.add_layout(color_bar, 'right')

# Show the plot

show(p)

41
Exercise 5
Regression
5a. Regression Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
df = pd.read_csv('/content/drive/MyDrive/datasets/winequality-red.csv')
df.head(3)

df.isnull().sum()

df.columns

import seaborn as sns

sns.regplot(x="fixed acidity",y="quality",data = df)

42
sns.residplot(x="fixed acidity",y="quality",data=df)

sns.heatmap(df.corr(),annot=True,cmap='YlGnBu',linecolor='r',linewidths=0.5)

43
fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='citric acid',y='quality',data=df,color='r',ax=ax0)
sns.residplot(x='citric acid',y='quality',data=df,color='r',ax=ax1)
plt.show()

44
fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='sulphates',y='quality',data=df,color='g',ax=ax0)
sns.residplot(x='sulphates',y='quality',data=df,color='g',ax=ax1)
plt.show()

fig,[ax0,ax1] = plt.subplots(1,2)
fig.set_size_inches([12,6])
sns.regplot(x='alcohol',y='quality',data=df,color='g',ax=ax0)
sns.residplot(x='alcohol',y='quality',data=df,color='g',ax=ax1)
plt.show()

45
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
X= sm.add_constant(X)
regression = sm.OLS(y,X).fit()
print(regression.summary())

46
from sklearn.metrics import mean_squared_error, mean_absolute_error

# get the predicted values

y_pred = regression.predict(X)

# calculate the mean squared error and root mean squared error
mse = mean_squared_error(y, y_pred)
rmse = mean_squared_error(y, y_pred, squared=False)

print("Mean Squared Error (MSE):", mse)

print("Root Mean Squared Error (RMSE):", rmse)

Mean Squared Error (MSE): 0.41676716722140794

Root Mean Squared Error (RMSE): 0.6455750670692045

5b. Logistic Regression using R

library(dplyr)
library(ggplot2)

library(caret)
data(winequality)
head(winequality)

summary(winequality)

47
winequality$quality_binary<- ifelse(winequality$quality> 6,1,0)
ggplot(winequality,aes(x=quality,fill=quality_binary))+geom_bar()

set.seed(123)

48
train_index<- sample(seq_len(nrow(wine)),size = 0.7 * nrow(winequality))
train_data<- winequality[train_index, ]
test_data<- winequality[-train_index, ]

model <- glm(quality_binary ~ .,data=train_data,family=binomial)

summary(model)

pred<- predict(model, new_data = test_data, type = "response")

pred<- head(pred, nrow(test_data))
# create a new dataframe to store predicted values
predictions <- data.frame(actual = test_data$quality,
predicted = ifelse(pred>= 0.5, 1, 0))
predictions

49
library(pROC)
roc <- roc(test_data$quality_binary,pred)
plot(roc)
auc(roc)

50
Exercise 6
Z – Test
6a. Z – test for one sample
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
df = pd.read_csv('/content/drive/MyDrive/datasets/student_clustering.csv')
df.head(3)

df.isnull().sum()

import numpy as np
import statsmodels.stats.api as sms
df['iq'].mean()
101.995
H0 : mean<=101 H1 : mean != 101
test_stat,p_value = sms.ztest(df['iq'],value=101)
print("Test Statistic: ",test_stat)
print("p-value: ",p_value)

print("\n ")
print("************Conclusion*************")
alpha = 0.05
if (p_value < alpha):
print("\nReject the null hypothesis\n The IQ value of the sample is not equal to the population
mean\nThe new students IQ is below 101")
else:
print("\nFail to reject the null hypothesis\n The IQ value of the sample is less than or equal to th
e population mean\nThe new students also have the IQ same as the population")

51
6b . Z – test for two independent samples
#b) Two sample Z-test

df1 = pd.read_csv('/content/drive/MyDrive/datasets/weight-height.csv')
df1.head(3)

df2 = df1.drop('Weight',axis=1)
male_df = df2[df2['Gender'] == 'Male']
female_df = df2[df2['Gender'] == 'Female']
from statsmodels.stats.weightstats import ztest

sample1 = male_df['Height']
sample2 = female_df['Height']
mean1,mean2 = np.mean(sample1),np.mean(sample2)
std1,std2 = np.std(sample1), np.std(sample2)

z_score,p_value = ztest(x1=sample1,x2=sample2,value=0)

print("The mean of Sample1: Height of Male: ",mean1)

print("The mean of Sample2: Height of Female: ",mean2)
print("The Z score obtained: ",z_score)
print("The P-value: ",p_value)

52
alpha = 0.05
if p_value<alpha:
print("Reject the null hypothethis\nThe difference between the population means is not zero")
else:
print("Fail to reject the null hypothethis\nThe difference between the population is zero")

6c. Z – test on paired sample

#c) Paired Z-test
import numpy as np
from scipy.stats import norm

cgpa_1_sem = np.array([7.8,7.7,6.5,9.8,7.0,7.9,8.0,8.5,7.4,8.8])
cgpa_2_sem = np.array([8.4,8.0,6.0,7.9,8.9,7.8,8.0,7.5,6.6,8.0])
diff = np.array(cgpa_2_sem) - np.array(cgpa_1_sem)
mean_diff = np.mean(diff)
std_diff = np.std(diff,ddof=1)
n = len(diff)
se_diff = std_diff/np.sqrt(n)
null_hypothesis = 0
z_score = mean_diff/se_diff
p_value = 2 * norm.cdf(-np.abs(z_score))
print("Mean difference:", mean_diff)
print("Standard deviation of difference:", std_diff)
print("Standard error of the mean difference:", se_diff)
print("Z-score:", z_score)
print("P-value:", p_value)

print("\n*********************************")
print("************Conclusion*************")

53
alpha = 0.05
if p_value<alpha:
print("Reject the null hypothesis\nThis indicates that the difference between the populations is
not zer")
else:
print("Failed to reject null hypothethis\nThis indicates the difference between the mean of popu
lations is zero")

54
Exercise 7
T – test
7a. t – test for one sample
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

df = pd.read_csv('/content/drive/MyDrive/datasets/winequality-red.csv')
df.head(3)

df['fixed acidity'].mean()
8.31963727329581

from scipy.stats import ttest_1samp

t_statistic,p_value = ttest_1samp(df['fixed acidity'],8.3)
print("T-Statistic: {:.2f}, P-Value: {:.2f}".format(t_statistic,p_value))

alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis\nThe mean of the population has no special thing happenning"
)
else:
print("Failed to reject the null hypothesis\nThere is something special happening in the populati
on")

55
7b t-test for two independent samples
import pandas as pd
import numpy as np
import scipy.stats as stats
data = pd.read_csv('/content/drive/MyDrive/datasets/weight-height.csv')
data.head(3)

male_weight = data[data['Gender']=='Male']
female_weight = data[data['Gender']=='Female']
sample1 = male_weight['Weight']
sample2 = female_weight['Weight']
t_stat , p_value = stats.ttest_ind(sample1,sample2)
print('t-statistic: ',t_stat)
print("p_value: ",p_value)

alpha = 0.05
print("\n*****************************")
print("\n*********Conclusion**********")
if p_value < alpha:
print('\nReject the null hypothethis\n There is no difference between the means of the populatio
n')
else:
print("\nFailed to reject null hypothethis\n There is a diifference between the means of the popu
lations")
print("\n**************************")

56
7c t-test for paired samples
marks_1_sem = [8.9,8.8,7.9,6.7,8.7,7.8,9.,6.7,8.9,8.0]
marks_2_sem = [9.0,7.8,8.0,6.7,9.0,6.8,8.9,7.4,8.0,6.7]
import scipy.stats as stats
t_stat,p_value = stats.ttest_rel(marks_1_sem,marks_2_sem)
print("t-statistic: ",t_stat)
print("p_value: ",p_value)

57
Exercise 8
Anova
8a. One Way Anova Test
View(DATA_RELEASE)

sum(is.na(DATA_RELEASE))

model <- aov(dnn_confidence ~ dnn_prediction,

data=DATA_RELEASE)
summary(model)

58
library(ggplot2)
ggplot(DATA_RELEASE, aes(x=dnn_prediction, y=dnn_confidence)) +
geom_boxplot() +
labs(x="DNN Prediction", y="DNN Confidence") +
ggtitle("Boxplot of DNN Confidence by DNN Prediction")

Conclusion:
There is a significant difference between the means of the dnn_confidence scores across the
levels of the dnn_prediction factor (F(5, 9594) = 216, p < 0.001). The null hypothesis that there
is no difference between the means is rejected

8b Pair wise Anova test:

#Pair wise anova
model2 <- aov(dnn_confidence ~ tested_occupation + dnn_prediction +
tested_occupation:dnn_prediction, data = DATA_RELEASE)
summary(model2)

59
# perform pairwise ANOVA for tested_occupation factor
pw1<-pairwise.t.test(DATA_RELEASE$dnn_confidence,
DATA_RELEASE$tested_occupation, p.adjust.method = "bonferroni")
print(pw1)

# perform pairwise ANOVA for dnn_prediction factor

pw2<-pairwise.t.test(DATA_RELEASE$dnn_confidence, DATA_RELEASE$dnn_prediction,
p.adjust.method = "bonferroni")
print(pw2)

# perform pairwise ANOVA for interaction of tested_occupation and dnn_prediction factors

pw3<-pairwise.t.test(DATA_RELEASE$dnn_confidence,
interaction(DATA_RELEASE$tested_occupation, DATA_RELEASE$dnn_prediction),
p.adjust.method = "bonferroni")

60
print(pw3)

library(ggpubr)
ggboxplot(DATA_RELEASE, x = "tested_occupation", y = "dnn_confidence",
color = "tested_occupation", palette = "jco") +
stat_compare_means(comparisons = list(c("Accountant", "Data Analyst"),
c("Accountant", "Software Developer"),
c("Data Analyst", "Marketing Manager"),
c("Marketing Manager", "Software Developer")),
method = "t.test", label = "p.signif") +
labs(title = "Comparison plot for dnn_confidence by tested_occupation",
x = "Tested occupation", y = "DNN confidence")

61
Conclusion:
The conclusion is that both tested_occupation and dnn_prediction have a significant impact on
the dnn_confidence, and the interaction between them also has a significant effect. Additionally,
there are significant pairwise differences in dnn_confidence between different levels of
tested_occupation and dnn_prediction.

62
8c Two wayanova test:
model1 <- aov(dnn_confidence ~ tested_occupation
* dnn_prediction, data=DATA_RELEASE)
summary(model1)

library(ggplot2)
ggplot(DATA_RELEASE, aes(x = tested_occupation, y = dnn_confidence, color =
dnn_prediction)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
labs(title = "Interaction plot for dnn_confidence by tested_occupation and dnn_prediction",
x = "Tested occupation", y = "DNN confidence")

63
Conclusion:
The two-way ANOVA, we can see that both the tested_occupation factor and the dnn_prediction
factor have a significant effect on the dnn_confidence variable, as indicated by the extremely
small p-values (< 2e-16) and the significant F values. The interaction term between the
tested_occupation and dnn_prediction factors is also significant, indicating that the effect of one
factor on the dnn_confidence variable is dependent on the level of the other factor.

64
Exercise 9
Building and validating linear models
View(diamond)

#Check for null values

sum(is.na(diamond))

#Display the head of the data

head(diamond)

#The information of the data can be obtained by

summary(diamond)

65
Linear Regression:
#The built in Linear Regression model lm() is used
#to train the data
model <- lm(price ~ carat+depth+table,data=diamond)
summary(model)

66
#MAke the predictions using model
new_data<- data.frame(carat = c(0.5, 0.75, 1.0), depth = c(61.0, 62.0, 63.0), table = c(58, 60,
62))
new_data$predicted_price<- predict(model,new_data)

new_data$predicted_price

summary(model)$r.squared

Polynomial Regression:
#Polynomial Regression
model1 <- lm(price ~poly(carat,2),data=diamond)
summary(model1)

new_data$prediction_poly<- predict(model1,new_data)
new_data$prediction_poly

67
summary(model1)$r.squared

Bayesian Linear Regression:

#Bayesian Linear Regression
library(arm)

# Fit a Bayesian linear regression model

model.bayes<- bayesglm(price ~ carat + cut + color + clarity + depth + table + x + y + z, data =
diamonds)

# Print the model summary

summary(model.bayes)

# Create new data to predict on

new_data<- data.frame(carat = c(0.3, 0.5, 0.7),
cut = c("Ideal", "Premium", "Good"),

68
color = c("D", "E", "F"),
clarity = c("VS1", "VS2", "SI1"),
depth = c(61.5, 62.0, 63.5),
table = c(55, 59, 63),
x = c(4.29, 4.96, 5.67),
y = c(4.32, 4.97, 5.70),
z = c(2.65, 3.08, 3.62))

# Predict the prices for the new data using the Bayesian linear regression model
new_data$predicted_bayes<- predict(model.bayes, newdata = new_data)

# Print the predicted prices

print(new_data$predicted_bayes)

library(caret)

# Split the data into training and testing sets

set.seed(123)
train_index<- createDataPartition(diamond$price, p = 0.7, list = FALSE)
train_data<- diamond[train_index, ]
test_data<- diamond[-train_index, ]

# Fit the Bayesian linear regression model on the training data

model.bayes<- bayesglm(price ~ carat + cut + color + clarity + depth + table + x + y + z, data =
train_data)

# Predict the prices for the test data using the Bayesian linear regression model
test_data$predicted_price<- predict(model.bayes, newdata = test_data)

# Calculate the MSE, RMSE, MAE, and R-squared

mse<- mean((test_data$price - test_data$predicted_price)^2)
rmse<- sqrt(mse)
mae<- mean(abs(test_data$price - test_data$predicted_price))
rsq<- cor(test_data$price, test_data$predicted_price)^2

# Print the metrics

print(paste0("MSE: ", mse))
print(paste0("RMSE: ", rmse))
print(paste0("MAE: ", mae))
print(paste0("R-squared: ", rsq))

69
library(ggplot2)
# Get the predictions for each of the three models
preds1 <- predict(model, newdata = diamond)
preds2 <- predict(model1, newdata = diamond)
preds3 <- predict(model.bayes, newdata = diamond)

# Create a data frame with the actual values and the predicted values for each model
df <- data.frame(actual = diamond$price,
model = preds1,
model1 = preds2,
model.bayes = preds3)

# Convert the data frame to a long format

df_long<- tidyr::gather(df, model, predicted, -actual)

# Create a scatter plot of the actual values versus the predicted values for each model
ggplot(df_long, aes(x = actual, y = predicted, color = model)) +
geom_point() +
scale_color_manual(values = c("red", "green", "blue")) +
labs(x = "Actual Price", y = "Predicted Price", color = "Model")

70
9b) Linear Regression to predict top 10 players
View(Chess)

#Check for null values

sum(is.na(Chess))

71
# Perform a linear regression of ELO on Age
model <- lm(ELO ~ Age, data = Chess)
summary(model)

library(ggplot2)

# Create a scatter plot of ELO vs Age with regression line

ggplot(Chess, aes(x = Age, y = ELO)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Age", y = "ELO", title = "Chess Player Rankings in Jan 2021")

72
library(dplyr)

# Load Chess dataset

data(Chess)

# Predict ELO rating for each player using the linear regression model
players <- Chess %>%
mutate(Predicted_ELO = predict(lm(ELO ~ Age, data = Chess)))

# Rank players by predicted ELO rating and select top 10

top_players<- players %>%
arrange(desc(Predicted_ELO)) %>%
slice_head(n = 10)

# Print top 10 players

top_players

73
74
Exercise 10
Building and Validating Logistic models
10a. Logistic Regression on Titanic DataSet
View(titanic)

titanic <- na.omit(titanic)

titanic_updated<- select(titanic,-PassengerId,-Sex,-Name,-Ticket,-Cabin,-Embarked)
head(titanic_updated)

library(dplyr)
library(caret)
library(ggplot2)
titanic$Gender_Survived<- ifelse(titanic$Sex=='male',1,0)
ggplot(titanic,aes(x=Survived,fill=Gender_Survived))+geom_bar()

75
train_index<-sample(seq_len(nrow(titanic)),size = 0.7*nrow(titanic_updated))
train_data<- titanic_updated[train_index,]
test_data<- titanic_updated[-train_index,]

model <- glm(Survived ~.,data=train_data,family=binomial)

summary(model)

pred<- predict(model,new_data=test_data,type="response")
pred<-head(pred,nrow(test_data))
predictions<- data.frame(actual=test_data$Survived,predicted = ifelse(pred>=0.5,1,0))
predictions

library(pROC)
roc <- roc(test_data$Survived,pred)
plot(roc)

76
auc(roc)

10b. Logistic Regression on Breast Cancer Dataset

head(Breast_cancer_data)

#check for null values

sum(is.na(Breast_cancer_data))

library(corrplot)
corr<- cor(Breast_cancer_data)
corrplot(corr,method='circle')

77
library(dplyr)
library(ggplot2)
library(caret)
set.seed(1234)
train_index<- sample(seq_len(nrow(Breast_cancer_data)),size=0.7*nrow(Breast_cancer_data))
train_data<- Breast_cancer_data[train_index,]
test_data<- Breast_cancer_data[-train_index,]
head(train_data)

head(test_data)

78
model <- glm(diagnosis ~., data=train_data,family=binomial)
summary(model)

library(pROC)
roc <- roc(test_data$diagnosis,pred)
plot(roc)
auc(roc)

79
10c) Finding Odd ratio
df<- read.csv('C:/Users/21ad016/Desktop/Datasets/framingham.csv')
sum(is.na(df))
645
df<- na.omit(df)
head(df)

correlation <- cor(df)

corrplot(correlation)

set.seed(123)
trainIndex<- sample(1:nrow(df), 0.8*nrow(df))
trainData<- df[trainIndex, ]
testData<- df[-trainIndex, ]

model <- glm(male ~.,data=trainData,family=binomial)

80
summary(model)

exp(coef(model))

10d) Logistic Regression using Sckit-Learn

import pandas as pd
import numpy as np

df = pd.read_csv(r'C:\Users\21ad016\Desktop\Datasets\archive_iris\Iris.csv')
df.head(3)

81
df.isna().sum()

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df.drop('Species', axis=1), df['Species'],

test_size=0.3)

lr_model = LogisticRegression()
# Fit the model to the training data
lr_model.fit(X_train, y_train)

LogisticRegression()

y_pred = lr_model.predict(X_test)

# Evaluate the accuracy of the model

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

82
Exercise 11
Time series analysis
11 Time series Analysis using R
df<- read.csv('C:/Users/21ad016/Desktop/Datasets/test.csv')
head(df)

df$Date<- as.Date(df$Date)
sales_ts<- ts(df$number_sold, start = min(df$Date), frequency = 7)
plot(sales_ts, main = "Weekly Sales Data", xlab = "Date", ylab = "Number Sold")

sales_decomp<- decompose(sales_ts)
plot(sales_decomp)

83
library(forecast)
sales_model<- auto.arima(sales_ts)
sales_forecast<- forecast(sales_model, h = 7)
plot(sales_forecast, main = "Sales Forecast", xlab = "Date", ylab = "Number Sold")

84
Exercise 12
Regression – Least Squares
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve

local, _ = urlretrieve(url, filename)

print("Downloaded " + local)

download("C:\Users\21ad016\Desktop\data")
download("C:\Users\21ad016\Desktop\data1")
import numpy as np

import random

import thinkstats2
import thinkplot
download("C:\Users\21ad016\Desktop\nsfg.py")
download("C:\Users\21ad016\Desktop\first.py")
download("C:\Users\21ad016\Desktop\2002FemPreg.dct")
download(
C:\Users\21ad016\Desktop\2002FemPreg.dat.gz"
)
import first
live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
ages = live.agepreg
weights = live.totalwgt_lb
live.head(4)

from thinkstats2 import Mean, MeanVar, Var, Std, Cov

def LeastSquares(xs, ys):

meanx, varx = MeanVar(xs)
meany = Mean(ys)

slope = Cov(xs, ys, meanx, meany) / varx

inter = meany - slope * meanx

85
return inter, slope
inter, slope = LeastSquares(ages, weights)
inter, slope

inter + slope * 25

slope * 10

def FitLine(xs, inter, slope):

fit_xs = np.sort(xs)
fit_ys = inter + slope * fit_xs
return fit_xs, fit_ys
fit_xs, fit_ys = FitLine(ages, inter, slope)
thinkplot.Scatter(ages, weights, color='red', alpha=0.1, s=10)
thinkplot.Plot(fit_xs, fit_ys, color='green', linewidth=3)
thinkplot.Plot(fit_xs, fit_ys, color='black', linewidth=2)
thinkplot.Config(xlabel="Mother's age (years)",
ylabel='Birth weight (lbs)',
axis=[10, 45, 0, 15],
legend=False)

live.info()

86
import seaborn as sns
sns.distplot(live['agepreg'])

sns.distplot(live['totalwgt_lb'])

def Residuals(xs, ys, inter, slope):

xs = np.asarray(xs)
ys = np.asarray(ys)
res = ys - (inter + slope * xs)
return res
live['residual'] = Residuals(ages, weights, inter, slope)

87
bins = np.arange(10, 48, 3)
indices = np.digitize(live.agepreg, bins)
groups = live.groupby(indices)

age_means = [group.agepreg.mean() for _, group in groups][1:-1]

age_means

cdfs = [thinkstats2.Cdf(group.residual) for _, group in groups][1:-1]

def PlotPercentiles(age_means, cdfs):
thinkplot.PrePlot(3)
for percent in [75, 50, 25]:
weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
label = '%dth' % percent
thinkplot.Plot(age_means, weight_percentiles, label=label)
PlotPercentiles(age_means, cdfs)

thinkplot.Config(xlabel="Mother's age (years)",

ylabel='Residual (lbs)',
xlim=[10, 45])

88
def SampleRows(df, nrows, replace=False):
indices = np.random.choice(df.index, nrows, replace=replace)
sample = df.loc[indices]
return sample

def ResampleRows(df):
return SampleRows(df, len(df), replace=True)
def SamplingDistributions(live, iters=101):
t = []
for _ in range(iters):
sample = ResampleRows(live)
ages = sample.agepreg
weights = sample.totalwgt_lb
estimates = LeastSquares(ages, weights)
t.append(estimates)

inters, slopes = zip(*t)

return inters, slopes
inters, slopes = SamplingDistributions(live, iters=1001)
def Summarize(estimates, actual=None):
mean = Mean(estimates)
stderr = Std(estimates, mu=actual)
cdf = thinkstats2.Cdf(estimates)
ci = cdf.ConfidenceInterval(90)
print('mean, SE, CI', mean, stderr, ci)
Summarize(inters)
Summarize(slopes)

for slope, inter in zip(slopes, inters):

fxs, fys = FitLine(age_means, inter, slope)
thinkplot.Plot(fxs, fys, color='gray', alpha=0.01)

thinkplot.Config(xlabel="Mother's age (years)",

ylabel='Residual (lbs)',
xlim=[10, 45])

89
def PlotConfidenceIntervals(xs, inters, slopes, percent=90, **options):
fys_seq = []
for inter, slope in zip(inters, slopes):
fxs, fys = FitLine(xs, inter, slope)
fys_seq.append(fys)

p = (100 - percent) / 2
percents = p, 100 - p
low, high = thinkstats2.PercentileRows(fys_seq, percents)
thinkplot.FillBetween(fxs, low, high, **options)
PlotConfidenceIntervals(age_means, inters, slopes, percent=90,
color='gray', alpha=0.3, label='90% CI')
PlotConfidenceIntervals(age_means, inters, slopes, percent=50,
color='gray', alpha=0.5, label='50% CI')

thinkplot.Config(xlabel="Mother's age (years)",

ylabel='Residual (lbs)',
xlim=[10, 45])

90
def ResampleRowsWeighted(df, column='finalwgt'):
weights = df[column]
cdf = thinkstats2.Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample
iters = 100
estimates = [ResampleRowsWeighted(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)

estimates = [thinkstats2.ResampleRows(live).totalwgt_lb.mean()
for _ in range(iters)]
Summarize(estimates)

91
92

DSBDA Lab Manual 2022-23
100% (2)
DSBDA Lab Manual 2022-23
148 pages
Hit! Typical Golf Yardage Distances. Guideline
100% (13)
Hit! Typical Golf Yardage Distances. Guideline
2 pages
Ccw331 Lab Manual
No ratings yet
Ccw331 Lab Manual
102 pages
Institute's Vision
No ratings yet
Institute's Vision
57 pages
DV Lab Manual
No ratings yet
DV Lab Manual
68 pages
AI Manual-2021-2022 (Even) - Lab Manual
100% (1)
AI Manual-2021-2022 (Even) - Lab Manual
37 pages
Computernetworks Manual Final
No ratings yet
Computernetworks Manual Final
106 pages
ML Lab Manual Simplified
No ratings yet
ML Lab Manual Simplified
40 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Eda Lab Manual Without Output
No ratings yet
Eda Lab Manual Without Output
33 pages
III-II AIDS R22 ML
No ratings yet
III-II AIDS R22 ML
25 pages
Lecture Zero_UIT-Data Sciednce
No ratings yet
Lecture Zero_UIT-Data Sciednce
18 pages
ML Lab Manual 20-06
No ratings yet
ML Lab Manual 20-06
40 pages
BDA Lab Manual AI&DS
No ratings yet
BDA Lab Manual AI&DS
60 pages
1 to 5 and 9
No ratings yet
1 to 5 and 9
38 pages
EDA LAB VERIFIED (2)
No ratings yet
EDA LAB VERIFIED (2)
38 pages
21ai66 ML Lab Manual
No ratings yet
21ai66 ML Lab Manual
41 pages
Cse3036 Predictive Analytics Final Lab Manual
No ratings yet
Cse3036 Predictive Analytics Final Lab Manual
112 pages
Geethanjali College of Engineering and Technology (Ugc Autonomous Institution)
No ratings yet
Geethanjali College of Engineering and Technology (Ugc Autonomous Institution)
34 pages
DSL Lab
No ratings yet
DSL Lab
79 pages
DM lab manual
No ratings yet
DM lab manual
26 pages
It Iii B.tech Sem-Ii Dwdm-R17a0590 Lab Manual 2019-20
No ratings yet
It Iii B.tech Sem-Ii Dwdm-R17a0590 Lab Manual 2019-20
107 pages
DWDM Lab Manual - It - Iii-Ii - 2018-19 PDF
No ratings yet
DWDM Lab Manual - It - Iii-Ii - 2018-19 PDF
96 pages
DM LAB MANUAL
No ratings yet
DM LAB MANUAL
72 pages
VMTW_ML_LAB_MANUAL
No ratings yet
VMTW_ML_LAB_MANUAL
37 pages
Course Plan For DEV
No ratings yet
Course Plan For DEV
18 pages
Screenshot 2024-05-28 at 12.25.15 PM
No ratings yet
Screenshot 2024-05-28 at 12.25.15 PM
53 pages
It - III B.tech Sem-II - DWDM Lab Manual (20-21)
No ratings yet
It - III B.tech Sem-II - DWDM Lab Manual (20-21)
94 pages
3160714 Data Mining. Vbv Jjj Ldce Vgec (1) - Copy
No ratings yet
3160714 Data Mining. Vbv Jjj Ldce Vgec (1) - Copy
43 pages
DEV Lab Record Updated Final Copy
No ratings yet
DEV Lab Record Updated Final Copy
59 pages
CSE-DS Power BI Updated Lab Manual
No ratings yet
CSE-DS Power BI Updated Lab Manual
99 pages
DSL Lab
No ratings yet
DSL Lab
81 pages
Siddh Ds
No ratings yet
Siddh Ds
121 pages
Dbms Lab 4th Sem 2024-25
No ratings yet
Dbms Lab 4th Sem 2024-25
54 pages
CCS334 Master Manual
No ratings yet
CCS334 Master Manual
65 pages
Dsb Da Lab Manual
No ratings yet
Dsb Da Lab Manual
164 pages
Ccs335 Cloud Computing Lab Manual
No ratings yet
Ccs335 Cloud Computing Lab Manual
95 pages
Bda Lab Manual (R20a0592)
No ratings yet
Bda Lab Manual (R20a0592)
89 pages
Data Analytics Lab Manual Students.docx
No ratings yet
Data Analytics Lab Manual Students.docx
33 pages
R Programming Manual 24-25
No ratings yet
R Programming Manual 24-25
58 pages
BTCS9202 Data Sciences Lab Manual
No ratings yet
BTCS9202 Data Sciences Lab Manual
39 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
90 pages
Cst 322 Data Analytics (Elective)
No ratings yet
Cst 322 Data Analytics (Elective)
244 pages
Bda Lab Manual (R20a0592)
No ratings yet
Bda Lab Manual (R20a0592)
89 pages
DMBI Lab Manual Final
No ratings yet
DMBI Lab Manual Final
56 pages
Ada Final
No ratings yet
Ada Final
37 pages
DAV CIS R20 DS
No ratings yet
DAV CIS R20 DS
9 pages
LecturePlan CS201 20SMP-460
No ratings yet
LecturePlan CS201 20SMP-460
5 pages
Daa Labmanual 2024-25
No ratings yet
Daa Labmanual 2024-25
35 pages
Eai Record Fin
No ratings yet
Eai Record Fin
28 pages
Aiml Front Page Print
No ratings yet
Aiml Front Page Print
10 pages
CS3352 Foundations of Data Science
No ratings yet
CS3352 Foundations of Data Science
27 pages
Ethics and Ai
No ratings yet
Ethics and Ai
8 pages
FDS Lab Manual Student Manual
No ratings yet
FDS Lab Manual Student Manual
50 pages
APznzabE783TWm5PnmnZmyqlGmiUEdbJFKKdhbU5kfZJdaXqkCbDMUET6fHn52VeIAHBjWb3yyxWC0s_vedvJwFmDOhu8F0pXFqlRhuaRw...s-qiyolcyk7MV1Rz2bTnmGPhPRKHhqgjpjO9GQHZmWcy_5-FQT1J3H213pCjd6oz938_c8GmybJDHy9GcpYVoc-ZKEYOlSDTUxgRHhmb1OaxyDReGNAK4ySaWh.pdf
No ratings yet
APznzabE783TWm5PnmnZmyqlGmiUEdbJFKKdhbU5kfZJdaXqkCbDMUET6fHn52VeIAHBjWb3yyxWC0s_vedvJwFmDOhu8F0pXFqlRhuaRw...s-qiyolcyk7MV1Rz2bTnmGPhPRKHhqgjpjO9GQHZmWcy_5-FQT1J3H213pCjd6oz938_c8GmybJDHy9GcpYVoc-ZKEYOlSDTUxgRHhmb1OaxyDReGNAK4ySaWh.pdf
45 pages
DSBDA Lab Manual 2022-23 Final-1
No ratings yet
DSBDA Lab Manual 2022-23 Final-1
148 pages
Unit- one QB
No ratings yet
Unit- one QB
48 pages
CS3361 - Data Science Lab Manual-1
No ratings yet
CS3361 - Data Science Lab Manual-1
65 pages
P and S Manual - II Year Aids STD
No ratings yet
P and S Manual - II Year Aids STD
26 pages
Teaching and Learning in STEM With Computation, Modeling, and Simulation Practices: A Guide for Practitioners and Researchers
From Everand
Teaching and Learning in STEM With Computation, Modeling, and Simulation Practices: A Guide for Practitioners and Researchers
Alejandra J. Magana
No ratings yet
Research Evaluation Baseline
From Everand
Research Evaluation Baseline
IPMA
No ratings yet
exe 10
No ratings yet
exe 10
10 pages
unit 5 UA
No ratings yet
unit 5 UA
19 pages
CCS369 Two Marks
No ratings yet
CCS369 Two Marks
9 pages
ba unit 1 UA
No ratings yet
ba unit 1 UA
13 pages
Embedd Iat
No ratings yet
Embedd Iat
6 pages
Medical Imaging Techniques - Hca
No ratings yet
Medical Imaging Techniques - Hca
3 pages
ba unit 4
No ratings yet
ba unit 4
13 pages
ba unit 4 UA
No ratings yet
ba unit 4 UA
19 pages
UNIT-5 (1)
No ratings yet
UNIT-5 (1)
50 pages
EXERCISE-1
No ratings yet
EXERCISE-1
3 pages
internn ppt
No ratings yet
internn ppt
9 pages
HCA-1
No ratings yet
HCA-1
71 pages
Exercise 4
No ratings yet
Exercise 4
2 pages
obt unit 3 IAT2
No ratings yet
obt unit 3 IAT2
3 pages
ba unit 3 own (1)
No ratings yet
ba unit 3 own (1)
7 pages
EXERCISE-2
No ratings yet
EXERCISE-2
2 pages
BA unit2 own
No ratings yet
BA unit2 own
10 pages
Unit 3 - Desktop, Network, Storage Virtualization
No ratings yet
Unit 3 - Desktop, Network, Storage Virtualization
8 pages
UNIT II
No ratings yet
UNIT II
59 pages
BD unit 1
No ratings yet
BD unit 1
5 pages
exercise 1 changes
No ratings yet
exercise 1 changes
3 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
migration no sql
No ratings yet
migration no sql
4 pages
CloudComputing Unit 3
No ratings yet
CloudComputing Unit 3
31 pages
cc,IAM design challengs
No ratings yet
cc,IAM design challengs
3 pages
cc unit 3 (virtual clusters and resource management) (1)
No ratings yet
cc unit 3 (virtual clusters and resource management) (1)
3 pages
cc unit 5 own notes
No ratings yet
cc unit 5 own notes
13 pages
vm security attacks and real case studies
No ratings yet
vm security attacks and real case studies
4 pages
Scher 1998
No ratings yet
Scher 1998
7 pages
2021 Internet Law Final Exam Sample Answer
No ratings yet
2021 Internet Law Final Exam Sample Answer
10 pages
Chapter Six: Evaluation Techniques and Universal Design
No ratings yet
Chapter Six: Evaluation Techniques and Universal Design
27 pages
Market Share - A Key To Profitability
No ratings yet
Market Share - A Key To Profitability
13 pages
Solid Edge Machinery Library
No ratings yet
Solid Edge Machinery Library
1 page
PULSAR-23 Patchbook
No ratings yet
PULSAR-23 Patchbook
88 pages
Analysis of The Gait Characteristics and Usability After Wearable Exoskeleton Robot Gait Training
No ratings yet
Analysis of The Gait Characteristics and Usability After Wearable Exoskeleton Robot Gait Training
10 pages
Fluid Mechanics 2
No ratings yet
Fluid Mechanics 2
33 pages
Full Download The Game Writing Guide Get Your Dream Job and Keep It 1st Edition Anna Megill PDF DOCX
No ratings yet
Full Download The Game Writing Guide Get Your Dream Job and Keep It 1st Edition Anna Megill PDF DOCX
40 pages
Training Design ESIP
No ratings yet
Training Design ESIP
2 pages
FIN-40220 Sunvalley Hospital Financial Reports
No ratings yet
FIN-40220 Sunvalley Hospital Financial Reports
6 pages
Grounding Rod Suppliers PDF
No ratings yet
Grounding Rod Suppliers PDF
6 pages
Product Parameter: Details PDF
No ratings yet
Product Parameter: Details PDF
1 page
Jotachar Approved Primer and Topcoat List: Important Information
No ratings yet
Jotachar Approved Primer and Topcoat List: Important Information
5 pages
Operations Research: USN 06CS661 Sixth Semester B.E. Degree Examination, June/July 2009
33% (3)
Operations Research: USN 06CS661 Sixth Semester B.E. Degree Examination, June/July 2009
23 pages
n38-1934-Eyes-of-the-Movie_Potamkin
No ratings yet
n38-1934-Eyes-of-the-Movie_Potamkin
32 pages
News Paper Mag
No ratings yet
News Paper Mag
45 pages
Rubem Alves 4
No ratings yet
Rubem Alves 4
85 pages
Mari Culture
No ratings yet
Mari Culture
36 pages
Performance Evaluation in A Matrix Organization A Case Study #3
No ratings yet
Performance Evaluation in A Matrix Organization A Case Study #3
6 pages
Declaration by Paper Setter: Scrutiny of The Question Paper
No ratings yet
Declaration by Paper Setter: Scrutiny of The Question Paper
5 pages
Tibco Enterprise Message Service™ Installation On Red Hat Openshift Container Platform
No ratings yet
Tibco Enterprise Message Service™ Installation On Red Hat Openshift Container Platform
21 pages
B2 Practice Test 1: Answer Key
No ratings yet
B2 Practice Test 1: Answer Key
6 pages
MGNM838 Ca2
No ratings yet
MGNM838 Ca2
33 pages
Edu 401 (Curriculum Theory, Design and Development) University of Jos, Plateau State, Nigeria.
No ratings yet
Edu 401 (Curriculum Theory, Design and Development) University of Jos, Plateau State, Nigeria.
73 pages
Conveyor Gearbox Failure Analysis #
No ratings yet
Conveyor Gearbox Failure Analysis #
6 pages
Health and Safety in Swimming Pools
No ratings yet
Health and Safety in Swimming Pools
60 pages
Question: A Packed Column Is To Be Designed To Absorb CO2 From Air Into
No ratings yet
Question: A Packed Column Is To Be Designed To Absorb CO2 From Air Into
2 pages
Benchmark-Schoolwide Character Education Proposal-Meda
100% (5)
Benchmark-Schoolwide Character Education Proposal-Meda
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Science Lab Manual

Uploaded by

Data Science Lab Manual

Uploaded by

AD3341 DATA SCIENCE AND ANALYTICS LABORATORY

Department of Artificial Intelligence and Data Science

Prepared by, Approved by,

Dr. M. Kaliappan HOD/AD Dr. M. Kaliappan HOD/AD

Dr. S. Vimal ASP/AD

Mission of the Department

 Excel in Teaching-Learning process and collaborative Research by the use

 Establish an Artificial Intelligence and Data Science based centre of

 Motivate students to emerge as entrepreneurs with leadership qualities in a

PEO 2. Excel with professional skills, fundamental knowledge, and advanced

Program Specific Outcomes(PSO’s)

 Experiments should be completed within the Specified Lab Session

 Programs (Printed) should be placed on the right side

 Marks for each lab exercise is awarded as follows :

A number of people helped me with this revision. I would like to thank

2 Basic Plots using Matplotlib 27

9 Building and Validating linear 64

12 Regression – least squares 85

LIST OF EXPERIMENTS Tools:

1 Working with CO1

9 Building and CO5

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

0 name 10 non-null object

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',

Original data frame:

Change the score in row 'd' to 11.5:

Sum of the examination attempts by the students:

Mean score for each different student in data frame:

Name Age Location

exam_data =pd.DataFrame( {'name': ['Anastasia', 'Dima', 'Katherine',

# Create the series

Delete the 'attempts' column from the data frame:

24. Write a Pandas program to change the order of a DataFrame columns

25. Write a Pandas program to add one row in an existing DataFrame.

New DataFrame replacing all NaN with 0:

1b) Cardino dataset:

new_df = df[['name', 'platform', 'release_date', 'summary','meta_score']]

ffmc_month = df[['FFMC', 'month']]

ffmc_month = df[['DC', 'month']]

#checking for null values

#There are no null values in the dataset

cat("50% quartile: ",q50,"\n")

from google.colab import drive

from bokeh.plotting import figure,output_file,show

x = np.linspace(mean - 3*std_dev, mean + 3*std_dev, 1000)

source = ColumnDataSource({'hist': hist, 'left': edges[:-1], 'right': edges[1:]})

p.line(x, y, line_color='red', line_width=2)

# Create some sample data

# Define the color mapper

# Create the figure and plot the rectangles

# Add the color bar

# Show the plot

import seaborn as sns

# get the predicted values

print("Mean Squared Error (MSE):", mse)

Mean Squared Error (MSE): 0.41676716722140794

5b. Logistic Regression using R

model <- glm(quality_binary ~ .,data=train_data,family=binomial)

pred<- predict(model, new_data = test_data, type = "response")

print("The mean of Sample1: Height of Male: ",mean1)

6c. Z – test on paired sample

from scipy.stats import ttest_1samp

model <- aov(dnn_confidence ~ dnn_prediction,

8b Pair wise Anova test:

# perform pairwise ANOVA for dnn_prediction factor

# perform pairwise ANOVA for interaction of tested_occupation and dnn_prediction factors

#Check for null values

#Display the head of the data

#The information of the data can be obtained by

Bayesian Linear Regression:

# Fit a Bayesian linear regression model

# Print the model summary

# Create new data to predict on

# Print the predicted prices

x = np.linspace(mean - 3std_dev, mean + 3std_dev, 1000)