Fods (1) - Merged (1) - 1
Fods (1) - Merged (1) - 1
VI SEMESTER
REGULATION 2021
LABORATORY MANUAL
COIMBATORE-641 107.
LAB MANUAL
REGULATION : 2021
MONTH : APRIL
Prepared By Approved By
Amendment Sheet
Rev no.
Rev no.
Name of and Date Details of
and
the Page of the the Approved
Doc No. Date of
Docume No. Previous amendm by
amend
nt amendm ent
ment
ent
LAB
DATA SCIENCE MANUAL
Revision SD/HOD/Course
All No/ Coordinator
LABORATORY FOR 2022
pages Date : 00/
EXP NO 1 TO 7 DATA -
06.12.2022
SCIENCE
INDEX
Reading data from text files, Excel and the web and
exploring various commandsfor doing descriptive analytics
4. 44
on the Iris data set.
Use the diabetes data set from UCI and Pima Indians
Diabetes data set forperforming the following:
a. Univariate analysis: Frequency, Mean,
Median, Mode, Variance, StandardDeviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic
5. regression modeling 48
c. Multiple Regression analysis
d. Also compare the results of the above analysis
for the two data sets.
LIST OF EXPERIMENTS:
1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
Pandas packages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring various commands for
doing descriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap
TOTAL: 60 PERIODS
COURSE OUTCOMES:
At the end of this course, the students will be able to: CO1:
Make use of the python libraries for data science.
CO2: Make use of the basic Statistical and Probability measures for data science.
AIM:
To learn how to download and install the different packages of NumPy, SciPy, Jupyter,
Statsmodels and Pandas.
ALGORITHM:
1. Download Python and Jupyter.
2. Install Python and Jupyter.
3. Install the pack like NumPy, SciPy Satsmodels and Pandas.
4. Verify the proper execution of Python and Jupyter.
Python Installation
● Open the python official web site. (https://www.python.org/)
● Downloads ==> Windows ==> Select Recent Release. (Requires Windows 10 or
above versions)
● Install "python-3.10.6-amd64.exe"
Jupyter Installation
● Open command prompt and enter the following to check whether the pyton was
installed properly or not, “python –version”.
● If installation is proper it returns the version of python
● Enter the following to check whether the pyton package manager was installed
properly or not, “pip –version”
● If installation is proper it returns the version of python package manager
● Enter the following command “pip install jupyterlab”.
● Enter the following command “pip install jupyter notebook”.
● Copy the above command result from path to upgrade command and paste it and
execute for upgrade process.
● Create a folder and name the folder accordingly.
● Open command prompt and enter in to that folder. Enter the following code
“jupyter notebook” and then give enter.
● Now new jupyter notebook will be opened for our use.
pip Installation
Installation of NumPy
● pip install
numpy Installation of
SciPy
● pip install scipy
Installation of Statsmodels
● pip install
statsmodels Installation of
Pandas
● pip install pandas
Sample Output
RESULT:
NumPy, SciPy, Jupyter, Statsmodels and Pandas packages were installed properly and the
execution also verified.
1(b). Explore the features of NumPy
AIM:
To learn the different features provided by NumPy package.
ALGORITHM:
1. Install the NumPy package
2. Study all the features of NumPy package.
NumPy
● NumPy is a Python library used for working with arrays.
● It also has functions for working in domain of linear algebra, fourier transform,
and matrices.
Features
These are the important features of NumPy
1. Array 2. Random 3. Universal Functions
1. Arrays
1.1 Array Slicing
● Slicing in python means taking elements from one given index to another given
index.
● We pass slice instead of index like this: [start:end].
● We can also define the step, like this: [start:end:step].
● If we don't pass start its considered 0
● If we don't pass end its considered length of array in that dimension
● If we don't pass step its considered 1
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
2. Random
Random Permutations
A permutation refers to an arrangement of elements. e.g. [3, 2, 1] is a permutation of [1,
2, 3] and vice-versa.
The NumPy Random module provides two methods for this: shuffle() and permutation().
from numpy import random import
numpy as np
arr = np.array([1, 2, 3, 4, 5])
random.shuffle(arr) print(arr)
2.1 Seaborn
Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to
visualize random distributions.
import matplotlib.pyplot as plt
import seaborn as sns sns.distplot([0,
1, 2, 3, 4, 5]) plt.show()
3. Universal Functions
Create Your Own ufunc (Universal)
To create you own ufunc, you have to define a function, like you do with normal
functions in Python, then you add it to your NumPy ufunc library with the frompyfunc() method.
The frompyfunc() method takes the following arguments:
function - the name of the function.
inputs - the number of input arguments (arrays).
outputs - the number of output arrays.
Create your own ufunc for addition:
import numpy as np def
myadd(x, y):
return x+y
myadd = np.frompyfunc(myadd, 2, 1)
print(myadd([1, 2, 3, 4], [5, 6, 7, 8]))
3.1 Simple Arithmetic
You could use arithmetic operators + - * / directly between NumPy arrays, but this
section discusses an extension of the same where we have functions that can take any array- like
objects e.g. lists, tuples etc. and perform arithmetic conditionally.
Addition
Add the values in arr1 to the values in arr2:
import numpy as np
arr1 = np.array([10, 11, 12, 13, 14, 15])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.add(arr1, arr2)
print(newarr)
Subtraction
Subtract the values in arr2 from the values in arr1:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.subtract(arr1, arr2)
print(newarr)
Multiplication
Multiply the values in arr1 with the values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.multiply(arr1, arr2)
print(newarr)
Division
Divide the values in arr1 with the values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 10, 8, 2, 33])
newarr = np.divide(arr1, arr2)
print(newarr)
Power Raise the valules in arr1 to the power of values
in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50,
60])
a
r
r
2
=
n p.
p. p
a o
r w
r e
a r(
y a
([ r
3, r
5, 1,
6, a
8, r
2, r
3 2
3 )
]) p
n ri
e n
w t(
a n
r e
r w
= a
n r
r)
Remainder
Return the remainders:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.mod(arr1, arr2)
print(newarr)
Absolute Values
Return the quotient and mod:
import numpy as np
arr = np.array([-1, -2, 1, 2, 3, -4])
newarr = np.absolute(arr)
print(newarr)
Remove the decimals, and return the float number closest to zero. Use the trunc() and fix()
functions.
Truncate elements of following array:
import numpy as np
arr = np.trunc([-3.1666, 3.6667])
print(arr)
3.2.2 Rounding
The around() function increments preceding digit or decimal by 1 if >=5 else do nothing.
Round off 3.1666 to 2 decimal places:
import numpy as np
arr = np.around(3.1666, 2) print(arr)
3.2.3 Floor
The floor() function rounds off decimal to nearest lower integer.
Floor the elements of following array:
import numpy as np
arr = np.floor([-3.1666, 3.6667])
print(arr)
3.2.4 Ceil
The ceil() function rounds off decimal to nearest upper integer.
Ceil the elements of following array:
import numpy as np
arr = np.ceil([-3.1666, 3.6667])
print(arr)
3.3 Logs
NumPy provides functions to perform log at the base 2, e and 10.
We will also explore how we can take log for any base by creating a custom ufunc. All of
the log functions will place -inf or inf in the elements if the log can not be computed.
Find log at base 10 of all elements of following array:
import numpy as np
arr = np.arange(1, 10)
print(np.log10(arr))
3.4 Summations
Addition is done between two arguments whereas summation happens over n
elements
Add the values in arr1 to the values in arr2:
import numpy as np arr1 =
np.array([1, 2, 3])
arr2 = np.array([1, 2, 3])
newarr = np.add(arr1, arr2)
print(newarr)
3.5 Products
To find the product of the elements in an array, use the prod() function.
Find the product of the elements of this array:
import numpy as np
arr = np.array([1, 2, 3, 4]) x =
np.prod(arr)
print(x)
3.6 Differences
A discrete difference means subtracting two successive elements.
To find the discrete difference, use the diff() function.
Compute discrete difference of the following array:
import numpy as np
arr = np.array([10, 15, 25, 5]) newarr
= np.diff(arr) print(newarr)
x = np.gcd(num1, num2)
print(x)
Sample Output:
RESULT
Thus the feature study of NumPy was completed successfully.
1(c). Explore the features of SciPy
AIM:
To learn the different features provided by SciPy package.
ALGORITHM:
1. Install the SciPy package
2. Study all the features of SciPy package.
SciPy
SciPy stands for Scientific Python, SciPy is a scientific computation library that uses
NumPy underneath.
Features
These are the important features of SciPy
1. Constants 2. Sparse Data 3. Graphs
4. Spatial Data 5. Matlab Arrays 6. Interpolation
1. Constants in SciPy
As SciPy is more focused on scientific implementations, it provides many built-in
scientific constants.
These constants can be helpful when you are working with Data Science.
1.1 Constants in
SciPy Metric
R
e
Binary t
u
r
Mass n
t
h
Angle e
s
p
Time e
c
i
f
i
e
d
u
n
i i
t e
i d
n u
n
m
i
e t
t i
e n
r b
e y
x t
: e
p s
r e
i x
n :
t p
( r
c i
o n
n t
s (
t c
a o
n n
t s
s t
. a
m n
i t
l s
l .
i k
) i
b
R i
e )
t
u Return the specified unit in kg
r ex: print(constants.stone)
n
t R
h e
e t
s u
p r
e n
c t
i h
f
e p
s r
p i
e n
c t
i (
f c
i o
e n
d s
u t
n a
i n
t t
i s
n .
r d
a e
d g
i r
a e
n e
s )
e
x Return the specified unit in seconds
: ex: print(constants.year)
Length
Return the specified unit in meters
ex: print(constants.mile)
Pressure
Return the specified unit in pascals
ex: print(constants.bar)
Area Return the specified unit in square meters
ex: print(constants.hectare)
Volume
Return the specified unit in cubic meters
ex: print(constants.litre)
Speed Return the specified unit in meters per second
ex: print(constants.kmh)
Temperature
Return the specified unit in Kelvin
ex: print(constants.zero_Celsius)
Energy
Return the specified unit in joules
ex: print(constants.calorie)
Power Return the specified unit in watts
ex: print(constants.hp)
2. Sparse Data
Sparse data is data that has mostly unused elements (elements that don't carry any
information).
It can be an array like this one:
[1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0]
Sparse Data: is a data set where most of the item values are zero.
Dense Array: is the opposite of a sparse array: most of the values are not zero.
3. Graphs
Graphs are an essential data structure.
SciPy provides us with the module scipy.sparse.csgraph for working with such data
structures.
Adjacency Matrix
Adjacency matrix is a nxn matrix where n is the number of elements in a graph. The
values represents the connection between the elements.
3.1 Dijkstra
Use the dijkstra method to find the shortest path in a graph from one element to another.
It takes following arguments:
return_predecessors: boolean (True to return whole path of traversal otherwise False).
indices: index of the element to return all paths from that element only.
limit: max weight of path.
Find the shortest path from element 1 to 2:
import numpy as np
from scipy.sparse.csgraph import dijkstra from
scipy.sparse import csr_matrix
arr = np.array([
[0, 1, 2],
[1, 0, 0],
[2, 0, 0]
])
newarr = csr_matrix(arr)
print(dijkstra(newarr, return_predecessors=True, indices=0))
4. Spatial Data
Spatial data refers to data that is represented in a geometric space.
E.g. points on a coordinate system.
We deal with spatial data problems on many tasks.
E.g. finding if a point is inside a boundary or not.
4.1 Triangulation
A Triangulation of a polygon is to divide the polygon into multiple triangles with which
we can compute an area of the polygon.
A Triangulation with points means creating surface composed triangles in which all of the
given points are on at least one vertex of any triangle in the surface.
One method to generate these triangulations through points is the Delaunay()
Triangulation.
Example:
Create a triangulation from following points:
import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt points
= np.array([
[2, 4],
[3, 4],
[3, 0],
[2, 2],
[4, 1]
])
simplices = Delaunay(points).simplices
plt.triplot(points[:, 0], points[:, 1], simplices)
plt.scatter(points[:, 0], points[:, 1], color='r')
plt.show()
4.3 KDTrees
KDTrees are a datastructure optimized for nearest neighbor queries.
E.g. in a set of points using KDTrees we can efficiently ask which points are nearest to a
certain given point.
The KDTree() method returns a KDTree object.
The query() method returns the distance to the nearest neighbor and the location of the
neighbors.
Example
Find the nearest neighbor to point (1,1):
from scipy.spatial import KDTree points =
[(1, -1), (2, 3), (-2, 3), (2, -3)]
kdtree = KDTree(points) res
= kdtree.query((1, 1))
print(res)
4.4 Distance Matrix
There are many Distance Metrics used to find various types of distances between two
points in data science, Euclidean distsance, cosine distsance etc.
The distance between two vectors may not only be the length of straight line between
them, it can also be the angle between them from origin, or number of unit steps required etc.
Many of the Machine Learning algorithm's performance depends greatly on distance
metrices. E.g. "K Nearest Neighbors", or "K Means" etc.
Let us look at some of the Distance Metrices:
Hamming Distance
Is the proportion of bits where two bits are difference. It's a
way to measure distance for binary sequences.
Example
Find the hamming distance between given points:
from scipy.spatial.distance import hamming p1
= (True, False, True)
p2 = (False, True, True) res
= hamming(p1, p2)
print(res)
5. Matlab Arrays
We know that NumPy provides us with methods to persist the data in readable
formats for Python. But SciPy provides us with interoperability with Matlab as well.
Working With Matlab Arrays
We know that NumPy provides us with methods to persist the data in readable formats for
Python. But SciPy provides us with interoperability with Matlab as well.
Exporting Data in Matlab Format
The savemat() function allows us to export data in Matlab format. The
method takes the following parameters:
filename - the file name for saving data.
mdict - a dictionary containing the data.
do_compression - a boolean value that specifies whether to compress the result or
not. Default False.
Example
Export the following array as variable name "vec" to a mat file: from
scipy import io
import numpy as np
arr = np.arange(10)
io.savemat('arr.mat', {"vec": arr})
6. Interpolation
Interpolation is a method for generating points between given points.
For example: for points 1 and 2, we may interpolate and find points 1.33 and 1.66.
Interpolation has many usage, in Machine Learning we often deal with missing data in a
dataset, interpolation is often used to substitute those values.
This method of filling values is called imputation.
Apart from imputation, interpolation is often used where we need to smooth the discrete
points in a dataset.
6.1 1D Interpolation
The function interp1d() is used to interpolate a distribution with 1 variable.
It takes x and y points and returns a callable function that can be called with new x and
returns corresponding y.
Example
For given xs and ys interpolate values from 2.1, 2.2... to 2.9:
from scipy.interpolate import interp1d import
numpy as np
xs = np.arange(10) ys =
2*xs + 1
interp_func = interp1d(xs, ys)
newarr = interp_func(np.arange(2.1, 3, 0.1))
print(newarr)
Sample Output
RESULT
Thus the feature study of SciPy was completed successfully.
1(d). Explore the features of Pandas
AIM:
To learn the different features provided by Pandas package.
ALGORITHM:
1. Install the Pandas package
2. Study all the features of Pandas package.
Pandas
● Pandas is a Python library used for working with data sets.
● It has functions for analyzing, cleaning, exploring, and manipulating data.
● Pandas allows us to analyze big data and make conclusions based on statistical
theories.
● Pandas can clean messy data sets, and make them readable and relevant.
Features
These are the important features of Pandas.
1. Series 2. DataFrames 3. Read CSV
4. Read JSON 5. Viewing the Data 6. Data Cleaning
7. Plotting
1. Series
● A Pandas Series is like a column in a table.
● It is a one-dimensional array holding data of any type.
● Create a simple Pandas Series from a list:
import pandas as pd a
= [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
2. DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
3. Read CSV
A simple way to store big data sets is to use CSV files (comma separated files). CSV files
contains plain text and is a well know format that can be read by everyone including
Pandas.
Example
To print maximum rows in a CSV file
import pandas as pd
pd.options.display.max_rows = 9999 df
= pd.read_csv('data.csv') print(df)
4. Read JSON
● Big data sets are often stored, or extracted as JSON.
● JSON is plain text, but has the format of an object, and is well known in the world of
programming, including Pandas.
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
5. Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the
head() method. The head() method returns the headers and a specified number of rows, starting
from the top.
6. Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
● Empty cells
● Data in wrong format
● Wrong data
● Duplicates
7.2 Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
Example
import sys import
matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt df =
pd.read_csv('data.csv')
df["Duration"].plot(kind = 'hist')
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Sample Output
RESULT
Thus the feature study of Pandas was completed successfully.
1(e). Explore the features of statsmodels
AIM:
To learn the different features provided by statsmodels package.
ALGORITHM:
3. Install the statsmodels package
4. Study all the features of statsmodels package.
Statsmodels
statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical data
exploration.
Features
These are the important features of statsmodels
1. Linear regression models
2. Survival analysis
Example:
# Importing libraries
import statsmodels.api as sm
X = sm.datasets.get_rdataset("Moore", "carData").data #
Filtering data of low fcategory
X = X[X['fcategory'] == "low"] #
Creating SurvfuncRight model
model = sm.SurvfuncRight(X["conformity"], X["fscore"]) #
Model Summary
model.summary()
Sample Output
RESULT
Thus the few important features of study statsmodels was completed successfully.
2. Working with Numpy arrays
AIM:
To work with different features provided by Numpy arrays.
ALGORITHM:
1. Install the numpy package
2. Work with all the features of numpy array.
Arrays
1. Creating Arrays
● 0-D Arrays
Each value in an array is a 0-D array.
import numpy as np
arr = np.array(42)
print(arr)
● 1-D Arrays
An array that has 0-D arrays as its elements is called 1-D array.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
● 2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
● 3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Example:
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
2. Access Array
Elements Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1]) Access
3-D Arrays
To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
3. Array Slicing
● Slicing in python means taking elements from one given index to another given index.
● We pass slice instead of index like this: [start:end].
● We can also define the step, like this: [start:end:step].
● If we don't pass start its considered 0
● If we don't pass end its considered length of array in that dimension
● If we don't pass step its considered 1
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
4. Data Types
NumPy has some extra data types, and refer to data types with one character, like i for
integers, u for unsigned integers etc.
Below is a list of all data types in NumPy and the characters used to represent them.
i o
- l
i e
n a
t n
e u-
Example: g unsigned
e integer f
r - float
b c-
- comple
b x float
o m-
timedel M - datetime
ta O - object
S - string
import numpy as np U - unicode string
arr = np.array([1, 2, 3, 4], V - fixed chunk of
dtype='S') print(arr) memory for other type
print(arr.dtype) (void)
5. Copy & View
5.1 Copy:
Make a copy
import numpy as np
arr = np.array([1, 2, 3, 4, 5]) x
= arr.copy()
arr[0] = 42
print(arr) print(x)
5.2 View:
Make a view
import numpy as np
arr = np.array([1, 2, 3, 4, 5]) x
= arr.view()
arr[0] = 42
print(arr) print(x)
7. Array Iterating
● Iterating means going through elements one by one.
● As we deal with multi-dimensional arrays in numpy, we can do this using basic for
loop of python.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
for x in arr:
print(x)
8. Joining Array
Joining means putting contents of two or more arrays in a single array.
import numpy as np arr1 =
np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
9. Splitting Array
Splitting is reverse operation of Joining.
Joining merges multiple arrays into one and Splitting breaks one array into multiple.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)
Sample Output:
RESULT
Thus the important features of numpy array was completed successfully.
3. Working with DataFrame
AIM:
To work with dataframe provided by pandas.
ALGORITHM:
1. Install the pandas package
2. Work with all the features of dataframe.
1. DataFrame
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
2. Locate Row
As you can see from the result above, the DataFrame is like a table with rows and
columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object: df
= pd.DataFrame(data) print(df.loc[0])
3. Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
Sample Output:
RESULT
Thus the dataframe features of pandas was completed successfully.
4. Reading data from iris data set and doing descriptive analytics on the Iris data set
AIM:
To read data from files and exploring various commands for doing descriptive
analytics on the Iris data set.
ALGORITHM:
1. Download “Iris.csv” file from GitHub.com
2. Load the “Iris.csv” into google colab.
3. Perform descriptive analysis on the Iris file.
Importing Iris.csv
● Login to google colab by using gmail.
● Login to google drive and create a folder with required name.
● Move the Iris file from system to google drive.
● Click on the “file” icon and click on “Mount Device”.
● Code will appeared on a typing area, execute the same code.
● It requires authentication verification, complete the authentication.
● After successful verification it shows the message “Mounted at /content/drive”
● Find the Iris.csv file and copy the path for future references.
Example:
df.isnull().sum()
Checking Duplicates
Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method
helps in removing duplicates from the data frame.
Example: data =
df.drop_duplica
tes(subset
="variety",)
data
Handling Correlation
Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the
dataframe. Any NA values are automatically excluded. For any non-numeric data type columns
in the dataframe it is ignored.
Example:
data.corr(method='pearson')
Sample Output
RESULT
Iris.csv file was loaded into google colab and descriptive analytics was made on the Iris
data set successfully.
5(a). Perform Univariate analysis on the diabetes data set
AIM:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for Univariate
analysis.
ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform analysis like Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
Univariate analysis
● The term univariate analysis refers to the analysis of one variable.
● There are three common ways to perform univariate analysis on one variable:
Summary statistics – Measures the center and spread of values.
1. Central tendency — mean, median, mode
2. Dispersion — variance, standard deviation, range, interquartile
range (IQR)
3. Skewness — symmetry of data along with mean value
4. Kurtosis — peakedness of data at mean value
5. Frequency table – Describes how often different values occur.
File Importing:
# Reading the UCI file import
pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv")
# Printing top 5 rows df.head()
# Reading the Pima file import
pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Printing top 5 rows df.head()
1. Central Tendency
We can use the following syntax to calculate various summary statistics like Mean, Median
and Mode.
1.1 Mean:
It is average value of given numeric values
● Mean of UCI data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.cs
v")
# Mean of UCI data df.mean(axis=0)
● Mean of Pima data
import pandas as pd #
Reading the UCI file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Mean of Pima data
df.mean(axis=0)
1.2 Median:
It is middle most value of given values
● Median of UCI data
import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv") #
Median of UCI data
df.median(axis=0)
1.3 Mode:
It is the most frequently occurring value of given numeric variables
● Mode of UCI data
import pandas as pd #
Reading the UCI file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv") #
Median of UCI data
df.mode(axis=0)
2. Dispersion
2.1 Variance
The range is the difference between the maximum and minimum values of a data set.
Example
import pandas as pd #
Reading the UCI file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
variance of the BMI column
df.loc[:,"BMI"].var()
2.3 Range
Range is the simplest of the measurements but is very limited in its use, we calculate the
range by taking the largest value of the dataset and subtract the smallest value from it, in other
words, it is the difference of the maximum and minimum values of a dataset.
Example
df=pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.cs v")
print("Range is:",df.BloodPressure.max()-df.BloodPressure.min())
3. Skewness
● Skewness essentially measures the symmetry of the distribution.
Example
# importing pandas as pd
import pandas as pd
# Creating the dataframe df
=
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
skip the na values
# find skewness in each row
df.skew(axis = 0, skipna = True)
4. kurtosis
kurtosis determines the heaviness of the distribution tails.
Example
import pandas as pd df
=
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
df['BloodPressure'].kurtosis()
5. Frequency
Frequency is a count of the number of occurrences a particular value occurs or appears in
our data. A frequency table displays a set of values along with the frequency with which they
appear. They allow us to better understand which data values are common and which are
uncommon.
Example
# import packages
import pandas as pd
import numpy as np #
reading csv file
data =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv') #
one way frequency table for the species column.
freq_table = pd.crosstab(data['Age'], 'BMI') #
frequency table in proportion of species
freq_table= freq_table/len(data)
freq_table Sample
Output
RESULT
Thus the Univariate analysis on the Diabetes data of UCI and Pima was performed
successfully.
5(b). Perform Bivariate analysis on the diabetes data set.
AIM:
To use the UCI and Pima Indians Diabetes data set for Bivariate analysis.
ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform various methods of bivariate.
Bivariate analysis
The term bivariate analysis refers to the analysis of two variables. The purpose of bivariate
analysis is to understand the relationship between two variables
There are three common ways to perform bivariate analysis:
1. Scatterplots
2. Correlation Coefficients
3. Simple Linear Regression
1. Scatterplots
A scatterplot is a type of data display that shows the relationship between two numerical
variables
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import packages data
=
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Diabetes Outcome
g1 = data.loc[data.Outcome==1,:]
# Pregnancies, Glucose and Diabetes relation
g1.plot.scatter('Pregnancies', 'Glucose');
2. Correlation Coefficients
The correlation coefficient is a statistical measure of the strength of the relationship
between the relative movements of two variables. The values range between -1.0 and 1.0.
Correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a
perfect positive correlation. A correlation of 0.0 shows no linear relationship between the
movement of the two variables.
Example
# Import those libraries import
pandas as pd
from scipy.stats import pearsonr
# Import your data into Python df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv") #
Convert dataframe into series
list1 = df['BloodPressure']
list2 = df['SkinThickness'] #
Apply the pearsonr()
corr, _ = pearsonr(list1, list2) print('Pearsons
correlation: %.3f' % corr)
RESULT:
Thus the Bivariate analysis on the diabetes data set was executed successfully.
5(c). Perform Multiple Regression Analysis on the diabetes data set
AIM:
To use UCI and Pima Indians Diabetes data set for Multiple Regression Analysis.
ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform multiple regression analysis on data sets.
# UCI-Diabetes import
pandas
from sklearn import linear_model df
=
pandas.read_csv("("/content/drive/MyDrive/Data_Science/UCI_diabetes.
csv")
X = df[['Time', 'Code']] y =
df['Value']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the Diabetes based on Time and Code:
predictedBP = regr.predict([[13:23, 46]]) print(predictedBP)
Sample Output
RESULT
Thus the Multiple Regression analysis on the Diabetes data of UCI and Pima was
performed successfully.
6(a). Apply and explore Normal curves & Histograms
plotting functions on UCI-Iris data sets
AIM:
To apply and explore Normal curves & Histograms plotting functions on UCI-Iris
data sets.
ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the normal curve and Histograms for Iris data set.
Normal Curves
It is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in statistics because of
its advantages in real case scenarios.
Example
import numpy as np
import matplotlib.pyplot as plt from
scipy.stats import norm import
statistics
# import dataset
df = pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv") #
Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-20, 20, 0.01)
# Calculating mean and standard deviation mean
= df["sepal.length"].mean()
sd = df.loc[:,"sepal.width"].std() plt.plot(x_axis,
norm.pdf(x_axis, mean, sd)) plt.show()
Sample Output
RESULT
Thus the UCI data set was plotted using Normal Curve and Histogram plotting was
executed successfully.
6(b). Density and contour plotting functions on UCI-Iris data sets.
AIM:
To apply and explore Density & Contour plotting functions on UCI-Iris data sets.
ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the density and contour plotting for Iris data sets.
Density Plotting
Density Plot is a type of data visualization tool. It is a variation of the histogram that uses
‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a
histogram inferred from a data.
Density plots uses Kernel Density Estimation (so they are also known as Kernel density
estimation plots or KDE) which is a probability density function. The region of plot with a
higher peak is the region with maximum data points residing between those values.
Contour plotting
Contour plots also called level plots are a tool for doing multivariate analysis and
visualizing 3-D plots in 2-D space. If we consider X and Y as our variables we want to plot then
the response Z will be plotted as slices on the X-Y plane due to which contours are sometimes
referred as Z-slices or iso-response.
Contour plots are widely used to visualize density, altitudes or heights of the
mountain as well as in the meteorological department.
Sample Output
RESULT
Thus the UCI data set was plotted using Density & Contour plotting was executed
successfully.
6(c). Correlation and scatter plotting functions on UCI data sets.
AIM:
To apply and correlation & Scatter plotting functions on UCI-Iris data sets.
ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the correlation and scatter plotting for Iris data sets.
Example #
C
o
r
r
e
c
t
i
o
n
M
a
t
r
i
x
P
l
o
t
i
m
p
o m
r p
t y
m u
a r
t l
p =
l "https://raw.githubusercontent.com
o /jbrownlee/Datasets/master/pima-
t indians-diabetes.csv"
l names = ['preg', 'plas', 'pres', 'skin',
i 'test', 'mass', 'pedi', 'age', 'class'] data =
b pandas.read_csv(url, names=names)
. c
p o
y r
p r
l e
o l
t a
a t
s i
p o
l n
t s
i =
m d
p a
o t
r a
t .
p c
a o
n r
d r
a (
s )
i #
m p
p l
o o
r t
t c
n o
u r
r y
e .
l a
a r
t a
i n
o g
n e
m (
a 0
t ,
r 9
i ,
x 1
f )
i a
g x
= .
p s
l e
t t
. _
f x
i t
g i
u c
r k
e s
( (
) t
ax = fig.add_subplot(111) i
cax = ax.matshow(correlations, c
vmin=-1, vmax=1) k
fig.colorbar(cax) s
t )
i a
c x
k .
s s
= e
n t
u _
m y
p t
i
c c
k k
s l
( a
t b
i e
c l
k s
s (
) n
a a
x m
. e
s s
e )
t p
_ l
x t
t .
i s
c h
k o
l w
a (
b )
e
l
s
(
n
a
m
e
s
)
a
x
.
s
e
t
_
y
t
i
Scatter Plotting
A scatterplot shows the relationship between two variables as dots in two dimensions,
one axis for each attribute. You can create a scatterplot for each pair of attributes in your data.
Drawing all these scatterplots together is called a scatterplot matrix.
Scatter plots are useful for spotting structured relationships between variables, like
whether you could summarize the relationship between two variables with a line. Attributes with
structured relationships may also be correlated and good candidates for removal from your
dataset.
Sample Output
RESULT
Thus the UCI data set was plotted using Correlation and scatter plotting was executed
successfully.
7. Visualizing Geographic Data with Basemap
AIM:
To visualizing the Geographic Data with Basemap using Zomato geographic data.
ALGORITHM:
1. Study the basics of Basemap.
2. Use Zomato data to plot city names and restaurants details.
Basemap Introduction
Basemap is a toolkit under the Python visualization library Matplotlib. Its main function
is to draw 2D maps, which are important for visualizing spatial data. basemap itself does not do
any plotting, but provides the ability to transform coordinates into one of 25 different map
projections.
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from glob import glob as gb
len(dirs)
df_file=pd.read_csv("C:/Users/IT LAB-
I\Desktop/Data_Science/zomato_data/"+dir1+"/"+file,quotechar='"',delimiter="|")
#appending the dataframe into a list
li.append(df_file.values) len(li)
#numpys vstack method to append all the datafames to stack the sequence of input vertically
to make a single array
df_np=np.vstack(li)
#no of rows is represents the total no restaurants ,now of coloumns(12) is columns for the
dataframe
df_np.shape
#header column "PAGE NO" is not required ,i used it while scraping the data from zomato
to do some sort of validation,lets remove the column df_final.drop(columns=["PAGE
NO"],axis=1,inplace=True)
# import json and requests library to use googl apis to get the longitude ant latituide values
import requests
import json
#creating a separate array with all city names as elements of array
city_name=df_final["CITY"].unique()
li1=[]
#googlemap api calling url
geo_s ='https://maps.googleapis.com/maps/api/geocode/json'
#iterating through a for loop for each city names
for i in range(len(city_name)):
#i have used my own google map api, please use ypur own api
param = {'address': city_name[i], 'key': 'AIzaSyD-kYTK-
8FQGueJqA2028t2YHbUX96V0vk'}
response = requests.get(geo_s, params=param)
response=response.text data=json.loads(response)
#setting up the variable with corresponding city longitude and latitude
lat=data["results"][0]["geometry"]["location"]["lat"]
lng=data["results"][0]["geometry"]["location"]["lng"]
#creating a new data frame with city , latitude and longitude as columns
df2=pd.DataFrame([[city_name[i],lat,lng]])
li1.append(df2.values)
#numpys vstack method to append all the datafames to stack the sequence of input vertically
to make a single array
df_np=np.vstack(li1)
#merge this data frame to the existing df_final data frame using merge and join features
from pandas,and creating a new data frame
df_final2=df_final.merge(df_sec,on="CITY",how="left")
#display the contents , it will have longitude and latitude now df_final2
#creating pandas series to hold the citynames and corresponding count of restuarnats in
ascending order
li2=df_final["CITY"].value_counts().sort_values(ascending=True) li2
#merging this data frame with df_sec data frame(which we created using city
names,longitude and latitude) df_map_final=df_map.merge(df_sec,on="CITY",how="left")
#displaying the new data frame this frame will be used for map ploting
df_map_final
#lets take one data frame for top 20 cities with most retaurants counts
df_plot_top=df_map_final.tail(20)
#lets plot this inside the map corresponding to the cities exact co-ordinates which we received
from google api
#plt.subplots(figsize=(20,50))
plt.figure(figsize=(50,60))
map=Basemap(width=120000,height=900000,projection="lcc",resolution="l",llcrnrlon
=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=77)
map.drawcountries() map.drawmapboundary(color='#f2f2f2')
map.drawcoastlines()
lg=np.array(df_plot_top["lng"])
lat=np.array(df_plot_top["lat"])
pt=np.array(df_plot_top["COUNT"])
city_name=np.array(df_plot_top["CITY"])
x,y=map(lg,lat)
#using lambda function to create different sizes of marker as per thecount
p_s=df_plot_top["COUNT"].apply(lambda x: int(x)/2)
#plt.scatter takes logitude ,latitude, marker size,shape,and color as parameter in the below
, in this plot marker color is always blue.
plt.scatter(x,y,s=p_s,marker="o",c='BLUE')
plt.title("TOP 20 INDIAN CITIES RESTAURANT COUNTS PLOT AS PER
ZOMATO",fontsize=30,color='RED')
#lets plot this inside the map corresponding to the cities exact co-ordinates which we received
from google api ,here marker color will be different as per marker size
#plt.subplots(figsize=(20,50))
plt.figure(figsize=(50,60))
map=Basemap(width=120000,height=900000,projection="lcc",resolution="l",llcrnrlon
=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=77)
map.drawcountries() map.drawmapboundary(color='#f2f2f2')
map.drawcoastlines()
lg=np.array(df_plot_top["lng"])
lat=np.array(df_plot_top["lat"])
pt=np.array(df_plot_top["COUNT"])
city_name=np.array(df_plot_top["CITY"])
x,y=map(lg,lat)
#using lambda function to create different sizes of marker as per thecount
p_s=df_plot_top["COUNT"].apply(lambda x: int(x)/2)
#plt.scatter takes logitude ,latitude, marker size,shape,and color as parameter in the
below , in this plot marker color is different. plt.scatter(x,y,s=p_s,marker="o",c=p_s)
plt.title("TOP 20 INDIAN CITIES RESTAURANT COUNTS PLOT AS PER
ZOMATO",fontsize=30,color='RED')
#lets plot with the city names inside the map corresponding to the cities exact co- ordinates
which we received from google api ,here marker color will be different as per marker size
#plt.subplots(figsize=(20,50))
plt.figure(figsize=(50,60))
map=Basemap(width=120000,height=900000,projection="lcc",resolution="l",llcrnrlon
=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=77)
map.drawcountries() map.drawmapboundary(color='#f2f2f2')
map.drawcoastlines()
lg=np.array(df_plot_top["lng"])
lat=np.array(df_plot_top["lat"])
pt=np.array(df_plot_top["COUNT"])
city_name=np.array(df_plot_top["CITY"])
x,y=map(lg,lat)
#using lambda function to create different sizes of marker as per thecount
p_s=df_plot_top["COUNT"].apply(lambda x: int(x)/2)
#plt.scatter takes logitude ,latitude, marker size,shape,and color as parameter in the
below , in this plot marker color is different. plt.scatter(x,y,s=p_s,marker="o",c=p_s)
for a,b ,c,d in zip(x,y,city_name,pt):
#plt.text takes x position , y position ,text ,font size and color as arguments
plt.text(a,b,c,fontsize=30,color="r")
plt.title("TOP 20 INDIAN CITIES RESTAURANT COUNTS PLOT AS PER
ZOMATO",fontsize=30,color='RED')
#lets plot with the city names and restaurants count inside the map corresponding to the
cities exact co-ordinates which we received from google api ,here marker color will be
different as per marker size
#plt.subplots(figsize=(20,50))
plt.figure(figsize=(50,60))
map=Basemap(width=120000,height=900000,projection="lcc",resolution="l",llcrnrlon
=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=77)
map.drawcountries() map.drawmapboundary(color='#f2f2f2')
map.drawcoastlines()
lg=np.array(df_plot_top["lng"])
lat=np.array(df_plot_top["lat"])
pt=np.array(df_plot_top["COUNT"])
city_name=np.array(df_plot_top["CITY"])
x,y=map(lg,lat)
#using lambda function to create different sizes of marker as per thecount
p_s=df_plot_top["COUNT"].apply(lambda x: int(x)/2)
#plt.scatter takes logitude ,latitude, marker size,shape,and color as parameter in the
below , in this plot marker color is different. plt.scatter(x,y,s=p_s,marker="o",c=p_s)
for a,b ,c,d in zip(x,y,city_name,pt):
#plt.text takes x position , y position ,text(city name) ,font size and color as arguments
plt.text(a,b,c,fontsize=30,color="r")
#plt.text takes x position , y position ,text(restaurant counts) ,font size and color as
arguments, like above . but only i have changed the x and y position to make it more
clean and easier to read
plt.text(a+60000,b+30000,d,fontsize=30)
plt.title("TOP 20 INDIAN CITIES RESTAURANT COUNTS PLOT AS PER
ZOMATO",fontsize=30,color='RED')
Sample Output
RESULT
Thus the visualization of Zomato geographic data was visualized using Basemap.