Pds-clg
Pds-clg
Affiliated
S. N. PATEL INSTITUTE OF TECHNOLOGY & RESEARCH
CENTRE, UMRAKH
Numbers
Number stores numeric values.
The integer, float, and complex values belong to a Python Numbers data-type.
Python provides the type() function to know the data-type of the variable.
Python creates Number objects when a number is assigned to a variable.
Python supports three types of numeric data.
o Int - Integer value can be any length such as integers 10, 2, 29, -20, -150
etc. Python has no restriction on the length of an integer. Its value belongs
to int
o Float - Float is used to store floating-point numbers like 1.9, 9.902, 15.2,
etc. It is accurate up to 15 decimal points.
o Complex - A complex number contains an ordered pair, i.e., xiy where x
and y denote the real and imaginary parts, respectively. The complex
numbers like 2.14j, 2.02.3j, etc. (Note: imaginary part in python will be
denoted with j suffix)
Example:
Boolean
Boolean type provides two built-in values, True and False, these values are used
to determine within the given statement is true or false.
It is denoted by bool.
True can be represented by any non-zero value or 'True' whereas false can be represented
by the 0 or 'False'.
Is String a mutable data type? Also explain the string operations length, indexing and slicing in
detail with an appropriate example.
ANS:
No. String in Python is immutable.
The strings is Ordered Sequence of character such as "darshan", 'college', "282"
Q.2. etc...
Strings are arrays of bytes representing Unicode characters.
A string can be represented as single, double, or triple quotes.
String with triple Quotes allows multiple lines.
Square brackets can be used to access elements of the string, Ex. "Darshan”[1] =
a, characters can also be accessed with a reverse index like “Darshan”[-1] = n.
String in python is immutable.
x = "S N P a t e l"
Index = 0 1 2 3 4 5 6 String
Negative index = 0 -6 –5 -4 -3 -2 -1
Length:
len() is not a string function but we can use this method to get the length of string.
x = "S N Patel"
print(len(x))
Output:
8
X = "our string"
subx1 = x[StartIndex]
subx2 = x[StartIndex:EndIndex]
subx3 = x[StartIndex:Endindex:Steps]
Syntax:
The StartIndex, EndIndex, and steps values must all be integers, It extracts the
substring from StartIndex till EndIndex (not including Endindex) with steps
defining the increment of index, If we specify steps to be -1 it will extract the
reversed string.
Example:
x = 'We are the students of SNPITRC, Umrakh.'
subx1 = x[0:6]
subx2 = x[23:30]
subx3 = x[32:]
subx4 = x[:19]
subx5 = x[::2]
subx6 = x[::-1]
print(subx1)
print(subx2)
print(subx3)
print(subx4)
print(subx5)
print(subx6)
Output
We are
SNPITRC
Umrakh.
We are the students
W r h tdnso NIR,Urk.
.hkarmU ,CRTIPNS fo stneduts eht era eW
Write an python code for slicing to fetch first name and last name from full name of person and
display it.
ANS:
for i in range(5):
print(i, end=" ")
print()
Output
0 1 2 3 4
In simple terms, range() allows the user to generate a series of numbers within a given range.
Depending on how many arguments the user is passing to the function, the user can decide where
that series of numbers will begin and end, as well as how big the difference will be between one
number and the next. Python range() function takes can be initialized in 3 ways.
range (stop) takes one argument:
When the user call range() with one argument, the user will get a series of numbers that
starts at 0 and includes every whole number up to, but not including, the number that the
user has provided as the stop.
for i in range(6):
print(i, end=" ")
print()
Output
0 1 2 3 4 5
Output
0 2 4 6 8
Compare and summarize four different coding styles supported by Python language. List
Advantages of Python
ANS:
Python is a Programming language that is emerging in a broader way throughout the
world. Many programming languages like Java, C++ are object-oriented programming
languages that only support object-oriented coding. Unlike other languages, Python
provides flexibility for the users to opt for different coding styles.
Different coding styles can be chosen for different problems. There are four different
coding styles offered by python. They are:
Q.5.
Functional
Imperative
Object-oriented
Procedural
Functional coding:
In the functional type of coding, every statement is treated as a Mathematical
equation and mutable (able to change) data can be avoided.
Most of the programmers prefer this type of coding for recursion and lambda
calculus.
The merit of this functional coding is it works well for parallel processing, as
there is no state to consider.
This style of coding is mostly preferred by academics and Data scientists.
Example:
Imperative coding:
When there is a change in the program, computation occurs.
Imperative style of coding is adopted, if we have to manipulate the data
structures.
This style establishes the program in a simple manner.
It is mostly used by Data scientists because of its easy demonstration.
Example:
sum = 0
for x in my_list:
sum += x
print(sum)
Object-oriented coding:
This type of coding relies on the data fields, which are treated as objects.
These can be manipulated only through prescribed methods.
Python doesn’t support this paradigm completely, due to some features like
encapsulation cannot be implemented.
This type of coding also supports code reuse.
Other programming languages generally employ this type of coding.
Example:
class word:
def test(self):
print("Python")
string = word()
string.test()
Procedural coding:
Usually, most of the people begin learning a language through procedural code,
where the tasks proceed a step at a time.
It is the simplest form of coding.
It is mostly used by non-programmers as it is a simple way to accomplish simpler
and experimental tasks.
It is used for iteration, sequencing, selection, and modularization.
Example:
def add(list):
sum = 0
for x in list:
sum += x
return sum
print(add(my_list))
a = [1,2,3,4,5,6]
type(a)
Output
<class 'list'>
Q.6. The list is useful over tuple when we need to perform add or delete on the data structure. Because
of this dynamic nature choosing a list over a tuple will increase the runtime of the program while
accessing or iterating items from the list.
Tuple: A tuple is a collection of data elements like a list. The items in a tuple are separated by a
comma. However, the major difference is, a tuple is immutable. Items in the tuple can be
accessed by its index e.g. a[0] =1. Items can’t be added or deleted once the tuple is defined. Also,
items can’t be updated in a tuple. The tuple can store any type of data element e.g. int, str, etc.
a = (1,2,3,4,5,6)
type(a)
len(a)
Output
<class 'tuple'>
6
The tuple is preferred over a list when we need to deal with a fixed data type e.g. [‘MON’,
‘TUE’, ‘WED’, ‘THU’, ‘FRI’, ‘SAT’, ‘SUN’]. Because of the immutable nature, a tuple is
efficient in terms of the runtime of the program while accessing or iterating items from it.
Set: Set is an unordered collection of data elements. Items in a set are separated by a comma. set
is an unordered collection of data elements. So, items in the set can’t be accessed by its index. set
does not allow duplicates. So adding an existing element to the set will not make any change.
However, items in a set can be deleted/removed.
a = {1,2,3,4,5,6}
type(a)
len(a)
Output
<class 'set'>
6
Set is really useful when a programmer needs to check whether an item exists in the data
structure. This leads to an efficient time complexity over tuple or list.
Dictionary (dict): Dictionary is a data type that stores data values as key: values pair. Dictionary
is written in curly brackets with comma-separated key: value pair. dict is an ordered collection of
elements that allow the addition, update, and deletion of items. dict allows retrieving values using
keys. keys and values can be accessed by the following. e.g. a.keys() gives all the keys, a.values()
return all the values. A key is unique to a dictionary meaning key can’t be duplicated in the same
dictionary. Also, a value can be retrieved by providing a corresponding key. A key can be any
type of data element e.g. int, str, etc. However, values for a key can be int, str, list(), set(), tuple.
a = {'a':1,'b':2,'c':3}
type(a)
len(a)
Output
<class 'dict'>
3
CO 2 – QUESTIONS & ANSWERS
Q.7. What is the role of Python in Data science? Explain sampling in terms of data science?
Ans:
Python is the most flexible and capable because it supports so many third‐party libraries
devoted to the task
•The following points you better understand why Python is such a good choice for many data
science needs.
1. Considering the shifting profile of data scientists
2. Working with a multipurpose, simple, and efficient language
Python is the vision of a single person, Guido van Rossum, Guido started the language in
December 1989 as a replacement for the ABC language
However, it far exceeds the ability to create applications of all types, and in contrast to
ABC, boasts four programming styles (programming paradigms)
Functional
Treats every statements as a mathematical equation and avoids any form of state or
mutable data
The main advantage of this approach is having no side effects to consider
This coding style lends itself better than the others to parallel processing because there is
no state to consider
Many developers prefer this coding style for recursion and for lambda calculus
Imperative
Performs computations as a direct change to program state
This style is especially useful when manipulating data structures and produces elegant
but simple code
-oriented
Relies on data fields that are treated as objects and manipulated only through prescribed
methods
Python doesn’t fully support this coding form because it can’t implement features such as
data hiding
This is useful coding style for complex applications because it supports encapsulation
and polymorphism
Procedural
Treats tasks as step by step iterations where common tasks are placed in functions that
are called as needed
Python has a unique attribute and is easy to use when it comes to quantitative and
analytical computing
Data Science Python is widely used and is a favorite tool along being a flexible and open
sourced language.
Its massive libraries are used for data manipulation and are very easy to learn even for a
beginner data analyst.
Apart from being an independent platform it also easily integrates with any existing
infrastructure which can be used to solve the most complex problems.
Python is preferred over other data science tools because of following features,
⮩ Powerful and Easy to use
⮩ Open Source
⮩ Choice of Libraries
⮩ Flexibility
⮩ Visualization and Graphics
⮩ Well supported
Where in other programming languages the indentation in code is for readability only, the
Q.9. indentation in Python is very important.
Python provides no braces to indicate blocks of code for class and function definitions or flow
control. Blocks of code are denoted by line indentation, which is rigidly enforced.
The number of spaces in the indentation is variable, but all statements within the block must be
indented the same amount.
Which are the basic activities we performed as a part of data science pipeline? Summarize and
explain in brief.
Ans:
Data science pipeline requires the data scientist to follow particular steps in the preparation,
analysis and presentation of the data.
General steps in the pipeline are
⮩ Preparing the data
▪ The data we access from various sources may not come directly in the structured format.
▪ We need to transform the data in the structured format.
▪ Transformation may require changing data types, order in which data appears and even the
creation of missing data
Q.10.
⮩ Performing data analysis
▪ Results of the data analysis should be provable and consistent.
▪ Some time single approach may not provide the desired output, we need to use multiple
algorithms to get the result.
▪ The use of trial and error is part of the data science art.
⮩ Learning from data
▪ As we iterate through various statistical analysis methods and apply algorithms to detect
patterns, we begin learning from the data.
▪ The data might not tell the story that you originally thought it would.
⮩ Visualizing
⮩ Obtaining insights
Summarize the characteristics of NumPy, Pandas, Scikit-Learn and matplotlib libraries along
with their usage in brief.
Ans:
1) NumPy
NumPy is used to perform fundamental scientific computing.
NumPy library provides the means for performing n-dimensional array manipulation,
which is critical for data science work.
NumPy provides functions that include support for linear algebra, Fourier transformation,
Q.11. random-number generation and many more..
2) pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language.
it offers data structures and operations for manipulating numerical tables and time series.
The library is optimized to perform data science tasks especially fast and efficiently.
The basic principle behind pandas is to provide data analysis and modelling support for
Python that is similar to other languages such as R.
3) matplotlib
The matplotlib library gives a MATLAB like interface for creating data presentations of
the analysis.
The library is initially limited to 2-D output, but it still provide means to express analysis
graphically.
Without this library we can not create output that people outside the data science
community could easily understand.
4) Scikit-learn
The Scikit-learn library is one of many Scikit libraries that build on the capabilities provided
by NumPy and SciPy to allow Python developers to perform domain specific tasks.
Scikit-learn library focuses on data mining and data analysis, it provides access to following
sort of functionality:
⮩ Classification
⮩ Regression
⮩ Clustering
⮩ Dimensionality reduction
⮩ Model selection
⮩ Pre-processing
Scikit-learn is the most important library we are going to learn in this subject
What is HTML parsing?
Ans:
BeautifulSoup is a Python library for parsing HTML and XML documents. It provides a
convenient way to extract and navigate data from HTML documents, making it a popular choice
among developers for web scraping and data extraction tasks.
One reason for its popularity is its ease of use. BeautifulSoup provides a simple and intuitive API
that makes it easy to extract data from HTML documents. It also supports a wide range of parsing
strategies and can handle malformed HTML documents with ease.
In the following example, we show you how to use BeautifulSoup to extract every quote from the
one website.
Program:
Q.12. import requests
from bs4 import BeautifulSoup
url = 'https://quotes.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
Then we use the find_all() method to extract all div elements with a class attribute of 'quote'. For
each quote, we extract the quote text and author name using the find() method and print them to
the console.
1. Send an HTTP request to the URL of the webpage you want to access. The server
responds to the request by returning the HTML content of the webpage. For this task, we
will use a third-party HTTP library for python-requests.
2. Once we have accessed the HTML content, we are left with the task of parsing the data.
Since most of the HTML data is nested, we cannot extract data simply through string
processing. One needs a parser which can create a nested/tree structure of the HTML
data. There are many HTML parser libraries available but the most advanced one is
html5lib.
3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree
traversal. For this task, we will be using another third-party python library, Beautiful
Soup. It is a Python library for pulling data out of HTML and XML files.
Program:
#Python program to scrape website
#and save quotes from website
Q.13.
import requests
from bs4 import BeautifulSoup
import csv
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f,['theme','url','img','lines','author'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
Q.15. What do you mean by missing values? Explain the different ways to handle the missing value
with example. Explain how to deal with missing data in Pandas.
Q.16. What is Categorical Variables? Explain it with example.
Q.17. Write a python program to read data from CSV files using pandas.
Q.18. Write a python program to read data from text files in form of table.
Q.19. Explain DataFrame in Pandas with example.
Q.20. Define term n-gram. Explain the TF-IDF techniques.
Q.21. Compare the numpy and pandas on the basis of their characteristics and usage.
Q.22. Differentiate rand and randn function in Numpy.
Q.23. What kind data is analyzed with Bag of word model? Explain it with example.
Q.24. Write a brief note on NetworkX library.
Q.25. Describe date time transformation using datetime module.
Q.27. List various types of graph/chart available in the pyplot of matplotlib library for data
visualization.
Ans:
The kind of graph we choose determines how people view the associated data, so choosing
the right graph from the outset is important.
For example,
⮩ if we want o show how various data elements contribute towards a whole, we should use
pie chart.
To find specific pattern from the data, we can further divide the data and plot scatter plot.
We can do this with the help of groupby method of DataFrame, and then using tuple
unpacking while looping the group.
we can specify marker, color, and size of the marker with the help of marker, color and s
parameter respectively.
⮩ freq, to specify the frequency at which we want the date range (default is ‘D’ for days)
⮩ periods, number of periods to generate in between start/end or from start with freq.
We can also create a date range with the help of startdate, periods and freq, for example
Some of important possible values for the freq are
⮩ W, for week
⮩ M, for month
⮩ Y, for year
⮩ H, for hour
⮩ S, for seconds
⮩ L, for milliseconds
Compare bar graph, box-plot and histogram with respect to their applicability in data
Q.31. visualization.
Ans:
Histogram Bar Graph
The histogram is a term that refers to a The bar graph is a graphical
graphical representation that shows data representation of data that uses
by way of bars to display the frequency of bars to compare different categories
numerical data. of data.
Distribution of non-discrete variables. Comparison of discrete variables.
Bars touch each other, so there are no Bars never touch each other, so
spaces between bars. there are spaces between bars.
In this type of graph, elements are
In this type of graph, elements are
grouped so that they are considered as
taken as individual entities.
ranges.
The bar chart is mostly of equal
Histogram width may vary.
width.
To compare different categories of
To display the frequency of occurrences.
data.
In Histogram, the data points are
In the Bar graph, each data point is
grouped and rendered based on its bin
rendered as a separate bar.
value.
The items of the Histogram are numbers, As opposed to the bar graph, items
which should be categorized to represent should be considered as individual
data range. entities.
Bar graph, it is common to
In Histogram, we cannot rearrange the
rearrange the blocks, from highest
blocks.
to lowest
What is the use of scatter-plot in data visualization? Can we draw trendline in scatter-plot?
Explain it with example.
Q.32. Ans:
To draw a scatter trend line using matplotlib, we can use polyfit() and poly1d() methods to get
the trend line points.
Steps
Set the figure size and adjust the padding between and around the subplots.
Create x and y data points using numpy.
Create a figure and a set of subplots.
Plot x and y data points using numpy.
Find the trend line data points using polyfit() and poly1d() method.
Plot x and p(x) data points using plot() method.
To display the figure, use show() method.
Example
import numpy as np
from matplotlib import pyplot as plt
x = np.random.rand(100)
y = np.random.rand(100)
fig, ax = plt.subplots()
_ = ax.scatter(x, y, c=x, cmap='plasma')
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x, p(x), "r-o")
plt.show()
Output
Q.33. Explain Labels, Annotation and Legends in MatPlotLib.
Ans:
To fully document our graph, we have to resort the labels, annotation and legends.
Each of this elements has a different purpose as follows,
⮩ Label : provides identification of a particular data element or grouping, it will make easy for
viewer to know the name or kind of data illustrated.
⮩ Annotation : augments the information the viewer can immediately see about the data with
notes, sources or other useful information.
⮩ Legend : presents a listing of the data groups within the graph and often provides cues (
such as line type or color) to identify the line with the data.
Program:
import matplotlib.pyplot as plt
%matplotlib inline
values1 = [5,8,9,4,1,6,7,2,3,8]
values2 = [8,3,2,7,6,1,4,9,8,5]
plt.plot(range(1,11),values1)
plt.plot(range(1,11),values2)
plt.xlabel('Roll No')
plt.ylabel('CPI')
plt.annotate(xy=[5,1],s='Lowest CPI')
plt.legend(['CX','CY'],loc=4)
plt.show()
Output:
Output:
CO 5 – QUESTIONS & ANSWERS
⮩ Classifying
⮩ Regressing
⮩ Grouping by clusters
⮩ Transforming data
Even though each base class has specific methods and attributes, the core functionalities for
data processing and machine learning are guaranteed by one or more series of methods and
attributes.
What is Data Wrangling process? Define data exploratory data analysis? Why EDA is required
in data analysis? Explain the steps needed to perform data wrangling.
Ans:
Exploratory Data Analysis (EDA) refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to
check assumptions with the help of summary statistics and graphical representations.
EDA was developed at Bell Labs by John Tukey, a mathematician and statistician who
Q.40. wanted to promote more questions and actions on data based on the data itself.
In one of his famous writings, Tukey said:
“The only way humans can do BETTER than computers is to take a chance of doing WORSE
than them.”
Above statement explains why, as a data scientist, your role and tools aren’t limited to
automatic learning algorithms but also to manual and creative exploratory tasks.
Computers are unbeatable at optimizing, but humans are strong at discovery by taking
unexpected routes and trying unlikely but very effective solutions.
With EDA we,
Describe data
Closely explore data distributions
Understand the relationships between variables
Notice unusual or unexpected situations
Place the data into groups
Notice unexpected patterns within the group
Take note of group differences
What do you mean by covariance? What is the importance of covariance in data analysis?
Explain it with example.
Ans:
Covariance
⮩ Covariance is a measure used to determine how much two variables change in tandem.
Q.41. ⮩ The unit of covariance is a product of the units of the two variables.
⮩ Covariance is affected by a change in scale, The value of covariance lies between -∞ and +∞.
⮩ Pandas does have built-in function to find covariance in DataFrame named cov()
Eg,
print(df.cov())
List different way for defining descriptive statistics for Numeric Data. Explain them in brief.
Ans:
We can use pandas built-in function value_counts() to obtain the frequency of the categorical
variables.
Program:
Q.42.
print(df["Sex"].value_counts())
Output:
male 577
female 314
Name: Sex, dtype: int64
Program:
print(df.describe())
Output:
What is the use of hash function in EDA? Express various hashing trick along with example.
Ans:
Most Machine Learning algorithms uses numeric inputs, if our data contains text instead we
need to convert those text into numeric values first, this can be done using hashing tricks.
For Example our dataset is something like below table,
Q.43.
We can not apply the machine learning algorithms in text like Male/Female, we need numeric
values instead of this, One thing which we can do here is assigning the number to each word
and replace that number with the word.
If we assign 1 to male and 0 to female we can use ML algorithms, same data set in numeric
values are given above.
In machine learning, feature hashing, also known as the hashing trick, is a fast and space-
efficient way of vectorising features, i.e. turning arbitrary features into indices in a vector or
matrix.
When dealing with text, one of the most useful solutions provided by the Scikit‐learn package
is the hashing trick.
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
Output:
Define covariance and correlation.
Ans:
Covariance
⮩ Covariance is a measure used to determine how much two variables change in tandem.
⮩ Covariance is affected by a change in scale, The value of covariance lies between -∞ and +∞.
⮩ Pandas does have built-in function to find covariance in DataFrame named cov()
Q.46. Correlation
⮩ Once we’ve normalized the metric to the -1 to 1 scale, we can make meaningful statements
and compare correlations.
Program Code:
print(df.cov())
print(df.corr())