Data Mining Python Lab
Data Mining Python Lab
Data Mining Python Lab
A Learner Centric
LABORATORY MANUAL & RECORD BOOK
First Edition, December 2023
Student Details
Student Name:
Roll Number:
Semester/Branch/Section:
1.
2.
Laboratory Course Faculty:
3.
4.
Course Offered by
DEPARTMENT OF INFORMATION TECHNOLOGY
KAKATIYA INSTITUTE OF TECHNOLOGY & SCIENCE, WARANGAL
(An Autonomous Institute under Kakatiya University, Warangal)
CERTIFICATE
This is to certify that it is a bonafide record of practical work done by Mr. / Kum.
………………………………………………………………………………………
………………… branch in the Design and Analysis of Algorithms Laboratory (DAA Lab)
Examiner
(Signature with Date)
PREFACE
Dear students,
This lab manual is designed and developed as “A Learner Centric Laboratory Manual and
Record Book (LMRB)” for Outcome Based Education (OBE).
Note:
a) Student should complete the Programming Task in first 2 periods of the lab session
b) And use the last 30 – 45 minutes of lab time to do the following tasks
a. Complete the record write up (W:10 marks) and
b. attend viva-voce (V:10 marks)
c) LAB RECORD WTIE UP is to be completed in the respective lab session itself. The lab
course faculty will complete the evaluation in the lab session itself.
d) Lab record WRITE UP should not be carried to home for completion.
NOTE: FACULTY WILL COMPLETE THE STUDENT EVALUATION IN THE LAB SESSION
ITSELF SO STUDENT SHOULD COMPLETE THE WRITE UP IN THE LAB SESSION
ITSELF.
The lab course faculty will assess and evaluate the student in four quadrants i.e. K, P, W & V
during the lab slot itself, and award the marks after conduction of viva-voce. This evaluation
gives scope for the students to improve, in the upcoming weeks of programming tasks, by
demonstrating relevant skills and the competencies in K, P, W & V.
Bottom Line:
a) A well-defined leaner centric continuous internal evaluation (CIE) will be followed in this
lab. It is expected to make students active learners, skilled and acquire several
competencies related to the programming tasks
b) Hence, students are advised to love learning, follow the stipulated CIE and become active
learners
c) Active learning will ensure students acquire the 21st century skills and competencies to be
successful in a job
INDEX
Institute vision and mission 1
Vision of Institute:
• To make our students technologically superior and ethically strong by providing quality
education with the help of our dedicated faculty and staff and thus improve the quality of
human life.
Vision of Department:
• To become a Centre of Excellence in the Information Technology discipline with effective
teaching and strong research environment that makes our students globally competitive with
strong ethical values and leadership abilities.
Page 1 of 201
Program - B.Tech. Information Technology
Within first few years after graduation, the Information Technology graduates will be able
to…
PEO1 To provide students with a sound foundation in Information Technology theory
and practices to analyze, formulate and solve engineering problems
PEO2 To develop an ability to design algorithms, implement programs and deploy
software.
PEO3 To develop Information Technology solutions with the changing needs of the
society for the career-related activities.
Page 2 of 201
PO7 Environment and Understand the impact of the professional engineering
sustainability solutions in societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable
development.
PO8 Ethics Apply ethical principles and commit to professional ethics
and responsibilities and norms of the engineering practice
PO9 Individual and Function effectively as an individual, and as a member or
teamwork leader in diverse teams, and in multidisciplinary settings.
PO10 Communication Communicate effectively on complex engineering
activities with the engineering community and with
society at large, such as, being able to comprehend and
write effective reports and design documentation, make
effective presentations, and give and receive clear
instructions.
PO11 Project management Demonstrate knowledge and understanding of the
and finance engineering and management principles and apply these
to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
PO12 Life-long learning Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in
the broadest context of technological change.
Page 3 of 201
INSTRUCTIONS TO THE STUDENTS
1. This Learner Centric LABORATORY MANUAL & RECORD BOOK (LMRB) is essential for the
student and must be brought to every laboratory session.
2. This learner centric LMRB consists of At-Home Sample Programs (SPs) and In-lab Exercise
Problems (EPs)
a) At-Home Sample Programs (SPs): At-Home Sample Programs (SPs) are the HOMEWORK
programs to be completed, before attending the lab. You should execute these Sample Programs
(SPs) with the given sample test cases and check for the results. In addition, as a proof of
completion of the HOMEWORK, the student should execute the SPs, with other set of test cases
and record the answers in the space provided under Additional Test Cases
(i). You should design your own additional Test Cases and execute these SPs.
(ii). You should design the test cases, which challenge the robustness of the code. The
challenging test cases have the capacity to halt the program execution.
(iii). You should bring those challenging test cases to the notice of course faculty, so that the
code of SPs can suitably be modified to make the code robust.
b) In-Lab Exercise Problem (EPs): In-Lab Exercise Problems (EPs) are the problems to be coded
during the lab session. Student should complete all EPs in the Lab slot itself with necessary
write up in the space provided, by following the required Program Development Steps (PDS).
Therefore, students should:
(i). work on the At-Home and execute SPs with Additional Test Cases before attending the
lab.
(ii). Complete the In-Lab the EPs in the lab session.
Prior preparation on EPs will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior preparation.
c) For EP1 - All steps of PDS are mandatory: Algorithm/Psuedocode development is mandatory
only for the FIRST Exercise Problem (EP-1) of every Laboratory task.
d) For OTHER EPs (EP2 onwards): To save time during lab session, the student can skip "writing
algorithm" for other EPs. The student should focus and work on Problem Analysis (Logic
development), Coding, Testing & Debugging and execution with Test Cases.
e) Student should focus on developing code for the In-Lab EPs, which is readable (with proper
Annotations and Indentation), maintainable, extendable, testable and robust.
3. All the EPs must be completed within the stipulated time.
4. Students should demonstrate the required skills during ORAL VIVA-VOCE. It is not mandatory
to write the answers to the viva voce questions of every lab tasks. But it is a good practice to keep
the question answered in the place provided, after completion of lab session.
5. Incompletion of the lab record will result in reduction of marks.
Page 4 of 201
Rubrics for Continuous Internal Evaluation (CIE)
Continuous Internal Evaluation (CIE) for Practical (Laboratory) Course shall carry 40% weightage.
CIE throughout the semester shall consist of the following for each experiment/lab.
Every laboratory session is evaluated for a total of 40 marks. The details have been listed below.
These Five (05) marks shall be awarded based on student's performance, as below:
% of questions answered satisfactorily Marks awarded
80-100 : 5
60-80 : 3-4
30-60 : 2
0-30 : 1
Note: Faculty will check whether student expected the At-Home SPs with at additional Test Cases
and affix signature.
Once the student is allowed to develop the program for the In-Lab EPs, marks will be awarded based on
his/her participation as an individual while developing the programs by following the PDS.
Page 5 of 201
Completed the program effectively with partial assistance of faculty but able to answer
7-9
the questions related to programming tasks :
Completed the EPs only with full assistance of faculty but unable to answer the
3-6
questions on programming tasks :
Unable to complete the EPs even with assistance of the faculty : 0-2
The student should complete the write-up, related to the program conducted, in the manual itself, in the
designated space for Record Work. The write-up must be on the following:
• Problem Analysis (Logic Development)
• Program execution with Test Cases
4. Viva-voce(V): 10 Marks
After completing the write-up, the student should attend viva-voce to answer the following:
Interpretation of output: Viva-voce should not be limited to only the sample questions listed in VIVA-
VOCE questions at the end of program, but should go beyond to test the student's involvement in the
program development and also the technical competency.
(i). What did you learn from these programs based on objectives?
(ii). How will you apply knowledge gained, by performing these programs, in future?
Student should be asked to comment, on the following, specific to SPs and EPs:
(iii). Alternative Approach: Can you propose any "alternative logic/solution" to the make the code
effective? (Specific to specified EPs)
(iv). Maintainability of the code: Do you think that the code written by you is maintainable? Justify.
(v). Testability of the code: Do you think that your programs are testable? Justify.
(vi). Extensibility of the code: What are your ideas on code extendibility for additional features to the
existing code?
(vii). Readability of the code: Do you think that the code written by you is readable (easy to follow,
easy to understand)? Justify.
(viii). Robustness of the code: Whether your code is robust? Justify.
(ix). Any other ideas related to the specific SPs/EPs.
Page 6 of 201
Marks will be awarded based on student's performance, as below:
Marks
Viva-Voce
awarded
Reasonable conclusions drawn with good interpretation of results and answered 80-
: 10
100% of the viva-voce questions perfectly
Reasonable conclusions drawn but answered 50-80% of the viva-voce Questions : 7-9
Poor conclusions and interpretation of results with only 30-50% of viva-voce questions
: 3-6
answered
Conclusions without interpretation of results and answered less than 30% of viva-voce
: 0-2
questions posed
Page 7 of 201
MAKE-UP LAB SESSIONS
1. Missing lab sessions due to holidays or unforeseen circumstances / disturbances will cause a
big loss to student learning.
2. To compensate for this loss, lab course faculty has to plan and conduct additional lab sessions,
called Make-up Lab Sessions, beyond working hours of the institute (or) on Saturdays /
Sundays, by giving prior information to students.
3. The lab course faculty has to ensure that Make-up Lab Sessions are arranged in the following
cases
i. to compensate for the lab sessions to be lost due to holidays
ii. to compensate for the lab sessions to be lost due to unforeseen circumstances
4. The dates for Make-up lab sessions for case (i) i.e., for the sessions which are expected to be lost
due to holidays, are to be announced very much at the beginning of semester itself and printed,
in the Lab Programs Calendar.
5. The dates for Make-up lab sessions, for case (ii) i.e., for the sessions which are expected to be lost
due to unforeseen circumstances, are to be announced, conducted and recorded as and when the
lab sessions get disturbed
IMPORTANT NOTE:
a) Completing all stipulated programs is mandatory for the students to appear for
Laboratory End Semester Examination (ESE).
b) It is student's responsibility to complete all programs
c) If any student is absent for any laboratory session due to valid/genuine reasons, he/she
must complete the program within a week time by seeking permission from the lab course
faculty.
d) Upon completion of the programs of lab sessions which were missed due to valid/genuine
reasons, student will be evaluated for only 50% of the maximum marks of the program and
the corresponding attendance will not be counted.
e) Students allowed to utilize the laboratory sessions beyond the working hours.
Page 8 of 201
PDS
The students should follow the following steps known as "Program Development Steps" (PDS) to
develop and execute a given programming task.
1. Problem analysis (Logic Development)
2. Algorithm/Pseudo code
3. Flowchart
4. Coding
5. Testing & Debugging
6. Programming execution with Test Cases
25.12.2023 to
Week 2 30.12.2023
2. Write a program to perform various OLAP operations.
22.01.2024 to
Week 6 27.01.2024
8. Introduction to NumPy, Operations on NumPy Arrays.
29.01.2024 to
Week 7 03.02.2024
9. Introduction to Pandas, Getting and Cleaning Data.
12.02.2024 to
Week 9 20.02.2024
MID SEMESTER EXAMINATION - 1
21.02.2024 to
Week 10 24.02.2024
No Laboratory due to MSE-I on Monday & Tuesday
26.02.2024 to
Week 11 02.03.2024
12. Plotting Data Distributions, Categorical and Time-Series Data.
04.03.2024 to
Week 12 13. Generate association rules from frequent item sets.
09.03.2024
25.03.2024 to
Week 15 30.03.2024
16. Implement K-means and hierarchical clustering algorithms.
01.04.2024 to
Week 16 06.04.2024
Makeup Laboratory
08.04.2024 to
Week 17 20.04.2024
MID SEMESTER EXAMINATION - 2
22.04.2024 to
Week 18 30.04.2024
LABORATORY END SEMESTER EXAMINATION
Page 10 of 201
LAB EXPERIMENTS CALENDAR – MAKE-UP SESSIONS
Make-up Lab
S. No. Time Title of the experiment
on (Date)
Make-up lab sessions - for sessions lost due to holidays
1.
2.
3.
4.
1.
2.
3.
4.
Page 11 of 201
LIST OF PROGRAMS & CIE
Marks
Exp. Date of Signature
Title of the experiment awarded
No. conduction of faculty
(40)
1. Write a program to perform
multidimensional data model using SQL
E1
queries (Star, snowflake and fact
constellation schemes).
Page 12 of 201
Laboratory Task - 1
E1. 1. Write a program to perform multidimensional data model using SQL queries
(Star, snowflake and fact constellation schemes)
CONCEPT AT A GLANCE
Data: It is a set of facts and figures. It is like raw materials of data items with numbers,
alphabets and other symbols.
Field or Column: To prepare information, data items are organized in the form of
fields.
Page 16 of 201
Data cube: A Lattice of Cuboids
Modeling data warehouses: Data warehouse is modeled using one of the following
multidimensional models which is described by dimensions & measures
1. Star schema: A fact table in the middle connected to a set of dimension tables
2. Snowflake schema: A refinement of Star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to Snowflake
3. Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of Star
schemas, therefore called galaxy schema or fact constellation
Sales analysis modeling using Star schema w.r.t item, time, branch and location dimensions
Page 17 of 201
Example of Snowflake Schema
Sales analysis modeling using Snowflake schema w.r.t item, time, branch and location dimensions.
In this example item and location tables are normalized.
Sales analysis modeling using Fact constellation schema w.r.t item, time, branch and location
dimensions. In this example two fact tables, sales and shipping fact tables are sharing common
dimensions.
Page 18 of 201
Demonstrate Star schema creation for sales data analysis.
Assume Sales data is analyzed w.r.t item, location and branch and time dimensions and create the
tables and insert the sample data
Table created
Table Created
Table Created
Table Created
Table Created
Page 19 of 201
REORD WORK
(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
1. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
2. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
(c). Student should focus on developing code for the In-Lab EPs, which is
1. readable (with proper Annotations and Indentation)
2. maintainable
3. extendable
4. testable and
5. robust
Page 20 of 201
RECORD WORK
Page 21 of 201
Page 22 of 201
Page 23 of 201
Page 24 of 201
Viva-Voce Questions:
Note: For Viva-voce questions, the students should demonstrate their knowledge and
skills through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
Page 25 of 201
Page 26 of 201
Page 27 of 201
Laboratory Task - 2
CONCEPT AT A GLANCE
Group By Clause:
An aggregate function takes multiple rows of data returned by a query and aggregates them into a
single result row. Including the Group By clause limits the window of data processed by the
aggregate function. It produces an aggregated value for each distinct combination of values present
in the columns listed in the Group By clause. The number of rows can be calculated by multiplying
the number of distinct values of each column listed in the Group By clause.
Rollup:
In addition to the regular aggregation results with the Group By clause, the Rollup extension
produces group subtotals from right to left and a grand total. If "n" is the number of columns listed
in the Rollup, there will be n+ 1 levels of subtotals.
Example:
(a, b, c)
(a, b) (a)
()
Query 1:
Page 28 of 201
It is possible to do a partial rollup to reduce the number of subtotals calculated.
Cube:
In addition to the subtotals generated by the Rollup extension, the Cube extension will generate
subtotals for all combinations of the dimensions specified. If "n" is the number of columns listed in
the CUBE, there will be 2n subtotal combinations. If the number of dimensions increases, so the
combinations of subtotals that need to be calculated will also increase.
(a, b)
()
Page 29 of 201
Query 2:
SQL> select deptno, job, Avg(sal) from emp group by mgr, cube(deptno,job);
Query 4:
Page 30 of 201
Grouping Functions:
Grouping:
It can be quite easy to visually identify subtotals generated by rollups and cubes, but to do it
programmatically, need to know the presence of null valuses. This is where the Grouping function is
useful. It accepts a single column as a parameter and returns "1" if the column contains a null value
generated as part of a subtotal by a ROLLUP or CUBE operation or "0" for any other value, including
stored null values.
Query 5:
Page 31 of 201
Group_Id:
It's possible to write queries that return the duplicate subtotals, which can be a little confusing. The
group_id function assigns the value "0" to the first set, and all subsequent sets get assigned a higher
number.
Grouping sets:
Calculating all possible subtotals in a cube, especially those with many dimensions, can be quite an
intensive process. To calculate few subtotals, this can represent a considerable amount of wasted
effort. If we only need a few of these levels of subtotaling we can use the Grouping Sets expression
and specify exactly which are required.
Composite Columns:
Rollup and Cube consider each column independently when deciding which subtotals must be
calculated. For Rollup this means stepping back through the list to determine the groupings.
Composite columns allow columns to be grouped together with braces so they are treated as a single
unit when determining the necessary groupings. In the following Rollup columns "a" and "b" have
been turned into a composite column by the additional braces. As a result, the group of "a" is no
longer calculated as the column "a" is only present as part of the composite column in the statement.
(a, b, c)
(a, b) ()
Not considered:
(a)
Page 32 of 201
In a similar way, the possible combinations of the following Cube are reduced because references to
"a" or "b" individually are not considered as they are treated as a single column when the groupings
are determined.
(a, b, c)
(a, b) (c) ()
Not considered:
Query 6:
Concatenated Groupings
Concatenated groupings are defined by putting together multiple GROUPING SETS, CUBEs or
ROLLUPs separated by commas. The resulting groupings are the cross-product of all the groups
produced by the individual grouping sets.
Query 7:
Page 33 of 201
Page 34 of 201
REORD WORK
(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
(c). Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
Page 35 of 201
RECORD WORK
EP1. Write the queries for the following using sales fact table:
a. Display average sales for rollup combination’s location and branch.
b. Display sum of dollars sold for cube combination location,time,branch.
c. Display average of dollars sold for cube combination location,(time,branch)
d. Demonstrate groping sets on sales table.
e. Demonstrate concatenated grouping sets on sales schema.
Page 36 of 201
Page 37 of 201
Page 38 of 201
Page 39 of 201
Viva-Voce Questions:
Note: For Viva-voce questions, the students should demonstrate their knowledge and
skills through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
Page 40 of 201
Page 41 of 201
Page 42 of 201
Laboratory Task - 3
CONCEPT AT A GLANCE
What is Python?
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.Python is a high-level object-oriented programming language It is also called
general- purpose programming language as it is used in almost every domain as mentioned below:
1. Web Development
2. Software Development
3. Game Development
4. AI & ML
5. Data Analytics
Why Python?
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc). Python has a
simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer lines than some other
programming languages.
Python runs on an interpreter system, meaning that code can be executed as soon as it is
written. This means that prototyping can be very quick. Python can be treated in a procedural way,
an object-oriented way or a functional way.
The way to run a python file is like this on the command line:
Page 43 of 201
C:\Users\Name>python helloworld.py
Let's write our first Python file, called helloworld.py, which can be done in any text editor.
helloworld.py print("Hello, World!")
Save the file. Open the command line, navigate to the directory where we saved the file, and run:
C:\Users\Name>python helloworld.py
• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Identity operators
• Membership operators
• Bitwise operators
Arithmetic Operators
Arithmetic operators are used to performing mathematical operations like addition, subtraction,
multiplication, and division.
# Examples of Arithmetic
Operator a = 9
b=4
Page 44 of 201
# Division(floor) of number div2 = a // b
# Power p = a ** b
# print results
print(add)
print(sub)
print(mul)
print(div1)
print(div2)
print(mod)
print(p)
Output
13
36
2.25
6561
Comparison Operators
Comparison of Relational operators compares the values. It either returns True or False according to
the condition.
Page 45 of 201
# Examples of Relational Operators
a = 13
b = 33
# a == b is False print(a == b)
# a != b is True print(a != b)
Output
False
True
False
True
False
True
Logical Operators
Logical operators perform Logical AND, Logical OR, and Logical NOT operations. It is used to
combine conditional statements.
a = True
b = False
Page 46 of 201
Output
False
True
False
Bitwise Operators
Bitwise operators act on bits and perform the bit-by-bit operations. These are used to operate on
binary numbers.
a = 10
b=4
Output
14
-11
14
40
Page 47 of 201
Assignment Operators
a = 10
b=a
print(b)
Output
10
20
10
100
Page 48 of 201
102400
Identity Operators
is and is not are the identity operators both are used to check if two values are located on the same
part of the memory. Two variables that are equal do not imply that they are identical.
b = 20
c=a
print(a is not b)
print(a is c)
Output
True
True
Membership Operators
in and not in are the membership operators; used to test whether a value or variable is in a sequence.
x = 24
y = 20
if (x not in list):
else:
if (y in list):
else:
Page 49 of 201
Output
Functions:
A function is a block of organized, reusable code that is used to perform a single, related
action. Functions provide better modularity for the application and a high degree of code reusing.
As we already know, Python gives we many built-in functions like print(), etc. but we can also
create own functions. These functions are called user-defined functions.
Creating a Function
Example
def my_function():
Calling a Function
Example
def my_function():
my_function()
Arguments
Arguments are specified after the function name, inside the parentheses. We can add as many
arguments as we want, just separate them with a comma.
The following example has a function with one argument (fname). When the function is called,
we pass along a first name, which is used inside the function to print the full name:
Page 50 of 201
Example
def my_function(fname):
print(fname+ "Refsnes")
my_function("Emil")
my_function("Tobias")
my_function("Linus")
Strings
Strings in python are surrounded by either single quotation marks, or double quotation marks. 'hello'
is the same as "hello".
Example
print("Hello")
Assigning a string to a variable is done with the variable name followed by an equal sign and the
string:
Example
a= "Hello"
print(a)
Multiline Strings
Example
a= """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua."""
print(a)
Like many other popular programming languages, strings in Python are arrays of bytes representing
unicode characters.
Page 51 of 201
However, Python does not have a character data type, a single character is simply a string with a
length of 1.
Example
Get the character at position 1 (remember that the first character has the position 0):
a= "Hello,World!"
print(a[1])
Since strings are arrays, we can loop through the characters in a string, with a for loop.
Example
for x in "banana":
print(x)
String Length
Example
a= "Hello,World!"
print(len(a))
Slicing
Specify the start index and the end index, separated by a colon, to return a part of the string.
Example
b= "Hello,World!"
print(b[2:5])
Page 52 of 201
Slice From the Start
By leaving out the start index, the range will start at the first character:
Example
b= "Hello,World!"
print(b[:5])
By leaving out the end index, the range will go to the end:
Example
Get the characters from position 2, and all the way to the end:
b= "Hello,World!"
print(b[2:])
Page 53 of 201
REORD WORK
(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
(c). Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
Page 54 of 201
RECORD WORK
Page 55 of 201
Page 56 of 201
Page 57 of 201
Page 58 of 201
Viva-Voce Questions:
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
Page 59 of 201
Page 60 of 201
Page 61 of 201
Laboratory Task – 4
CONCEPT AT A GLANCE
Lists:
Lists are used to store multiple items in a single variable. Lists are created using square brackets:
Example
Create a List:
print(thislist)
Output:
List Items
List items are indexed, the first item has index [0], the second item has index [1] etc.
Ordered
When we say that lists are ordered, it means that the items have a defined order, and that order will
not change. If we add new items to a list, the new items will be placed at the end of the list.
Page 62 of 201
Changeable
The list is changeable, meaning that we can change, add, and remove items in a list after it has been
created.
Allow Duplicates
Since lists are indexed, lists can have items with the same value:
Example
print(thislist)
List Length
To determine how many items a list has, use the len() function:
Example
print(len(thislist))
Example
list2=[1, 5, 7, 9, 3]
Example
Page 63 of 201
Access Items
List items are indexed and we can access them by referring to the index number:
Example
print(thislist[1])
Range of Indexes
We can specify a range of indexes by specifying where to start and where to end the range.
When specifying a range, the return value will be a new list with the specified items.
Example
print(thislist[2:5])
Example
This example returns the items from the beginning to, but NOT including, "kiwi":
print(thislist[:4])
Example
if "apple" in thislist:
type()
From Python's perspective, lists are defined as objects with the data type 'list':
<class 'list'>
Page 64 of 201
Example
print(type(mylist))
Tuple:
Tuple is one of 4 built-in data types in Python used to store collections of data, the other 3 are List,
Set, and Dictionary, all with different qualities and usage.
A tuple is a collection which is ordered and unchangeable. Tuples are written with round brackets.
Example
Create a Tuple:
print(thistuple)
Tuple Items
Tuple items are indexed, the first item has index [0], the second item has index [1] etc.
Ordered
When we say that tuples are ordered, it means that the items have a defined order, and that order
will not change.
Unchangeable
Tuples are unchangeable, meaning that we cannot change, add or remove items after the tuple has
been created.
Allow Duplicates
Since tuples are indexed, they can have items with the same value:
Example
Page 65 of 201
print(thistuple)
Tuple Length
To determine how many items a tuple has, use the len() function:
Example
print(len(thistuple))
type()
From Python's perspective, tuples are defined as objects with the data type 'tuple':
<class 'tuple'>
Example
print(type(mytuple))
Page 66 of 201
Dictionary collection and set collection.
Dictionary
A dictionary is a collection which is ordered*, changeable and does not allow duplicates.
Dictionaries are written with curly brackets, and have keys and values:
Example
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
print(thisdict)
Dictionary Items
Dictionary items are ordered, changeable, and does not allow duplicates.
Dictionary items are presented in key:value pairs, and can be referred to by using the key name.
Example
thisdict= {
"brand": "Ford",
"model": "Mustang",
"year": 1964
print(thisdict["brand"])
Ordered or Unordered?
As of Python version 3.7, dictionaries are ordered. In Python 3.6 and earlier, dictionaries are
unordered.
When we say that dictionaries are ordered, it means that the items have a defined order, and that
order will not change.
Page 67 of 201
Unordered means that the items does not have a defined order, we cannot refer to an item by using
an index.
Changeable
Dictionaries are changeable, meaning that we can change, add or remove items after the dictionary
has been created.
Example
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964,
"year": 2020
print(thisdict)
type()
From Python's perspective, dictionaries are defined as objects with the data type 'dict':
<class 'dict'>
Example
thisdict= {
"brand": "Ford",
"model": "Mustang",
"year": 1964
print(type(thisdict))
Page 68 of 201
Accessing Items
We can access the items of a dictionary by referring to its key name, inside square brackets:
Example
thisdict= {
"brand": "Ford",
"model": "Mustang",
"year": 1964
x = thisdict["model"]
There is also a method called get() that will give we the same result:
Example
x = thisdict.get("model")
Nested Dictionaries
Example
myfamily={
"child1" :{
"name" : "Emil",
"year" : 2004
},
"child2" :{
"name" : "Tobias",
"year" : 2007
},
"child3" :{
"name" : "Linus",
"year" : 2011
Page 69 of 201
}
Output:
{'child1': {'name': 'Emil', 'year': 2004}, 'child2': {'name': 'Tobias', 'year': 2007}, 'child3': {'name': 'Linus',
'year': 2011}}
Sets
Set is one of 4 built-in data types in Python used to store collections of data, the other 3 are List, Tuple,
and Dictionary, all with different qualities and usage.
Example
Create a Set:
print(thisset)
Set Items
Set items are unordered, unchangeable, and do not allow duplicate values.
Unordered
Unordered means that the items in a set do not have a defined order.
Set items can appear in a different order every time we use them, and cannot be referred to by index
or key.
Unchangeable
Sets are unchangeable, meaning that we cannot change the items after the set has been created.
Example
Page 70 of 201
thisset={"apple", "banana", "cherry", "apple"}
print(thisset)
To determine how many items a set has, use the len() method.
Example
print(len(thisset))
Access Items
But we can loop through the set items using a for loop, or ask if a specified value is present in a set,
by using the in keyword.
Example
for x in thisset:
print(x)
type()
From Python's perspective, sets are defined as objects with the data type 'set':
<class 'set'>
Example
print(type(myset))
We can use the union() method that returns a new set containing all items from both sets, or the
update() method that inserts all the items from one set into another:
Page 71 of 201
Example
The union() method returns a new set with all items from both sets:
set2={1, 2, 3}
set3=set1.union(set2)
print(set3)
Page 72 of 201
Control structures and functions.
• Equals: a == b
• Not Equals: a != b
• Less than: a < b
• Less than or equal to: a <= b
• Greater than: a > b
• Greater than or equal to: a >= b
These conditions can be used in several ways, most commonly in "if statements" and loops.
Example
If statement:
a= 33
b= 200
if b>a:
In this example we use two variables, a and b, which are used as part of the if statement to test whether
b is greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so we print to
screen that "b is greater than a".
Indentation
Python relies on indentation (whitespace at the beginning of a line) to define scope in the code. Other
programming languages often use curly-brackets for this purpose.
Elif
The elif keyword is pythons way of saying "if the previous conditions were not true, then try this
condition".
Example
a= 33
b= 33
if b>a:
Page 73 of 201
elif a==b:
In this example a is equal to b, so the first condition is not true, but the elif condition is true, so we
print to screen that "a and b are equal".
Else
The else keyword catches anything which isn't caught by the preceding conditions.
Example
a= 200
b= 33
if b>a:
elif a==b:
else:
In this example a is greater than b, so the first condition is not true, also the elif condition is not true,
so we go to the else condition and print to screen that "a is greater than b".
Example
a= 200
b= 33
if b>a:
else:
Loops in python
Python programming language provides following types of loops to handle looping requirements.
Python provides three ways for executing the loops. While all the ways provide similar basic
functionality, they differ in their syntax and condition checking time.
Page 74 of 201
While Loop:
In python, while loop is used to execute a block of statements repeatedly until a given a condition is
satisfied. And when the condition becomes false, the line immediately after the loop in program is
executed.
With the while loop we can execute a set of statements as long as a condition is true.
Example
i= 1
while i< 6:
print(i)
i += 1
With the break statement we can stop the loop even if the while condition is true:
Example
i= 1
while i< 6:
print(i)
if i== 3:
break
i += 1
With the continue statement we can stop the current iteration, and continue with the next:
Example
i= 0
while i< 6:
i+= 1
if i== 3:
continue
Page 75 of 201
print(i)
With the else statement we can run a block of code once when the condition no longer is true:
Example
i= 1
while i< 6:
print(i)
i+= 1
else:
For Loop:
A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a
string).
This is less like for keyword in other programming languages, and works more like an iterator
method as found in other object-orientated programming languages.
With the for loop we can execute a set of statements, once for each item in a list, tuple, set etc.
Example
for x in fruits:
print(x)
Example
for x in "banana":
print(x)
Page 76 of 201
The range() Function
To loop through a set of code a specified number of times, we can use the range() function,
The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1
(by default), and ends at a specified number.
Example
for x in range(6):
print(x)
The range() function defaults to increment the sequence by 1, however it is possible to specify the
increment value by adding a third parameter: range(2, 30, 3):
Example
print(x)
Nested Loops
The "inner loop" will be executed one time for each iteration of the "outer loop":
Example
for x in adj:
for y in fruits:
print(x, y)
Page 77 of 201
REORD WORK
(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
(c). Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
Page 78 of 201
RECORD WORK
Page 79 of 201
Page 80 of 201
Page 81 of 201
Page 82 of 201
Viva-Voce Questions:
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
Page 83 of 201
Page 84 of 201
Page 85 of 201
Laboratory Task – 5
CONCEPT AT A GLANCE
Example
NumPy is used to work with arrays. The array object in NumPy is called ndarray.
import numpy as np
arr=np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr)
OUTPUT
[1 2 3 4 5]
<class 'numpy.ndarray'>
Dimensions in Arrays
Page 86 of 201
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
import numpy as np
arr=np.array(42)
print(arr)
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
Example
import numpy as np
arr=np.array([1, 2, 3, 4, 5])
print(arr)
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
Example
Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
print(arr)
Page 87 of 201
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
Example
Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
print(arr)
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.
Example
import numpy as np
arr=np.array([1, 2, 3, 4])
print(arr[0])
To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
Example
import numpy as np
arr=np.array([[1,2,3,4,5],[6,7,8,9,10]])
Page 88 of 201
print('2nd element on 1st dim: ', arr[0, 1])
To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.
Example
Access the third element of the second array of the first array:
import numpy as np
print(arr[0, 1, 2])
Slicing arrays
Slicing in python means taking elements from one given index to another given index.
Example
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
Negative Slicing
Page 89 of 201
Example
Slice from the index 3 from the end to index 1 from the end:
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])
STEP
Example
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
Example
From the second element, slice elements from index 1 to index 4 (not included):
import numpy as np
print(arr[1, 1:4])
Shape of an Array
NumPy arrays have an attribute called shape that returns a tuple with each index having the
number of corresponding elements.
Example
Page 90 of 201
Print the shape of a 2-D array:
import numpy as np
print(arr.shape)
In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.
We pass a sequence of arrays that we want to join to the concatenate() function, along with the axis.
If axis is not explicitly passed, it is taken as 0.
Example
import numpy as np
arr1=np.array([1, 2, 3])
arr2=np.array([4, 5, 6])
arr=np.concatenate((arr1,arr2))
print(arr)
Joining merges multiple arrays into one and Splitting breaks one array into multiple.
We use array_split() for splitting arrays, we pass it the array we want to split and the number of
splits.
Example
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 6])
Page 91 of 201
newarr= np.array_split(arr, 3)
print(newarr)
Searching Arrays
You can search an array for a certain value, and return the indexes that get a match.
Example
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 4, 4])
x= np.where(arr== 4)
print(x)
Sorting Arrays
Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or descending.
The NumPy ndarray object has a function called sort(), that will sort a specified array.
Example
import numpy as np
arr=np.array([3, 2, 0, 1])
print(np.sort(arr))
Page 92 of 201
REORD WORK
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
Page 93 of 201
RECORD WORK
EP1. Write Python code to print column wise addition of a 2-D array.
EP2. Write Python code to print row wise addition of a 2-D array.
EP3. Write Python code to print diagonal elements of a 2-D array.
Page 94 of 201
Page 95 of 201
Page 96 of 201
Page 97 of 201
Viva-Voce Questions:
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
Page 98 of 201
Page 99 of 201
Page 100 of 201
Laboratory Task – 6
CONCEPT AT A GLANCE
Example
import pandas
mydataset={
'passings':[3, 7, 2]
myvar=pandas.DataFrame(mydataset)
print(myvar)
Pandas as pd
import pandas as pd
Example
import pandas as pd
mydataset={
myvar=pd.DataFrame(mydataset)
print(myvar)
What is a Series?
Example
import pandas as pd
a=[1, 7, 2]
myvar=pd.Series(a)
print(myvar)
Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0,
second value has index 1 etc.
Example
print(myvar[0])
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar[0])
Output:
Create Labels
import pandas as pd
a=[1, 7, 2]
print(myvar)
Example:
import pandas as pd
a = [1, 7, 2]
print(myvar)
output:
x 1
y 7
z 2
dtype: int64
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.
Example
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df)
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
print(df.loc[0])
Result
calories 420
duration 50
Named Indexes
With the index argument, you can name your own indexes.
Example
import pandas as pd
data = {
print(df)
calories duration
day1 420 50
day2 380 40
day3 390 45
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Read JSON
JSON is plain text, but has the format of an object, and is well known in the world of programming,
including Pandas.
Open data.json.
Example
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
One of the most used method for getting a quick overview of the DataFrame, is the head() method.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Data Cleaning
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
Data Set:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
Example
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
import pandas as pd
df = pd.read_csv('data.csv')
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
To fix it, you have two options: remove the rows, or convert all cells in the columns into the same
format.
In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date'
column should be a string that represents a date:
Let's try to convert all cells in the 'Date' column into dates.
Example
Convert to date:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
Result:
Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if
someone registered "199" instead of "1.99".
Sometimes we can spot wrong data by looking at the data set, because we have an expectation of
what it should be.
If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the other
rows the duration is between 30 and 60.
It doesn't have to be wrong, but taking in consideration that this is the data set of someone's workout
sessions, we conclude with the fact that this person did not work out in 450 minutes.
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we could
just insert "45" in row 7:
Example
df.loc[7, 'Duration'] = 45
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
Example
print(df.duplicated())
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
EP1. Write Python code to sort values in ascending and descending order.
EP2. Create a dataset with null values and write code to remove null values from the dataset.
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
CONCEPT AT A GLANCE
Data visualization is the discipline of trying to understand data by placing it in a visual context so
that patterns, trends and correlations that might not otherwise be detected can be exposed.
Python offers multiple great graphing libraries that come packed with lots of different features. No
matter if you want to create interactive, live or highly customized plots python has an excellent library
for you.
What is Matplotlib?
Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and JavaScript
for Platform compatibility.
If this command fails, then use a python distribution that already has Matplotlib installed, like
Anaconda, Spyder etc.
Import Matplotlib
Once Matplotlib is installed, import it in your applications by adding the import module statement:
import matplotlib
Pyplot
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the
plt alias:
Example
import sys
import matplotlib
matplotlib.use('Agg')
import numpy as np
plt.plot(xpoints, ypoints)
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Matplotlib Plotting
The plot() function is used to draw points (markers) in a diagram. By default, the plot() function
draws a line from point to point. The function takes parameters for specifying points in the
diagram. Parameter 1 is an array containing the points on the x-axis. Parameter 2 is an array
containing the points on the y-axis. If we need to plot a line from (1, 3) to (8, 10), we have to pass
two arrays [1, 8] and [3, 10] to the plot function.
import sys
import matplotlib
matplotlib.use('Agg')
import numpy as np
plt.plot(xpoints, ypoints)
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Matplotlib Subplots
With the subplots() function we can draw multiple plots in one figure:
The subplots() function takes three arguments that describes the layout of the figure. The layout is
organized in rows and columns, which are represented by the first and second argument. The third
argument represents the index of the current plot.
plt.subplot(1, 2, 1)
#the figure has 1 row, 2 columns, and this plot is the first plot.
plt.subplot(1, 2, 2)
#the figure has 1 row, 2 columns, and this plot is the second plot.
import sys
import matplotlib
matplotlib.use('Agg')
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.figure(figsize=(8,5)) ## width,height
plt.plot(x,y,color='green',lw="3",linestyle="dotted",label="Line Plot")
plt.legend(loc="best")
plt.show()
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.figure(figsize=(8,5)) ## width,height
plt.scatter(x,y,color='green',label="Scatter Plot",s=150,marker="*")
plt.legend(loc="best")
plt.show()
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.figure(figsize=(8,5)) ## width,height
plt.bar(x,y,color=['green','orange'],label="Bar Plot",width=0.6)
plt.legend(loc="best")
plt.show()
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
CONCEPT AT A GLANCE
Data Distribution
In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at
least at an early stage of a project.
To create big data sets for testing, we use the Python module NumPy, which comes with a number
of methods to create random data sets, of any size.
Histogram
To visualize the data set we can draw a histogram with the data we collected. We will use the Python
module Matplotlib to draw a histogram.
In probability theory this kind of data distribution is known as the normal data distribution, or the
Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the
formula of this data distribution.
Depending on the frequency of observations, a time series may typically be hourly, daily, weekly,
monthly, quarterly and annual. Sometimes, you might have seconds and minute-wise time series as
well, like, number of clicks and user visits every minute etc.
These are all categorical features in your dataset. These features are typically stored as text values
which represent various traits of the observations. For example, gender is described as Male (M) or
Female (F), product type could be described as electronics, apparels, food etc.
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.figure(figsize=(8,5)) ## width,height
plt.bar(x,y,color=['green','orange'],label="Bar Plot",width=0.6)
plt.plot(x,y,color='red',lw="1",linestyle="solid",label="Line Plot")
plt.legend(loc="best")
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.barh(x,y,color=['green','orange'],label="Bar Plot",height=0.6)
plt.legend(loc="best")
plt.show()
slices = [30,100,50,22,44,66,22,55]
names = ["A","B","C","D","E","F","G","H"]
cols = ["red","blue","orange","green","pink","violet","magenta","yellow"]
plt.figure(figsize=(6,6))
plt.pie(slices,labels=names,colors=cols,autopct="%0.2f%%",explode=(0.2,0,0,0.5,0,0,0,0))
plt.legend(loc=4)
plt.show()
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
EP1. Draw the all types of chart for sales data analysis.
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
CONCEPT AT A GLANCE
Association Mining searches for frequent items in the data-set. In frequent mining usually the
interesting associations and correlations between item sets in transactional and relational databases
are found.
In short, Frequent Mining shows which items appear together in a transaction or relation.
Frequent mining is generation of association rules from a Transactional Dataset. If there are 2 items
X and Y purchased frequently then its good to put them together in stores or provide some discount
offer on one item on purchase of other item. This can really increase the sales. For example it is
likely to find that if a customer buys Milk and bread he/she also buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can suggest the customer to buy
butter if he/she buys Milk and Bread.
Important Definitions:
Support: It is one of the measure of interestingness. This tells about usefulness and certainty of rules.
5% Support means total 5% of transactions in database follow the rule.
Confidence: A confidence of 60% means that 60% of the customers who purchased a milk and
bread also bought butter.
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
print(num_records) records=[]
for i in range(0,num_records):
print(records)
association_rules=apriori(records,min_support=0.4,min_confidence=0.2)
association_results=list(association_rules)
print(len(association_results))
sup_list=[]
conf_list=[]
# print(pair)
if(len(list(item[2][0][1]))>=2):
#third index of the list located at 0th of the third index of the inner list
conf_list.append(item[2][0][2])
print("=====================================")
Output:
Confidence: 0.6
=====================================
Confidence: 0.6
=====================================
Confidence: 0.4
=====================================
Confidence: 0.6
=====================================
Confidence: 0.4
=====================================
Support: 0.4
Confidence: 0.4
=====================================
Support: 0.4
Confidence: 0.4
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
E10. 14. Regression and Classification: Linear regression and logistic regression.
CONCEPT AT A GLANCE
Regression
Regression Analysis is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors. Regression
analysis explains the changes in criteria in relation to changes in select predictors. The conditional
expectation of the criteria is based on predictors where the average value of the dependent variables
is given when the independent variables are changed. Three major uses for regression analysis are
determining the strength of predictors, forecasting an effect, and trend forecasting.
It is the most basic version of linear regression which predicts a response using a single feature. The
assumption in SLR is that the two variables are linearly related.
It is the extension of simple linear regression that predicts a response using two or more features.
Mathematically we can explain it as follows
Consider a dataset having n observations, p features i.e. independent variables and y as one response
i.e. dependent variable the regression line for p features can be calculated as follows:
h(xi)=b0+b1xi1+b2xi2+...+bpxiph(xi)=b0+b1xi1+b2xi2+...+bpxip
Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients.
h(xi)=b0+b1xi1+b2xi2+...+bpxip+ei
yi=h(xi)+ei or ei=yi−h(xi)
Logistic regression is a supervised learning classification algorithm used to predict the probability of
a target variable. The nature of target or dependent variable is dichotomous, which means there
would be only two possible classes.
In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest
ML algorithms that can be used for various classification problems such as spam detection, Diabetes
prediction, cancer detection etc.
Classification
There are two forms of data analysis that can be used for extracting models describing important
classes or to predict future data trends. These two forms are as follows
• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict continuous
valued functions.
What is classification?
Following are the examples of cases where the data analysis task is Classification
• A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data and yes or no for marketing data.
Linear Regression:
import pandas as pd
data = pd.read_csv("/content/Salary_Data.csv")
data.head()
Output:
YearsExperience Salary
0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0
x = np.array(data[['YearsExperience']]) ## feature
y = np.array(data['Salary']) ## target
xtrain,xtest,ytrain,ytest = train_test_split(x,y,train_size=0.8,random_state=9014)
model = LinearRegression()
model.fit(xtrain,ytrain)
### Prediction
ypred = model.predict(xtest)
ypred
Output:
122820.04811718, 115311.59768499])
Ytest
Output:
xtest
array([[ 3.2],
[ 7.9],
[ 2. ],
[ 3.9],
[10.3],
[ 9.5]])
score = r2_score(ytest,ypred)
score
Output:
0.99842716176972
m = model.coef_
c = model.intercept_
print(m,c)
Output:
[9385.56304023] 26148.74880284306
plt.figure(figsize=(10,6))
plt.scatter(xtrain,ytrain,color="red",label="Actual Samples")
plt.scatter(xtrain,model.predict(xtrain),color="blue",label="Predicted Samples")
plt.plot(xtrain,model.predict(xtrain),color="yellow",label="Line of Regression")
plt.legend()
plt.show()
plt.figure(figsize=(10,6))
plt.scatter(xtest,ytest,color="red",label="Actual Samples")
plt.scatter(xtest,model.predict(xtest),color="blue",label="Predicted Samples")
plt.plot(xtest,model.predict(xtest),color="yellow",label="Line of Regression")
plt.legend()
plt.show()
import pandas as pd
import numpy as np
data = sns.load_dataset('titanic')
data.isnull().sum()
Output:
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64
mean_age = round(data['age'].mean(),2)
data['age'] = data['age'].fillna(mean_age)
data.isnull().sum()
survived 0
pclass 0
sex 0
age 0
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64
data = data.drop(["deck","embark_town"],axis=1)
data = data.dropna()
data.isnull().sum()
y = np.array(data['survived']) ## target
x = data[['pclass','sex','age','sibsp','parch','embarked']]
x['sex'] = x['sex'].map({"male":0,"female":1})
x['embarked'] = x['embarked'].map({"S":0,"C":1,"Q":2})
xtrain,xtest,ytrain,ytest = train_test_split(x,y,train_size=0.80,random_state=3)
model = LogisticRegression()
model.fit(xtrain,ytrain)
ypred = model.predict(xtest)
ypred
Output:
array([1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0])
ytest
Output:
array([1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
0, 0])
cm = confusion_matrix(ytest,ypred)
cm
array([[96, 14],
[27, 41]])
cm.diagonal().sum()/cm.sum()
Output:
0.7696629213483146
a = accuracy_score(ytest,ypred)
Output:
0.7696629213483146
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
EP1. Predict salary of an employee based on years of experience using linear regression.
EP2. Classify the loan applicants using logistic regression.
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
E11. 15. Implement Decision tree, random forest, k-Nearest Neighbor algorithms.
CONCEPT AT A GLANCE
In general, Decision tree analysis is a predictive modeling tool that can be applied across many areas.
Decision trees can be constructed by an algorithmic approach that can split the dataset in different
ways based on different conditions. Decisions tress are the most powerful algorithms that falls under
the category of supervised algorithms.
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output. The greater number
of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for
both classification as well as regression predictive problems. However, it is mainly used for
classification predictive problems in industry. The following two properties would define KNN well.
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized
training phase and uses all the data for training while classification.
import math
import xlsxwriter
import pandas as pd
book = xlsxwriter.Workbook("dt.xlsx")
sheet = book.add_worksheet()
r=0
sheet.write(r, 0, 'DecisionTree')
r=r+1
sheet.write(r, 0, 'Accuracy')
sheet.write(r, 1, 'Precision')
sheet.write(r, 2, 'Recall')
sheet.write(r, 3, 'F-measure')
sheet.write(r, 4,'Specificty')
sheet.write(r, 5,'GeometricMean')
sheet.write(r, 6,'AUC')
x = df_final.drop(['Defective'],axis=1)
y = df_final.Defective
clf = DecisionTreeClassifier()
predictions = clf.predict(x_test)
c=confusion_matrix(y_test, predictions)
print("confusion_matrix:")
print(confusion_matrix(y_test,predictions))
mo2=c[1][0] + c[1][1]
if mo!=0:
else:
acc=0
if mo1!=0:
else:
pre =0
if mo2!=0:
else:
rec=0
print("Accuracy=",acc)
print("Precision=",pre)
print("Recall=",rec)
mo3=pre+rec
if mo3!=0:
else:
fm=0
print("F-measure=",fm)
mo4=c[0][0]+c[0][1]
if mo4!=0:
sp = round((c[0][0] / mo4),2)
else:
sp=0
print("Specificity=",sp)
gm = round((math.sqrt(rec * sp)),2)
print("Geometric Mean=",gm)
fn = c[1][0]
fp = c[0][1]
tn = c[0][0]
c=0
dt_fpr,dt_tpr,threshold = roc_curve(y_test,predictions)
auc = round((auc(dt_fpr,dt_tpr)),2)
print("AUC=",auc)
r=r+1
sheet.write(r, c, acc)
sheet.write(r, c + 1, pre)
sheet.write(r, c + 2, rec)
sheet.write(r, c + 3, fm)
sheet.write(r, c + 4, sp)
sheet.write(r, c + 5, gm)
sheet.write(r, c + 6, auc)
book.close()
Output:
confusion_matrix:
[[252 25]
[ 26 262]]
Accuracy= 0.91
Precision= 0.91
Recall= 0.91
F-measure= 0.91
Specificity= 0.91
AUC= 0.91
import math
import xlsxwriter
import pandas as pd
book = xlsxwriter.Workbook("rf.xlsx")
sheet = book.add_worksheet()
r=0
sheet.write(r, 0, 'RandomForest')
r=r+1
sheet.write(r, 0, 'Accuracy')
sheet.write(r, 1, 'Precision')
sheet.write(r, 2, 'Recall')
sheet.write(r, 3, 'F-measure')
sheet.write(r, 4,'Specificty')
sheet.write(r, 5,'GeometricMean')
sheet.write(r, 6,'AUC')
x = df_final.drop(['Defective'],axis=1)
y = df_final.Defective
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(x_train, y_train)
predictions = rfc.predict(x_test)
c=confusion_matrix(y_test, predictions)
print("confusion_matrix:")
print(confusion_matrix(y_test,predictions))
mo1=c[0][1] + c[1][1]
if mo!=0:
else:
acc=0
if mo1!=0:
else:
pre =0
if mo2!=0:
else:
rec=0
print("Accuracy=",acc)
print("Precision=",pre)
print("Recall=",rec)
mo3=pre+rec
if mo3!=0:
else:
fm=0
print("F-measure=",fm)
mo4=c[0][0]+c[0][1]
if mo4!=0:
sp = round((c[0][0] / mo4),2)
else:
sp=0
print("Specificity=",sp)
gm = round((math.sqrt(rec * sp)),2)
print("Geometric Mean=",gm)
tp = c[1][1]
fp = c[0][1]
tn = c[0][0]
rf_fpr,rf_tpr,thresholds = roc_curve(y_test,predictions)
auc = round((auc(rf_fpr,rf_tpr)),2)
print("AUC=",auc)
c=0
r=r+1
sheet.write(r, c, acc)
sheet.write(r, c + 1, pre)
sheet.write(r, c + 2, rec)
sheet.write(r, c + 3, fm)
sheet.write(r, c + 4, sp)
sheet.write(r, c + 5, gm)
sheet.write(r, c + 6, auc)
book.close()
Output:
Confusion_matrix:
[[271 14]
[ 35 245]]
Accuracy= 0.91
Precision= 0.95
Recall= 0.88
F-measure= 0.91
Specificity= 0.95
AUC= 0.91
import math
import xlsxwriter
import pandas as pd
book = xlsxwriter.Workbook("knn.xlsx")
sheet = book.add_worksheet()
r=0
r=r+1
sheet.write(r, 0, 'Accuracy')
sheet.write(r, 1, 'Precision')
sheet.write(r, 2, 'Recall')
sheet.write(r, 3, 'F-measure')
sheet.write(r, 4,'Specificty')
sheet.write(r, 5,'GeometricMean')
sheet.write(r, 6,'AUC')
x = df_final.drop(['Defective'],axis=1)
y = df_final.Defective
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
predictions = knn.predict(x_test)
c=confusion_matrix(y_test, predictions)
print("confusion_matrix:")
print(confusion_matrix(y_test,predictions))
mo1=c[0][1] + c[1][1]
if mo!=0:
else:
acc=0
if mo1!=0:
else:
pre =0
if mo2!=0:
else:
rec=0
print("Accuracy=",acc)
print("Precision=",pre)
print("Recall=",rec)
mo3=pre+rec
if mo3!=0:
else:
fm=0
print("F-measure=",fm)
mo4=c[0][0]+c[0][1]
if mo4!=0:
sp = round((c[0][0] / mo4),2)
else:
sp=0
print("Specificity=",sp)
gm = round((math.sqrt(rec * sp)),2)
print("Geometric Mean=",gm)
tp = c[1][1]
fp = c[0][1]
tn = c[0][0]
c=0
knn_fpr,knn_tpr,threshold = roc_curve(y_test,predictions)
auc = round((auc(knn_fpr,knn_tpr)),2)
print("AUC=",auc)
r=r+1
sheet.write(r, c, acc)
sheet.write(r, c + 1, pre)
sheet.write(r, c + 2, rec)
sheet.write(r, c + 3, fm)
sheet.write(r, c + 4, sp)
sheet.write(r, c + 5, gm)
sheet.write(r, c + 6, auc)
book.close()
Output:
confusion_matrix:
[[277 14]
[ 36 238]]
Accuracy= 0.91
Precision= 0.94
Recall= 0.87
F-measure= 0.9
Specificity= 0.95
AUC= 0.91
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
EP1. Using weather dataset, forecast the “Play” using decision tree algorithm.
EP2. Using customer transaction history, predict the customer decision on product purchase.
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
CONCEPT AT A GLANCE
K-Means Algorithm
K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid.
It assumes that the number of clusters are already known. It is also called flat clustering algorithm.
The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared
distance between the data points and centroid would be minimum. It is to be understood that less
variation within the clusters will lead to more similar data points within same cluster.
Hierarchical Clustering
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into
following two categories −
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-down
approach) the one big cluster into various small clusters.
K-Means algorithm:
import numpy as np
import pandas as pd
df = pd.read_csv('/content/iris.csv')
x = df.iloc[:, [0,1,2,3]].values
kmeans5 = KMeans(n_clusters=5)
y_kmeans5 = kmeans5.fit_predict(x)
print(y_kmeans5)
kmeans5.cluster_centers_
Output:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111111000300030330303003030300
0000033330300033303333303340244232424
4404442204020420042224002440444044404
4 0]
Error =[]
kmeans.fit(x)
Error.append(kmeans.inertia_)
plt.title('Elbow method')
plt.xlabel('No of clusters')
plt.ylabel('Error')
plt.show()
kmeans3 = KMeans(n_clusters=3)
y_kmeans3 = kmeans3.fit_predict(x)
kmeans3.cluster_centers_
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000112111111111111111111111
1112111111111111111111111121222212222
2211222212121221122222122221222122212
2 1]
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
EP1. Cluster the employees based on their salaries using k-means algorithm.
1. Define cluster.
2. List various partitioning algorithms.
3. Give the difference between agglomerative and Divisive hierarchical clustering methods.
4. What is K-Means Algorithm?
5. Give the library file used for K-Means method in python.
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.