NumPy, Pandas, MatplotLib,Seaborn, ScikitLearn (SkLearn)
NumPy, Pandas, MatplotLib,Seaborn, ScikitLearn (SkLearn)
Index Data Labelled Index Data Numeric Index Labelled Index Data
0 1776 USA 1776 0 USA 1776
1 1867 Canada 1867 1 Canada 1867
2 1821 England 1821 2 England 1821
Finding data using this Finding data using this We can still use the
index is not easy index is very easy numeric index, if we want
DataFrame
• DataFrame: Table of columns and rows that can be easily
restructured/filtered
Series Multiple Series with the Same Index Dataframe
Index Year Index Year Index Pop Index GDP Index Year Pop GDP
USA 1776 USA 177 USA 328 USA 20.5 USA 1776 328 20.5
Canada 1867 6 Canada 38 Canada 1.7 Canada 1867 38 1.7
England 1821 Canada 186 Englan 126 Englan 3.9 England 1821 126 3.9
7 d d
Englan 182
• So, Dataframe
d = Several
1 series that share the same index, like a
spreadsheet
DataFrame
• Basic operations
• Create a dataframe
• Select a column/multiple columns
• Select a row/multiple rows
• Insert a new column/row
• Advanced Operations
• Indexing
• Filtering
• Missing data
• Grouping
• Joining
MatplotLib
• Data visualization is very important to quickly understand trends and relationships in the data
• Matplotlib: One of the most popular plotting libraries in Python
• Grandfather of plotting and visualization libraries in Python
• Seaborn/Pandas built-in visualization are built on top of Matplotlib
• Heavily inspired by the plotting functions in the MatLab programming language
• Two approaches: (1) Functional (2) OOP
• Main goals
• Plot a functional relationship, e.g. y = 2x
• Plot a relationship between raw data points: x = [1, 2, 3, 4] and y = [2, 4, 6, 8]
• Main types of plots
• Line Plot: Great for showing functional relationships and continuous data
• Scatter Plot: Useful for plotting raw data points and understanding the correlation between two variables
• Bar Plot: Useful for categorical data to show comparisons between different groups
• Histogram: Good for showing the distribution of a single variable
• Pie Chart: Used for showing proportions or percentages of categories
import matplotlib.pyplot as plt
Matplotlib Approaches x = [1, 2, 3, 4]
y = [2, 4, 6, 8]
• Functional • Object-oriented
• plt.plot(x, y) # Plotting • fig, ax = plt.subplots() # Create a figure
using a simple and a set of subplots
functional call • ax.plot(x, y) # Plot on the axes object
• plt.xlabel('x-axis') • ax.set_xlabel('x-axis') # Set label for x-axis
• plt.ylabel('y-axis') • ax.set_ylabel('y-axis') # Set label for y-axis
• plt.show() • plt.show()
Seaborn
• Seaborn: Statistical plotting library
• Built on top of Matplotlib, but uses a simpler one-line syntax
• Can directly work with Python Dataframes
• Easy to use, but less customization possible as compared to Matplotlib
• Types of plots
• Scatter plots: Relationship between two continuous variables (Trends, correlations)
• Distribution plots: How a single variable is distributed (patterns, skew, outliers)
(Histogram, KDE plot)
• Categorical plots: Categorical variables and their relationships with continuous data
(Box plot, bar plot, count plot)
• Comparison plots: Compare two or more variables (pair plot)
• Matrix plots: Complex relationship in a matrix form (heatmap)
Data Visualization Summary
Plot Usage Example Code
Line plot Trends over time periods/data Stock prices over a month plt.plot([1, 2, 3], [4, 5, 6]); plt.show()
points
Scatter plot Relationship between two House price versus Area of the plt.scatter([1, 2, 3], [4, 5, 6]); plt.show()
numeric variables house
Bar plot Compare categories/groups Sales across product plt.bar(['A', 'B', 'C'], [4, 7, 1]); plt.show()
with respect to numeric values categories
Histogram Distribution of a single numeric Distribution of ages in a plt.hist([1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5],
variable country bins=5); plt.show()
Box plot Distribution of data with Income distribution across plt.boxplot([7, 2, 5, 13, 9, 6]); plt.show()
reference to minimum, Q1, Q2, professions
Q3, maximum
Heatmap Matrix to display values using Correlation matrix between sns.heatmap(np.random.rand(5, 5),
colour intensity height and weight annot=True); plt.show()
Pie chart Proportions/percentages Market share of mobile phone plt.pie([40, 30, 20, 10], labels=['A', 'B', 'C',
among categories brands 'D'], autopct='%1.1f%%'); plt.show()