data handling module
data handling module
NumPy
Debasish Dutta
July 4, 2024
Contents
1 Pandas 3
1.1 Basic Operations and Data Structures . . . . . . . . . . . . . . . . . . 3
1.1.1 Creating Series and DataFrames . . . . . . . . . . . . . . . . . . 3
1.1.2 Basic Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . 3
1.2 Data Exploration and Manipulation . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 Applying Functions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Removing Duplicates . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 String Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Changing Data Types . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Data Aggregation and Grouping . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Grouping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Aggregation Functions . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.3 Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Combining DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Merging and Joining . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 Date/Time Indexing . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Visualization with Pandas . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7.1 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.1 Saving to CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 NumPy 7
2.1 Basic Operations and Data Structures . . . . . . . . . . . . . . . . . . 7
2.1.1 Creating Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Basic Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . 8
2.2 Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1
2.2.1 Element-wise Operations . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . 9
2.4.2 Solving Linear Equations . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . 9
2.6 Saving and Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 Saving Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2
1 Pandas
1.1 Basic Operations and Data Structures
1.1.1 Creating Series and DataFrames
Explanation: A Series is a one-dimensional labeled array, capable of holding any
data type. It can be created from a list, dictionary, or scalar value. DataFrame is a
two-dimensional labeled data structure with columns of potentially different types. It
can be created from a dictionary of lists, a list of dictionaries, or other data structures.
Usage:
import pandas as pd
# Series
s = pd . Series ([1 , 3 , 5 , 7 , 9] , index =[ ’a ’ , ’b ’ , ’c ’ , ’d ’ , ’e ’ ])
print ( s )
3
Usage:
# First few rows
print ( df . head () )
# Descriptive statistics
print ( df . describe () )
1.2.3 Sorting
Explanation: Sorting data helps in organizing the data and making it easier to ana-
lyze. Pandas allows sorting by values or index.
Usage:
# Sort by values in column ’A ’
d f_ s o r te d _ by _ v al u e s = df . sort_values ( by = ’A ’)
# Sort by index
df _s or te d_ by _i nd ex = df . sort_index ()
1.2.4 Filtering
Explanation: Filtering data based on conditions allows extracting subsets of data
that meet specific criteria.
Usage:
# Filter rows where column ’A ’ is greater than 1
filtered_df = df [ df [ ’A ’] > 1]
4
# Apply a function to each column
df_applied = df . apply ( lambda x : x * 2)
# Replace substring
df [ ’A ’] = df [ ’A ’ ]. str . replace ( ’ old ’ , ’ new ’)
5
1.4.2 Aggregation Functions
Explanation: Aggregation functions can be applied to grouped data to calculate
summary statistics.
Usage:
# Group by column ’A ’ and calculate sum and mean
aggregated_df = df . groupby ( ’A ’) . agg ([ ’ sum ’ , ’ mean ’ ])
6
1.6.2 Date/Time Indexing
Explanation: Indexing data by date/time allows performing time series analysis.
Usage:
# Set column ’ date ’ as index
df . set_index ( ’ date ’ , inplace = True )
# Scatter plot
df . plot ( kind = ’ scatter ’ , x = ’A ’ , y = ’B ’)
# Histogram
df [ ’A ’ ]. plot ( kind = ’ hist ’)
2 NumPy
2.1 Basic Operations and Data Structures
2.1.1 Creating Arrays
Explanation: NumPy arrays are used for storing and manipulating data efficiently.
Usage:
import numpy as np
7
# Create a NumPy array of ones
ones_arr = np . ones ((2 , 2) )
# Slicing
print ( arr [1:4]) # Elements from index 1 to 3
# Dot product
result = np . dot ( vector1 , vector2 )
8
# Median
median_value = np . median ( arr )
# Standard deviation
std_deviation = np . std ( arr )
# Variance
variance = np . var ( arr )
9
References
[1] Pandas Documentation: Comprehensive official documentation covering instal-
lation, user guide, API reference, and more. Available at https://pandas.pydata.
org/docs/.
[3] Jake VanderPlas. Python Data Science Handbook. O’Reilly Media, 2016.
[6] GitHub Repositories: Explore GitHub repositories for code examples and
projects using Pandas and NumPy. Example: https://github.com/pandas-dev/
pandas.
10