0% found this document useful (0 votes)
2 views10 pages

data handling module

The document is an introduction to data handling using Pandas and NumPy, covering basic operations, data structures, data exploration, manipulation, cleaning, aggregation, and visualization techniques. It includes practical usage examples for creating Series and DataFrames, handling missing data, sorting, filtering, and applying functions. Additionally, it addresses NumPy functionalities such as mathematical operations, statistical functions, and saving/loading data.

Uploaded by

debasish0dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

data handling module

The document is an introduction to data handling using Pandas and NumPy, covering basic operations, data structures, data exploration, manipulation, cleaning, aggregation, and visualization techniques. It includes practical usage examples for creating Series and DataFrames, handling missing data, sorting, filtering, and applying functions. Additionally, it addresses NumPy functionalities such as mathematical operations, statistical functions, and saving/loading data.

Uploaded by

debasish0dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Introduction to Data Handling with Pandas and

NumPy
Debasish Dutta
July 4, 2024

Contents
1 Pandas 3
1.1 Basic Operations and Data Structures . . . . . . . . . . . . . . . . . . 3
1.1.1 Creating Series and DataFrames . . . . . . . . . . . . . . . . . . 3
1.1.2 Basic Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . 3
1.2 Data Exploration and Manipulation . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 Applying Functions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Removing Duplicates . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 String Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Changing Data Types . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Data Aggregation and Grouping . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Grouping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Aggregation Functions . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.3 Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Combining DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Merging and Joining . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 Date/Time Indexing . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Visualization with Pandas . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7.1 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.1 Saving to CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 NumPy 7
2.1 Basic Operations and Data Structures . . . . . . . . . . . . . . . . . . 7
2.1.1 Creating Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Basic Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . 8
2.2 Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1
2.2.1 Element-wise Operations . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . 9
2.4.2 Solving Linear Equations . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . 9
2.6 Saving and Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 Saving Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2
1 Pandas
1.1 Basic Operations and Data Structures
1.1.1 Creating Series and DataFrames
Explanation: A Series is a one-dimensional labeled array, capable of holding any
data type. It can be created from a list, dictionary, or scalar value. DataFrame is a
two-dimensional labeled data structure with columns of potentially different types. It
can be created from a dictionary of lists, a list of dictionaries, or other data structures.
Usage:
import pandas as pd

# Series
s = pd . Series ([1 , 3 , 5 , 7 , 9] , index =[ ’a ’ , ’b ’ , ’c ’ , ’d ’ , ’e ’ ])
print ( s )

# DataFrame from dictionary


data = { ’A ’: [1 , 2 , 3] , ’B ’: [4 , 5 , 6]}
df = pd . DataFrame ( data )
print ( df )

# DataFrame from list of dictionaries


data = [{ ’A ’: 1 , ’B ’: 4} , { ’A ’: 2 , ’B ’: 5} , { ’A ’: 3 , ’B ’: 6}]
df = pd . DataFrame ( data )
print ( df )

1.1.2 Basic Indexing and Slicing


Explanation: Indexing and slicing are used to access specific elements or subsets of
data from Series or DataFrames. This can be done using labels or integer positions.
Usage:
# Series indexing by label
print ( s [ ’a ’ ])

# Series indexing by integer position


print ( s [0])

# DataFrame slicing by column label


print ( df [ ’A ’ ])

# DataFrame slicing by integer position


print ( df . iloc [0 , 1]) # Row 0 , Column 1

# DataFrame slicing by label


print ( df . loc [0 , ’B ’ ]) # Row 0 , Column ’B ’

1.2 Data Exploration and Manipulation


1.2.1 Data Overview
Explanation: Pandas provides methods to get a quick overview of the dataset, such
as the first few rows, summary of the DataFrame, and descriptive statistics.

3
Usage:
# First few rows
print ( df . head () )

# Summary of the DataFrame


print ( df . info () )

# Descriptive statistics
print ( df . describe () )

1.2.2 Handling Missing Data


Explanation: Handling missing data is crucial in data analysis. Pandas provides
methods to detect, remove, or fill missing values.
Usage:
# Detect missing values
print ( df . isna () )

# Drop rows with missing values


df . dropna ( inplace = True )

# Fill missing values


df . fillna (0 , inplace = True )

1.2.3 Sorting
Explanation: Sorting data helps in organizing the data and making it easier to ana-
lyze. Pandas allows sorting by values or index.
Usage:
# Sort by values in column ’A ’
d f_ s o r te d _ by _ v al u e s = df . sort_values ( by = ’A ’)

# Sort by index
df _s or te d_ by _i nd ex = df . sort_index ()

1.2.4 Filtering
Explanation: Filtering data based on conditions allows extracting subsets of data
that meet specific criteria.
Usage:
# Filter rows where column ’A ’ is greater than 1
filtered_df = df [ df [ ’A ’] > 1]

1.2.5 Applying Functions


Explanation: Applying functions to data allows transforming data or performing
operations on it. Pandas provides methods such as apply, applymap, and map.
Usage:

4
# Apply a function to each column
df_applied = df . apply ( lambda x : x * 2)

# Apply a function element - wise


df [ ’A ’] = df [ ’A ’ ]. map ( lambda x : x * 2)

1.3 Data Cleaning


1.3.1 Removing Duplicates
Explanation: Removing duplicate rows from the DataFrame ensures data integrity
and consistency.
Usage:
# Remove duplicate rows
df_no_duplicates = df . drop_duplicates ()

1.3.2 String Manipulation


Explanation: Using string methods to manipulate text data is essential for cleaning
and transforming textual data.
Usage:
# Convert strings to lowercase
df [ ’A ’] = df [ ’A ’ ]. str . lower ()

# Replace substring
df [ ’A ’] = df [ ’A ’ ]. str . replace ( ’ old ’ , ’ new ’)

1.3.3 Changing Data Types


Explanation: Converting data types of DataFrame columns is necessary for ensuring
data types are appropriate for analysis.
Usage:
# Convert column ’A ’ to float
df [ ’A ’] = df [ ’A ’ ]. astype ( ’ float ’)

1.4 Data Aggregation and Grouping


1.4.1 Grouping Data
Explanation: Grouping data by one or more columns and applying aggregation func-
tions helps in summarizing and analyzing data.
Usage:
# Group by column ’A ’ and calculate the sum
grouped_df = df . groupby ( ’A ’) . sum ()

5
1.4.2 Aggregation Functions
Explanation: Aggregation functions can be applied to grouped data to calculate
summary statistics.
Usage:
# Group by column ’A ’ and calculate sum and mean
aggregated_df = df . groupby ( ’A ’) . agg ([ ’ sum ’ , ’ mean ’ ])

1.4.3 Pivot Tables


Explanation: Creating pivot tables allows summarizing data in a matrix format,
which is useful for data analysis and reporting.
Usage:
# Create a pivot table
pivot_table = df . pivot_table ( values = ’B ’ , index = ’A ’ , aggfunc = ’ mean ’)

1.5 Combining DataFrames


1.5.1 Concatenation
Explanation: Concatenating DataFrames along rows or columns combines multiple
DataFrames into one.
Usage:
# Concatenate along columns
concatenated_df = pd . concat ([ df , df ] , axis =1)

# Concatenate along rows


concatenated_df = pd . concat ([ df , df ] , axis =0)

1.5.2 Merging and Joining


Explanation: Merging DataFrames using a key column allows combining data based
on common columns.
Usage:
# Merge DataFrames on column ’A ’
merged_df = df . merge ( df , on = ’A ’)

1.6 Time Series Data


1.6.1 Resampling
Explanation: Resampling time series data involves changing the frequency of the
time series, such as converting daily data to monthly data.
Usage:
# Resample data to monthly frequency and calculate the mean
resampled_df = df . resample ( ’M ’) . mean ()

6
1.6.2 Date/Time Indexing
Explanation: Indexing data by date/time allows performing time series analysis.
Usage:
# Set column ’ date ’ as index
df . set_index ( ’ date ’ , inplace = True )

# Select data for a specific date range


selected_data = df [ ’ 2023 -01 -01 ’: ’ 2023 -12 -31 ’]

1.7 Visualization with Pandas


1.7.1 Plotting
Explanation: Pandas provides built-in plotting methods for quick data visualization.
Usage:
# Line plot
df . plot ( kind = ’ line ’ , x = ’A ’ , y = ’B ’)

# Scatter plot
df . plot ( kind = ’ scatter ’ , x = ’A ’ , y = ’B ’)

# Histogram
df [ ’A ’ ]. plot ( kind = ’ hist ’)

1.8 Exporting Data


1.8.1 Saving to CSV
Explanation: Saving DataFrame to CSV format allows exporting data for use in other
applications.
Usage:
# Save DataFrame to CSV file
df . to_csv ( ’ data . csv ’ , index = False )

2 NumPy
2.1 Basic Operations and Data Structures
2.1.1 Creating Arrays
Explanation: NumPy arrays are used for storing and manipulating data efficiently.
Usage:
import numpy as np

# Create a NumPy array from a list


arr = np . array ([1 , 2 , 3 , 4 , 5])

# Create a NumPy array of zeros


zeros_arr = np . zeros ((3 , 3) )

7
# Create a NumPy array of ones
ones_arr = np . ones ((2 , 2) )

2.1.2 Basic Indexing and Slicing


Explanation: Indexing and slicing NumPy arrays allows accessing specific elements
or subsets of data.
Usage:
# Indexing
print ( arr [0]) # First element

# Slicing
print ( arr [1:4]) # Elements from index 1 to 3

2.2 Mathematical Operations


2.2.1 Element-wise Operations
Explanation: NumPy arrays support element-wise operations, such as addition, sub-
traction, multiplication, and division.
Usage:
# Element - wise addition
result = arr1 + arr2

# Element - wise multiplication


result = arr1 * arr2

2.2.2 Matrix Operations


Explanation: NumPy supports matrix operations, such as matrix multiplication and
dot product.
Usage:
# Matrix multiplication
result = np . matmul ( matrix1 , matrix2 )

# Dot product
result = np . dot ( vector1 , vector2 )

2.3 Statistical Functions


2.3.1 Descriptive Statistics
Explanation: NumPy provides functions for calculating descriptive statistics, such as
mean, median, standard deviation, and variance.
Usage:
# Mean
mean_value = np . mean ( arr )

8
# Median
median_value = np . median ( arr )

# Standard deviation
std_deviation = np . std ( arr )

# Variance
variance = np . var ( arr )

2.4 Linear Algebra


2.4.1 Eigenvalues and Eigenvectors
Explanation: NumPy allows computing eigenvalues and eigenvectors of a square
matrix.
Usage:
# Compute eigenvalues and eigenvectors
eigenvalues , eigenvectors = np . linalg . eig ( matrix )

2.4.2 Solving Linear Equations


Explanation: NumPy provides functions for solving systems of linear equations.
Usage:
# Solve linear equations
solution = np . linalg . solve ( coeff_matrix , const_vector )

2.5 Random Sampling


2.5.1 Generating Random Numbers
Explanation: NumPy allows generating arrays of random numbers from various prob-
ability distributions.
Usage:
# Generate random numbers from uniform distribution
random_numbers = np . random . rand (5)

# Generate random integers


random_integers = np . random . randint (1 , 100 , size =5)

2.6 Saving and Loading Data


2.6.1 Saving Arrays
Explanation: NumPy arrays can be saved to and loaded from binary files.
Usage:
# Save array to binary file
np . save ( ’ array . npy ’ , arr )

# Load array from binary file


loaded_arr = np . load ( ’ array . npy ’)

9
References
[1] Pandas Documentation: Comprehensive official documentation covering instal-
lation, user guide, API reference, and more. Available at https://pandas.pydata.
org/docs/.

[2] NumPy Documentation: Official documentation providing details on instal-


lation, quickstart tutorial, and API reference. Available at https://numpy.org/
doc/.

[3] Jake VanderPlas. Python Data Science Handbook. O’Reilly Media, 2016.

[4] DataCamp Pandas Tutorial: Interactive Pandas tutorial covering es-


sential topics. Access at https://www.datacamp.com/community/tutorials/
pandas-tutorial-dataframe-python.

[5] DataCamp NumPy Tutorial: Interactive NumPy tutorial with examples


and exercises. Access at https://www.datacamp.com/community/tutorials/
python-numpy-tutorial.

[6] GitHub Repositories: Explore GitHub repositories for code examples and
projects using Pandas and NumPy. Example: https://github.com/pandas-dev/
pandas.

[7] Matplotlib: https://matplotlib.org/

[8] Seaborn: https://seaborn.pydata.org/

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy