Pandas Fuction Notes
Pandas Fuction Notes
1. Data Loading
• Read CSV File: df = pd.read_csv('filename.csv')
• Read Excel File: df = pd.read_excel('filename.xlsx')
• Read from SQL Database: df = pd.read_sql(query, connection)
2. Basic Data Inspection
• Display Top Rows: df.head()
• Display Bottom Rows: df.tail()
• Display Data Types: df.dtypes
• Summary Statistics: df.describe()
• Display Index, Columns, and Data: df.info()
3. Data Cleaning
• Check for Missing Values: df.isnull().sum()
• Fill Missing Values: df.fillna(value)
• Drop Missing Values: df.dropna()
• Rename Columns: df.rename(columns={'old_name': 'new_name'})
• Drop Columns: df.drop(columns=['column_name'])
4. Data Transformation
• Apply Function: df['column'].apply(lambda x: function(x))
• Group By and Aggregate: df.groupby('column').agg({'column': 'sum'})
• Pivot Tables: df.pivot_table(index='column1', values='column2', aggfunc='mean')
• Merge DataFrames: pd.merge(df1, df2, on='column')
• Concatenate DataFrames: pd.concat([df1, df2])
5. Data Visualization Integration
• Histogram: df['column'].hist()
• Boxplot: df.boxplot(column=['column1', 'column2'])
• Scatter Plot: df.plot.scatter(x='col1', y='col2')
• Line Plot: df.plot.line()
• Bar Chart: df['column'].value_counts().plot.bar()
6. Statistical Analysis
• Value Counts: df['column'].value_counts()
• Unique Values in Column: df['column'].unique()
• Number of Unique Values: df['column'].nunique()
7. Indexing and Selection
• Select Column: df['column']
• Select Multiple Columns: df[['col1', 'col2']]
• Select Rows by Position: df.iloc[0:5]
• Select Rows by Label: df.loc[0:5]
• Conditional Selection: df[df['column'] > value]
8. Data Formatting and Conversion
• Convert Data Types: df['column'].astype('type')
• String Operations: df['column'].str.lower()
• Datetime Conversion: pd.to_datetime(df['column'])
• Setting Index: df.set_index('column')
9. Handling Time Series Data
• Set Datetime Index: df.set_index(pd.to_datetime(df['date']))
• Resampling Data: df.resample('M').mean()
• Rolling Window Operations: df.rolling(window=5).mean()
10. File Export
• Write to CSV: df.to_csv('filename.csv')
• Write to Excel: df.to_excel('filename.xlsx')
• Write to SQL Database: df.to_sql('table_name', connection)
11. Advanced Data Queries
• Query Function: df.query('column > value')
• Filtering with isin: df[df['column'].isin([value1, value2])]
12. Memory Optimization
• Reducing Memory Usage: df.memory_usage(deep=True)
• Change Data Types to Save Memory: df['column'].astype('category')
13. Multi-Index Operations
• Creating MultiIndex: df.set_index(['col1', 'col2'])
• Slicing on MultiIndex: df.loc[(slice('index1_start', 'index1_end'),
• slice('index2_start', 'index2_end'))]
14. Data Merging Techniques
• Outer Join: pd.merge(df1, df2, on='column', how='outer')
• Inner Join: pd.merge(df1, df2, on='column', how='inner')
• Left Join: pd.merge(df1, df2, on='column', how='left')
• Right Join: pd.merge(df1, df2, on='column', how='right')
15. Dealing with Duplicates
• Finding Duplicates: df.duplicated()
• Removing Duplicates: df.drop_duplicates()
16. Specialized Data Types Handling
• Working with Categorical Data: df['column'].astype('category')
17. Advanced Grouping and Aggregation
• Group by Multiple Columns: df.groupby(['col1', 'col2']).mean()
• Aggregate with Multiple Functions: df.groupby('col').agg(['mean','sum'])
• Transform Function: df.groupby('col').transform(lambda x: x - x.mean())
18. Time Series Specific Operations
• Time-Based Grouping: df.groupby(pd.Grouper(key='date_col',freq='M')).sum()
• Resample Time Series Data: df.resample('M', on='date_col').mean()
19. Text Data Specific Operations
• String Contains: df[df['column'].str.contains('substring')]
• String Split: df['column'].str.split(' ', expand=True)
• Regular Expression Extraction: df['column'].str.extract(r'(regex)')
20. Working with JSON and XML
• Reading JSON: df = pd.read_json('filename.json')
• Reading XML: df = pd.read_xml('filename.xml')
21. Advanced File Handling
• Read CSV with Specific Delimiter: df = pd.read_csv('filename.csv', delimiter=';')
• Writing to JSON: df.to_json('filename.json')
22. Dealing with Missing Data
• Interpolate Missing Values: df['column'].interpolate()
• Forward Fill Missing Values: df['column'].ffill()
• Backward Fill Missing Values: df['column'].bfill()
23. Data Reshaping
• Wide to Long Format: pd.wide_to_long(df, ['col'], i='id_col', j='year')
• Long to Wide Format: df.pivot(index='id_col', columns='year', values='col')
24. Categorical Data Operations
• Convert Column to Categorical: df['column'] = df['column'].astype('category')
• Order Categories: df['column'].cat.set_categories(['cat1', 'cat2'], ordered=True)
25. Advanced Indexing
• Reset Index: df.reset_index(drop=True)
• Set Multiple Indexes: df.set_index(['col1', 'col2'])
• MultiIndex Slicing: df.xs(key='value', level='level_name')
26. Handling Large Data Efficiently
• Dask Integration for Large Data: import dask.dataframe as dd; ddf = dd.from_pandas(df,
npartitions=10)
• Sampling Data for Quick Insights: df.sample(n=1000)
27. Advanced Data Merging
• SQL-like Joins: pd.merge(df1, df2, how='left', on='col')
• Concatenating Along a Different Axis: pd.concat([df1, df2], axis=1)