0% found this document useful (0 votes)
3 views8 pages

UNIT II 2M

The document is a question bank for the course CCS346 Exploratory Data Analysis at Ramco Institute of Technology. It includes various topics related to data manipulation using Pandas in Python, covering features, data indexing, handling missing values, combining datasets, and more. Each question is designed to test knowledge on specific aspects of using Pandas for exploratory data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

UNIT II 2M

The document is a question bank for the course CCS346 Exploratory Data Analysis at Ramco Institute of Technology. It includes various topics related to data manipulation using Pandas in Python, covering features, data indexing, handling missing values, combining datasets, and more. Each question is designed to test knowledge on specific aspects of using Pandas for exploratory data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

RAMCO INSTITUTE OF TECHNOLOGY

Department of Computer Science and Technology


Academic Year: 2024 - 2025 (Even Semester)

QUESTION BANK
Semester / Class : VI / III Year B.E CSE
Name of Subject : CCS346 Exploratory Data Analysis
Name of Faculty member : Mrs.S.Vijaya Amala Devi AP/CSE

Unit II: EDA using Python


Data Manipulation using Pandas – Pandas Objects – Data Indexing and Selection – Operating
on Data – Handling Missing Data – Hierarchical Indexing – Combining datasets – Concat,
Append,Merge and Join – Aggregation and grouping – Pivot Tables – Vectorized String
Operations
PART A
1. What are the key features of Pandas that make it useful for data analysis?.
Pandas is an open-source data analysis and data manipulation library written in
python.
Pandas provide data structures and functions to work on structured data seamlessly.
Key Features of Pandas:
 Data Frames are multidimensional arrays with attached row and column
labels, and with heterogeneous types and/or missing data.
 Offers a convenient storage interface for labelled data
 Datasets are mutable using pandas and allows to add new rows and columns.
 Easy to handle missing data
 Merge and join datasets
 Indexing and subsetting data.

2. What is meant by Data indexing in pandas?


 Data Indexing in pandas refer to selecting specific rows and columns of data
from a Series or Data Frame.
 Indexing means selecting all the rows and some of the columns, some of the
rows and all of the columns, or some of each of the rows and columns.
 Indexing can also be known as Subset Selection.
 .loc() : Label based
 .iloc() : Integer based
 .ix() : Both Label and Integer based

3. What methods does Pandas provide for dealing with missing values?
 Pandas uses sentinels for missing data, and two already-existing Python null
values: the special floating-point NaN value, and the Python None object.
 In Pandas missing data is represented by two value:None: None is a Python
singleton object that is often used for missing data in Python code.
 NaN : NaN (Not a Number), is a special floating-point value based on the
standard IEEE floating-point representation.

1
4. What are the two method of combining dataset in Pandas?
There are two methods for combining datasets: concatenation and merging (or
joining).

5. List the python operators and their equivalent pandas object.


Python Operator Pandas methods
+ add()
- sub(),subtract
* mul, multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
** pow()

6. List the pandas handling of NAs by type.

Typeclass Conversion When storing NAs NA sentinel values


Floating No change Np.nan
Object No change None or np.nan
Integer Cast to float64 Np.nan
Boolean Cast to object None or np.nan

7. Write a syntax for Concatenation operation in Pandas.


Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but
contains anumber of options.
pd.concat(objs, axis=0, join='outer', join_axes=None,
ignore_index=False,keys=None, levels=None, names=None, verify_integrity=False,
copy=True)

8. Name the categories of Join.


The pd.merge() function implements a number of types of joins:
1. the one-to-one,
2. many-to-one,
3. and many-to-many joins.
All three types of joins are accessed via an identical call to the pd.merge() interface;
the type of join performed depends on the form of the input data.

9. What is Group By?


Simple aggregations can give a flavor of dataset, but often prefer to aggregate
conditionally on some label or index is called group by operation. The name “group
by” comes from a command in the SQL database language, but it is perhaps more
illuminative to think of it in the terms first coined by Hadley Wickham of Rstats
fame: split, apply, combine.

2
A canonical example of this split-apply-combine operation, where the “apply” is a
summation aggregation makes clear what the Group By accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on the
value of the specified key.
• The apply step involves computing some function, usually an aggregate,
transformation, or filtering, within the individual groups. The combine step merges
the results of these operations into an output array.

10. List the panda’s aggregation methods.


Aggregation Description
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all item

11. What is resampling in pandas?


Pandas dataframe.resample() function is primarily used for time series data.
A time series is a series of data points indexed (or listed or graphed) in time order.
Most commonly, a time series is a sequence taken at successive equally spaced
points in time. It is a Convenience method for frequency conversion and resampling
of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex,
or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Syntax : DataFrame.resample(rule, how=None, axis=0, fill_method=None,
closed=None, label=None, convention=’start’, kind=None, loffset=None,
limit=None, base=0, on=None, level=None)

12. Define shifting in pandas.


Pandas dataframe.shift() function Shift index by desired number of periods with an
optional time freq. This function takes a scalar parameter called the period, which
represents the number of shifts to be made over the desired axis. This function is
very helpful when dealing with time-series data.

Syntax:DataFrame.shift(periods=1, freq=None, axis=0)


where periods : Number of periods to move, can be positive or negative
freq : DateOffset, timedelta, or time rule string, optional Increment to use from the
tseries
module or time rule (e.g. ‘EOM’). See Notes axis : {0 or ‘index’, 1 or ‘columns’}
Return : shifted : DataFrame.

13. Define data wrangling?

3
Data wrangling is the process of transforming data from its original "raw" form into a
more digestible format and organizing sets from various sources into a singular
coherent whole for further processing.
14. List out the mapping between Pandas method and functions in pythons re
module.
Method Description
match() Call re.match() on each element, returning a Boolean.
extract() Call re.match() on each element, returning matched groups as
strings.
findall() Call re.findall() on each element.
replace() Replace occurrences of pattern with some other string.
contains() Call re.search() on each element, returning a Boolean.
count() Count occurrences of pattern.
split() Equivalent to str.split(), but accepts regexps.
rsplit() Equivalent to str.rsplit(), but accepts regexp

15. Name the pandas string method and its description.


Method Description
get() Index each element
slice() Slice each element
slice_replace() Replace slice in each element with passed value
cat() Concatenate strings
repeat() Repeat values
normalize() Return Unicode form of string
pad() Add whitespace to left, right, or both sides of string
wrap() Split long strings into lines with length less than a given width
join() Join Strings in each element of the series with passed separator
get_dummies() Extract dummy variables as a Data Frame

16. What is Python?


Python is a high-level scripting language which can be used for a wide variety of text
processing, system administration and internet-related tasks. Python is a true object-
oriented language and is available on a wide variety of platform.

17. How do import the necessary libraries to display plots in pandas?


To display plots in pandas, it typically need to import pandas and matplotlib libraries:
import pandas as pd
import matplotlib.pyplot as plt

18. Name the two interfaces that are used in pandas.


1. MATLAB-style Interface
2. Object-oriented interface

4
19. What are the three fundamental data structure used in Pandas?
The three fundamental data structures used in Pandas are:
1. Series
2. Data frame
3. Index

20. What is a Series in pandas?


In pandas, a Series is a one-dimensional labeled array capable of holding data of any
type (integer, float, string, Python objects, etc.). It is similar to a column in a
spreadsheet or a table.
Example of creating a Series:
import pandas as pd
# Creating a Series
s = pd.Series([1, 3, 5, 7, 9])
print(s)

21. What is a DataFrame in pandas?


A DataFrame in pandas is a two-dimensional labeled data structure with columns of
potentially different types. It can be thought of as a spreadsheet or a relational
database table, where each column is a Series.
Example of creating a DataFrame:
import pandas as pd
# Creating a DataFrame
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Sophia'],
'Age': [28, 24, 22, 32, 29],
'Salary': [60000, 45000, 50000, 70000, 62000]
}
df = pd.DataFrame(data)
print(df)

22. What is an Index in pandas?


An Index in pandas is an immutable array-like structure used to label the rows and
columns of a Data Frame. It provides metadata that helps in identifying rows or
columns uniquely.
Example of using an Index in a Data Frame:
import pandas as pd
# Creating a DataFrame with custom index
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Sophia'],
'Age': [28, 24, 22, 32, 29],
'Salary': [60000, 45000, 50000, 70000, 62000]

5
}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D', 'E'])
print(df)

23. Explain the process of creating a Data Frame from a list of dictionaries.
Creating a Data Frame from a list of dictionaries is a common operation in pandas,
especially when data that is structured as a list of records, where each record is
represented as a dictionary with consistent keys across all dictionaries.
import pandas as pd
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago

24. What are the methods used for operating null values in pandas?
 isnull()
o Generate a Boolean mask indicating missing values
 notnull()
o Opposite of isnull()
 dropna()
o Return a filtered version of the data
 fillna()
o Return a copy of the data with missing values filled or imputed
25. Name the pandas handling of NAs by type.
Type class Conversion storing when NA sentinel Value
NAs
Floating No Change Np.nan
Object No Change None or np.nan
Integer Cast to Float64 Np.nan
Boolean Cast to Object None or np.nan

26. How to use Hierarchical Indexes with Pandas?


Hierarchical Indexes are also known as multi-indexing is setting more than one column
name as the index. It is used to incorporate multiple index levels within a single index. In
this way, higher-dimensional data can be compactly represented within the familiar one-
dimensional Series and two-dimensional DataFrame objects.

6
# importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')
print(df.head())

27. What are the fundamentals of pandas time series data structure.
 For time stamps, Pandas provides the Timestamp type. As mentioned before, it is
essentially a replacement for Python’s native datetime, but is based on the more
efficient numpy. datetime64 data type. The associated index structure is
DatetimeIndex.
 For time periods, Pandas provides the Period type. This encodes a fixed frequency
interval based on numpy. datetime64. The associated index structure is
PeriodIndex.
 For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a
more efficient replacement for Python’s native datetime.t imedelta type, and is
based on numpy.timedelta64. The associated index structure is TimedeltaIndex.

28. What is meant by hierarchical indexing?


Hierarchical indexing is a method of creating structured group relationships in data.
These hierarchical indexes, or MultiIndexes, are highly flexible and offer a range of
options when performing complex data queries. Hierarchical indexing allows us to
use multiple index levels on an axis. Hierarchical indexing is also known as multiple
indexing.

29. What is data selection in series?


The Series object provides a mapping from a collection of keys to a collection of
values:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
Output: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64

30. List any two major advantages of data indexing and selection in EDA.
Nov/Dec2023
Efficiency: Data indexing allows for rapid access to specific subsets of data, which is
crucial when exploring large datasets. Operations like filtering rows based on
conditions or selecting specific columns can be performed efficiently using indexing
techniques, such as .loc[], .iloc[], and boolean indexing.

Versatility and Flexibility: Pandas indexing provides versatile methods for selecting
data, such as selecting by label (loc[]), by integer position (iloc[]), or using boolean

7
masks. This flexibility allows analysts to tailor their selections based on the
requirements of their analysis, enhancing both the depth and breadth of exploration
possible during EDA.

31. What is preprocessing and Data Engineering? Nov/Dec2024


 Preprocessing refers to the steps taken to clean, transform, and prepare raw data
for analysis. It includes handling missing values, removing duplicates, normalizing
data, and feature scaling to improve data quality.
 Data Engineering is the broader process of designing, building, and maintaining
data pipelines to collect, store, and process data efficiently. It involves ETL (Extract,
Transform, Load), database management, and integration of data from multiple
sources for analytics and machine learning.

32. How do get the column name of your data frame using pandas in python?
Nov/Dec2024
In Pandas, you can get the column names of a DataFrame using the .columns attribute.
import pandas as pd
# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30], 'City': ['New York', 'London']}
df = pd.DataFrame(data)
# Getting column names
column_names = df.columns
print(column_names)
Output:
Index(['Name', 'Age', 'City'], dtype='object')s

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy