UNIT II 2M
UNIT II 2M
QUESTION BANK
Semester / Class : VI / III Year B.E CSE
Name of Subject : CCS346 Exploratory Data Analysis
Name of Faculty member : Mrs.S.Vijaya Amala Devi AP/CSE
3. What methods does Pandas provide for dealing with missing values?
Pandas uses sentinels for missing data, and two already-existing Python null
values: the special floating-point NaN value, and the Python None object.
In Pandas missing data is represented by two value:None: None is a Python
singleton object that is often used for missing data in Python code.
NaN : NaN (Not a Number), is a special floating-point value based on the
standard IEEE floating-point representation.
1
4. What are the two method of combining dataset in Pandas?
There are two methods for combining datasets: concatenation and merging (or
joining).
2
A canonical example of this split-apply-combine operation, where the “apply” is a
summation aggregation makes clear what the Group By accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on the
value of the specified key.
• The apply step involves computing some function, usually an aggregate,
transformation, or filtering, within the individual groups. The combine step merges
the results of these operations into an output array.
3
Data wrangling is the process of transforming data from its original "raw" form into a
more digestible format and organizing sets from various sources into a singular
coherent whole for further processing.
14. List out the mapping between Pandas method and functions in pythons re
module.
Method Description
match() Call re.match() on each element, returning a Boolean.
extract() Call re.match() on each element, returning matched groups as
strings.
findall() Call re.findall() on each element.
replace() Replace occurrences of pattern with some other string.
contains() Call re.search() on each element, returning a Boolean.
count() Count occurrences of pattern.
split() Equivalent to str.split(), but accepts regexps.
rsplit() Equivalent to str.rsplit(), but accepts regexp
4
19. What are the three fundamental data structure used in Pandas?
The three fundamental data structures used in Pandas are:
1. Series
2. Data frame
3. Index
5
}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D', 'E'])
print(df)
23. Explain the process of creating a Data Frame from a list of dictionaries.
Creating a Data Frame from a list of dictionaries is a common operation in pandas,
especially when data that is structured as a list of records, where each record is
represented as a dictionary with consistent keys across all dictionaries.
import pandas as pd
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
24. What are the methods used for operating null values in pandas?
isnull()
o Generate a Boolean mask indicating missing values
notnull()
o Opposite of isnull()
dropna()
o Return a filtered version of the data
fillna()
o Return a copy of the data with missing values filled or imputed
25. Name the pandas handling of NAs by type.
Type class Conversion storing when NA sentinel Value
NAs
Floating No Change Np.nan
Object No Change None or np.nan
Integer Cast to Float64 Np.nan
Boolean Cast to Object None or np.nan
6
# importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')
print(df.head())
27. What are the fundamentals of pandas time series data structure.
For time stamps, Pandas provides the Timestamp type. As mentioned before, it is
essentially a replacement for Python’s native datetime, but is based on the more
efficient numpy. datetime64 data type. The associated index structure is
DatetimeIndex.
For time periods, Pandas provides the Period type. This encodes a fixed frequency
interval based on numpy. datetime64. The associated index structure is
PeriodIndex.
For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a
more efficient replacement for Python’s native datetime.t imedelta type, and is
based on numpy.timedelta64. The associated index structure is TimedeltaIndex.
30. List any two major advantages of data indexing and selection in EDA.
Nov/Dec2023
Efficiency: Data indexing allows for rapid access to specific subsets of data, which is
crucial when exploring large datasets. Operations like filtering rows based on
conditions or selecting specific columns can be performed efficiently using indexing
techniques, such as .loc[], .iloc[], and boolean indexing.
Versatility and Flexibility: Pandas indexing provides versatile methods for selecting
data, such as selecting by label (loc[]), by integer position (iloc[]), or using boolean
7
masks. This flexibility allows analysts to tailor their selections based on the
requirements of their analysis, enhancing both the depth and breadth of exploration
possible during EDA.
32. How do get the column name of your data frame using pandas in python?
Nov/Dec2024
In Pandas, you can get the column names of a DataFrame using the .columns attribute.
import pandas as pd
# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30], 'City': ['New York', 'London']}
df = pd.DataFrame(data)
# Getting column names
column_names = df.columns
print(column_names)
Output:
Index(['Name', 'Age', 'City'], dtype='object')s