FDS Pyq2
FDS Pyq2
FDS Pyq2
Data Science is the field that uses scientific methods, processes, and algorithms to
extract insights and knowledge from structured and unstructured data.
Missing values refer to data points that are absent or not recorded.
Matplotlib
Seaborn
Plotly
Bokeh
ggplot
1/10
Data transformation is the process of converting data from its original format into a
format that is suitable for analysis.
The size of the bubbles represents the third variable, showing more than two data points
in a 2D plot.
Standard deviation measures how spread out the values in a dataset are.
A high standard deviation means the data points are spread out; a low one means they
are close to the mean.
2/10
b) Define Statistical Data Analysis?
Statistical data analysis involves using statistical methods to interpret, analyze, and
summarize data.
It helps in organizing and summarizing large datasets, typically for OLAP (Online
Analytical Processing).
Each dimension represents a different variable, and the cube stores aggregated data
points.
Data preprocessing is the process of cleaning, transforming, and organizing data before
analysis.
It helps in improving data quality by handling missing values, removing duplicates, and
standardizing formats.
Data visualization presents data in graphical formats like charts, graphs, and maps.
a) What are the Measures of Central Tendency? Explain Any Two of Them in Brief.
Measures of central tendency are statistics that describe the center or typical value of a
dataset.
1. Mean:
3/10
Calculated by summing all values and dividing by the number of values.
2. Median:
If the dataset has an odd number of elements, it is the exact middle value. If even,
it’s the average of the two middle values.
Example: For the dataset [1, 3, 5], Median = 3. For [1, 2, 3, 4], Median = (2 + 3) / 2 =
2.5.
b) What Are the Various Types of Data Available? Give Examples of Each.
1. Nominal Data:
2. Ordinal Data:
Categorical data with a meaningful order, but the differences between the
categories are not meaningful.
3. Interval Data:
Numerical data where the difference between values is meaningful, but there is no
true zero.
4. Ratio Data:
A Venn diagram is a diagram that shows all possible logical relations between different
sets.
It uses circles or other shapes to represent the sets and overlaps to show relationships
between them.
4/10
1. Identify the sets: Determine what groups or categories you are comparing.
3. Label the sets: Label each circle with the name of the set.
4. Show relationships: Overlap the circles to show common elements between the
sets.
Example:
In the Venn diagram, the intersection (overlap) of Set A and Set B will contain {2, 3},
while the unique elements of Set A and Set B will be outside the overlap.
Stores tabular data as plain text with each line representing a data row.
4. Parquet:
Commonly used with big data tools like Hadoop and Spark.
Example: Optimized for reading specific columns without scanning the entire
dataset.
5. Excel (XLSX):
5/10
A spreadsheet format used by Microsoft Excel, stores data in rows and columns.
Data Quality refers to the condition of data based on factors like accuracy,
completeness, consistency, and reliability. High-quality data is accurate, consistent, and
suitable for analysis.
5. Validity: Data should follow the correct format or range (e.g., age must be a positive
number).
1. Matplotlib:
Overview: A basic Python library for creating static, animated, and interactive plots.
6/10
Strengths: Easy to use, highly customizable.
Example:
python
2. Seaborn:
Example:
python
3. Plotly:
Strengths: Interactive plots with hover information, zoom-in features, and ability to
embed in web apps.
Example:
python
import plotly.express as px
fig = px.scatter(x=[1, 2, 3], y=[10, 20, 30])
fig.show()
4. Tableau:
7/10
Overview: A powerful data visualization tool used for creating interactive
dashboards and reports.
Types of Visualizations: Bar charts, line graphs, pie charts, geographic maps.
5. Power BI:
Overview: A business analytics tool from Microsoft that provides interactive data
visualization and business intelligence capabilities.
Strengths: Integration with various data sources (Excel, SQL, Azure), easy-to-use
interface.
Example: Users can connect to data sources, create reports, and share dashboards.
These tools help in transforming complex data into clear visual representations that support
better understanding and decision-making.
Outlier: An outlier is a data point that significantly differs from other values in the
dataset. It can distort statistical analyses and models, making them less reliable.
Types of Outliers:
1. Global Outliers:
These are values that are significantly different from the rest of the data points
across the entire dataset.
2. Contextual Outliers:
These values may be normal in some context but appear unusual when
considered in a different context.
3. Collective Outliers:
8/10
These are a group of data points that together are abnormal, but individually
may not appear as outliers.
Explanation: Normalization scales the data into a fixed range, usually [0, 1], by
subtracting the minimum value and dividing by the range of values.
Formula:
X − min(X)
Xnorm =
max(X) − min(X)
Use: Useful when features have different units or scales, e.g., height in cm and
weight in kg.
Formula:
X −μ
Xstd =
σ
Where μ is the mean, and σ is the standard deviation.
Use: Helps in handling features with different units and is required for algorithms
like k-means or SVM.
3. Log Transformation:
Explanation: Log transformation applies the logarithm to each data point to reduce
skewness and make the data more normally distributed.
Formula:
Xlog = log(X)
9/10
Use: Useful when data has a large range or exponential growth (e.g., population
growth, financial data).
These techniques are often applied in data preprocessing to prepare data for machine
learning models, ensuring better model performance and more reliable results.
10/10