FDS Pyq2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

FDS pyq2

Here are the answers in simple points:

a) What is Data Science?

Data Science is the field that uses scientific methods, processes, and algorithms to
extract insights and knowledge from structured and unstructured data.

It involves data analysis, machine learning, and statistical modeling.

b) Define Data Source?

A data source is any place or system where data is collected from.

Examples: databases, websites, sensors, or files like CSV.

c) What is Missing Values?

Missing values refer to data points that are absent or not recorded.

It can occur in datasets due to errors or lack of information.

d) List the Visualization Libraries in Python.

Matplotlib

Seaborn

Plotly

Bokeh

ggplot

e) List Applications of Data Science.

Predictive analytics (e.g., weather forecasting, sales predictions)

Recommendation systems (e.g., Netflix, Amazon)

Fraud detection (e.g., credit card fraud)

Image recognition (e.g., self-driving cars)

Healthcare (e.g., medical diagnoses)

f) What is Data Transformation?

1/10
Data transformation is the process of converting data from its original format into a
format that is suitable for analysis.

Examples include scaling, encoding, and normalizing data.

g) Define Hypothesis Testing?

Hypothesis testing is a statistical method used to decide whether there is enough


evidence to accept or reject a hypothesis.

It involves comparing the observed data to a theory or assumption.

h) What is the Use of Bubble Plot?

A bubble plot is used to visualize relationships between three variables.

The size of the bubbles represents the third variable, showing more than two data points
in a 2D plot.

i) Define Data Cleaning?

Data cleaning is the process of identifying and correcting or removing errors,


inconsistencies, and inaccuracies in data.

It ensures the data is accurate and ready for analysis.

j) Define Standard Deviation?

Standard deviation measures how spread out the values in a dataset are.

A high standard deviation means the data points are spread out; a low one means they
are close to the mean.

Here are the answers in simple points:

a) List the Tools for Data Scientists.

Programming Languages: Python, R, SQL

Libraries: Pandas, NumPy, Scikit-learn, TensorFlow

Data Visualization Tools: Matplotlib, Seaborn, Tableau

Databases: MySQL, MongoDB

Big Data Tools: Hadoop, Spark

Cloud Services: AWS, Google Cloud

2/10
b) Define Statistical Data Analysis?

Statistical data analysis involves using statistical methods to interpret, analyze, and
summarize data.

It helps in identifying patterns, trends, and relationships within the data.

Common techniques: mean, median, mode, variance, regression analysis.

c) What is Data Cube?

A data cube is a multi-dimensional array of values used in data analysis.

It helps in organizing and summarizing large datasets, typically for OLAP (Online
Analytical Processing).

Each dimension represents a different variable, and the cube stores aggregated data
points.

d) Give the Purpose of Data Preprocessing?

Data preprocessing is the process of cleaning, transforming, and organizing data before
analysis.

It helps in improving data quality by handling missing values, removing duplicates, and
standardizing formats.

It ensures better accuracy and efficiency in machine learning models.

e) What is the Purpose of Data Visualization?

Data visualization presents data in graphical formats like charts, graphs, and maps.

It helps in understanding complex data and patterns more easily.

It supports decision-making by making data insights clearer and more accessible.

Here are the answers in simple points:

a) What are the Measures of Central Tendency? Explain Any Two of Them in Brief.

Measures of central tendency are statistics that describe the center or typical value of a
dataset.

Common measures: Mean, Median, Mode.

1. Mean:

The average of all data points.

3/10
Calculated by summing all values and dividing by the number of values.

Example: For the dataset [1, 2, 3], Mean = (1+2+3) / 3 = 2.

2. Median:

The middle value when the data is sorted in order.

If the dataset has an odd number of elements, it is the exact middle value. If even,
it’s the average of the two middle values.

Example: For the dataset [1, 3, 5], Median = 3. For [1, 2, 3, 4], Median = (2 + 3) / 2 =
2.5.

b) What Are the Various Types of Data Available? Give Examples of Each.

1. Nominal Data:

Categorical data without any order or ranking.

Example: Colors (Red, Blue, Green), Gender (Male, Female).

2. Ordinal Data:

Categorical data with a meaningful order, but the differences between the
categories are not meaningful.

Example: Education level (High School, Bachelor's, Master's).

3. Interval Data:

Numerical data where the difference between values is meaningful, but there is no
true zero.

Example: Temperature in Celsius or Fahrenheit (0°C does not mean no temperature).

4. Ratio Data:

Numerical data with meaningful differences and a true zero.

Example: Height, Weight, Age (0 means no height, no weight, or no age).

c) What Is a Venn Diagram? How to Create It? Explain with Example.

A Venn diagram is a diagram that shows all possible logical relations between different
sets.

It uses circles or other shapes to represent the sets and overlaps to show relationships
between them.

Steps to create a Venn diagram:

4/10
1. Identify the sets: Determine what groups or categories you are comparing.

2. Draw the circles: Each set is represented by a circle.

3. Label the sets: Label each circle with the name of the set.

4. Show relationships: Overlap the circles to show common elements between the
sets.

Example:

Set A = {1, 2, 3}, Set B = {2, 3, 4}

In the Venn diagram, the intersection (overlap) of Set A and Set B will contain {2, 3},
while the unique elements of Set A and Set B will be outside the overlap.

Here are the answers in simple points:

a) Explain Different Data Formats in Brief.

1. CSV (Comma Separated Values):

Stores tabular data as plain text with each line representing a data row.

Values are separated by commas.

Example: name,age,city\nJohn,25,New York

2. JSON (JavaScript Object Notation):

Stores data in key-value pairs, useful for nested structures.

Commonly used in web applications and APIs.

Example: {"name": "John", "age": 25, "city": "New York"}

3. XML (Extensible Markup Language):

A markup language that stores data in a hierarchical structure with tags.

Example: <person><name>John</name><age>25</age><city>New York</city>


</person>

4. Parquet:

A columnar storage file format optimized for large-scale data processing.

Commonly used with big data tools like Hadoop and Spark.

Example: Optimized for reading specific columns without scanning the entire
dataset.

5. Excel (XLSX):

5/10
A spreadsheet format used by Microsoft Excel, stores data in rows and columns.

Supports various data types, formulas, and charts.

Example: Data arranged in cells within worksheets.

6. SQL Database (Relational Databases):

Data is stored in tables with rows and columns.

Structured Query Language (SQL) is used to query the data.

Example: A table like students(name, age, grade) .

b) What is Data Quality? Which Factors Are Affected Data Qualities?

Data Quality refers to the condition of data based on factors like accuracy,
completeness, consistency, and reliability. High-quality data is accurate, consistent, and
suitable for analysis.

Factors Affecting Data Quality:

1. Accuracy: Data should represent real-world values correctly (no errors).

2. Completeness: All required data should be present (no missing values).

3. Consistency: Data should be consistent across different datasets (no contradictions).

4. Timeliness: Data should be up-to-date and relevant.

5. Validity: Data should follow the correct format or range (e.g., age must be a positive
number).

6. Uniqueness: Data should not have duplicate entries.

7. Relevance: Data should be suitable for the specific analysis or purpose.

c) Write Detailed Notes on Basic Data Visualization Tools.

1. Matplotlib:

Overview: A basic Python library for creating static, animated, and interactive plots.

Types of Visualizations: Line plots, bar charts, scatter plots, histograms.

6/10
Strengths: Easy to use, highly customizable.

Example:

python

import matplotlib.pyplot as plt


plt.plot([1, 2, 3, 4], [10, 20, 25, 30])
plt.show()

2. Seaborn:

Overview: Built on top of Matplotlib, provides a higher-level interface for creating


attractive statistical graphics.

Types of Visualizations: Heatmaps, pair plots, violin plots, box plots.

Strengths: Simplifies complex visualizations, integrates well with Pandas


DataFrames.

Example:

python

import seaborn as sns


sns.heatmap([[1, 2], [3, 4]], annot=True)

3. Plotly:

Overview: A library for creating interactive visualizations that can be embedded in


web applications.

Types of Visualizations: Line charts, scatter plots, 3D plots, maps.

Strengths: Interactive plots with hover information, zoom-in features, and ability to
embed in web apps.

Example:

python

import plotly.express as px
fig = px.scatter(x=[1, 2, 3], y=[10, 20, 30])
fig.show()

4. Tableau:

7/10
Overview: A powerful data visualization tool used for creating interactive
dashboards and reports.

Types of Visualizations: Bar charts, line graphs, pie charts, geographic maps.

Strengths: Drag-and-drop interface, real-time data connection, and dashboard


creation.

Example: Users can create interactive reports without coding.

5. Power BI:

Overview: A business analytics tool from Microsoft that provides interactive data
visualization and business intelligence capabilities.

Types of Visualizations: Charts, tables, and maps for data reporting.

Strengths: Integration with various data sources (Excel, SQL, Azure), easy-to-use
interface.

Example: Users can connect to data sources, create reports, and share dashboards.

These tools help in transforming complex data into clear visual representations that support
better understanding and decision-making.

Here are the answers in simple points:

a) What is Outlier? State Types of Outliers.

Outlier: An outlier is a data point that significantly differs from other values in the
dataset. It can distort statistical analyses and models, making them less reliable.

Types of Outliers:

1. Global Outliers:

These are values that are significantly different from the rest of the data points
across the entire dataset.

Example: A person’s age recorded as 200 in a dataset of ages.

2. Contextual Outliers:

These values may be normal in some context but appear unusual when
considered in a different context.

Example: A temperature of 100°C in winter may be normal in a hot climate but


is an outlier in a cold region.

3. Collective Outliers:

8/10
These are a group of data points that together are abnormal, but individually
may not appear as outliers.

Example: A sudden change in a stock's price over several days might be


considered a collective outlier.

b) State and Explain Any Three Data Transformation Techniques.

1. Normalization (Min-Max Scaling):

Explanation: Normalization scales the data into a fixed range, usually [0, 1], by
subtracting the minimum value and dividing by the range of values.

Formula:

X − min(X)
Xnorm =
max(X) − min(X)
​ ​

Use: Useful when features have different units or scales, e.g., height in cm and
weight in kg.

2. Standardization (Z-Score Normalization):

Explanation: Standardization transforms the data to have a mean of 0 and a


standard deviation of 1 by subtracting the mean and dividing by the standard
deviation.

Formula:

X −μ
Xstd = ​ ​

σ
Where μ is the mean, and σ is the standard deviation.

Use: Helps in handling features with different units and is required for algorithms
like k-means or SVM.

3. Log Transformation:

Explanation: Log transformation applies the logarithm to each data point to reduce
skewness and make the data more normally distributed.

Formula:

Xlog = log(X)

9/10
Use: Useful when data has a large range or exponential growth (e.g., population
growth, financial data).

These techniques are often applied in data preprocessing to prepare data for machine
learning models, ensuring better model performance and more reliable results.

10/10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy