FDS - 1 SOLVED
FDS - 1 SOLVED
Sol:
The volume characteristic of data refers to the sheer amount of data generated and stored
in data systems. It is a key aspect of Big Data, signifying the massive quantities of data that
organizations collect, process, and analyze. The volume characteristic highlights the need
for scalable storage solutions and advanced data processing techniques to handle large
datasets efficiently.
Sol:
o JSON is a lightweight data interchange format that is easy for humans to read
and write, and easy for machines to parse and generate.
Sol:
Data discretization is the process of converting continuous data attributes into discrete
intervals or categories. This technique is used to simplify data analysis and mining by
reducing the number of possible values, making it easier to identify patterns and trends.
d) What is a quartile?
Sol:
A quartile is a type of quantile that divides a dataset into four equal parts. The three
quartiles are:
1. First Quartile (Q1): The median of the lower half of the dataset.
3. Third Quartile (Q3): The median of the upper half of the dataset.
Sol:
1. Nominal Attributes: Categorical data with no intrinsic ordering (e.g., gender, colors).
2. Ordinal Attributes: Categorical data with an intrinsic ordering (e.g., rankings, grades).
3. Interval Attributes: Numeric data with meaningful differences but no true zero point
(e.g., temperature in Celsius).
4. Ratio Attributes: Numeric data with meaningful differences and a true zero point (e.g.,
height, weight).
Sol:
A data object is an entity that encapsulates data and its associated attributes. It represents
a single record in a dataset, which can include various attributes describing the object. For
example, in a customer dataset, each data object represents an individual customer with
attributes like name, age, and address.
g) What is Data Transformation?
Sol:
Data transformation is the process of converting data from one format or structure into
another. This process includes various techniques such as normalization, aggregation, and
scaling to prepare data for analysis, improve its quality, and ensure consistency across
datasets.
Sol:
• ArcGIS: A powerful platform for working with maps and geographic information. It
allows users to create, share, and analyze spatial data.
Sol:
3. Embedded Methods: Perform feature selection during the model training process
(e.g., Lasso, Ridge Regression).
j) List any two libraries used in Python for data analysis
Sol:
• Pandas: This library provides data structures and data analysis tools for handling
and manipulating numerical tables and time series data. It is particularly useful for
data wrangling and transformation, allowing for easy data cleaning, preparation, and
aggregation.
• NumPy: This library supports large, multi-dimensional arrays and matrices, along
with a large collection of high-level mathematical functions to operate on these
arrays. It is essential for scientific computing and is often used as a foundation for
other data analysis libraries.
Sol:
1. Text Files:
• Usage: Commonly used for logs, configuration files, and simple data storage.
• Example:
• Advantages: Easy to read and edit using any text editor. Lightweight and compatible
with most software.
• Format: Stores data in binary format, which is not human-readable. Data is encoded
in bytes.
• Usage: Used for storing data that requires efficient read/write operations, such as
images, executable files, and custom data structures.
• Advantages: More compact and efficient for large datasets. Can store complex data
structures.
Sol:
Statistics plays a crucial role in data science by providing the tools and methodologies
needed to analyze and interpret data.
1. Descriptive Statistics:
o Measures such as mean, median, mode, standard deviation, and variance help
in understanding the central tendency and dispersion of the data.
2. Inferential Statistics:
o Statistical methods help in imputing missing data and validating the integrity of
the data.
Sol:
1. Imputation:
• Mean/Median Imputation: Replace missing values with the mean or median of the
non-missing values in the same column.
• Mode Imputation: Replace missing categorical values with the mode (most frequent
value) of the column.
2. Deletion:
• Listwise Deletion: Remove entire rows that contain any missing values.
Example: - data.dropna(inplace=True)
• Pairwise Deletion: Remove only the specific missing values without deleting entire
rows, often used in correlation and regression analysis.
d) Explain any two tools in data scientist tool box
Sol:
1. Jupyter Notebook:
• Widely used for data cleaning, transformation, visualization, and machine learning
model development.
2. TensorFlow:
• Supports deep learning and neural networks, with extensive tools for model
building, training, and deployment.
e) Write a short note on word clouds
Sol:
Word clouds are visual representations of text data where the importance or frequency of
each word is shown with varying sizes and colors. The more frequently a word appears in
the text data, the larger and more prominent it is displayed in the word cloud.
• Purpose:
o To quickly identify the most common words or themes in a large text dataset.
• Construction:
o Words are extracted from the text data, and their frequencies are calculated.
text = "Data science is a multi-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from structured and
unstructured data."
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Q3) Attempt any TWO of the following:
Sol:
1. Problem Definition: Understand and define the problem you want to solve.
3. Data Cleaning: Process and clean the data to handle missing values and
inconsistencies.
4. Data Exploration: Perform exploratory data analysis (EDA) to uncover patterns and
insights.
5. Data Modeling: Apply statistical and machine learning models to the data.
Diagram:
[ Problem Definition] -> [ Data Collection ] -> [ Data Cleaning ] -> [ Data Exploration ] -> [
Data Modeling ] -> [ Model Evaluation ] -> [ Deployment ] -> [ Monitoring and Maintenance ]
b) Explain the concept and use of data visualization
Sol:
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way
to see and understand trends, outliers, and patterns in data. The primary goal is to make
complex data more understandable and interpretable.
Uses:
o By converting data into visual formats, data visualization aids in making informed
decisions. It helps stakeholders to see the bigger picture and make data-driven
decisions.
4. Communicating Insights:
o Visualizations are powerful tools for communicating data insights to others. They
make it easier to convey information to non-technical audiences, making data-driven
discussions more effective.
o Data visualization makes it easy to spot outliers and anomalies that might indicate
issues or opportunities. For example, a spike in a line chart might highlight a
significant event that needs further investigation.
Examples of Data Visualization Tools:
• Tableau: A popular data visualization tool that enables users to create interactive
and shareable dashboards.
• Matplotlib: A plotting library for Python that is widely used for creating static,
interactive, and animated visualizations.
Examples of Visualizations:
• Heatmaps: Represent data values as colors, useful for identifying high and low
values.
# Sample data
plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Values')
Sol:
Variance = ((14 - 13.71)2 + (9 - 13.71) 2 + (13 - 13.71) 2 + (16 - 13.71) 2 + (25 - 13.71) 2 +
(7 - 13.71) 2 + (12 - 13.71) 2) / 7
Sol:
1. Formulating Null and Alternative Hypotheses: The null hypothesis (H0) represents
the default assumption, while the alternative hypothesis (H1) represents the
assertion being tested.
3. Calculating a Test Statistic: Based on sample data, a test statistic (e.g., t-test, z-test)
is computed.
4. Making a Decision: Comparing the test statistic to a critical value or using the p-
value to decide whether to reject or fail to reject the null hypothesis.
b) Differentiate between structured data and unstructured data
Sol:
Sol:
Python offers several powerful libraries for data visualization. Some of the most commonly
used libraries are:
1. Matplotlib:
o Description: Matplotlib is one of the most widely used data visualization libraries in
Python. It provides a flexible platform for creating static, animated, and interactive
visualizations.
o Capabilities: Line plots, scatter plots, bar charts, histograms, pie charts, and more.
o Example:
x = [1, 2, 3, 4, 5]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
2. Seaborn:
o Capabilities: Enhanced visualizations, including heatmaps, violin plots, box plots, and
pair plots.
o Example:
data = sns.load_dataset("iris")
sns.pairplot(data, hue="species")
plt.show()
3. Plotly:
o Example:
import plotly.express as px
df = px.data.iris()
fig.show()
4. Bokeh:
o Example:
show(p)
5. Altair:
o Example:
import altair as alt
import pandas as pd
data = pd.DataFrame({
'b': [3, 4, 5, 6, 7]
})
chart = alt.Chart(data).mark_line().encode(
x='a',
y='b'
chart.show()
Sol:
Sol:
Example:
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
Sol:
2. Finance: Data science is applied for fraud detection, credit risk assessment,
algorithmic trading, and customer segmentation. Financial institutions use data
science to analyze transaction data and detect fraudulent activities.
Sol:
Z-Score Outliers: An outlier can be identified using the Z-score method, which measures
the number of standard deviations a data point is from the mean. If a data point's Z-score is
greater than a certain threshold (commonly 3 or -3), it is considered an outlier.
Example:
import numpy as np
data = [10, 12, 14, 15, 18, 21, 100] # 100 is an outlier
mean = np.mean(data)
std_dev = np.std(data
print("Outliers:", outliers)