0% found this document useful (0 votes)
20 views17 pages

FDS - 1 SOLVED

TYBCS foundation of data science solved question papers

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views17 pages

FDS - 1 SOLVED

TYBCS foundation of data science solved question papers

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Q1) Attempt any EIGHT of the following:

a) Define volume characteristic of data in reference to data science

Sol:

The volume characteristic of data refers to the sheer amount of data generated and stored
in data systems. It is a key aspect of Big Data, signifying the massive quantities of data that
organizations collect, process, and analyze. The volume characteristic highlights the need
for scalable storage solutions and advanced data processing techniques to handle large
datasets efficiently.

b) Give examples of semi-structured data

Sol:

Examples of Semi-structured Data:

1. XML (eXtensible Markup Language):

o XML files contain data in a hierarchical structure using custom tags.

2. JSON (JavaScript Object Notation):

o JSON is a lightweight data interchange format that is easy for humans to read
and write, and easy for machines to parse and generate.

c) Define Data Discretization

Sol:

Data discretization is the process of converting continuous data attributes into discrete
intervals or categories. This technique is used to simplify data analysis and mining by
reducing the number of possible values, making it easier to identify patterns and trends.
d) What is a quartile?

Sol:

A quartile is a type of quantile that divides a dataset into four equal parts. The three
quartiles are:

1. First Quartile (Q1): The median of the lower half of the dataset.

2. Second Quartile (Q2): The median of the dataset.

3. Third Quartile (Q3): The median of the upper half of the dataset.

e) List different types of attributes

Sol:

1. Nominal Attributes: Categorical data with no intrinsic ordering (e.g., gender, colors).

2. Ordinal Attributes: Categorical data with an intrinsic ordering (e.g., rankings, grades).

3. Interval Attributes: Numeric data with meaningful differences but no true zero point
(e.g., temperature in Celsius).

4. Ratio Attributes: Numeric data with meaningful differences and a true zero point (e.g.,
height, weight).

f) Define Data object

Sol:

A data object is an entity that encapsulates data and its associated attributes. It represents
a single record in a dataset, which can include various attributes describing the object. For
example, in a customer dataset, each data object represents an individual customer with
attributes like name, age, and address.
g) What is Data Transformation?

Sol:

Data transformation is the process of converting data from one format or structure into
another. This process includes various techniques such as normalization, aggregation, and
scaling to prepare data for analysis, improve its quality, and ensure consistency across
datasets.

h) Write the tools used for geospatial data

Sol:

• ArcGIS: A powerful platform for working with maps and geographic information. It
allows users to create, share, and analyze spatial data.

• QGIS (Quantum GIS): An open-source GIS application that provides data


visualization, editing, and analysis capabilities.

i) State the methods of feature selection

Sol:

1. Filter Methods: Use statistical techniques to evaluate the relevance of features


independently of any learning algorithm (e.g., Chi-square test, correlation
coefficient).

2. Wrapper Methods: Use a predictive model to evaluate the combination of features


and select the best subset (e.g., Recursive Feature Elimination).

3. Embedded Methods: Perform feature selection during the model training process
(e.g., Lasso, Ridge Regression).
j) List any two libraries used in Python for data analysis

Sol:

• Pandas: This library provides data structures and data analysis tools for handling
and manipulating numerical tables and time series data. It is particularly useful for
data wrangling and transformation, allowing for easy data cleaning, preparation, and
aggregation.
• NumPy: This library supports large, multi-dimensional arrays and matrices, along
with a large collection of high-level mathematical functions to operate on these
arrays. It is essential for scientific computing and is often used as a foundation for
other data analysis libraries.

Q2) Attempt any FOUR of the following:

a) Explain any two ways in which data is stored in files

Sol:

1. Text Files:

• Format: Plain text with no special formatting. Data is stored as sequences of


characters, typically in ASCII or UTF-8 encoding.

• Usage: Commonly used for logs, configuration files, and simple data storage.

• Example:

John, 25, New York

Jane, 30, Los Angeles

• Advantages: Easy to read and edit using any text editor. Lightweight and compatible
with most software.

• Disadvantages: Limited in handling complex data structures and large datasets


efficiently.
2. Binary Files:

• Format: Stores data in binary format, which is not human-readable. Data is encoded
in bytes.

• Usage: Used for storing data that requires efficient read/write operations, such as
images, executable files, and custom data structures.

• Example: (Binary data representation of an image or a serialized object)

• Advantages: More compact and efficient for large datasets. Can store complex data
structures.

• Disadvantages: Not human-readable. Requires specific software or tools to


interpret the data.

b) Explain role of statistics in data science

Sol:

Statistics plays a crucial role in data science by providing the tools and methodologies
needed to analyze and interpret data.

Key roles include:

1. Descriptive Statistics:

o Summarizes and describes the main features of a dataset.

o Measures such as mean, median, mode, standard deviation, and variance help
in understanding the central tendency and dispersion of the data.

2. Inferential Statistics:

o Allows making predictions or inferences about a population based on a


sample of data.

o Techniques like hypothesis testing, confidence intervals, and regression


analysis are used to draw conclusions and make decisions based on data.
3. Data Analysis and Modeling:

o Helps in identifying patterns, trends, and relationships within the data.

o Statistical models are used to predict future outcomes and optimize


processes, such as in machine learning algorithms.

4. Data Validation and Cleaning:

o Ensures data quality by detecting outliers, missing values, and


inconsistencies.

o Statistical methods help in imputing missing data and validating the integrity of
the data.

c) Explain two methods of data cleaning for missing values

Sol:

1. Imputation:

• Mean/Median Imputation: Replace missing values with the mean or median of the
non-missing values in the same column.

Example: - data['column'].fillna(data['column'].mean(), inplace=True)

• Mode Imputation: Replace missing categorical values with the mode (most frequent
value) of the column.

Example: - data['column'].fillna(data['column'].mode()[0], inplace=True)

2. Deletion:

• Listwise Deletion: Remove entire rows that contain any missing values.

Example: - data.dropna(inplace=True)

• Pairwise Deletion: Remove only the specific missing values without deleting entire
rows, often used in correlation and regression analysis.
d) Explain any two tools in data scientist tool box

Sol:

1. Jupyter Notebook:

• An open-source web application that allows creating and sharing documents


containing live code, equations, visualizations, and narrative text.

• Widely used for data cleaning, transformation, visualization, and machine learning
model development.

• Supports multiple programming languages, including Python, R, and Julia.

2. TensorFlow:

• An open-source machine learning framework developed by Google.

• Provides a comprehensive ecosystem for building and deploying machine learning


models.

• Supports deep learning and neural networks, with extensive tools for model
building, training, and deployment.
e) Write a short note on word clouds

Sol:

Word clouds are visual representations of text data where the importance or frequency of
each word is shown with varying sizes and colors. The more frequently a word appears in
the text data, the larger and more prominent it is displayed in the word cloud.

• Purpose:

o To quickly identify the most common words or themes in a large text dataset.

o Used for exploratory data analysis, text summarization, and presentations.

• Construction:

o Words are extracted from the text data, and their frequencies are calculated.

o Visualization libraries, such as WordCloud in Python, generate the word cloud


based on these frequencies.

• Example Code (Python):

from wordcloud import WordCloud

import matplotlib.pyplot as plt

text = "Data science is a multi-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from structured and
unstructured data."

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

plt.figure(figsize=(10, 5))

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis('off')

plt.show()
Q3) Attempt any TWO of the following:

a) Explain data science life cycle with suitable diagram

Sol:

Data Science Life Cycle:

1. Problem Definition: Understand and define the problem you want to solve.

2. Data Collection: Gather relevant data from various sources.

3. Data Cleaning: Process and clean the data to handle missing values and
inconsistencies.

4. Data Exploration: Perform exploratory data analysis (EDA) to uncover patterns and
insights.

5. Data Modeling: Apply statistical and machine learning models to the data.

6. Model Evaluation: Evaluate the model's performance using appropriate metrics.

7. Deployment: Deploy the model to a production environment.

8. Monitoring and Maintenance: Continuously monitor and maintain the model to


ensure its effectiveness over time.

Diagram:

[ Problem Definition] -> [ Data Collection ] -> [ Data Cleaning ] -> [ Data Exploration ] -> [
Data Modeling ] -> [ Model Evaluation ] -> [ Deployment ] -> [ Monitoring and Maintenance ]
b) Explain the concept and use of data visualization

Sol:

Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way
to see and understand trends, outliers, and patterns in data. The primary goal is to make
complex data more understandable and interpretable.

Uses:

1. Simplifying Complex Data:

o Data visualization helps in understanding complex data by presenting it in a visual


context. This makes it easier to grasp insights and relationships within the data, which
might be missed if the data were presented in a tabular or textual format.

2. Identifying Trends and Patterns:

o Visual representations allow users to quickly identify trends, correlations, and


patterns. For example, line charts can show trends over time, while scatter plots can
reveal relationships between variables.

3. Facilitating Decision Making:

o By converting data into visual formats, data visualization aids in making informed
decisions. It helps stakeholders to see the bigger picture and make data-driven
decisions.

4. Communicating Insights:

o Visualizations are powerful tools for communicating data insights to others. They
make it easier to convey information to non-technical audiences, making data-driven
discussions more effective.

5. Spotting Outliers and Anomalies:

o Data visualization makes it easy to spot outliers and anomalies that might indicate
issues or opportunities. For example, a spike in a line chart might highlight a
significant event that needs further investigation.
Examples of Data Visualization Tools:

• Tableau: A popular data visualization tool that enables users to create interactive
and shareable dashboards.

• Power BI: A business analytics tool by Microsoft that provides interactive


visualizations and business intelligence capabilities.

• Matplotlib: A plotting library for Python that is widely used for creating static,
interactive, and animated visualizations.

Examples of Visualizations:

• Bar Charts: Useful for comparing quantities across different categories.

• Pie Charts: Ideal for showing proportions and percentages.

• Heatmaps: Represent data values as colors, useful for identifying high and low
values.

• Histograms: Display the distribution of data over a continuous interval.

Example Code (Python using Matplotlib):

import matplotlib.pyplot as plt

# Sample data

categories = ['A', 'B', 'C', 'D']

values = [10, 15, 7, 10]

# Create a bar chart

plt.bar(categories, values)

plt.xlabel('Categories')

plt.ylabel('Values')

plt.title('Sample Bar Chart')


c) Calculate the variance and standard deviation for the following data: X: 14, 9, 13, 16,
25, 7, 12

Sol:

1. Calculate the mean (average) of the data:

Mean (μ) = (14 + 9 + 13 + 16 + 25 + 7 + 12) / 7 = 96 / 7 = 13.71

2. Calculate the variance:

Variance = ((14 - 13.71)2 + (9 - 13.71) 2 + (13 - 13.71) 2 + (16 - 13.71) 2 + (25 - 13.71) 2 +
(7 - 13.71) 2 + (12 - 13.71) 2) / 7

(0.0841 + 22.1841 + 0.5041 + 5.2441 + 127.4641 + 45.0241 + 2.9241) / 7 = 203.4287 / 7

Variance = 29.0612 (approx 29.06)

3. Calculate the standard deviation:

Standard Deviation = √(Variance} = √ (29.06} = 5.3907

Q4) Attempt any TWO of the following:

a) Write short note on hypothesis testing

Sol:

Hypothesis testing is a statistical method used to make inferences or draw conclusions


about a population based on sample data. It involves:

1. Formulating Null and Alternative Hypotheses: The null hypothesis (H0) represents
the default assumption, while the alternative hypothesis (H1) represents the
assertion being tested.

2. Selecting a Significance Level (alpha): Typically set at 0.05, it represents the


probability of rejecting the null hypothesis when it is true.

3. Calculating a Test Statistic: Based on sample data, a test statistic (e.g., t-test, z-test)
is computed.

4. Making a Decision: Comparing the test statistic to a critical value or using the p-
value to decide whether to reject or fail to reject the null hypothesis.
b) Differentiate between structured data and unstructured data

Sol:

Feature Structured Data Unstructured Data


Organized and formatted in Lacks a predefined format or
Definition
a fixed schema or structure organization
Stored in relational
Stored in data lakes, NoSQL
Storage databases, spreadsheets,
databases, file systems
data warehouses
Text documents, images,
Tables in databases, Excel
Examples videos, emails, social media
sheets, SQL databases
posts
Easily queried using SQL
Requires more complex
Querying and other structured query
processing and analytics tools
languages
More challenging to analyze;
Ease of Easier to analyze due to
requires advanced tools for
Analysis well-defined structure
processing
Typically numeric or Can include text, multimedia,
Data Types
categorical and mixed types
Usually smaller and Often larger in size and
Data Size
manageable complexity
More flexible, can
Less flexible, as schema
accommodate various types of
Flexibility must be defined
data without a predefined
beforehand
schema
c) Explain data visualization libraries in Python

Sol:

Python offers several powerful libraries for data visualization. Some of the most commonly
used libraries are:

1. Matplotlib:

o Description: Matplotlib is one of the most widely used data visualization libraries in
Python. It provides a flexible platform for creating static, animated, and interactive
visualizations.

o Capabilities: Line plots, scatter plots, bar charts, histograms, pie charts, and more.

o Example:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [10, 20, 25, 30, 35]

plt.plot(x, y)

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.title('Sample Line Plot')

plt.show()

2. Seaborn:

o Description: Built on top of Matplotlib, Seaborn offers a high-level interface for


drawing attractive and informative statistical graphics.

o Capabilities: Enhanced visualizations, including heatmaps, violin plots, box plots, and
pair plots.

o Example:

import seaborn as sns

import matplotlib.pyplot as plt

data = sns.load_dataset("iris")
sns.pairplot(data, hue="species")

plt.show()

3. Plotly:

o Description: Plotly is an interactive, open-source plotting library that supports a wide


range of visualization types and interactive features.

o Capabilities: 3D plots, geographic maps, interactive charts, and dashboards.

o Example:

import plotly.express as px

df = px.data.iris()

fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species')

fig.show()

4. Bokeh:

o Description: Bokeh provides an elegant and concise way to create interactive


visualizations for modern web browsers.

o Capabilities: Interactive plots, dashboards, and data applications.

o Example:

from bokeh.plotting import figure, show

p = figure (title="Bokeh Plot Example", x_axis_label='x', y_axis_label='y')

p.line([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], legend_label="Line", line_width=2)

show(p)

5. Altair:

o Description: Altair is a declarative statistical visualization library based on Vega and


Vega-Lite, designed for creating simple yet powerful visualizations.

o Capabilities: Interactive visualizations with concise code.

o Example:
import altair as alt

import pandas as pd

data = pd.DataFrame({

'a': [1, 2, 3, 4, 5],

'b': [3, 4, 5, 6, 7]

})

chart = alt.Chart(data).mark_line().encode(

x='a',

y='b'

chart.show()

Q5) Attempt any ONE of the following:

a) i) Define data science

Sol:

Data science is an interdisciplinary field that combines domain expertise, programming


skills, and knowledge of mathematics and statistics to extract meaningful insights from
data. It involves processes such as data collection, data cleaning, data analysis, data
visualization, and the application of machine learning algorithms to solve complex
problems and make data-driven decisions.

a) ii) Explain any one technique of data transformation

Sol:

Normalization is a data transformation technique used to scale numeric data to a standard


range, typically between 0 and 1. It helps in bringing all features to the same scale, which
can improve the performance of machine learning algorithms. Normalization is commonly
used when the data contains features with different units or scales.

Example:

from sklearn.preprocessing import MinMaxScaler


data = [[100], [200], [300], [400], [500]]

scaler = MinMaxScaler()

normalized_data = scaler.fit_transform(data)

print(normalized_data)

b) i) Write any two applications of data science

Sol:

1. Healthcare: Data science is used for predictive analytics, disease diagnosis,


personalized treatment plans, and drug discovery. For example, machine learning
models can predict patient outcomes and identify high-risk patients.

2. Finance: Data science is applied for fraud detection, credit risk assessment,
algorithmic trading, and customer segmentation. Financial institutions use data
science to analyze transaction data and detect fraudulent activities.

b) ii) Explain any one type of outliers in detail

Sol:

Z-Score Outliers: An outlier can be identified using the Z-score method, which measures
the number of standard deviations a data point is from the mean. If a data point's Z-score is
greater than a certain threshold (commonly 3 or -3), it is considered an outlier.

Example:

import numpy as np

data = [10, 12, 14, 15, 18, 21, 100] # 100 is an outlier

mean = np.mean(data)

std_dev = np.std(data

z_scores = [(x - mean) / std_dev for x in data]

outliers = [x for x, z in zip(data, z_scores) if np.abs(z) > 3]

print("Outliers:", outliers)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy