0% found this document useful (0 votes)
3 views

Answer Key Split Up Fds

The document outlines the course details for 'Foundation of Data Science' at Indra Ganesan College of Engineering, including course outcomes and examination structure. It covers key concepts in data science, such as data types, data mining, visualization, and the data science process, along with specific questions and answers related to these topics. Additionally, it provides practical applications of statistical measures and data analysis techniques.

Uploaded by

vinotha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Answer Key Split Up Fds

The document outlines the course details for 'Foundation of Data Science' at Indra Ganesan College of Engineering, including course outcomes and examination structure. It covers key concepts in data science, such as data types, data mining, visualization, and the data science process, along with specific questions and answers related to these topics. Additionally, it provides practical applications of statistical measures and data analysis techniques.

Uploaded by

vinotha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 11

Register Number:

INDRA GANESAN COLLEGE OF ENGINEERING


(AN AUTONOMOUS INSTITUTION)
IG Valley, Manikandam, Tiruchirappalli, Tamil Nadu – 620 012, India
(Approved by AICTE, New Delhi and affiliated to Anna University, Chennai)
University question Date 30.11.2024 Marks 100
Course code CCS3352 Course Title Foundation of Data Science
Regulation 2021 Duration 3& hrs Academic Year 2024-2025
Year II Semester III Department IT
COURSE OUTCOMES
CO1 To understand the data science fundamentals and process.
CO2 To learn to describe the data for the data science process.
CO3 To learn to describe the relationship between data.
CO4 To utilize the Python libraries for Data Wrangling.
CO5 To present and interpret data using visualization libraries in Python
PART A
(Answer all the Questions 10 x 2 = 20 Marks)
1 Outline the difference between structured data and unstructured data
Definition: Structured data refers to data that is highly organized and stored in a fixed format, typically within
relational databases or spreadsheets. It follows a clear structure, usually represented by rows and columns.
Definition: Unstructured data refers to data that does not follow a specific format or organization. It can include
a wide range of information that is not easily categorized or stored in traditional databases.

2 Define data Miming


foundational techniques of data mining and how to apply them to solve problems in Financial Data Systems
(FDS). The course would cover both the theory behind various data mining algorithms and practical applications
within the financial sector,
3 Compare and contrast qualitative data an quantitative data with example
Qualitative Data:
Nominal: Data that categorizes variables without any order (e.g., gender, eye color, city of
residence).Quantitative Data: Discrete: Data that can take specific values and is countable
(e.g., number of children, number of transactions).

4 List the difference between a Discrete variable and a continuous variable with an example
 Discrete Variable:
o A discrete variable is a variable that takes on distinct, separate values. These values are
countable
 Continuous Variable:
o A continuous variable can take on any value within a range and can be measured to any degree
of

5 What is the use of Scatter plot


A scatter plot (also known as a scatter chart or scatter graph) is a graphical representation used to display the
relationship between two continuous variables. Each point in the scatter plot represents an individual data point
based on two variables.
6 Summarize correlation coefficient
The correlation coefficient is a statistical measure that quantifies the strength and direction of the relationship
between two variables. It is commonly denoted as r and can range from -1 to +1. It helps to understand whether
and how strongly two variables are related.
7 What are the advantage of using Numparrays
NumPy arrays are designed for handling large datasets more efficiently. Whether you're working with large
matrices, scientific data, or massive numerical datasets, NumPy provides high-performance capabilities that can
handle arrays with millions of elements without running into memory issues.
8 Explain the two types of Numpy arrays
1. 1D Arrays (One-Dimensional Arrays)
2. 2D Arrays (Two-Dimensional Arrays)

9 Identify the two possible optionsin IPython notebook used to embed graphics directly in the notebook
1. %matplotlib inline
 Description: This magic command is used to embed static graphics (such as plots and charts) directly in
the notebook. When this command is run, any plots generated by Matplotlib will automatically be

Page 1 of 11
displayed inline (i.e., directly below the code cell that produced them).

2. %matplotlib notebook
 Description: This magic command provides interactive graphics within the notebook. Unlike
%matplotlib inline, which creates static images, %matplotlib notebook creates interactive plots that
allow zooming, panning, and other interactive features.

10 How scatter functions differs from plot function


 Purpose: The scatter() function is used to create scatter plots, which represent data as individual points
on a two-dimensional plane. It does not connect data points with lines. It is best used to visualize the
relationship between two variables
PART B
(Answer all the Questions 5 x 13 = 65 Marks)
11a Explain about the steps in the data science process with a diagram
Steps in the Data Science Process
1. Data Collection
2. Data Cleaning (Data Preprocessing)
3. Exploratory Data Analysis (EDA)
4. Feature Engineering
5. Modeling
6. Model Evaluation
7. Deployment
8. Monitoring and Maintenance
.

Diagram of the Data Science Process


Below is a high-level representation of the steps involved in the data science process, along with their typical
flow:
plaintext
Copy code
+-----------------------+ +------------------------+
| Problem Definition | | Data Collection |
| - Define the problem |-----> | - Gather the data |
| - Align with business | | - External/ internal data|
| objectives | +------------------------+
+-----------------------+ |
v
+-----------------------+
| Data Cleaning |
| - Handle missing data |
| - Remove duplicates |
| - Format corrections |
+-----------------------+
|
v
+-----------------------+
| Exploratory Data |
| Analysis (EDA) |
| - Visualize data |
| - Identify trends |
| - Detect outliers |
+-----------------------+
|
v
+-----------------------+
| Feature Engineering |
| - Create new features |
| - Select relevant features|
+-----------------------+
|
v
+-----------------------+
| Modeling |
| - Choose model |
| - Train model |
| - Hyperparameter tuning|

Page 2 of 11
+-----------------------+
|
v
+-----------------------+
| Model Evaluation |
| - Assess performance |
| - Use metrics |
| - Cross-validation |
+-----------------------+
|
v
+-----------------------+
| Deployment |
| - Deploy model |
| - Monitor performance |
+-----------------------+
|
v
+-----------------------+
| Monitoring and |
| Maintenance |
| - Track performance |
| - Retrain model |
+-----------------------+

OR
11b Explain about the steps in the data science process with a diagram
1. Problem Definition 2. Data Collection 3. Data Cleaning (Data Preprocessing) 4. Exploratory Data
Analysis (EDA) 5. Feature Engineering 6. Model Building 7. Model Evaluation 8. Deployment
9. Monitoring and Maintenance
Diagram of the Data Science Process
+-----------------------+
| Problem Definition |
| - Understand goals |
| - Define objectives |
+-----------------------+
|
v
+-----------------------+
| Data Collection |
| - Gather data |
| - Data from various |
| sources |
+-----------------------+
|
v
+-----------------------+
| Data Cleaning |
| - Handle missing data |
| - Remove duplicates |
| - Format corrections |
+-----------------------+
|
v
+-----------------------+
| Exploratory Data |
| Analysis (EDA) |
| - Visualize data |
| - Discover patterns |
| - Find correlations |
+-----------------------+
|
v
+-----------------------+
| Feature Engineering |
| - Create new features |

Page 3 of 11
| - Select important |
| features |
+-----------------------+
|
v
+-----------------------+
| Model Building |
| - Train the model |
| - Choose algorithm |
+-----------------------+
|
v
+-----------------------+
| Model Evaluation |
| - Assess model |
| - Use metrics to |
| validate |
+-----------------------+
|
v
+-----------------------+
| Deployment |
| - Deploy model into |
| production |
+-----------------------+
|
v
+-----------------------+
| Monitoring & |
| Maintenance |
| - Track model's |
| performance |
| - Retrain if needed |
+-----------------------+
12a What is the frequency distribution? Customers who have purchased a particular rated the usability of the
product on a 10-point scale, ranging from 1(Poor) to 10Excellent follows
3 7 2 7 8
3 1 4 10 3
2 5 3 5 3
9 7 6 3 7
8 9 7 3 6

The Data Science Process is an iterative and cyclical approach to solving complex problems using data. The
steps involve defining the problem, collecting and preparing the data, performing analysis, building models,
evaluating them, deploying them in production, and continuously monitoring and improving them. By following
this process, data scientists can develop valuable insights and create models that drive data-driven decisions
across various industries.
What is Frequency Distribution?
A frequency distribution is a way to organize and summarize data in a table, showing how often each unique
value or range of values (called "bins") appears in the dataset. It helps in understanding the spread, patterns, and
frequency of data points within a given dataset.
In the case of the product usability ratings you have provided (on a 10-point scale from 1 to 10), we can create a
frequency distribution by counting how many times each rating (from 1 to 10) occurs in the dataset.
Given Data:
The usability ratings provided by customers are:
Copy code
3, 7, 2, 7, 8, 3, 1, 4, 10, 3, 2, 5, 3, 5, 3, 9, 7, 6, 3, 7, 8, 9, 7, 3, 6

Steps to create the Frequency Distribution:


1. List all the unique values (ratings from 1 to 10).
2. Count the frequency of each value in the dataset.
3. Create a table summarizing the counts of each rating.

Page 4 of 11
Explanation:
 Rating 1: Appears 1 time.
 Rating 2: Appears 2 times.
 Rating 3: Appears 8 times.
 Rating 4: Appears 1 time.
 Rating 5: Appears 2 times.
 Rating 6: Appears 2 times.
 Rating 7: Appears 5 times.
 Rating 8: Appears 2 times.
 Rating 9: Appears 2 times.
 Rating 10: Appears 1 time.

Conclusion:
From the frequency distribution, you can quickly see that:
 The most common rating is 3, which occurred 8 times.
 The least common ratings are 1, 4, and 10, each appearing only once.
This distribution helps in understanding how customers rated the product's usability on a 10-point scale, and it
gives a clear picture of the overall satisfaction of the customers.

OR
12b (i)What is Z-score outline the steps obtain a Z-score
(ii)Express each of the following Scores as a Z Score: First, Mary’s intelligence quotient is 135 given a
mean 100 and SD 15 sec Mary Obtained a score of 470 in the competitive Exam conducted in April 2022
given a mean of 500 and a SD deviation of 100

(i)What is Z-score outline the steps obtain a Z-score

A Z-score (also known as a standard score or z-value) represents how many standard deviations a particular
data
Z=(X−μ)σZ = \frac{(X - \mu)}{\sigma}Z=σ(X−μ)
Where:
 ZZZ = Z-score
 XXX = The data point (or observation) for which the Z-score is being calculated
 μ\muμ = Mean of the dataset
 σ\sigmaσ = Standard deviation of the dataset
The Z-score tells you if the data point is above or below the mean, and by how many standard deviations.
 If Z>0Z > 0Z>0, the data point is above the mean.
 If Z<0Z < 0Z<0, the data point is below the mean.
 If Z=0Z = 0Z=0, the data point is exactly at the mean.

Steps to Calculate a Z-Score


To calculate a Z-score, you can follow these steps:
1. Find the Mean of the Dataset (μ\muμ)
2. Find the Standard Deviation of the Dataset (σ\sigmaσ)
3. Subtract the Mean from the Data Point (X−μX - \muX−μ)
4. Divide by the Standard Deviation (σ\sigmaσ)
5. Interpret the Z-Score
13a The value of X and their corresponding values of y are presented below
X 6 6 6 7 7 7 7
6 8 8 0 1 2 2
Y 6 7 6 7 7 7 7
8 0 9 2 2 2 4

Given the values of XXX and YYY, it seems like you might want to explore the relationship between these two
variables. Common analyses include:
1. Correlation to assess how strongly the values of XXX and YYY are related.
2. Regression to model the relationship between XXX and YYY.
3. Z-scores for normalization of individual data points.
Let’s break down the process for correlation between XXX and YYY, as it is a common request when provided
with such pairs of values.
Given Data:
XY6668687068697072717272727274\begin{array}{|c|c|} \hline X & Y \\ \hline 66 & 68 \\ 68 & 70 \\ 68 & 69 \\
70 & 72 \\ 71 & 72 \\ 72 & 72 \\ 72 & 74 \\ \hline \end{array}X66686870717272Y68706972727274
Step 1: Calculate the Mean of X and Y
Page 5 of 11
We need to find the mean values for both XXX and YYY.
Mean of XXX:
μX=66+68+68+70+71+72+727=4877=69.857\mu_X = \frac{66 + 68 + 68 + 70 + 71 + 72 + 72}{7} = \
.

OR
13b The value of x and their corresponding values of y are presented below
X 0 1 2 3 4 5 6
. . . . . . .
5 5 5 5 5 5 5
Y 2 3 5 4 6 8 1
. . . . . . 0
5 5 5 5 5 5 .
5
(i)Find the least square regression line y=ax+b
(ii) Estimate the value of y when X=10

To find the least squares regression line y=ax+by = ax + by=ax+b, we need to calculate the slope aaa and the
intercept bbb. These can be found using the following formulas:
1. Formula for the Slope (aaa):
a=n(∑XY)−(∑X)(∑Y)n(∑X2)−(∑X)2a = \frac{n(\sum XY) - (\sum X)(\sum Y)}{n(\sum X^2) - (\sum
X)^2}a=n(∑X2)−(∑X)2n(∑XY)−(∑X)(∑Y)
2. Formula for the Intercept (bbb):
b=∑Y−a(∑X)nb = \frac{\sum Y - a(\sum X)}{n}b=n∑Y−a(∑X)
Where:
 nnn is the number of data points,
 XXX and YYY are the individual data points,
 ∑X\sum X∑X, ∑Y\sum Y∑Y, ∑XY\sum XY∑XY, and ∑X2\sum X^2∑X2 are the summations of
XXX, YYY, XYXYXY, and X2X^2X2 respectively.
Given Data:
X=[0.5,1.5,2.5,3.5,4.5,5.5,6.5]X = [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5]X=[0.5,1.5,2.5,3.5,4.5,5.5,6.5]
Y=[2.5,3.5,5.5,4.5,6.5,8.5,10.5]Y = [2.5, 3.5, 5.5, 4.5, 6.5, 8.5, 10.5]Y=[2.5,3.5,5.5,4.5,6.5,8.5,10.5]

14a What is an aggregate function? Elaborate about the aggregate functions in Numpy

1. numpy.sum() 2.numpy.median() 3.numpy.min() 4.numpy.std() 5.numpy.var() 6.numpy.prod()


7.numpy.count_nonzero() 8.numpy.ptp() (Peak to Peak)
OR
14b What is broadcasting? Explain the rules of broadcasting with an example
Summarize about the mapping between python operators and pandas methods

roadcasting is a powerful feature in NumPy that allows arithmetic operations on arrays of different shapes and
sizes, without needing to explicitly replicate data. It involves "stretching" the smaller array across the larger
array, so they have compatible shapes. This allows for element-wise operations on arrays of different dimensions
without making copies of data, thus improving efficiency.
The rule of broadcasting is that the smaller array is "broadcast" over the larger one, so that they have compatible
shapes. This works as long as the dimensions of the arrays are compatible, and it happens according to certain
broadcasting rules.
Rules of Broadcasting
There are three main rules that govern broadcasting:
1. Rule 1: If the arrays have a different number of dimensions, pad the smaller-dimensional array
with ones on the left side until both arrays have the same number of dimensions.
Example:
o Shape of A is (3, 5) (a 2D array)
o Shape of B is (5,) (a 1D array)
o The 1D array B is treated as having shape (1, 5), with the leading dimension padded with 1. It
will be broadcast to match the shape (3, 5).
2. Rule 2: If the sizes of the dimensions are not the same, the dimension with size 1 is stretched to
match the size of the larger array.
Example:
o Shape of A is (3, 5)
o Shape of B is (1, 5)
o The 1st dimension of B is stretched to match the size of A, i.e., B is treated as having shape
(3, 5).
3. Rule 3: If the dimensions are not compatible (i.e., the sizes in corresponding dimensions are
Page 6 of 11
neither the same nor one of them is 1), broadcasting will fail.
Example:
o Shape of A is (3, 5)
o Shape of B is (4, 5)
o Broadcasting will fail because the sizes of the 1st dimension (3 vs. 4) are not compatible and
neither dimension is 1.
Example of Broadcasting
Let’s look at an example of broadcasting in action:
python
Copy code
import numpy as np

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Shape (3, 3)


B = np.array([1, 2, 3]) # Shape (3,)

# Adding the arrays


result = A + B
print(result)
Explanation:
 A is a 3x3 array, and B is a 1D array of length 3.
 Since the shape of B is (3,), it is automatically broadcast across the rows of A to make it compatible
with A's shape (3, 3).
 The result is:
lua
Copy code
[[ 2 4 6]
[ 5 7 9]
[ 8 10 12]]
In this case, B was "broadcast" across the rows of A.

Mapping Between Python Operators and Pandas Methods


Pandas, like NumPy, allows for efficient data manipulation and analysis. Below is a summary of how some of
the commonly used Python operators are mapped to Pandas methods for data frames (and Series).
1. Arithmetic Operators
Python Operator Pandas Method Description
+ . add() Adds the values of two Data Frame/Series.
- . sub() Subtracts the values of two Data Frame/Series.
* . mul() Multiplies the values of two Data Frame/Series.
/ . div() Divides the values of two Data Frame/Series.
// . floor div() Performs floor division on two Data Frame/Series.
% . mod() Calculates modulus (remainder of division).
** .pow() Raises one Data Frame/Series to the power of the other.
Example:
python
Copy code
import pandas as pd

df1 = pd.DataFrame([[1, 2], [3, 4]])


df2 = pd.DataFrame([[5, 6], [7, 8]])

result_add = df1.add(df2) # Equivalent to df1 + df2


result_sub = df1.sub(df2) # Equivalent to df1 - df2
2. Comparison Operators
Python OperatorPandas MethodDescription
== .eq() Checks for equality between DataFrame/Series.
!= .ne() Checks for inequality between DataFrame/Series.
> .gt() Checks if values in one DataFrame/Series are greater than the other.
< .lt() Checks if values in one DataFrame/Series are less than the other.
>= .ge() Checks if values in one DataFrame/Series are greater than or equal to the other.
<= .le() Checks if values in one DataFrame/Series are less than or equal to the other.
Example:
python
Copy code
df1 = pd.DataFrame([1, 2, 3])
Page 7 of 11
df2 = pd.Series([1, 2, 4])

comparison_result = df1.eq(df2) # Equivalent to df1 == df2


3. Logical Operators
Python OperatorPandas MethodDescription
& .and() Element-wise logical AND for DataFrame/Series.
` ` .or()
~ .not() Element-wise logical NOT for DataFrame/Series.
Example:
python
Copy code
df1 = pd.DataFrame([True, False, True])
df2 = pd.DataFrame([False, False, True])

result_and = df1 & df2 # Equivalent to df1.and(df2)


result_or = df1 | df2 # Equivalent to df1.or(df2)
4. Aggregation Methods
Python OperatorPandas MethodDescription
sum() .sum() Returns the sum of the DataFrame/Series.
mean() .mean() Returns the mean of the DataFrame/Series.
count() .count() Returns the count of non-null values in DataFrame/Series.
min() .min() Returns the minimum value of the DataFrame/Series.
max() .max() Returns the maximum value of the DataFrame/Series.
std() .std() Returns the standard deviation of the DataFrame/Series.
Example:
python
Copy code
df = pd.DataFrame([[1, 2], [3, 4]])

total_sum = df.sum() # Equivalent to sum()


mean_value = df.mean() # Equivalent to mean()
5. String Methods
Python OperatorPandas MethodDescription
.lower() .str.lower() Converts string to lowercase.
.upper() .str.upper() Converts string to uppercase.
.replace() .str.replace() Replaces occurrences of a substring.
Example:
python
Copy code
df = pd.DataFrame(['apple', 'banana', 'cherry'])

lower_case = df[0].str.lower() # Equivalent to df.str.lower()


Conclusion
Broadcasting in NumPy allows for efficient element-wise operations on arrays of different shapes, reducing
memory overhead by eliminating the need for explicit copying. The rules of broadcasting ensure that arrays are
compatible for mathematical operations, making it easier to work with arrays of different dimensions.
On the other hand, Pandas provides a high-level interface to manipulate data with DataFrames and Series. Many
common Python operators have corresponding Pandas methods that allow for powerful and intuitive data
manipulation, including arithmetic, logical, comparison, and aggregation operations. These methods make
working with large datasets in a concise and efficient manner much easier.
15a Explain about various visualization charts like line plots, scatter plots and histogram with an
example
1. Line Plot 2. Scatter Plot 3. Histogram 4. Summary of Visualization Charts 5 Line Plot 6. Scatter Plot
7.hstogram:
OR
15b Expalin any two three - dimensional plotting in matplotlib with an example
1. 3D Line Plot 2. 3D Surface Plot 3.3D Plotting in Summary 4.3D Line Plot: 5.3D Surface Plot:
PART C
(Answer all the Questions 1 x 15 = 15 Marks)
16a What is mode? Can there be distrbutions with no mode or more the one mode?
The owner of a new carconducts six gas mileage tests and obtain the following results , expressed in
miles per gallon: 26.3,28,27.4,26.9 find the median and the median and find the median for the
following
Score:First set of five scores 2,8,2,7,6 and set of six scores 3,8,9,3,1,9 with steps

Page 8 of 11
OR
16b Consider the following dataset with one response varible y and predictor varibles x1 and x2
y 140 155 159 179 192 200 212 215
X1 60 62 67 70 71 72 75 78
78 22 25 24 20 15 14 14 11
Fit a multiple linear regression a model to this dataset
multiple linear regression model to the given dataset, we aim to predict the response variable yyy using the predictor
variables x1x_1x1 and x2x_2x2. Multiple linear regression models the relationship between one dependent variable and
two or more independent variables as a linear equation.
Dataset
yyy 140 155 159 179 192 200 212 215
x1x_1x1 60 62 67 70 71 72 75 78
x2x_2x2 22 25 24 20 15 14 14 11
Explanation of Code:
1. Data Preparation:
2. Model Fitting:
3. Model Evaluation:

Page 9 of 11
Page 10 of 11
Page 11 of 11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy