Answer Key Split Up Fds
Answer Key Split Up Fds
4 List the difference between a Discrete variable and a continuous variable with an example
Discrete Variable:
o A discrete variable is a variable that takes on distinct, separate values. These values are
countable
Continuous Variable:
o A continuous variable can take on any value within a range and can be measured to any degree
of
9 Identify the two possible optionsin IPython notebook used to embed graphics directly in the notebook
1. %matplotlib inline
Description: This magic command is used to embed static graphics (such as plots and charts) directly in
the notebook. When this command is run, any plots generated by Matplotlib will automatically be
Page 1 of 11
displayed inline (i.e., directly below the code cell that produced them).
2. %matplotlib notebook
Description: This magic command provides interactive graphics within the notebook. Unlike
%matplotlib inline, which creates static images, %matplotlib notebook creates interactive plots that
allow zooming, panning, and other interactive features.
Page 2 of 11
+-----------------------+
|
v
+-----------------------+
| Model Evaluation |
| - Assess performance |
| - Use metrics |
| - Cross-validation |
+-----------------------+
|
v
+-----------------------+
| Deployment |
| - Deploy model |
| - Monitor performance |
+-----------------------+
|
v
+-----------------------+
| Monitoring and |
| Maintenance |
| - Track performance |
| - Retrain model |
+-----------------------+
OR
11b Explain about the steps in the data science process with a diagram
1. Problem Definition 2. Data Collection 3. Data Cleaning (Data Preprocessing) 4. Exploratory Data
Analysis (EDA) 5. Feature Engineering 6. Model Building 7. Model Evaluation 8. Deployment
9. Monitoring and Maintenance
Diagram of the Data Science Process
+-----------------------+
| Problem Definition |
| - Understand goals |
| - Define objectives |
+-----------------------+
|
v
+-----------------------+
| Data Collection |
| - Gather data |
| - Data from various |
| sources |
+-----------------------+
|
v
+-----------------------+
| Data Cleaning |
| - Handle missing data |
| - Remove duplicates |
| - Format corrections |
+-----------------------+
|
v
+-----------------------+
| Exploratory Data |
| Analysis (EDA) |
| - Visualize data |
| - Discover patterns |
| - Find correlations |
+-----------------------+
|
v
+-----------------------+
| Feature Engineering |
| - Create new features |
Page 3 of 11
| - Select important |
| features |
+-----------------------+
|
v
+-----------------------+
| Model Building |
| - Train the model |
| - Choose algorithm |
+-----------------------+
|
v
+-----------------------+
| Model Evaluation |
| - Assess model |
| - Use metrics to |
| validate |
+-----------------------+
|
v
+-----------------------+
| Deployment |
| - Deploy model into |
| production |
+-----------------------+
|
v
+-----------------------+
| Monitoring & |
| Maintenance |
| - Track model's |
| performance |
| - Retrain if needed |
+-----------------------+
12a What is the frequency distribution? Customers who have purchased a particular rated the usability of the
product on a 10-point scale, ranging from 1(Poor) to 10Excellent follows
3 7 2 7 8
3 1 4 10 3
2 5 3 5 3
9 7 6 3 7
8 9 7 3 6
The Data Science Process is an iterative and cyclical approach to solving complex problems using data. The
steps involve defining the problem, collecting and preparing the data, performing analysis, building models,
evaluating them, deploying them in production, and continuously monitoring and improving them. By following
this process, data scientists can develop valuable insights and create models that drive data-driven decisions
across various industries.
What is Frequency Distribution?
A frequency distribution is a way to organize and summarize data in a table, showing how often each unique
value or range of values (called "bins") appears in the dataset. It helps in understanding the spread, patterns, and
frequency of data points within a given dataset.
In the case of the product usability ratings you have provided (on a 10-point scale from 1 to 10), we can create a
frequency distribution by counting how many times each rating (from 1 to 10) occurs in the dataset.
Given Data:
The usability ratings provided by customers are:
Copy code
3, 7, 2, 7, 8, 3, 1, 4, 10, 3, 2, 5, 3, 5, 3, 9, 7, 6, 3, 7, 8, 9, 7, 3, 6
Page 4 of 11
Explanation:
Rating 1: Appears 1 time.
Rating 2: Appears 2 times.
Rating 3: Appears 8 times.
Rating 4: Appears 1 time.
Rating 5: Appears 2 times.
Rating 6: Appears 2 times.
Rating 7: Appears 5 times.
Rating 8: Appears 2 times.
Rating 9: Appears 2 times.
Rating 10: Appears 1 time.
Conclusion:
From the frequency distribution, you can quickly see that:
The most common rating is 3, which occurred 8 times.
The least common ratings are 1, 4, and 10, each appearing only once.
This distribution helps in understanding how customers rated the product's usability on a 10-point scale, and it
gives a clear picture of the overall satisfaction of the customers.
OR
12b (i)What is Z-score outline the steps obtain a Z-score
(ii)Express each of the following Scores as a Z Score: First, Mary’s intelligence quotient is 135 given a
mean 100 and SD 15 sec Mary Obtained a score of 470 in the competitive Exam conducted in April 2022
given a mean of 500 and a SD deviation of 100
A Z-score (also known as a standard score or z-value) represents how many standard deviations a particular
data
Z=(X−μ)σZ = \frac{(X - \mu)}{\sigma}Z=σ(X−μ)
Where:
ZZZ = Z-score
XXX = The data point (or observation) for which the Z-score is being calculated
μ\muμ = Mean of the dataset
σ\sigmaσ = Standard deviation of the dataset
The Z-score tells you if the data point is above or below the mean, and by how many standard deviations.
If Z>0Z > 0Z>0, the data point is above the mean.
If Z<0Z < 0Z<0, the data point is below the mean.
If Z=0Z = 0Z=0, the data point is exactly at the mean.
Given the values of XXX and YYY, it seems like you might want to explore the relationship between these two
variables. Common analyses include:
1. Correlation to assess how strongly the values of XXX and YYY are related.
2. Regression to model the relationship between XXX and YYY.
3. Z-scores for normalization of individual data points.
Let’s break down the process for correlation between XXX and YYY, as it is a common request when provided
with such pairs of values.
Given Data:
XY6668687068697072717272727274\begin{array}{|c|c|} \hline X & Y \\ \hline 66 & 68 \\ 68 & 70 \\ 68 & 69 \\
70 & 72 \\ 71 & 72 \\ 72 & 72 \\ 72 & 74 \\ \hline \end{array}X66686870717272Y68706972727274
Step 1: Calculate the Mean of X and Y
Page 5 of 11
We need to find the mean values for both XXX and YYY.
Mean of XXX:
μX=66+68+68+70+71+72+727=4877=69.857\mu_X = \frac{66 + 68 + 68 + 70 + 71 + 72 + 72}{7} = \
.
OR
13b The value of x and their corresponding values of y are presented below
X 0 1 2 3 4 5 6
. . . . . . .
5 5 5 5 5 5 5
Y 2 3 5 4 6 8 1
. . . . . . 0
5 5 5 5 5 5 .
5
(i)Find the least square regression line y=ax+b
(ii) Estimate the value of y when X=10
To find the least squares regression line y=ax+by = ax + by=ax+b, we need to calculate the slope aaa and the
intercept bbb. These can be found using the following formulas:
1. Formula for the Slope (aaa):
a=n(∑XY)−(∑X)(∑Y)n(∑X2)−(∑X)2a = \frac{n(\sum XY) - (\sum X)(\sum Y)}{n(\sum X^2) - (\sum
X)^2}a=n(∑X2)−(∑X)2n(∑XY)−(∑X)(∑Y)
2. Formula for the Intercept (bbb):
b=∑Y−a(∑X)nb = \frac{\sum Y - a(\sum X)}{n}b=n∑Y−a(∑X)
Where:
nnn is the number of data points,
XXX and YYY are the individual data points,
∑X\sum X∑X, ∑Y\sum Y∑Y, ∑XY\sum XY∑XY, and ∑X2\sum X^2∑X2 are the summations of
XXX, YYY, XYXYXY, and X2X^2X2 respectively.
Given Data:
X=[0.5,1.5,2.5,3.5,4.5,5.5,6.5]X = [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5]X=[0.5,1.5,2.5,3.5,4.5,5.5,6.5]
Y=[2.5,3.5,5.5,4.5,6.5,8.5,10.5]Y = [2.5, 3.5, 5.5, 4.5, 6.5, 8.5, 10.5]Y=[2.5,3.5,5.5,4.5,6.5,8.5,10.5]
14a What is an aggregate function? Elaborate about the aggregate functions in Numpy
roadcasting is a powerful feature in NumPy that allows arithmetic operations on arrays of different shapes and
sizes, without needing to explicitly replicate data. It involves "stretching" the smaller array across the larger
array, so they have compatible shapes. This allows for element-wise operations on arrays of different dimensions
without making copies of data, thus improving efficiency.
The rule of broadcasting is that the smaller array is "broadcast" over the larger one, so that they have compatible
shapes. This works as long as the dimensions of the arrays are compatible, and it happens according to certain
broadcasting rules.
Rules of Broadcasting
There are three main rules that govern broadcasting:
1. Rule 1: If the arrays have a different number of dimensions, pad the smaller-dimensional array
with ones on the left side until both arrays have the same number of dimensions.
Example:
o Shape of A is (3, 5) (a 2D array)
o Shape of B is (5,) (a 1D array)
o The 1D array B is treated as having shape (1, 5), with the leading dimension padded with 1. It
will be broadcast to match the shape (3, 5).
2. Rule 2: If the sizes of the dimensions are not the same, the dimension with size 1 is stretched to
match the size of the larger array.
Example:
o Shape of A is (3, 5)
o Shape of B is (1, 5)
o The 1st dimension of B is stretched to match the size of A, i.e., B is treated as having shape
(3, 5).
3. Rule 3: If the dimensions are not compatible (i.e., the sizes in corresponding dimensions are
Page 6 of 11
neither the same nor one of them is 1), broadcasting will fail.
Example:
o Shape of A is (3, 5)
o Shape of B is (4, 5)
o Broadcasting will fail because the sizes of the 1st dimension (3 vs. 4) are not compatible and
neither dimension is 1.
Example of Broadcasting
Let’s look at an example of broadcasting in action:
python
Copy code
import numpy as np
Page 8 of 11
OR
16b Consider the following dataset with one response varible y and predictor varibles x1 and x2
y 140 155 159 179 192 200 212 215
X1 60 62 67 70 71 72 75 78
78 22 25 24 20 15 14 14 11
Fit a multiple linear regression a model to this dataset
multiple linear regression model to the given dataset, we aim to predict the response variable yyy using the predictor
variables x1x_1x1 and x2x_2x2. Multiple linear regression models the relationship between one dependent variable and
two or more independent variables as a linear equation.
Dataset
yyy 140 155 159 179 192 200 212 215
x1x_1x1 60 62 67 70 71 72 75 78
x2x_2x2 22 25 24 20 15 14 14 11
Explanation of Code:
1. Data Preparation:
2. Model Fitting:
3. Model Evaluation:
Page 9 of 11
Page 10 of 11
Page 11 of 11