fds-two-marks
fds-two-marks
TWO MARKS
UNIT – I
A data lake stores an organization's raw and processed (unstructured and structured) data at both large
and small scales.
10. List the issues with the real world data.
Issues with the real world data:
Incomplete data: Some data lack attribute values.
Noisy: Some data contains errors.
Inconsistent: Some data contain discrepancies in codes and names.
11. Mention the benefits of data preparation phase.
Benefits of Data Preparation
Fix errors quickly
Produce good-quality data
Make better more accurate decisions
Data preparation also involves finding relevant data to ensure actionable insights for business
decision-making.
Reduce data management and analytics costs
Avoid duplication of effort in preparing data for use in multiple applications
12. What is meant by data cleaning?
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly Romaned, duplicate,
or incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.
13. Mention any 4 types of common error that occur in data.
Mistakes during data entry
Redundant white space
Impossible values
Missing values
Deviations from a code book
Different units of measurement
14. What is outlier?
Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a
different manner. One observation that follows a different logic or generative process than the other
observations.
15. How can you handle missing values in the dataset? Various ways to handle missing values are,
Omit the values
Where Y is dependent variable. X_{1} X_{2} X_{3} are independent variables, m is the slope of
regression, and b is the constant value.
14. What is Regression towards the Mean?
Regression towards the Mean refers to a tendency for scores, particularly extreme scores, to shrink
towards the mean. It refers to the fact that if one sample of a random variable is extreme, the next
sampling of the same random variable is likely to be closer to its mean.
15. When does Regression Fallacy occur?
Regression fallacy occurs whenever regression towards the mean is interpreted as real effect, rather than
a chance. The regression fallacy can be avoided by splitting the subset of extreme observations into two
groups.
16. Indicate whether the following statements suggest a positive or negative relationship:
(a) More densely populated areas have higher crime rates.
(b) School children who often watch TV perform more poorly on academic achievement tests.
(c) Heavier automobiles yield poorer gas mileage.
(d) Better-educated people have higher incomes.
(e) More anxious people voluntarily spend more time performing a simple repetitive task.
Answer
Positive. The crime rate is higher, square mile by square mile, in densel populated cities than in sparsely
populated rural areas.
Negative. Increases in car weight are accompanied by decreases in miles per gallon.
Positive. Highly anxious people willingly spend more time performing simple repetitive task than less
anxious people.
Positive. Increases in educational level-grade school, high school college-tend to be associated with
increases in income.
Negative. As TV viewing increases, performance on academic achievement tests tends to decline.
UNIT – IV
1. What is NumPy? List its uses.
NumPy is a general-purpose array-processing package with high-performance multidimensional array
object, and tools. It is the fundamental package for scientific computing with Python. It provides N-
dimensional array object supporting many sophisticated (broadcasting) functions
Uses of NumPy
NumPy is a package in Python used for Scientific Computing. NumPy package is used to perform
different operations. The ndarray (NumPy Array) is multidimensional array used to store values of same
datatype. These arrays are indexed just like Sequences, starts with zero.
Scatter plots are used to observe relationship between variables. Scatter plot is a type of plot in which
the points are represented individually with a dot, circle or other shape. The scatter() method in the
matplotlib library is used to draw a scatter plot.
4. What is the significance of Error bar?
Error bars indicate the estimated error or uncertainty to show how precise a measurement. Error bars
function used as graphical enhancement that visualizes the variability of the plotted data on a Cartesian
graph.
5. What is contour plot?
Contour plots are a way to show a three-dimensional surface on a two-dimensional plane. It graphs two
predictor variables X Y on the y-axis and a response variable Z as contours. These contours are
sometimes called z-slices or iso-response variables.
6. What is a histogram?
A histogram is a graph showing frequency distributions. It shows the number of observations within
each given interval. A simple histogram is useful in understanding a dataset. Matplotlib's histogram
function creates a basic histogram in one line, once the normal boiler plate imports are done.
7. Mention the significance of subplots?
Subplots are used to compare different views of data side by side. It is a group of smaller axes that can
exist together within a single figure. These subplots might be inserts, grids of plots or other more
complicated layouts.
8. List the ways to customize Matplotlib.
Setting rcParams at runtime.
Using style sheets
Changing your matplotlibrc file.