0% found this document useful (0 votes)
5 views

fds-two-marks

The document provides an overview of key concepts in Data Science, including definitions, components, applications, and data types. It covers topics such as data cleaning, data preparation, correlation, regression, and data visualization using Python libraries like NumPy and Matplotlib. Additionally, it discusses the significance of setting project goals and the importance of data quality in analysis.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

fds-two-marks

The document provides an overview of key concepts in Data Science, including definitions, components, applications, and data types. It covers topics such as data cleaning, data preparation, correlation, regression, and data visualization using Python libraries like NumPy and Matplotlib. Additionally, it discusses the significance of setting project goals and the importance of data quality in analysis.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

lOMoARcPSD|15656136

FDS TWO Marks

Foundations of Datascience (Anna University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)
lOMoARcPSD|15656136

CS3352 – FOUNDATIONS OF DATA SCIENCE

TWO MARKS

UNIT – I

1. Define Data science.


Data Science is the area of study which involves extracting insights amounts of data using various
scientific methods, algorithms, and processes. It is useful to discover hidden patterns from the
voluminous raw data.
2. What is big data and list the V's of big data.
Big data refers to the data sets that are large and complex in nature and difficult to process using
traditional data-processing application software.
Four Vs of Big data are:
Volume- How much data is there?
Variety - How diverse are different types of data?
Velocity At what speed is new data generated?
Veracity How accurate is the data?
3. Identify the components of data science.
The main components of Data Science are
(a) Statistics
Statistics is a way to collect and analyze the numerical data in a large amo and finding meaningful
insights from it.
(b) Domain Expertise
Expert knowledge or skills of a particular area like health care, automobile, C industry, etc.
(c) Data engineering
Data engineering involves acquiring, storing, retrieving, and transforming the d
(d) Visualization
Representing data in a visual context for easy understanding of data and insigh and making it easy to
visualize huge amount of data.
(e) Advanced computing
Advanced computing involves designing, writing, debugging, and maintaining t source code of
computer programs to perform the complex analysis.
(f) Mathematics
Mathematics involves the study of quantity, structure, space, and changes.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

4. List a few applications of data science.

 Fraud and Risk Detection in banking Healthcare prediction and analysis


 Internet Search and recommendations
 Market analysis.
 Customer analysis.
 Targeted Advertising.
5. Enumerate the categories of data used in data science.
(a) Structured
(b) Unstructured
(c) Natural language
(e) Graph-based
(f) Audio, video, and images
(g) Streaming
(d) Machine-generated.
6. Mention the significance of setting goal in data science Goal of a data science project is to fulfill se
and measurable goal th is clearly connected to the purposes, workflows, and decision-making processes
of th business. This step defines the scope of the project.
7. What is project charter?
Project charter is a document that lays out the proiect vision, scope, objective project team, and their
responsibilities, key stakehold out or the implementation plan. and how it will be carried
8. Identify the important contents of a project charter.
A project charter must include
 A clear research goal
 The project mission and context
 How you're going to perform your analysis What resources you expect to use
 Proof that it's an achievable project, or proof of concepts
 Deliverables and a measure of success.
 A timeline
9. Define data warehouse, data mart and data lake.
Data warehouse is constructed by integrating data from multiple heterogeneous sources that support
analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves
data cleaning, data integration, and data consolidations A data mart supplies subject-oriented data
necessary to support a specific business unit.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

A data lake stores an organization's raw and processed (unstructured and structured) data at both large
and small scales.
10. List the issues with the real world data.
Issues with the real world data:
Incomplete data: Some data lack attribute values.
Noisy: Some data contains errors.
Inconsistent: Some data contain discrepancies in codes and names.
11. Mention the benefits of data preparation phase.
Benefits of Data Preparation
 Fix errors quickly
 Produce good-quality data
 Make better more accurate decisions
 Data preparation also involves finding relevant data to ensure actionable insights for business
decision-making.
 Reduce data management and analytics costs
 Avoid duplication of effort in preparing data for use in multiple applications
12. What is meant by data cleaning?
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly Romaned, duplicate,
or incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.
13. Mention any 4 types of common error that occur in data.
 Mistakes during data entry
 Redundant white space
 Impossible values
 Missing values
 Deviations from a code book
 Different units of measurement
14. What is outlier?
Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a
different manner. One observation that follows a different logic or generative process than the other
observations.
15. How can you handle missing values in the dataset? Various ways to handle missing values are,
 Omit the values

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

 Set value to null


 Impute a static value such as 0 or the mean
 Impute a value from an estimated or theoretical distribution
 Modeling the value (nondependent)
UNIT – II
1. Define data. What are the types of data?
Data is a collection of actual observations or scores in a survey or an experiment. Any statistical analysis
is performed on data Data can be broadly classified qualitative and quantitative.
2. What is Qualitative data? Give example.
Qualitative or Categorical data is a set of observations where any single observation is a word, letter, or
numerical code that represents a class or a category
Examples: Words - Yes or No. Letters Y or N. Numerical code - 0 or 1
3. What is quantitative data? Give an example
Quantitative Data is a set of observations where any single observation is number that represents an
amount or a count. It can be expressed in numerical values which make it countable and includes
statistical data analysis. It is also known a numerical data.
Example: Weights: 35, 56. 70 kg.
4. Compare Discrete and Continuous Variables.
Discrete Variable is a variable that consists of isolated numbers separated by gaps.
Example: Number of students in a class
Continuous Variable is a variable that consists of numbers whose values, a least in theory, have no
restrictions.
Example: Height of students in a class
5. Define approximate numbers. What is the approximate number of 7.2 and 8.8?
Approximate Numbers are the numbers that are rounded off, as is always the case with values for
continuous variables.
Approximate numbers are 7 and 9.
6. What is meant by frequency distribution?
A frequency distribution is a collection of observations produced by sorting observations into classes
and showing their frequency (f) of occurrence in each class
7. What are the types of Frequency Distribution?
 Grouped frequency distribution
 Ungrouped frequency distribution
 Cumulative frequency distribution

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

 Relative frequency distribution


 Relative cumulative frequency distribution
8. Define an outlier.
An outlier is a data point that differs significantly from other observations. An outlier can occur due to
variability in the measurement or it may indicate an experimental error.
9. What is percentile rank?
Percentile Rank (PR) of an observation is the percentage of scores in the entire distribution with equal or
smaller values than that score. Its mathematical formula is
PR = (CF - (0.5 + F))/N * 100
10. What are the measures of central tendency?
Mean. Median and Mode
11. Define Mode.
The mode represents the value of the most frequently occurring score.
12. Define Median.
Median represents the middle value when observations are ordered from least to most.
UNIT – III
1. What is Correlation?
Correlation measures the relationship between two variables. Example: The relationship between the
computer skills and GPA of the student
2. What are the types of correlation?
 Positive correlation
 Negative correlation
 No correlation
3. What is the need for correlation?
 Prediction
 Validity
 Reliability
 Theory Verification
4. What is a scatterplot?
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores. Scatter plots are
used to observe relationships between variables.
5. What is causation?
Causation indicates that one event is the result of the occurrence of the other event, i.e. there is a causal
relationship between the two events. This is also referred to as the cause and effect.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

6. What is a linear relationship?


Linear Relationship (or linear association) is a statistical term used to describe a straight-line
relationship between two variables.
7. What is Nonlinear Relationship?
A nonlinear relationship between variables is a relationship whose scatter plot does not resemble a
straight line. It could resemble a curve or not really resemble anything An increase in one variable does
not result in a proportional increase or decrease in the other variable.
8. List the types of nonlinear relationship.
 Quadratic relationship
 Cubic relationship
 Exponential relationship
 Logarithmic Relationship
9. What is the Interpretation of r ^ 2
The squared correlation coefficient. r ^ 2 provides us with not only a key interpretation of the correlation
coefficient but also a measure of predictive accuracy that supplements the standard error of estimate, S
y|x.
10. What is Standard Error of Estimate?
Standard Error of Estimate ( S y|x ) is a rough measure of the average amount of predictive error i.e.. as
a rough measure of the average amount by which known Y values deviate from their predicted Y values.
11. What is a Correlation matrix?
Correlation matrix is a table which displays the correlation coefficients for different variables. The
matrix depicts the correlation between all the possible pairs of values in a table 20. Give the Least
Squares Regression Equation
Where
Y = bX + a
Y represents the predicted value
X represents the known value
a and b represent numbers calculated from the original correlation analysis
12. State the desirable property of least square regression?
The desirable property is that it automatically minimizes the total of all squared predictive errors for
known Y scores in the original correlation analysis
13. State the Multiple Regression Equation.
Y = m*X_{1} + m*X_{2} + m*X_{3} + b

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Where Y is dependent variable. X_{1} X_{2} X_{3} are independent variables, m is the slope of
regression, and b is the constant value.
14. What is Regression towards the Mean?
Regression towards the Mean refers to a tendency for scores, particularly extreme scores, to shrink
towards the mean. It refers to the fact that if one sample of a random variable is extreme, the next
sampling of the same random variable is likely to be closer to its mean.
15. When does Regression Fallacy occur?
Regression fallacy occurs whenever regression towards the mean is interpreted as real effect, rather than
a chance. The regression fallacy can be avoided by splitting the subset of extreme observations into two
groups.
16. Indicate whether the following statements suggest a positive or negative relationship:
(a) More densely populated areas have higher crime rates.
(b) School children who often watch TV perform more poorly on academic achievement tests.
(c) Heavier automobiles yield poorer gas mileage.
(d) Better-educated people have higher incomes.
(e) More anxious people voluntarily spend more time performing a simple repetitive task.
Answer
Positive. The crime rate is higher, square mile by square mile, in densel populated cities than in sparsely
populated rural areas.
Negative. Increases in car weight are accompanied by decreases in miles per gallon.
Positive. Highly anxious people willingly spend more time performing simple repetitive task than less
anxious people.
Positive. Increases in educational level-grade school, high school college-tend to be associated with
increases in income.
Negative. As TV viewing increases, performance on academic achievement tests tends to decline.
UNIT – IV
1. What is NumPy? List its uses.
NumPy is a general-purpose array-processing package with high-performance multidimensional array
object, and tools. It is the fundamental package for scientific computing with Python. It provides N-
dimensional array object supporting many sophisticated (broadcasting) functions
Uses of NumPy
NumPy is a package in Python used for Scientific Computing. NumPy package is used to perform
different operations. The ndarray (NumPy Array) is multidimensional array used to store values of same
datatype. These arrays are indexed just like Sequences, starts with zero.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

2. Where is NumPy used?


NumPy is an open source numerical Python library. It provides multi-dimentional array and matrix data
structures. It can be utilised to perform mathematical operations on arrays such as trigonometric,
statistical and algebraic routines.
3. Identify the details maintained by Python to store an Integer.
An integer in Python contains four pieces:
ob_refent a reference count that helps Python to handle handle memory allocation and deallocation
ob_type encodes the type of the variable
ob_size: specifies the size of the following data members
ob_digit contains the actual integer value that the Python variable represent
4. Write Python code to create ID, 2D and 3D Numpy arrays.
ID array:
import numpy as np
al=np.array([1.2.3])
2D array:
a2 = np.array([[1.2.31.[2.2.2])
3D array:
a3=np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
5. Write short note on python array object.
 The array module allows us to store a collection of numeric values.
 To create an array of numeric values, we need to import the array module.
 Indices are used to access elements of an array.
 Slicing operator is used to access a range of items in an array
UNIT – V
1. Write the significance of Data visualization.
Data visualization is of greater significance in data science. It is used for many tasks such as exploratory
data analysis, model evaluation, storytelling and so on.
2. Write python code to plot the sine and cos wave.
import numpy as np
x=np.linspace(0,10,100)
fig=plt.figure()
plt.plot(x,np.sin(x),’-‘)
plt.plot(x,np.cos(x),’--')
3. What is scatterplot?

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Scatter plots are used to observe relationship between variables. Scatter plot is a type of plot in which
the points are represented individually with a dot, circle or other shape. The scatter() method in the
matplotlib library is used to draw a scatter plot.
4. What is the significance of Error bar?
Error bars indicate the estimated error or uncertainty to show how precise a measurement. Error bars
function used as graphical enhancement that visualizes the variability of the plotted data on a Cartesian
graph.
5. What is contour plot?
Contour plots are a way to show a three-dimensional surface on a two-dimensional plane. It graphs two
predictor variables X Y on the y-axis and a response variable Z as contours. These contours are
sometimes called z-slices or iso-response variables.
6. What is a histogram?
A histogram is a graph showing frequency distributions. It shows the number of observations within
each given interval. A simple histogram is useful in understanding a dataset. Matplotlib's histogram
function creates a basic histogram in one line, once the normal boiler plate imports are done.
7. Mention the significance of subplots?
Subplots are used to compare different views of data side by side. It is a group of smaller axes that can
exist together within a single figure. These subplots might be inserts, grids of plots or other more
complicated layouts.
8. List the ways to customize Matplotlib.
 Setting rcParams at runtime.
 Using style sheets
 Changing your matplotlibrc file.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy