Unit I 2 Marks
Unit I 2 Marks
PART A
1 •
What is Data Science?
7 List out the tools for Data Science.
• Analysis
Data Data Science is theR,area
– Python, of and
Spark study
SASwhich involves extracting insights from vast amounts of data
using various scientific methods, algorithms, and processes.
Data Warehousing – Hadoop, SQL
• It helps you to discover hidden patterns from the raw data.
Data
• Visualization
Data Science -isR,anTableau
interdisciplinary field that allows you to extract knowledge from structured or
Machine Learningdata.
unstructured – Spark,
DataAzure
scienceMLenables
studio you to translate a business problem into a research project
8 List out Some
and then applications
translate it back of Data
into aScience.
practical solution.
• Internet Search Results (Google)
2 Why • DataRecommendation
Science needed?Engine (Spotify)
• • It Intelligent
helps you to recommend
Digital the right
Assistants (Googleproduct to the right customer to enhance your business
Assistant)
• • Allows to build intelligence ability
Autonomous Driving Vehicle (Waymo, Tesla) in machines
• • It Spam
enables you (Gmail)
Filter to take better and faster decisions
• • DataAbusive Contenthelp
Science can andyouHatetoSpeech
detect fraud
Filterusing advanced machine learning algorithms
(Facebook)
• • It Robotics
helps you(Boston
to prevent any
Dynamics)significant financial losses
3 What
• are the components of
Automatic Piracy Detection data science?
(YouTube)
• Domain expertise
9 What
• Data are the skills required to become the data scientist?
engineering
• Statistics
• Visualization
• Advanced computing
4 List out the data science jobs.
Most prominent Data Scientist job titles are:
• Data Scientist
• Data Engineer
• Data Analyst
• Statistician
• Data Architect
10 What
• Data are the Challenges of Data Science Technology?
Admin
• • Business Analystof information & data is required for accurate analysis
A high variety
• • Data/Analytics
Not adequate data Manager
science talent pool available
• Management does not provide financial support for a data science team
5. What •is Euclidean
Unavailabilitydistance ? access to data
of/difficult
• Business decision-makers do not effectively use data Science results
Ans. Euclidean
• Explaining distance is usedtotoothers
data science measure the similarity between observations. It is calculated as
is difficult
• Privacy
the square root of issues
the sum of differences between each point.
• Lack of significant domain expert
• If an organization is very small, it can’t have a Data Science team
11 What is a Project Charter?
Clients like to know upfront what they are paying for, so after getting a good understanding of the
6. List the data
business cleaning
problem, try totasks?
get a formal agreement on the deliverables. All this information is collected in a
project charter. The outcome should be a clear research goal, a good understanding of the context well-
Ans. Data deliverables
defined cleaning areand as afollows:
plan of action with a timetable. This information is then placed in a project
charter.
12
1. Data
List theacquisition and metadata
steps involved in the data cleansing
• Errors from data entry
2. Fill •in missing
Physicallyvalues
impossible values
• Missing values
3. Unified date format
• Outliers
• Spaces and types
• Errors
4. Converting against to
nominal codebook
numeric
13 What do you mean by Outliers?
5. Identify outliers and smooth out noisy data
An outlier is an observation that seems to be distant from other observations or, more specifically,
one
observation that follows a different logic or generative process than the other observations. The easiest
way to find outliers is to use a plot or a table with the minimum and maximum values.
14 What are the two operations used to combine information from different datasets?
• The first operation is joining: enriching an observation from one table with information
from another table.
• The second operation is appending or stacking: adding the observations of one table to those of
another table.
▶ Exploratory Data Analysis (EDA) is an approach to analyse the data using visual techniques.
▶ Information becomes much easier to grasp when shown in a picture, therefore we mainly
use graphical techniques to gain an understanding of data and the interactions between variables.
▶ The visualization techniques used in this phase range from simple line graphs or histograms
to more complex diagrams such as Sankey and network graphs.
• Data mining provides tools to discover knowledge from data and it turns a large collection of
data into knowledge.
19 What is a data warehouse?
• A data warehouse is a repository of information collected from multiple sources stored under
a unified schema and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.
• Although data is considered an asset more valuable by certain companies, more and
more governments and organizations share their data for free with the world.
• This data can be of excellent quality and it depends on the institution that creates and manages it.
• The information they share covers a broad range of topics in a certain region and its demographics.
•
22 What is the need for basic statistical descriptions of data?
14 What are the two operations used to combine information from different datasets?
• The first operation is joining: enriching an observation from one table with information