0% found this document useful (0 votes)

2 views13 pages

Data Science

The document provides a comprehensive overview of data science, including its definition, the Venn diagram illustrating its components, and various concepts such as statistical inference, R programming basics, and machine learning algorithms like KNN and Naïve Bayes. It also covers data visualization techniques, the Data Science process, and frameworks like MapReduce for big data processing. Additionally, it discusses the limitations of certain algorithms for specific tasks, such as spam filtering, and the significance of feature selection and generation in data analysis.

Uploaded by

hajeera.nk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views13 pages

Data Science

Uploaded by

hajeera.nk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

1a Define Data science. Explain the Venn diagram of Data Science.

10
Definition of Data Science  2M
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and
systems to extract insights from structured and unstructured data. It combines elements of
statistics, mathematics, programming, and domain expertise to analyze and interpret
complex data.

Venn Diagram of Data Science  4M

The Venn diagram of Data Science consists of three overlapping components:
1. Mathematics & Statistics
2. Computer Science (Programming & Data Engineering)
3. Domain Knowledge (Business/Industry Expertise)
Intersections in the Venn Diagram 4M
 Machine Learning (Mathematics + Computer Science)
 Traditional Research (Mathematics + Domain Knowledge)
 Software Development (Computer Science + Domain Knowledge)
 Data Science (Core Area)

b Write a short note about basics of R. Write an R program to print the 10

Fibonacci Series.
R is a powerful programming language and environment primarily used for statistical computing,
data analysis, and graphical representation. It is widely used in data science, machine learning,
and bioinformatics.  2M
Basic Features of R:  2M
 Open-source and free to use.
 Provides built-in functions for statistical analysis.
 Supports data visualization using packages like ggplot2.
 Uses vectors, matrices, and data frames for data manipulation.
 Offers extensive libraries such as dplyr, tidyr, and caret for data science applications.
Code with sample output :  5 + 1 M
fibonacci <- function(n)
{
a <- 0
b <- 1
cat(a, b)

for (i in 3:n)
{
c <- a + b
cat(c, " ")
a <- b
b <- c
}
}
n_terms <- 10
cat("Fibonacci Series:\n")
fibonacci(n_terms)

2a Explain the following concepts with examples: 10

i. Statistical Inference
ii. Population
iii. Samples
iv. Types of data
v. Big Data
Statistical Inference  2M
 Statistical inference is the process of drawing conclusions about a population based on a
sample of data. It involves techniques like estimation, hypothesis testing, and confidence
intervals.
Any suitable Example:
 A researcher collects the test scores of 100 students from a university and uses statistical
methods to infer the average test score of all students at the university.
Population
 A population is the entire group of individuals or observations that a study aims to analyze. It
can be finite or infinite.
Any suitable Example:
All citizens of a country when conducting a national census.
iii. Samples
A sample is a subset of the population selected for analysis. It is used when studying the entire
population is impractical.
Example:
Surveying 1,000 voters (sample) to predict the outcome of a national election.
iv. Types of Data
Data can be categorized into different types based on its nature and usage:
Quantitative Data (Numerical) – Represents measurable quantities.
Example: Heights of students (in cm), annual income (in dollars).
Qualitative Data (Categorical) – Represents characteristics or categories.
Example: Gender (Male/Female), Car brands (Toyota, Ford).
Further Classification:
Discrete Data: Countable values (e.g., Number of students in a class).
Continuous Data: Measurable values (e.g., Temperature, Weight).
v. Big Data
Big Data refers to extremely large datasets that cannot be processed using traditional data
management tools. It is characterized by the 5 Vs: Volume, Velocity, Variety, Veracity, and
Value. Example:
Data generated by social media platforms like Facebook and
Twitter. Real-time transaction data from e-commerce websites like
Amazon.
b Write a program to find the Sum, Mean and Product of the Vector 3 1 10
in R programming.
# Define a vector
vec <- c(2, 4, 6, 8, 10)

# Calculate Sum
sum_vec <- sum(vec)

# Calculate Mean
mean_vec <- mean(vec)

# Calculate Product
product_vec <- prod(vec)

# Print results
cat("Vector:", vec, "\n")
cat("Sum:", sum_vec, "\n")
cat("Mean:", mean_vec, "\n")
cat("Product:", product_vec, "\n")

3a Briefly explain Data Science process with a neat diagram. 2 2 10

Explanation of Data Science Process 5

Neat and Clear Diagram 5
Data Science Process
The Data Science process consists of several stages that transform raw data into valuable insights.
Below are the key steps:
Problem Definition
Understanding the problem and defining objectives.
Example: Predicting customer churn for a telecom company.
Data Collection
Gathering relevant data from various sources (databases, APIs, web scraping).
Example: Collecting transaction data from an e-commerce platform.
Data Cleaning & Preprocessing
Handling missing values, duplicates, and outliers.
Example: Removing inconsistent records from customer datasets.
Exploratory Data Analysis (EDA)
Understanding patterns and relationships in data using visualization and summary statistics.
Example: Using histograms and scatter plots to explore sales trends.
Feature Engineering & Selection
Creating new features and selecting important ones for better model performance.
Example: Creating a “customer loyalty score” based on past purchases.
Model Building & Training
Applying machine learning algorithms to train predictive models.
Example: Using logistic regression to predict whether a customer will churn.
Model Evaluation & Optimization
Checking model accuracy using metrics like precision, recall, and RMSE.
Example: Comparing different machine learning models to find the best one.
Deployment & Monitoring
Deploying the model into a real-world system and tracking its performance.
Example: Integrating a fraud detection model into an online banking system.
b What is Matplotlib. Write a R program to plot line chart by assuming 3 1 10
own data.
Matplotlib is a popular data visualization library in Python used for creating static, animated, and
interactive plots. It provides various plotting functions like line charts, bar charts, histograms, and
scatter plots.
However, since you requested an R program, we will use ggplot2 or base R for plotting a line
chart.
# Load necessary library
library(ggplot2)
# Create a sample dataset
data <- data.frame(
Year = c(2015, 2016, 2017, 2018, 2019, 2020),
Sales = c(500, 700, 900, 1100, 1300, 1500)
)
# Create a line chart
ggplot(data, aes(x = Year, y = Sales)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 3) +
ggtitle("Yearly Sales Growth") +
xlab("Year") +
ylab("Sales") +
theme_minimal()
4a Explain the concept of KNN algorithm. What are the modelling 2 2 10
assumptions of KNN algorithm with example.
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both
classification and regression tasks. It is a non-parametric and instance-based learning algorithm,
meaning it does not make strong assumptions about the underlying data distribution and directly
stores training data for making predictions.
Similarity-Based Learning

Assumes similar data points are close to each other in feature space.
Example: If most nearby points are apples, a new point near them is likely an apple.
Choice of K Value Matters
A small K (e.g., 1 or 3) may lead to noise affecting predictions.
A large K (e.g., 20 or 50) may oversmooth and ignore local variations.
Feature Scaling is Important
Distance calculations are affected by different feature scales.
Example: If height is in cm and weight in kg, height will dominate. Normalization (Min-Max
Scaling or Standardization) is required.
Assumes a Meaningful Distance Metric
Uses distance measures like Euclidean, Manhattan, or Cosine similarity.
Example: Euclidean distance works well for continuous data, while Hamming distance is used for
categorical data.
Non-Parametric Nature
Does not assume an underlying distribution (unlike linear regression).
Learns only when making predictions (lazy learning).
b What is ggplot2 in R? Write an R program to plot a bar chart using 3 1 10
some sample data.
ggplot2 is a powerful and widely used data visualization package in R. It is based on the
Grammar of Graphics and allows users to create complex visualizations easily by layering
different elements like axes, colors, and labels.
Features of ggplot2:
Provides high-quality, customizable plots.
Supports multiple chart types (bar charts, line charts, histograms,
etc.). Uses a layered approach for plotting.
Works well with dplyr and tidyverse for data manipulation.
# Load ggplot2 library
library(ggplot2)
# Create sample data
data <- data.frame(
Category = c("A", "B", "C", "D", "E"),
Value = c(10, 25, 15, 30, 20)
)
# Create a bar chart
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity") +
ggtitle("Bar Chart Example") +
xlab("Categories") +
ylab("Values") +
theme_minimal()

5a Explain Naïve Bayes Algorithm for filtering Spam with example. 2 3 10

The Naïve Bayes algorithm is a probabilistic machine learning model used for classification
tasks, including spam filtering. It is based on Bayes' Theorem and assumes that features are
independent of each other (hence the term "naïve").
Bayes' Theorem Formula:
How Naïve Bayes is Used for Spam
Filtering Training Phase:
Collect a dataset of emails labeled as spam or ham (not spam).
Extract features like word frequency (e.g., "free", "win", "money").
Calculate probabilities of each word appearing in spam and non-spam emails.
Prediction Phase:
For a new email, check the presence of words and compute probabilities using Bayes' theorem.
If the probability of spam is higher than ham, classify the email as spam, otherwise classify it as
ham.
b Compare Naïve Bayes Algorithm with KNN Algorithm. Explain 2 3 10
Laplace smoothing

Why is Laplace Smoothing Needed?

In Naïve Bayes, if a word (feature) never appears in training data for a class, its probability
becomes zero, making the entire probability calculation zero. This problem is called zero-
frequency problem.
 Prevents zero probabilities.
 Makes the model more robust, especially for text classification.
 Works well for handling rare words or unseen features in test data.
6a Why Linear Regression and KNN are poor choices for filtering spam? 2 3 10
Discuss.
Why Linear Regression and KNN are Poor Choices for Spam Filtering?
Spam filtering is a classification problem where emails are categorized as spam or ham (not
spam). While algorithms like Naïve Bayes are well-suited for text classification, Linear
Regression and K- Nearest Neighbors (KNN) have significant drawbacks when applied to spam
filtering.
Why Linear Regression is a Poor Choice?
+ Not Designed for Classification
Linear Regression is meant for continuous output (regression tasks), not categorical classification.

It produces continuous values instead of distinct class labels (spam or ham).

+ Decision Boundary Issues
A linear model assumes a straight-line relationship between inputs and output, which does not fit
complex decision boundaries in text classification.

Emails have non-linear relationships between words and spam probability.

+ Probabilities May Go Out of Range

Linear Regression may predict values outside the valid probability range (0 to 1).

Example: It may classify an email with a spam score of 1.5, which is invalid.

Why KNN is a Poor Choice?

+ High Computational Cost
KNN is a lazy learner, meaning it stores all training data and makes predictions by computing
distances between new data and all stored emails.

With a large dataset, computing distances between emails is slow and inefficient.

+ Curse of Dimensionality
In spam filtering, emails are represented as high-dimensional feature vectors (e.g., thousands of
words).

KNN performs poorly in high dimensions because distances become less meaningful.

+ Does Not Handle Text Data Well

KNN requires a distance metric (e.g., Euclidean distance), which is not ideal for categorical/text
data.

Text data should be treated probabilistically (like in Naïve Bayes) rather than relying on distances.

⬛ Alternative: Naïve Bayes is better since it uses word probabilities instead of distances to
classify emails.

b Explain scraping the web with API’s and other tools. 2 3 10

Scraping the Web with APIs and Other Tools
Web scraping is the process of extracting data from websites. It can be done using APIs or web
scraping tools when APIs are not available.

Scraping the Web with APIs

APIs (Application Programming Interfaces) allow structured access to web data without parsing
HTML. Many websites provide APIs to fetch data efficiently.
Steps for Web Scraping Using APIs
Find the API – Check if the website provides a public API (e.g., Twitter API, OpenWeather API).
Get API Key – Many APIs require authentication with an API key.
Send Requests – Use GET or POST requests to fetch data.
Process JSON/XML Response – Extract and analyze data from the response. With example
7a Explain and construct a Decision Tree with an example. 3 3 10
B Explain the concept of Principal Component Analysis (PCA) and 2 4 10
evaluate its significance in dimensionality reduction.

What is PCA?
Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction
while retaining the most important information in a dataset. It transforms the original features into
a new set of uncorrelated variables called Principal Components (PCs), which capture the
maximum variance in the data.
8a Write short notes on: 3 3 10
i) Feature selection criteria
ii) Random Forest
iii) The three primary methods of Regression.
iv) The Kaggle model

b Discuss features generation. Describe feature selection using SVD 2 4 10

algorithm.
9a Explain the concept of data visualization. What are its key objectives? 2 5 10
Discuss the various techniques used to visualize time series data
Data visualization is the graphical representation of data to help users understand patterns, trends,
and insights. It transforms raw data into visual formats like charts, graphs, and maps, making
complex data easier to interpret and analyze.
Key Objectives of Data Visualization
Simplify Data Interpretation – Makes complex data more accessible and understandable.
Identify Trends and Patterns – Helps detect trends, correlations, and anomalies in data.
Support Decision-Making – Assists businesses and researchers in making informed decisions.
Enhance Communication – Provides a clear way to present data-driven insights.
Detect Outliers – Helps in identifying unusual data points or inconsistencies.
Techniques for Visualizing Time Series Data
Line Chart – Best for showing trends over time.
Area Chart – Similar to a line chart but with the area filled, emphasizing volume.
Bar Chart – Useful for comparing values over time, especially with categorical data.
Scatter Plot – Shows relationships between time and another variable.
Heatmap – Displays variations in data using color intensity.
Box Plot – Helps in understanding the distribution and variability of data over time.
Moving Averages & Smoothing – Used to highlight trends by reducing short-term fluctuations.
Seasonal Decomposition Plot – Separates time series data into trend, seasonality, and residual
B Explain the MapReduce framework in data engineering. How can it 3 5 10
be applied to solve the word frequency problem in large text datasets?
MapReduce is a distributed data processing framework used in big data engineering to process
large datasets across multiple machines in a parallel manner. It is primarily used in Hadoop
ecosystems and follows a divide-and-conquer approach to process data efficiently.
Components of MapReduce
Map Phase
Splits the input data into smaller chunks and processes them in parallel.
Each chunk is processed by a mapper function, which transforms the input into key-value pairs.
Shuffle & Sort Phase
The intermediate key-value pairs from the mappers are grouped by key.
Data is then sorted and sent to the reducers.
Reduce Phase
The reducers process grouped data and aggregate results.
The final output is stored in HDFS (Hadoop Distributed File System) or another storage system.
Example
The word frequency problem involves counting occurrences of words in a large text dataset.
10a Discuss the essential characteristics of a social network with a 2 5 10
suitable example. Explain the concept of a social graph and its
relevance in the field of data science.
Essential Characteristics of a Social Network
A social network is a structure made up of individuals or organizations that are connected through
relationships such as friendship, professional connections, or shared interests. The key
characteristics of social networks include:
Nodes (Entities) – Individuals, groups, or organizations that form the network.
Edges (Connections) – Relationships between nodes, which can be directed (one-way) or
undirected (mutual).
Community Structure – Groups of highly connected nodes within the network.
Degree Centrality – The number of direct connections a node has.
Homophily (Similarity) – The tendency of similar individuals to form connections (e.g., people
with common interests forming clusters).
Influence & Virality – Information spreads quickly through key influencers (viral marketing,
trends, etc.).
Dynamic Nature – Networks evolve as new connections form and old ones disappear.
Concept of a Social Graph
A social graph is a graphical representation of relationships between users in a social network. It
consists of:
Nodes (Users/Entities) representing people or groups.
Edges (Connections) depicting friendships, follows, or interactions.
Relevance of Social Graph in Data Science
Recommendation Systems – Used by platforms like Netflix and Amazon to suggest content based
on social connections.
Influencer Identification – Helps in marketing campaigns by identifying key users who influence
others.
Fraud Detection – Banks use social graphs to detect suspicious activities by analyzing transaction
networks.
Sentiment Analysis – Understanding public opinion on social media by analyzing interaction
patterns.
Disease Spread Analysis – Predicting how diseases (like COVID-19) spread through human
interactions.
b Explain Girvan-Newman algorithm with example 3 5 10
The Girvan-Newman algorithm is a community detection algorithm used in social network
analysis to identify clusters (communities) by iteratively removing edges with the highest
betweenness centrality.
Steps of the Girvan-Newman Algorithm
Calculate Edge Betweenness
Betweenness centrality measures how often an edge appears in the shortest paths between
nodes. Higher betweenness means the edge acts as a "bridge" between communities.
Remove the Edge with Highest Betweenness
The most influential edge is deleted.
The network may split into smaller groups.
Recalculate Betweenness
After removing an edge, betweenness is recalculated for the remaining edges.
Repeat Until Communities Emerge
The process continues until a clear separation of clusters is visible.
Applications of the Girvan-Newman Algorithm
Social Network Analysis – Finding friend groups in social media.
Biological Networks – Identifying functional modules in protein interactions.
Fraud Detection – Discovering fraudulent activity groups.
Marketing & Targeting – Understanding customer clusters for better recommendations.

Modern Data Science With R - Baumer Benjamin SKaplan Daniel THort
No ratings yet
Modern Data Science With R - Baumer Benjamin SKaplan Daniel THort
985 pages
Modern Statistics With R
100% (3)
Modern Statistics With R
580 pages
CRC Data Science
No ratings yet
CRC Data Science
443 pages
Intel Nehalem Core Architecture
No ratings yet
Intel Nehalem Core Architecture
123 pages
Kadir
No ratings yet
Kadir
84 pages
(Edward Curry) An Introduction To Bioinformatics - A Practical Guide For Biologists
No ratings yet
(Edward Curry) An Introduction To Bioinformatics - A Practical Guide For Biologists
248 pages
Introduction To Data Science - Lin and Li
No ratings yet
Introduction To Data Science - Lin and Li
403 pages
An R Companion To Statistical Thinking For The 21st Century
No ratings yet
An R Companion To Statistical Thinking For The 21st Century
159 pages
PGP-Data Science - Course Module With Internship Module
No ratings yet
PGP-Data Science - Course Module With Internship Module
17 pages
Ida PDF
No ratings yet
Ida PDF
62 pages
The FET Manual - Japanese Edition (1989)
No ratings yet
The FET Manual - Japanese Edition (1989)
413 pages
Practitioner's Guide To Data Science
No ratings yet
Practitioner's Guide To Data Science
403 pages
The Maintenance Managers Guide To Digital Transformation
No ratings yet
The Maintenance Managers Guide To Digital Transformation
33 pages
Mtech Final
No ratings yet
Mtech Final
16 pages
COA CSE 2009 Module 1
No ratings yet
COA CSE 2009 Module 1
90 pages
R语言学习笔记
No ratings yet
R语言学习笔记
78 pages
Introduction To MChip Advance Card Application Specifications - Payment
No ratings yet
Introduction To MChip Advance Card Application Specifications - Payment
38 pages
Non-Core Placement Bluebook 2024-25
No ratings yet
Non-Core Placement Bluebook 2024-25
709 pages
Introduction To Data Science: Hui Lin and Ming Li
No ratings yet
Introduction To Data Science: Hui Lin and Ming Li
403 pages
(1111) An Introduction To R Programming
No ratings yet
(1111) An Introduction To R Programming
136 pages
Modern Data Science With R-775437 Chapters
No ratings yet
Modern Data Science With R-775437 Chapters
10 pages
Data Science Training in Hyderabad
No ratings yet
Data Science Training in Hyderabad
7 pages
R Code Intro
No ratings yet
R Code Intro
46 pages
Module1 DS
No ratings yet
Module1 DS
61 pages
Afin8015 Topic 1 2023.
No ratings yet
Afin8015 Topic 1 2023.
64 pages
IDS (R22) U1 NotesRK 03092024
No ratings yet
IDS (R22) U1 NotesRK 03092024
22 pages
ExtremeCloud SD-WAN - TO PARTNER - PITCH DECK - July 2023 - PPT
No ratings yet
ExtremeCloud SD-WAN - TO PARTNER - PITCH DECK - July 2023 - PPT
26 pages
PDF 19253 VW MK7 Golf Power Folding Mirror Kit Install
No ratings yet
PDF 19253 VW MK7 Golf Power Folding Mirror Kit Install
41 pages
HarshShivam Internship Report 7 Sem
No ratings yet
HarshShivam Internship Report 7 Sem
26 pages
Mid 1 Answers IDS
No ratings yet
Mid 1 Answers IDS
22 pages
Contents
No ratings yet
Contents
17 pages
R Short Course
No ratings yet
R Short Course
40 pages
Data Anlytics Using R Notes
No ratings yet
Data Anlytics Using R Notes
14 pages
DS IAT 2 Question Bank
No ratings yet
DS IAT 2 Question Bank
7 pages
Data Science
No ratings yet
Data Science
9 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Data Science & Machine Learning by Using R Programming
No ratings yet
Data Science & Machine Learning by Using R Programming
6 pages
Question Bank R
No ratings yet
Question Bank R
19 pages
R programming.Q.A
No ratings yet
R programming.Q.A
13 pages
Data Science Master
No ratings yet
Data Science Master
11 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Cie Lab and Theory Template
No ratings yet
Cie Lab and Theory Template
14 pages
Hyundai Monitor Manual
No ratings yet
Hyundai Monitor Manual
26 pages
IAT 2 Part A - DS
No ratings yet
IAT 2 Part A - DS
5 pages
R Programming For Data Science. A Comprehensive Guide To R Programming... 2024
No ratings yet
R Programming For Data Science. A Comprehensive Guide To R Programming... 2024
235 pages
List of Experiments and Record of Progressive Assessment: Date of Performance Date of Submission
No ratings yet
List of Experiments and Record of Progressive Assessment: Date of Performance Date of Submission
37 pages
Module 1
No ratings yet
Module 1
91 pages
7 QML Presenting Data
No ratings yet
7 QML Presenting Data
52 pages
Ids PDF
No ratings yet
Ids PDF
397 pages
Reflective Essay of Principles of Data Science
No ratings yet
Reflective Essay of Principles of Data Science
16 pages
Boulder Handout 2019
No ratings yet
Boulder Handout 2019
187 pages
Big Data Part-I
No ratings yet
Big Data Part-I
15 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
COA CSE 2009 Module-2 Part-1
No ratings yet
COA CSE 2009 Module-2 Part-1
40 pages
B Ei
No ratings yet
B Ei
44 pages
MIT14 381F13 EcnomtrisInR PDF
No ratings yet
MIT14 381F13 EcnomtrisInR PDF
70 pages
Op Code Instruct
No ratings yet
Op Code Instruct
37 pages
QP SSCQ0503 IT Web Developer1.0 2018 PDF
No ratings yet
QP SSCQ0503 IT Web Developer1.0 2018 PDF
46 pages
Coa m3 Part2 Extra Slides
No ratings yet
Coa m3 Part2 Extra Slides
66 pages
UNIT I Single Topic Per Page
No ratings yet
UNIT I Single Topic Per Page
12 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Statistical Analysis and Visualizations Using R: Okan Bulut
No ratings yet
Statistical Analysis and Visualizations Using R: Okan Bulut
96 pages
Fire Alarm Systems: Program Detail Manual
No ratings yet
Fire Alarm Systems: Program Detail Manual
25 pages
COA - Module 3 Computer Arithmetic - Part2
No ratings yet
COA - Module 3 Computer Arithmetic - Part2
43 pages
Chapter 3.1 PDF
No ratings yet
Chapter 3.1 PDF
34 pages
Saurabh
No ratings yet
Saurabh
22 pages
What Are Document Databases
No ratings yet
What Are Document Databases
3 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
Sound Signal Reception System Vss Technical Manual
No ratings yet
Sound Signal Reception System Vss Technical Manual
18 pages
Ai BCS401
No ratings yet
Ai BCS401
2 pages
Multi Cloud and Edge Computing
No ratings yet
Multi Cloud and Edge Computing
2 pages
Data (MCS102) Module 1
No ratings yet
Data (MCS102) Module 1
40 pages
R Curriculum - 30 Hrs
No ratings yet
R Curriculum - 30 Hrs
3 pages
Question Bank For Internal Assessment
No ratings yet
Question Bank For Internal Assessment
3 pages
Data Science Brochure - Jan
No ratings yet
Data Science Brochure - Jan
14 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
C-MCTS: Safe Planning With Monte Carlo Tree Search: Preprint. Under Review
No ratings yet
C-MCTS: Safe Planning With Monte Carlo Tree Search: Preprint. Under Review
13 pages
Arduino Based Smart Trash Bin - Docx 2
No ratings yet
Arduino Based Smart Trash Bin - Docx 2
11 pages
Smart Arduino Touch Switch Board For Home Automation: Internal Guide - Prof. Satish Nimbalkar G - 23 (Members)
No ratings yet
Smart Arduino Touch Switch Board For Home Automation: Internal Guide - Prof. Satish Nimbalkar G - 23 (Members)
10 pages
Nac PDF
No ratings yet
Nac PDF
23 pages
Regridding With CDO
No ratings yet
Regridding With CDO
9 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Iris Recognition: Detecting The Pupil
No ratings yet
Iris Recognition: Detecting The Pupil
8 pages
Fisher Specification Manager Software
No ratings yet
Fisher Specification Manager Software
4 pages
Introduction To R For Social Scientist Preview
No ratings yet
Introduction To R For Social Scientist Preview
26 pages
Lesson 1-3 Adding Subtracting Integers Worksheets
No ratings yet
Lesson 1-3 Adding Subtracting Integers Worksheets
9 pages
CV of Ganesh B Krishnan 2019
No ratings yet
CV of Ganesh B Krishnan 2019
3 pages
Shuchi Resume CV
No ratings yet
Shuchi Resume CV
6 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Simulation of Zigbee Based Tacs For Collision Detection and Avoidance For Railway Traffic
No ratings yet
Simulation of Zigbee Based Tacs For Collision Detection and Avoidance For Railway Traffic
5 pages
4 III BTech Minor DS Courses Syllabus
No ratings yet
4 III BTech Minor DS Courses Syllabus
5 pages
Export 2 Excel
No ratings yet
Export 2 Excel
2 pages
Termostat Computherm Manual Utilizare
No ratings yet
Termostat Computherm Manual Utilizare
2 pages
Asl
No ratings yet
Asl
2 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Science

Uploaded by

Data Science

Uploaded by

1a Define Data science. Explain the Venn diagram of Data Science.

Venn Diagram of Data Science  4M

b Write a short note about basics of R. Write an R program to print the 10

2a Explain the following concepts with examples: 10

3a Briefly explain Data Science process with a neat diagram. 2 2 10

Explanation of Data Science Process 5

5a Explain Naïve Bayes Algorithm for filtering Spam with example. 2 3 10

Why is Laplace Smoothing Needed?

It produces continuous values instead of distinct class labels (spam or ham).

Emails have non-linear relationships between words and spam probability.

+ Probabilities May Go Out of Range

Why KNN is a Poor Choice?

+ Does Not Handle Text Data Well

b Explain scraping the web with API’s and other tools. 2 3 10

Scraping the Web with APIs

b Discuss features generation. Describe feature selection using SVD 2 4 10

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.