Data Science
Data Science
10
Definition of Data Science 2M
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and
systems to extract insights from structured and unstructured data. It combines elements of
statistics, mathematics, programming, and domain expertise to analyze and interpret
complex data.
for (i in 3:n)
{
c <- a + b
cat(c, " ")
a <- b
b <- c
}
}
n_terms <- 10
cat("Fibonacci Series:\n")
fibonacci(n_terms)
# Calculate Sum
sum_vec <- sum(vec)
# Calculate Mean
mean_vec <- mean(vec)
# Calculate Product
product_vec <- prod(vec)
# Print results
cat("Vector:", vec, "\n")
cat("Sum:", sum_vec, "\n")
cat("Mean:", mean_vec, "\n")
cat("Product:", product_vec, "\n")
Assumes similar data points are close to each other in feature space.
Example: If most nearby points are apples, a new point near them is likely an apple.
Choice of K Value Matters
A small K (e.g., 1 or 3) may lead to noise affecting predictions.
A large K (e.g., 20 or 50) may oversmooth and ignore local variations.
Feature Scaling is Important
Distance calculations are affected by different feature scales.
Example: If height is in cm and weight in kg, height will dominate. Normalization (Min-Max
Scaling or Standardization) is required.
Assumes a Meaningful Distance Metric
Uses distance measures like Euclidean, Manhattan, or Cosine similarity.
Example: Euclidean distance works well for continuous data, while Hamming distance is used for
categorical data.
Non-Parametric Nature
Does not assume an underlying distribution (unlike linear regression).
Learns only when making predictions (lazy learning).
b What is ggplot2 in R? Write an R program to plot a bar chart using 3 1 10
some sample data.
ggplot2 is a powerful and widely used data visualization package in R. It is based on the
Grammar of Graphics and allows users to create complex visualizations easily by layering
different elements like axes, colors, and labels.
Features of ggplot2:
Provides high-quality, customizable plots.
Supports multiple chart types (bar charts, line charts, histograms,
etc.). Uses a layered approach for plotting.
Works well with dplyr and tidyverse for data manipulation.
# Load ggplot2 library
library(ggplot2)
# Create sample data
data <- data.frame(
Category = c("A", "B", "C", "D", "E"),
Value = c(10, 25, 15, 30, 20)
)
# Create a bar chart
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity") +
ggtitle("Bar Chart Example") +
xlab("Categories") +
ylab("Values") +
theme_minimal()
Example: It may classify an email with a spam score of 1.5, which is invalid.
With a large dataset, computing distances between emails is slow and inefficient.
+ Curse of Dimensionality
In spam filtering, emails are represented as high-dimensional feature vectors (e.g., thousands of
words).
KNN performs poorly in high dimensions because distances become less meaningful.
Text data should be treated probabilistically (like in Naïve Bayes) rather than relying on distances.
⬛ Alternative: Naïve Bayes is better since it uses word probabilities instead of distances to
classify emails.
What is PCA?
Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction
while retaining the most important information in a dataset. It transforms the original features into
a new set of uncorrelated variables called Principal Components (PCs), which capture the
maximum variance in the data.
8a Write short notes on: 3 3 10
i) Feature selection criteria
ii) Random Forest
iii) The three primary methods of Regression.
iv) The Kaggle model