0% found this document useful (0 votes)
2 views13 pages

Data Science

The document provides a comprehensive overview of data science, including its definition, the Venn diagram illustrating its components, and various concepts such as statistical inference, R programming basics, and machine learning algorithms like KNN and Naïve Bayes. It also covers data visualization techniques, the Data Science process, and frameworks like MapReduce for big data processing. Additionally, it discusses the limitations of certain algorithms for specific tasks, such as spam filtering, and the significance of feature selection and generation in data analysis.

Uploaded by

hajeera.nk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views13 pages

Data Science

The document provides a comprehensive overview of data science, including its definition, the Venn diagram illustrating its components, and various concepts such as statistical inference, R programming basics, and machine learning algorithms like KNN and Naïve Bayes. It also covers data visualization techniques, the Data Science process, and frameworks like MapReduce for big data processing. Additionally, it discusses the limitations of certain algorithms for specific tasks, such as spam filtering, and the significance of feature selection and generation in data analysis.

Uploaded by

hajeera.nk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

1a Define Data science. Explain the Venn diagram of Data Science.

10
Definition of Data Science  2M
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and
systems to extract insights from structured and unstructured data. It combines elements of
statistics, mathematics, programming, and domain expertise to analyze and interpret
complex data.

Venn Diagram of Data Science  4M


The Venn diagram of Data Science consists of three overlapping components:
1. Mathematics & Statistics
2. Computer Science (Programming & Data Engineering)
3. Domain Knowledge (Business/Industry Expertise)
Intersections in the Venn Diagram 4M
 Machine Learning (Mathematics + Computer Science)
 Traditional Research (Mathematics + Domain Knowledge)
 Software Development (Computer Science + Domain Knowledge)
 Data Science (Core Area)

b Write a short note about basics of R. Write an R program to print the 10


Fibonacci Series.
R is a powerful programming language and environment primarily used for statistical computing,
data analysis, and graphical representation. It is widely used in data science, machine learning,
and bioinformatics.  2M
Basic Features of R:  2M
 Open-source and free to use.
 Provides built-in functions for statistical analysis.
 Supports data visualization using packages like ggplot2.
 Uses vectors, matrices, and data frames for data manipulation.
 Offers extensive libraries such as dplyr, tidyr, and caret for data science applications.
Code with sample output :  5 + 1 M
fibonacci <- function(n)
{
a <- 0
b <- 1
cat(a, b)

for (i in 3:n)
{
c <- a + b
cat(c, " ")
a <- b
b <- c
}
}
n_terms <- 10
cat("Fibonacci Series:\n")
fibonacci(n_terms)

2a Explain the following concepts with examples: 10


i. Statistical Inference
ii. Population
iii. Samples
iv. Types of data
v. Big Data
Statistical Inference  2M
 Statistical inference is the process of drawing conclusions about a population based on a
sample of data. It involves techniques like estimation, hypothesis testing, and confidence
intervals.
Any suitable Example:
 A researcher collects the test scores of 100 students from a university and uses statistical
methods to infer the average test score of all students at the university.
Population
 A population is the entire group of individuals or observations that a study aims to analyze. It
can be finite or infinite.
Any suitable Example:
All citizens of a country when conducting a national census.
iii. Samples
A sample is a subset of the population selected for analysis. It is used when studying the entire
population is impractical.
Example:
Surveying 1,000 voters (sample) to predict the outcome of a national election.
iv. Types of Data
Data can be categorized into different types based on its nature and usage:
Quantitative Data (Numerical) – Represents measurable quantities.
Example: Heights of students (in cm), annual income (in dollars).
Qualitative Data (Categorical) – Represents characteristics or categories.
Example: Gender (Male/Female), Car brands (Toyota, Ford).
Further Classification:
Discrete Data: Countable values (e.g., Number of students in a class).
Continuous Data: Measurable values (e.g., Temperature, Weight).
v. Big Data
Big Data refers to extremely large datasets that cannot be processed using traditional data
management tools. It is characterized by the 5 Vs: Volume, Velocity, Variety, Veracity, and
Value. Example:
Data generated by social media platforms like Facebook and
Twitter. Real-time transaction data from e-commerce websites like
Amazon.
b Write a program to find the Sum, Mean and Product of the Vector 3 1 10
in R programming.
# Define a vector
vec <- c(2, 4, 6, 8, 10)

# Calculate Sum
sum_vec <- sum(vec)

# Calculate Mean
mean_vec <- mean(vec)

# Calculate Product
product_vec <- prod(vec)

# Print results
cat("Vector:", vec, "\n")
cat("Sum:", sum_vec, "\n")
cat("Mean:", mean_vec, "\n")
cat("Product:", product_vec, "\n")

3a Briefly explain Data Science process with a neat diagram. 2 2 10

Explanation of Data Science Process 5


Neat and Clear Diagram 5
Data Science Process
The Data Science process consists of several stages that transform raw data into valuable insights.
Below are the key steps:
Problem Definition
Understanding the problem and defining objectives.
Example: Predicting customer churn for a telecom company.
Data Collection
Gathering relevant data from various sources (databases, APIs, web scraping).
Example: Collecting transaction data from an e-commerce platform.
Data Cleaning & Preprocessing
Handling missing values, duplicates, and outliers.
Example: Removing inconsistent records from customer datasets.
Exploratory Data Analysis (EDA)
Understanding patterns and relationships in data using visualization and summary statistics.
Example: Using histograms and scatter plots to explore sales trends.
Feature Engineering & Selection
Creating new features and selecting important ones for better model performance.
Example: Creating a “customer loyalty score” based on past purchases.
Model Building & Training
Applying machine learning algorithms to train predictive models.
Example: Using logistic regression to predict whether a customer will churn.
Model Evaluation & Optimization
Checking model accuracy using metrics like precision, recall, and RMSE.
Example: Comparing different machine learning models to find the best one.
Deployment & Monitoring
Deploying the model into a real-world system and tracking its performance.
Example: Integrating a fraud detection model into an online banking system.
b What is Matplotlib. Write a R program to plot line chart by assuming 3 1 10
own data.
Matplotlib is a popular data visualization library in Python used for creating static, animated, and
interactive plots. It provides various plotting functions like line charts, bar charts, histograms, and
scatter plots.
However, since you requested an R program, we will use ggplot2 or base R for plotting a line
chart.
# Load necessary library
library(ggplot2)
# Create a sample dataset
data <- data.frame(
Year = c(2015, 2016, 2017, 2018, 2019, 2020),
Sales = c(500, 700, 900, 1100, 1300, 1500)
)
# Create a line chart
ggplot(data, aes(x = Year, y = Sales)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 3) +
ggtitle("Yearly Sales Growth") +
xlab("Year") +
ylab("Sales") +
theme_minimal()
4a Explain the concept of KNN algorithm. What are the modelling 2 2 10
assumptions of KNN algorithm with example.
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both
classification and regression tasks. It is a non-parametric and instance-based learning algorithm,
meaning it does not make strong assumptions about the underlying data distribution and directly
stores training data for making predictions.
Similarity-Based Learning

Assumes similar data points are close to each other in feature space.
Example: If most nearby points are apples, a new point near them is likely an apple.
Choice of K Value Matters
A small K (e.g., 1 or 3) may lead to noise affecting predictions.
A large K (e.g., 20 or 50) may oversmooth and ignore local variations.
Feature Scaling is Important
Distance calculations are affected by different feature scales.
Example: If height is in cm and weight in kg, height will dominate. Normalization (Min-Max
Scaling or Standardization) is required.
Assumes a Meaningful Distance Metric
Uses distance measures like Euclidean, Manhattan, or Cosine similarity.
Example: Euclidean distance works well for continuous data, while Hamming distance is used for
categorical data.
Non-Parametric Nature
Does not assume an underlying distribution (unlike linear regression).
Learns only when making predictions (lazy learning).
b What is ggplot2 in R? Write an R program to plot a bar chart using 3 1 10
some sample data.
ggplot2 is a powerful and widely used data visualization package in R. It is based on the
Grammar of Graphics and allows users to create complex visualizations easily by layering
different elements like axes, colors, and labels.
Features of ggplot2:
Provides high-quality, customizable plots.
Supports multiple chart types (bar charts, line charts, histograms,
etc.). Uses a layered approach for plotting.
Works well with dplyr and tidyverse for data manipulation.
# Load ggplot2 library
library(ggplot2)
# Create sample data
data <- data.frame(
Category = c("A", "B", "C", "D", "E"),
Value = c(10, 25, 15, 30, 20)
)
# Create a bar chart
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity") +
ggtitle("Bar Chart Example") +
xlab("Categories") +
ylab("Values") +
theme_minimal()

5a Explain Naïve Bayes Algorithm for filtering Spam with example. 2 3 10


The Naïve Bayes algorithm is a probabilistic machine learning model used for classification
tasks, including spam filtering. It is based on Bayes' Theorem and assumes that features are
independent of each other (hence the term "naïve").
Bayes' Theorem Formula:
How Naïve Bayes is Used for Spam
Filtering Training Phase:
Collect a dataset of emails labeled as spam or ham (not spam).
Extract features like word frequency (e.g., "free", "win", "money").
Calculate probabilities of each word appearing in spam and non-spam emails.
Prediction Phase:
For a new email, check the presence of words and compute probabilities using Bayes' theorem.
If the probability of spam is higher than ham, classify the email as spam, otherwise classify it as
ham.
b Compare Naïve Bayes Algorithm with KNN Algorithm. Explain 2 3 10
Laplace smoothing

Why is Laplace Smoothing Needed?


In Naïve Bayes, if a word (feature) never appears in training data for a class, its probability
becomes zero, making the entire probability calculation zero. This problem is called zero-
frequency problem.
 Prevents zero probabilities.
 Makes the model more robust, especially for text classification.
 Works well for handling rare words or unseen features in test data.
6a Why Linear Regression and KNN are poor choices for filtering spam? 2 3 10
Discuss.
Why Linear Regression and KNN are Poor Choices for Spam Filtering?
Spam filtering is a classification problem where emails are categorized as spam or ham (not
spam). While algorithms like Naïve Bayes are well-suited for text classification, Linear
Regression and K- Nearest Neighbors (KNN) have significant drawbacks when applied to spam
filtering.
Why Linear Regression is a Poor Choice?
+ Not Designed for Classification
Linear Regression is meant for continuous output (regression tasks), not categorical classification.

It produces continuous values instead of distinct class labels (spam or ham).


+ Decision Boundary Issues
A linear model assumes a straight-line relationship between inputs and output, which does not fit
complex decision boundaries in text classification.

Emails have non-linear relationships between words and spam probability.

+ Probabilities May Go Out of Range


Linear Regression may predict values outside the valid probability range (0 to 1).

Example: It may classify an email with a spam score of 1.5, which is invalid.

Why KNN is a Poor Choice?


+ High Computational Cost
KNN is a lazy learner, meaning it stores all training data and makes predictions by computing
distances between new data and all stored emails.

With a large dataset, computing distances between emails is slow and inefficient.

+ Curse of Dimensionality
In spam filtering, emails are represented as high-dimensional feature vectors (e.g., thousands of
words).

KNN performs poorly in high dimensions because distances become less meaningful.

+ Does Not Handle Text Data Well


KNN requires a distance metric (e.g., Euclidean distance), which is not ideal for categorical/text
data.

Text data should be treated probabilistically (like in Naïve Bayes) rather than relying on distances.

⬛ Alternative: Naïve Bayes is better since it uses word probabilities instead of distances to
classify emails.

b Explain scraping the web with API’s and other tools. 2 3 10


Scraping the Web with APIs and Other Tools
Web scraping is the process of extracting data from websites. It can be done using APIs or web
scraping tools when APIs are not available.

Scraping the Web with APIs


APIs (Application Programming Interfaces) allow structured access to web data without parsing
HTML. Many websites provide APIs to fetch data efficiently.
Steps for Web Scraping Using APIs
Find the API – Check if the website provides a public API (e.g., Twitter API, OpenWeather API).
Get API Key – Many APIs require authentication with an API key.
Send Requests – Use GET or POST requests to fetch data.
Process JSON/XML Response – Extract and analyze data from the response. With example
7a Explain and construct a Decision Tree with an example. 3 3 10
B Explain the concept of Principal Component Analysis (PCA) and 2 4 10
evaluate its significance in dimensionality reduction.

What is PCA?
Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction
while retaining the most important information in a dataset. It transforms the original features into
a new set of uncorrelated variables called Principal Components (PCs), which capture the
maximum variance in the data.
8a Write short notes on: 3 3 10
i) Feature selection criteria
ii) Random Forest
iii) The three primary methods of Regression.
iv) The Kaggle model

b Discuss features generation. Describe feature selection using SVD 2 4 10


algorithm.
9a Explain the concept of data visualization. What are its key objectives? 2 5 10
Discuss the various techniques used to visualize time series data
Data visualization is the graphical representation of data to help users understand patterns, trends,
and insights. It transforms raw data into visual formats like charts, graphs, and maps, making
complex data easier to interpret and analyze.
Key Objectives of Data Visualization
Simplify Data Interpretation – Makes complex data more accessible and understandable.
Identify Trends and Patterns – Helps detect trends, correlations, and anomalies in data.
Support Decision-Making – Assists businesses and researchers in making informed decisions.
Enhance Communication – Provides a clear way to present data-driven insights.
Detect Outliers – Helps in identifying unusual data points or inconsistencies.
Techniques for Visualizing Time Series Data
Line Chart – Best for showing trends over time.
Area Chart – Similar to a line chart but with the area filled, emphasizing volume.
Bar Chart – Useful for comparing values over time, especially with categorical data.
Scatter Plot – Shows relationships between time and another variable.
Heatmap – Displays variations in data using color intensity.
Box Plot – Helps in understanding the distribution and variability of data over time.
Moving Averages & Smoothing – Used to highlight trends by reducing short-term fluctuations.
Seasonal Decomposition Plot – Separates time series data into trend, seasonality, and residual
B Explain the MapReduce framework in data engineering. How can it 3 5 10
be applied to solve the word frequency problem in large text datasets?
MapReduce is a distributed data processing framework used in big data engineering to process
large datasets across multiple machines in a parallel manner. It is primarily used in Hadoop
ecosystems and follows a divide-and-conquer approach to process data efficiently.
Components of MapReduce
Map Phase
Splits the input data into smaller chunks and processes them in parallel.
Each chunk is processed by a mapper function, which transforms the input into key-value pairs.
Shuffle & Sort Phase
The intermediate key-value pairs from the mappers are grouped by key.
Data is then sorted and sent to the reducers.
Reduce Phase
The reducers process grouped data and aggregate results.
The final output is stored in HDFS (Hadoop Distributed File System) or another storage system.
Example
The word frequency problem involves counting occurrences of words in a large text dataset.
10a Discuss the essential characteristics of a social network with a 2 5 10
suitable example. Explain the concept of a social graph and its
relevance in the field of data science.
Essential Characteristics of a Social Network
A social network is a structure made up of individuals or organizations that are connected through
relationships such as friendship, professional connections, or shared interests. The key
characteristics of social networks include:
Nodes (Entities) – Individuals, groups, or organizations that form the network.
Edges (Connections) – Relationships between nodes, which can be directed (one-way) or
undirected (mutual).
Community Structure – Groups of highly connected nodes within the network.
Degree Centrality – The number of direct connections a node has.
Homophily (Similarity) – The tendency of similar individuals to form connections (e.g., people
with common interests forming clusters).
Influence & Virality – Information spreads quickly through key influencers (viral marketing,
trends, etc.).
Dynamic Nature – Networks evolve as new connections form and old ones disappear.
Concept of a Social Graph
A social graph is a graphical representation of relationships between users in a social network. It
consists of:
Nodes (Users/Entities) representing people or groups.
Edges (Connections) depicting friendships, follows, or interactions.
Relevance of Social Graph in Data Science
Recommendation Systems – Used by platforms like Netflix and Amazon to suggest content based
on social connections.
Influencer Identification – Helps in marketing campaigns by identifying key users who influence
others.
Fraud Detection – Banks use social graphs to detect suspicious activities by analyzing transaction
networks.
Sentiment Analysis – Understanding public opinion on social media by analyzing interaction
patterns.
Disease Spread Analysis – Predicting how diseases (like COVID-19) spread through human
interactions.
b Explain Girvan-Newman algorithm with example 3 5 10
The Girvan-Newman algorithm is a community detection algorithm used in social network
analysis to identify clusters (communities) by iteratively removing edges with the highest
betweenness centrality.
Steps of the Girvan-Newman Algorithm
Calculate Edge Betweenness
Betweenness centrality measures how often an edge appears in the shortest paths between
nodes. Higher betweenness means the edge acts as a "bridge" between communities.
Remove the Edge with Highest Betweenness
The most influential edge is deleted.
The network may split into smaller groups.
Recalculate Betweenness
After removing an edge, betweenness is recalculated for the remaining edges.
Repeat Until Communities Emerge
The process continues until a clear separation of clusters is visible.
Applications of the Girvan-Newman Algorithm
Social Network Analysis – Finding friend groups in social media.
Biological Networks – Identifying functional modules in protein interactions.
Fraud Detection – Discovering fraudulent activity groups.
Marketing & Targeting – Understanding customer clusters for better recommendations.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy