Data science
Data science
Answer:
Pandas is a powerful, open-source Python library primarily used for data manipulation and analysis.
It provides two main data structures:
DataFrame: 2-dimensional labeled data structure similar to a table (like Excel or SQL table).
Answer:
Tools for reading and writing data between in-memory data structures and different formats
(CSV, Excel, SQL).
Answer:
NumPy (Numerical Python) is a library used for numerical computing. It provides support for:
Linear algebra
Answer:
Matplotlib is a Python library used for creating static, animated, and interactive visualizations. It is
often used with NumPy and Pandas for plotting data. The most commonly used module in Matplotlib
is pyplot.
Example:
python
CopyEdit
plt.show()
Answer:
Syntax More manual styling needed Easier to use with built-in themes
✅ 6) Are Sklearn and Scikit-learn the same? What is its use in data science?
Answer:
Yes, Sklearn and Scikit-learn are the same. sklearn is the importable module name for Scikit-learn, a
popular library for machine learning.
It provides tools for:
Answer:
Pandas:
NumPy:
Answer:
A DataFrame is a 2D labeled data structure with columns of potentially different types. It's part of
Pandas and resembles an Excel spreadsheet or a SQL table.
Example:
python
CopyEdit
import pandas as pd
Answer:
python
CopyEdit
Answer:
df.describe() provides statistical summary of numeric (or all, if specified) columns in a DataFrame.
It includes:
Count
Mean
Standard deviation
Min, Max
Example:
python
CopyEdit
df.describe(include='all')
Answer:
A confusion matrix is a performance measurement tool for classification models. It shows how many
predictions were:
Accuracy
Precision
Recall
F1 Score
Answer:
Term Description
Answer:
Recall (also called Sensitivity or True Positive Rate) is the ratio of correctly predicted positive
observations to all actual positives.
It answers: Out of all actual positives, how many did we correctly predict?
✅ 14) What is Precision?
Answer:
Precision is the ratio of correctly predicted positive observations to the total predicted positives.
It answers: Out of all predicted positives, how many were actually positive?
Answer:
The F1 Score is the harmonic mean of precision and recall. It balances the two metrics.
Answer:
Data visualization helps in:
Answer:
An outlier is a data point that differs significantly from other observations in a dataset.
Measurement errors
True variability
Outliers can skew statistical results and should be handled carefully (e.g., removed or capped).
Answer:
Answer:
Example:
python
CopyEdit
Example:
python
CopyEdit
sns.displot(df['Salary'], kde=True)
21) What are the tools used for Data Visualization?
Answer:
Popular tools for data visualization include:
🔹 Python Libraries:
🔹 BI Tools:
Tableau
Power BI
Answer:
Data Wrangling (also known as Data Munging) is the process of cleaning, transforming, and
organizing raw data into a usable format.
Removing duplicates
Normalizing data
Feature engineering
Answer:
Data Transformation is the process of converting data from one format or structure into another. It's
often used in:
Normalization/Standardization
Aggregating values
Scaling numerical values
Answer:
StandardScaler from sklearn.preprocessing standardizes features by removing the mean and scaling
to unit variance (Z-score normalization).
Example:
python
CopyEdit
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Answer:
Hadoop is an open-source framework used for storing and processing large datasets using a
distributed computing model.
Key components:
HDFS (Storage)
MapReduce (Processing)
Common (Libraries)
Answer:
HDFS (Hadoop Distributed File System): A distributed file system that stores data across
multiple machines. It breaks large files into blocks (default 128MB) and stores them
redundantly.
MapReduce: A programming model for processing large data in parallel. It consists of:
o Map step: Processes and filters data
Answer:
The Hadoop ecosystem includes:
Answer:
Scala is a general-purpose programming language that combines object-oriented and functional
programming paradigms. It runs on the Java Virtual Machine (JVM) and is used heavily with Apache
Spark.
Answer:
Type inference
Concise syntax
Answer:
Big Data refers to extremely large datasets that are too complex or massive for traditional data
processing tools. It includes structured, semi-structured, and unstructured data from various sources
like social media, sensors, logs, and transactions.
Answer:
Answer:
1. Problem Understanding
2. Data Collection
5. Feature Engineering
6. Model Building
7. Model Evaluation
8. Deployment
Answer:
Central Tendency refers to the measure that identifies the center of a dataset. The most common
measures are:
Mean (average)
Answer:
Dispersion measures how spread out the data is. It helps understand variability. Common measures:
Range
Variance
Standard Deviation
✅ 36) What are Mean, Median, Mode, Mid-range? Calculate for: 10, 22, 13, 10, 21, 43, 77, 21, 10
Answer:
Data: 10, 22, 13, 10, 21, 43, 77, 21, 10
Mean:
10+22+13+10+21+43+77+21+109=2279≈25.22\frac{10 + 22 + 13 + 10 + 21 + 43 + 77 + 21 + 10}{9} =
\frac{227}{9} \approx 25.22910+22+13+10+21+43+77+21+10=9227≈25.22
Median (sorted: 10, 10, 10, 13, 21, 21, 22, 43, 77):
Middle value = 21
Mid-Range:
Answer:
Standard deviation is the square root of variance. It shows how much the values deviate from the
mean.
σ=Variance\sigma = \sqrt{Variance}σ=Variance
Answer:
Posterior Probability is the probability of a class (e.g., spam) given a set of features (e.g., words in an
email).
P(Class∣Data)P(Class|Data)P(Class∣Data)
Answer:
Likelihood is the probability of the features (data) given a class.
P(Data∣Class)P(Data|Class)P(Data∣Class)
Example: In spam detection, it's the probability that certain words appear given that the email is
spam.
NLTK (Natural Language Toolkit) is a powerful Python library used for working with human language
data (text). It provides easy-to-use interfaces to:
Text processing libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning.
✅ Key Features:
Written in Python.
Tokenization is the process of splitting text into smaller units called tokens. Tokens can be:
Words
Sentences
Subwords
✅ Types of Tokenization:
Why Tokenization?
It’s the first step in NLP to break down raw text for further processing like parsing, tagging, etc.
Stemming is the process of reducing a word to its root form by chopping off derivational affixes.
✅ Example:
Note: Stemming is a rule-based process and may not always result in a real word.
Example: "studies" → "studi"
python
CopyEdit
stemmer = PorterStemmer()
✅ Example:
"better" → "good"
✅ Lemmatization in NLTK:
python
CopyEdit
lemmatizer = WordNetLemmatizer()
🔁 Stemming vs Lemmatization:
A Corpus is a large collection of text data used for training and evaluating NLP models.
✅ Types of Corpora:
Monolingual or multilingual
✅ Examples:
Brown Corpus
Gutenberg Corpus
NLTK Example:
python
CopyEdit
import nltk
nltk.download('gutenberg')
Apache Spark is an open-source distributed computing framework used for big data processing. It
supports:
Batch processing
Machine learning
Graph processing
✅ Why Spark?
In-memory computing
o In the address bar, type cmd and press Enter. This opens CMD in that path.
o Type:
bash
CopyEdit
spark-shell
scala
CopyEdit
:load sum.scala
✅ Example:
If sum.scala contains:
scala
CopyEdit
val a = 5
val b = 10
val sum = a + b
csharp
CopyEdit
Sum is: 15