0% found this document useful (0 votes)
9 views16 pages

Data science

The document provides an overview of various Python libraries used in data science, including Pandas, NumPy, Matplotlib, and Scikit-learn, along with their features and functionalities. It also covers concepts related to data visualization, big data, and the data science life cycle, as well as statistical measures like mean, median, and standard deviation. Additionally, it discusses machine learning metrics such as precision, recall, and F1 score, and introduces tools and techniques for data wrangling and transformation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

Data science

The document provides an overview of various Python libraries used in data science, including Pandas, NumPy, Matplotlib, and Scikit-learn, along with their features and functionalities. It also covers concepts related to data visualization, big data, and the data science life cycle, as well as statistical measures like mean, median, and standard deviation. Additionally, it discusses machine learning metrics such as precision, recall, and F1 score, and introduces tools and techniques for data wrangling and transformation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

✅ 1) What is Pandas library in Python?

Answer:
Pandas is a powerful, open-source Python library primarily used for data manipulation and analysis.
It provides two main data structures:

 Series: 1-dimensional labeled arrays.

 DataFrame: 2-dimensional labeled data structure similar to a table (like Excel or SQL table).

It is widely used in data science, machine learning, and data engineering.

✅ 2) List some key features of Pandas.

Answer:

 Fast and efficient data manipulation using DataFrames.

 Tools for reading and writing data between in-memory data structures and different formats
(CSV, Excel, SQL).

 Label-based indexing for rows and columns.

 Handling missing data.

 Grouping and aggregation.

 Time series functionality.

 Built-in plotting using Matplotlib.

✅ 3) What is NumPy library in Python?

Answer:
NumPy (Numerical Python) is a library used for numerical computing. It provides support for:

 N-dimensional arrays (ndarray)

 Mathematical functions (e.g., mean, sum, std)

 Linear algebra

 Random number generation


It forms the foundation for libraries like Pandas and SciPy.

✅ 4) What is Matplotlib library?

Answer:
Matplotlib is a Python library used for creating static, animated, and interactive visualizations. It is
often used with NumPy and Pandas for plotting data. The most commonly used module in Matplotlib
is pyplot.

Example:
python

CopyEdit

import matplotlib.pyplot as plt

plt.plot([1, 2, 3], [4, 5, 6])

plt.show()

✅ 5) What is the difference between Seaborn and Matplotlib?

Answer:

Feature Matplotlib Seaborn

Purpose Low-level, general plotting High-level interface built on Matplotlib

Syntax More manual styling needed Easier to use with built-in themes

Data Works with arrays Works directly with Pandas DataFrames

Example plt.plot() sns.lineplot()

✅ 6) Are Sklearn and Scikit-learn the same? What is its use in data science?

Answer:
Yes, Sklearn and Scikit-learn are the same. sklearn is the importable module name for Scikit-learn, a
popular library for machine learning.
It provides tools for:

 Classification (e.g., Naive Bayes, SVM)

 Regression (e.g., Linear Regression)

 Clustering (e.g., K-Means)

 Model selection and evaluation

✅ 7) What functions come in Pandas and Numpy library?

Answer:
Pandas:

 read_csv(), head(), info(), describe()

 groupby(), merge(), dropna(), fillna(), value_counts()

NumPy:

 array(), arange(), linspace()


 mean(), sum(), std(), reshape(), dot()

✅ 8) What is a DataFrame in Python?

Answer:
A DataFrame is a 2D labeled data structure with columns of potentially different types. It's part of
Pandas and resembles an Excel spreadsheet or a SQL table.

Example:

python

CopyEdit

import pandas as pd

df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

✅ 9) How to find duplicates in Python? (Python Command)

Answer:

python

CopyEdit

df.duplicated() # Returns True for duplicate rows

df[df.duplicated()] # Filters duplicate rows

df.drop_duplicates() # Removes duplicates

✅ 10) What is the use of describe() command?

Answer:
df.describe() provides statistical summary of numeric (or all, if specified) columns in a DataFrame.

It includes:

 Count

 Mean

 Standard deviation

 Min, Max

 25%, 50%, and 75% percentiles

Example:

python

CopyEdit
df.describe(include='all')

✅ 11) What is the significance of Confusion Matrix?

Answer:
A confusion matrix is a performance measurement tool for classification models. It shows how many
predictions were:

 True Positives (TP): Correctly predicted positive class

 True Negatives (TN): Correctly predicted negative class

 False Positives (FP): Incorrectly predicted as positive

 False Negatives (FN): Incorrectly predicted as negative

It helps in calculating metrics like:

 Accuracy

 Precision

 Recall

 F1 Score

✅ 12) What is TP, TN, FP, FN in Confusion Matrix?

Answer:

Term Description

TP (True Positive) Model correctly predicts positive class

TN (True Negative) Model correctly predicts negative class

FP (False Positive) Model incorrectly predicts positive class

FN (False Negative) Model incorrectly predicts negative class

✅ 13) What is Recall?

Answer:
Recall (also called Sensitivity or True Positive Rate) is the ratio of correctly predicted positive
observations to all actual positives.

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}Recall=TP+FNTP

It answers: Out of all actual positives, how many did we correctly predict?
✅ 14) What is Precision?

Answer:
Precision is the ratio of correctly predicted positive observations to the total predicted positives.

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}Precision=TP+FPTP

It answers: Out of all predicted positives, how many were actually positive?

✅ 15) What is F1 Score?

Answer:
The F1 Score is the harmonic mean of precision and recall. It balances the two metrics.

F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \times


\text{Recall}}{\text{Precision} + \text{Recall}}F1 Score=2×Precision+RecallPrecision×Recall

Useful when you need a balance between Precision and Recall.

✅ 16) What is the need for Data Visualization in Data Science?

Answer:
Data visualization helps in:

 Understanding trends, patterns, and outliers

 Communicating insights effectively

 Making data-driven decisions

 Validating assumptions and hypotheses

Tools: Matplotlib, Seaborn, Tableau, PowerBI

✅ 17) What is an Outlier?

Answer:
An outlier is a data point that differs significantly from other observations in a dataset.

They can arise due to:

 Measurement errors

 Data entry errors

 True variability

Outliers can skew statistical results and should be handled carefully (e.g., removed or capped).

✅ 18) When to use Histogram and Pie Chart?


Answer:

Chart Use Case

Histogram To show distribution of a continuous variable (e.g., Age, Salary)

Pie Chart To show proportion/percentage of categories in a dataset (e.g., Gender, City)

✅ 19) What are the challenges in Big Data Visualization?

Answer:

 Scalability: Standard tools may not handle billions of rows.

 Speed: Rendering large datasets takes time.

 Interactivity: Real-time filtering and zooming becomes hard.

 Storage: Large visual files consume memory.

 Data Cleaning: Big data may have missing, inconsistent entries.

✅ 20) What is Joint Plot and Dist Plot?

Answer:

🔹 jointplot() – from Seaborn:

 Combines scatter plot and histograms.

 Useful for visualizing the relationship between two variables + distribution.

Example:

python

CopyEdit

sns.jointplot(x='Age', y='Salary', data=df)

🔹 distplot() (deprecated, use displot()):

 Plots a histogram + KDE (Kernel Density Estimate).

 Shows distribution of a single variable.

Example:

python

CopyEdit

sns.displot(df['Salary'], kde=True)
21) What are the tools used for Data Visualization?

Answer:
Popular tools for data visualization include:

🔹 Python Libraries:

 Matplotlib – Basic plots (line, bar, scatter).

 Seaborn – Statistical visualizations with better styling.

 Plotly – Interactive plots.

 Bokeh – Web-based visualizations.

 Altair – Declarative charts.

🔹 BI Tools:

 Tableau

 Power BI

 Google Data Studio

✅ 22) What is Data Wrangling?

Answer:
Data Wrangling (also known as Data Munging) is the process of cleaning, transforming, and
organizing raw data into a usable format.

Typical steps include:

 Handling missing values

 Converting data types

 Removing duplicates

 Normalizing data

 Feature engineering

✅ 23) What is Data Transformation?

Answer:
Data Transformation is the process of converting data from one format or structure into another. It's
often used in:

 Normalization/Standardization

 Encoding categorical data

 Aggregating values
 Scaling numerical values

It prepares data for analysis or modeling.

✅ 24) What is the use of StandardScaler function in Python?

Answer:
StandardScaler from sklearn.preprocessing standardizes features by removing the mean and scaling
to unit variance (Z-score normalization).

Z=x−μσZ = \frac{x - \mu}{\sigma}Z=σx−μ

It ensures that all features contribute equally to the model.

Example:

python

CopyEdit

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

✅ 25) What is Hadoop?

Answer:
Hadoop is an open-source framework used for storing and processing large datasets using a
distributed computing model.

Key components:

 HDFS (Storage)

 MapReduce (Processing)

 YARN (Resource management)

 Common (Libraries)

It enables parallel processing across multiple computers.

✅ 26) What is HDFS and MapReduce?

Answer:

 HDFS (Hadoop Distributed File System): A distributed file system that stores data across
multiple machines. It breaks large files into blocks (default 128MB) and stores them
redundantly.

 MapReduce: A programming model for processing large data in parallel. It consists of:
o Map step: Processes and filters data

o Reduce step: Aggregates results

✅ 27) What are the components of the Hadoop Ecosystem?

Answer:
The Hadoop ecosystem includes:

 HDFS – Storage layer

 MapReduce – Processing layer

 YARN – Resource manager

 Hive – SQL-like querying

 Pig – Data flow scripting

 HBase – NoSQL database

 Sqoop – Transfers data between Hadoop and RDBMS

 Flume – Collects and transports log data

 Zookeeper – Coordination service

 Oozie – Workflow scheduler

✅ 28) What is Scala?

Answer:
Scala is a general-purpose programming language that combines object-oriented and functional
programming paradigms. It runs on the Java Virtual Machine (JVM) and is used heavily with Apache
Spark.

✅ 29) What are the features of Scala?

Answer:

 Statically typed (like Java)

 Supports functional and object-oriented programming

 Type inference

 Concise syntax

 Interoperable with Java

 Concurrency support (via Akka)

 Used in big data frameworks like Spark


✅ 30) How is Scala different from Java?

Feature Scala Java

Programming Style Functional + OOP Only OOP

Code Length Concise Verbose

Type Inference Yes No

Concurrency Actor model (Akka) Threads

Use in Big Data Apache Spark Limited

31) What is Big Data?

Answer:
Big Data refers to extremely large datasets that are too complex or massive for traditional data
processing tools. It includes structured, semi-structured, and unstructured data from various sources
like social media, sensors, logs, and transactions.

✅ 32) What are the characteristics of Big Data? (The 5 V's)

Answer:

1. Volume – Huge amount of data.

2. Velocity – Speed at which data is generated and processed.

3. Variety – Different forms: text, image, video, etc.

4. Veracity – Accuracy and trustworthiness of data.

5. Value – Extracting useful insights from the data.

✅ 33) List phases in Data Science Life Cycle.

Answer:

1. Problem Understanding

2. Data Collection

3. Data Cleaning / Wrangling

4. Exploratory Data Analysis (EDA)

5. Feature Engineering

6. Model Building
7. Model Evaluation

8. Deployment

9. Monitoring and Maintenance

✅ 34) What is Central Tendency?

Answer:
Central Tendency refers to the measure that identifies the center of a dataset. The most common
measures are:

 Mean (average)

 Median (middle value)

 Mode (most frequent value)

✅ 35) What is Dispersion?

Answer:
Dispersion measures how spread out the data is. It helps understand variability. Common measures:

 Range

 Variance

 Standard Deviation

 Interquartile Range (IQR)

✅ 36) What are Mean, Median, Mode, Mid-range? Calculate for: 10, 22, 13, 10, 21, 43, 77, 21, 10

Answer:
Data: 10, 22, 13, 10, 21, 43, 77, 21, 10

 Mean:

10+22+13+10+21+43+77+21+109=2279≈25.22\frac{10 + 22 + 13 + 10 + 21 + 43 + 77 + 21 + 10}{9} =
\frac{227}{9} \approx 25.22910+22+13+10+21+43+77+21+10=9227≈25.22

 Median (sorted: 10, 10, 10, 13, 21, 21, 22, 43, 77):
Middle value = 21

 Mode: 10 (appears 3 times)

 Mid-Range:

Min+Max2=10+772=43.5\frac{Min + Max}{2} = \frac{10 + 77}{2} = 43.52Min+Max=210+77=43.5

✅ 37) What is Variance?


Answer:
Variance measures the average squared deviation from the mean. It shows how much the data
spreads out.

σ2=1n∑(xi−μ)2\sigma^2 = \frac{1}{n}\sum (x_i - \mu)^2σ2=n1∑(xi−μ)2

✅ 38) What is Standard Deviation?

Answer:
Standard deviation is the square root of variance. It shows how much the values deviate from the
mean.

σ=Variance\sigma = \sqrt{Variance}σ=Variance

If data is tightly clustered, SD is low; if spread out, SD is high.

✅ 39) What is Posterior Probability in Naive Bayes?

Answer:
Posterior Probability is the probability of a class (e.g., spam) given a set of features (e.g., words in an
email).

P(Class∣Data)P(Class|Data)P(Class∣Data)

It is calculated using Bayes’ Theorem:

P(C∣X)=P(X∣C)⋅P(C)P(X)P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}P(C∣X)=P(X)P(X∣C)⋅P(C)

✅ 40) What is Likelihood Probability in Naive Bayes?

Answer:
Likelihood is the probability of the features (data) given a class.

P(Data∣Class)P(Data|Class)P(Data∣Class)

Example: In spam detection, it's the probability that certain words appear given that the email is
spam.

41) What is NLTK?

NLTK (Natural Language Toolkit) is a powerful Python library used for working with human language
data (text). It provides easy-to-use interfaces to:

 Over 50 corpora and lexical resources such as WordNet.

 Text processing libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning.

 Wrappers for industrial-strength NLP libraries.

✅ Key Features:
 Written in Python.

 Good for educational and research purposes in NLP.

 Helps in building Python programs to work with human language data.

42) What is Tokenization in NLP?

Tokenization is the process of splitting text into smaller units called tokens. Tokens can be:

 Words

 Sentences

 Subwords

✅ Types of Tokenization:

 Word Tokenization: Splits text into words.


Example: "I love Python" → ["I", "love", "Python"]

 Sentence Tokenization: Splits text into sentences.


Example: "I love Python. It is powerful." → ["I love Python.", "It is powerful."]

Why Tokenization?
It’s the first step in NLP to break down raw text for further processing like parsing, tagging, etc.

43) What is Stemming?

Stemming is the process of reducing a word to its root form by chopping off derivational affixes.

✅ Example:

 "playing", "played", "plays" → "play"

 "running", "runner" → "run"

Note: Stemming is a rule-based process and may not always result in a real word.
Example: "studies" → "studi"

✅ Common Stemmer in NLTK:

python

CopyEdit

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmer.stem("playing") # Output: 'play'

44) What is Lemmatization?


Lemmatization is the process of reducing a word to its base or dictionary form, called a lemma.
Unlike stemming, lemmatization returns a valid word and considers the context (POS).

✅ Example:

 "running", "ran" → "run"

 "better" → "good"

✅ Lemmatization in NLTK:

python

CopyEdit

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize("running", pos="v") # Output: 'run'

🔁 Stemming vs Lemmatization:

 Stemming is faster, less accurate.

 Lemmatization is slower, but more accurate.

45) What is Corpus in NLP?

A Corpus is a large collection of text data used for training and evaluating NLP models.

✅ Types of Corpora:

 Annotated corpora (tagged with POS, syntax)

 Raw corpora (plain text)

 Monolingual or multilingual

✅ Examples:

 Brown Corpus

 Gutenberg Corpus

 WordNet (lexical database)

NLTK Example:

python

CopyEdit

import nltk

nltk.download('gutenberg')

from nltk.corpus import gutenberg


gutenberg.fileids() # Lists files in the Gutenberg corpus

46) What is Spark Framework?

Apache Spark is an open-source distributed computing framework used for big data processing. It
supports:

 Batch processing

 Real-time stream processing

 Machine learning

 Graph processing

✅ Languages Supported: Scala, Python (PySpark), Java, R

✅ Why Spark?

 Processes data faster than Hadoop MapReduce

 In-memory computing

 Built-in libraries for ML (MLlib), graph (GraphX), SQL (Spark SQL)

Steps to Run Scala Program in Windows Using Spark Framework:

1. Copy Scala File:

o Save your .scala file (e.g., sum.scala) in the Spark folder:


C:\Program Files\Big Data\Spark

2. Open CMD in Spark Folder:

o Open the Spark folder.

o In the address bar, type cmd and press Enter. This opens CMD in that path.

3. Start Spark Shell:

o Type:

bash

CopyEdit

spark-shell

o Press Enter. This starts the interactive Spark Scala shell.

4. Load Scala File in Spark Shell:

o Use the :load command to load your Scala file:

scala

CopyEdit
:load sum.scala

o This will run the code inside sum.scala.

✅ Example:
If sum.scala contains:

scala

CopyEdit

val a = 5

val b = 10

val sum = a + b

println("Sum is: " + sum)

Output will be:

csharp

CopyEdit

Sum is: 15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy