ADS Imp Ans
ADS Imp Ans
3. Variance
Variance is another measure of the spread of data, and it is simply the square of the
standard deviation. It represents how far each data point in the set is from the mean.
Formula for Variance:
Variance(σ2)=∑(xi−xˉ)2n
(Note that the variance is the same as the quantity before taking the square root for the
standard deviation.)
In the example above:
Variance=138/5=27.6
So, the variance is 27.6.
5. Outliers
An outlier is a data point that is significantly different from other data points in the dataset.
To detect outliers, we use the Interquartile Range (IQR) method:
Formula to identify outliers:
Lower Bound = Q1−1.5×IQR
Upper Bound = Q3+1.5×IQR
Steps to Find Outliers:
1. Lower Bound = 6.5−1.5×11=6.5−16.5=−10
Explain graphical method in applied data science ?
In Applied Data Science, the graphical method refers to using various types of visual
representations to analyze and interpret data. These methods help data scientists, analysts,
and decision-makers understand complex datasets, identify patterns, trends, and
relationships, and communicate insights effectively. Visualizations play a critical role in the
data analysis process, making it easier to detect outliers, distributions, correlations, and
other important patterns that are often difficult to discern from raw data alone.
1. Histograms
A histogram is a graphical representation of the distribution of a dataset. It is used to show
how often different values occur within a dataset and helps to visualize the frequency
distribution.
When to use:
To understand the distribution of numerical data (e.g., whether the data is normally
distributed or skewed).
To check for outliers or gaps in the data.
Example:
If you have a dataset of exam scores, a histogram will show how many students fall into
each score range (e.g., 0-10, 10-20 etc.).
2. Box Plots (Box-and-Whisker Plots)
A box plot is a summary visualization that shows the distribution of a dataset based on five
summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
It helps detect outliers and understand data spread and skewness.
When to use:
To understand the spread and variability of a dataset.
To compare distributions across different groups or categories.
To identify outliers.
Example:
A box plot for a dataset of salaries across various job titles would show the range of salaries,
the median, and any extreme outliers (e.g., very high or low salaries).
3. Scatter Plots
A scatter plot is used to visualize the relationship between two continuous variables. It
displays points on a 2D plane with each point representing a pair of values (x, y).
When to use:
To check for correlations between two variables (e.g., height vs. weight, age vs.
income).
To identify trends, clusters, or outliers in the data.
To perform regression analysis (visualizing the fit of a regression line).
Example:
If you're trying to predict house prices based on their square footage, a scatter plot can
show how square footage correlates with price, and whether there's a linear relationship.
4. Line Graphs
A line graph is used to visualize trends over time. It connects data points with a line and is
typically used to represent time series data.
When to use:
To visualize trends over time (e.g., stock prices, sales over months, temperature over
days).
To identify periodicity, seasonality, and growth patterns.
Example:
A line graph can track the number of new website visitors over the past year, helping
identify peak periods or trends in user growth.
5. Heatmaps
A heatmap is a graphical representation where individual values are represented by colors.
It is typically used to visualize correlations or patterns in large datasets, and it can be
especially helpful in understanding the structure of complex data like correlations,
confusion matrices, or geographic distributions.
When to use:
To visualize correlation matrices between different variables.
To show the intensity of values in a matrix form (e.g., website clicks, geographic
regions).
To visualize the distribution of values in large data grids.
Example:
In a dataset with customer interaction across multiple product categories, a heatmap can
show which categories have the highest or lowest engagement.
6. Pair Plots (Scatterplot Matrix)
A pair plot (or scatterplot matrix) is a collection of scatter plots used to visualize pairwise
relationships between multiple variables in a dataset. It is particularly useful when analyzing
high-dimensional data.
When to use:
To visualize relationships between multiple variables.
Explain rank method in applied data science ?
In applied data science, the rank method refers to a technique used to transform or
organize data based on the relative ranking or order of values within a dataset. Ranking is a
fundamental concept in data analysis and is especially useful when dealing with data that
needs to be compared or when it's not necessary to work with the actual values but rather
their relative positions.
The rank method can be used for various purposes such as feature engineering, handling
ordinal data, statistical analysis, and data transformation. Here’s a breakdown of how and
why ranking is applied in data science:
1. What is Ranking in Data Science?
Ranking assigns a numerical order to items in a dataset based on their magnitude. For a
given dataset, the smallest value is assigned the rank of 1, the next smallest the rank of 2,
and so on. When there are ties (i.e., two or more identical values), different methods can be
used to assign ranks.
Example:
For the dataset 12,17,9,15
Rank(9) = 1
Rank(12) = 2
Rank(15) = 3
Rank(17) = 4
If there are duplicate values, they are assigned the average rank of their positions.
2. Rank Transformation in Data Science
In data preprocessing, the rank transformation technique involves replacing the raw data
values with their corresponding ranks. This is often done to:
Normalize or scale the data: Ranking transforms data into ordinal scale, which might
be useful for certain algorithms.
Reduce the impact of outliers: By replacing raw values with ranks, the influence of
outliers is minimized.
Handle non-linear relationships: Ranking can sometimes make relationships between
variables more linear and easier to model.
Example of Rank Transformation:
Original Data: 12,17,9,15
After Rank Transformation: 2,4,1,3
This transformation helps to standardize the data by focusing on its relative positions rather
than the absolute values.
3.Rank-Based Methods in Applied Data Science
a. Spearman’s Rank Correlation
One of the most common rank-based statistical methods is Spearman’s Rank Correlation,
used to measure the strength and direction of the association between two variables.
Unlike Pearson correlation (which measures linear relationships), Spearman's correlation
assesses monotonic relationships (relationships where one variable tends to increase as the
other increases, though not necessarily in a linear fashion).
Formula: ρ=1−6∑di2n(n2−1)
Example:
For two variables:
Variable 1: 12,17,9,15
Variable 2: 5,3,6,2
You would rank both variables and then calculate the differences in ranks, square them, sum
them up, and use the formula to find the Spearman rank correlation. This method helps
identify monotonic relationships in the data, even if those relationships are non-linear.
b. Rank-based Models
In some machine learning algorithms, such as RankNet (used in learning-to-rank problems),
rank-based models are used to predict the order of items. These models are used
extensively in recommendation systems (e.g., ranking movies or products), search engine
results, or any domain where the goal is to rank items in a specific order.
5. Ranking in Data Transformation and Normalization
In some situations, the rank method can be used as a form of data transformation. This can
be particularly useful when you want to make your data more robust to outliers or when
working with non-normal distributions. Ranking helps make the data more uniform,
especially for skewed or non-linear datasets.
Example of Ranking for Normalization:
Given a dataset of incomes: 1000,5000,3000,2500,10000
Ranking these values: 1,4,3,2,5
This transformation reduces the influence of large numbers (like the outlier of 10000) and
brings the dataset into a uniform scale based on relative positions.
Explain Two line of regression ?
In statistics and data science, two lines of regression refer to the regression models that
involve two variables: one independent variable (predictor) and one dependent variable
(response). The two lines of regression are used to predict the relationship between two
variables, and these lines are the least squares lines that best fit the data points in a scatter
plot.
The two primary types of regression lines are:
1. Regression of Y on X (or Y on X line)
2. Regression of X on Y (or X on Y line)
Both regression lines are used to describe the relationship between two variables, but they
are derived differently. Here’s an explanation of each:
1. Regression of Y on X (Y on X Line)
The regression of Y on X is the line that minimizes the sum of squared vertical deviations
(residuals) between the observed values of Y and the predicted values of Y based on X. In
other words, it tries to predict the value of the dependent variable Y using the independent
variable X.
Formula:
Y=a+bXY
Where:
Y is the dependent variable (response variable).
X is the independent variable (predictor variable).
a is the intercept of the regression line (the value of Y when X = 0).
b is the slope of the regression line (the change in Y for each one-unit change in X).
Interpretation:
The slope b represents the amount by which Y changes for a one-unit change in X.
The intercept a represents the value of Y when X is zero.
2. Regression of X on Y (X on Y Line)
The regression of X on Y is the line that minimizes the sum of squared horizontal deviations
(residuals) between the observed values of X and the predicted values of X based on Y. It
predicts the value of the independent variable X based on the dependent variable Y.
Formula:
X=c+dYX
Where:
X is the independent variable (predictor variable).
Y is the dependent variable (response variable).
c is the intercept of the regression line.
d is the slope of the regression line.
Interpretation:
The slope d represents the amount by which X changes for a one-unit change in Y.
The intercept c represents the value of X when Y is zero.
Key Differences between the Two Lines of Regression
Direction of Prediction:
o Y on X predicts Y (dependent) based on X (independent).
o X on Y predicts X (independent) based on Y (dependent).
Slope Difference:
o The slope of Y on X and X on Y is generally not the same, though they are
related.
Minimizing Deviations:
o Y on X minimizes vertical deviations (residuals) between the observed and
predicted values of Y.
o X on Y minimizes horizontal deviations (residuals) between the observed and
predicted values of X.
Graphical Representation of the Two Lines of Regression
1. Regression of Y on X: The line of best fit passes through the data points, with vertical
residuals (the difference between actual Y values and predicted Y values from the
regression line).
2. Regression of X on Y: The line of best fit is also plotted, but now the residuals are
horizontal (the difference between actual X values and predicted X values from the
regression line).
Here’s a simple example to illustrate:
Example Dataset:
X (Independent) Y (Dependent)
1 2
2 3
3 5
4 4