0% found this document useful (0 votes)
12 views11 pages

ADS Imp Ans

The document explains statistical concepts including mean, median, mode, standard deviation, variance, quartiles, and outliers using both grouped and individual data examples. It also covers graphical methods in applied data science, such as histograms, box plots, scatter plots, line graphs, heatmaps, and pair plots, which help visualize data relationships and trends. Additionally, it discusses the rank method for data transformation and normalization, including its applications in statistical analysis and machine learning.

Uploaded by

abhishekmore1322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

ADS Imp Ans

The document explains statistical concepts including mean, median, mode, standard deviation, variance, quartiles, and outliers using both grouped and individual data examples. It also covers graphical methods in applied data science, such as histograms, box plots, scatter plots, line graphs, heatmaps, and pair plots, which help visualize data relationships and trends. Additionally, it discusses the rank method for data transformation and normalization, including its applications in statistical analysis and machine learning.

Uploaded by

abhishekmore1322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Explain Mean Median And Mode in group data example ?

1. Mean (Arithmetic Average)


The mean is the average of the data, calculated by summing the products of the class
midpoints and their frequencies, then dividing by the total frequency.
Formula for Mean in Grouped Data:
Mean=∑(f×x)∑f
Example:
Suppose you have the following grouped data:
Class Interval Frequency (f)
10 - 20 5
20 - 30 8
30 - 40 12
40 - 50 10
Step 1: Find the class midpoints (xxx):
 Midpoint for 10-20: 10+20/2=15
 Midpoint for 20-30: 20+30/2=25
 Midpoint for 30-40: 30+40/2=35
 Midpoint for 40-50: 40+50/2=45
Step 2: Multiply each midpoint by its corresponding frequency:
 15×5=75
 25×8=200
 35×12=420
 45×10=450
Step 3: Sum the frequencies and the products:
 Total frequency: 5+8+12+10=35
 Sum of products: 75+200+420+450=1145
Step 4: Calculate the mean:
Mean=114535=32.71
So, the mean of the data is 32.71.

2. Median (Middle Value)


The median is the value that divides the data into two equal parts. In grouped data, the
median is found using the following steps:
Formula for Median in Grouped Data:
Median=L+(N2−Ff)×h
Example:
Let's find the median using the same grouped data.
Step 1: Find the cumulative frequencies:
Frequency (f) Cumulative Frequency
Class Interval
10 - 20 5 5
20 - 30 8 13
30 - 40 12 25
40 - 50 10 35
 N=35(total frequency)
Step 2: Find the median class. The median class is the one where the cumulative frequency
exceeds N2=17.5 So, the median class is 30 - 40 (since cumulative frequency reaches 25
here).
Step 3: Apply the median formula:
 L=30 (lower boundary of the median class)
 N=35
 F=13 (cumulative frequency before the median class)
 f=12 (frequency of the median class)
 h=10 (class width)
Median=30+(352−1312)×10
Median=30+(17.5−1312)×10
30+(12/4.5)×10
Median=30+3.75=33.75
So, the median is 33.75.

3. Mode (Most Frequent Value)


The mode is the value that occurs most frequently in the data. In grouped data, the mode
can be approximated using the following formula:
Formula for Mode in Grouped Data:
Mode=L+(f1−f02f1−f0−f2)×h
Example:
From the frequency table, we can see that the highest frequency is 12 for the class interval
30 - 40, so this is the modal class.
Step 1: Apply the mode formula:
 L=30
 f1=12 (frequency of the modal class)
 f0=8 (frequency of the class before the modal class)
 f2=10 (frequency of the class after the modal class)
 h=10
Mode=30+(12−82(12)−8−10)
Mode=30+(2(12)−8−1012−8)×10
Mode=30+(424−8−10)×10
Mode=30+(4/6)×10
Mode=30+(2/3)×10
Mode=30+6.67=36.67
So, the mode is approximately 36.67.
Explain mean derivation , S-D , Variance quartier derivation , outlier example on
individual data without frequency ?
Sure! Let's go through each concept—Mean Derivation, Standard Deviation (S.D.),
Variance, Quartiles, and Outliers—with individual data, step-by-step, without frequency
distribution.
1. Mean Derivation (Arithmetic Mean)
The mean (or arithmetic average) of a set of values is the sum of all the values divided by
the number of values. It represents the "central" or "average" value of the dataset.
Formula for Mean:
Mean(xˉ)=∑xin
Where:
 ∑xi= sum of all individual data points
 n = number of data points
Example:
Consider the data: 5,8,12,15,20
Step 1: Add the data points:
∑xi=5+8+12+15+20=60
Step 2: Count the number of data points:
n=5
Step 3: Calculate the mean:
Mean=60/5=12
So, the mean is 12.

2. Standard Deviation (S.D.)


The standard deviation measures how spread out the values in a dataset are from the
mean. A smaller standard deviation indicates that the values are closer to the mean, while a
larger standard deviation indicates more spread.
Formula for Standard Deviation:
S.D.=∑(xi−xˉ)2n
Where:
 xi= individual data points
 xˉ = mean
 n = number of data points
Steps to Calculate Standard Deviation:
Step 1: Find the mean (which we've already done: xˉ=12\bar{x} = 12xˉ=12).
Step 2: Subtract the mean from each data point and square the result.
Step 3: Sum the squared differences:
∑(xi−xˉ)2=49+16+0+9+64=138
Step 4: Divide by the number of data points n=5n = 5n=5:
∑(xi−xˉ)2n=138/5=27.6
Step 5: Take the square root:
S.D.=27.6≈5.26
So, the standard deviation is approximately 5.26.

3. Variance
Variance is another measure of the spread of data, and it is simply the square of the
standard deviation. It represents how far each data point in the set is from the mean.
Formula for Variance:
Variance(σ2)=∑(xi−xˉ)2n
(Note that the variance is the same as the quantity before taking the square root for the
standard deviation.)
In the example above:
Variance=138/5=27.6
So, the variance is 27.6.

4. Quartiles (Q1, Q2, Q3) and Interquartile Range (IQR)


Quartiles divide the data into four equal parts. The first quartile (Q1) is the value below
which 25% of the data lies, second quartile (Q2) is the median (middle value), and third
quartile (Q3) is the value below which 75% of the data lies.
Steps to Calculate Quartiles:
1. Arrange the data in ascending order (already done in this case): 5,8,12,15,20.
2. Q2 (Median): The median is the middle value of the dataset, which is 12.
3. Q1 (First Quartile): The median of the lower half of the data (excluding the median):
o Lower half: 5,8
o Median of lower half: 5+8/2=6.5
o So, Q1=6.5
4. Q3 (Third Quartile): The median of the upper half of the data (excluding the median):
o Upper half: 15,20
o Median of upper half: 15+20/2=17.5
o So, Q3=17.5
Interquartile Range (IQR):
IQR=Q3−Q1=17.5−6.5=11
So, the quartiles for this data are:
 Q1 = 6.5
 Q2 (Median) = 12
 Q3 = 17.5
And the IQR is 11.

5. Outliers
An outlier is a data point that is significantly different from other data points in the dataset.
To detect outliers, we use the Interquartile Range (IQR) method:
Formula to identify outliers:
 Lower Bound = Q1−1.5×IQR
 Upper Bound = Q3+1.5×IQR
Steps to Find Outliers:
1. Lower Bound = 6.5−1.5×11=6.5−16.5=−10
Explain graphical method in applied data science ?
In Applied Data Science, the graphical method refers to using various types of visual
representations to analyze and interpret data. These methods help data scientists, analysts,
and decision-makers understand complex datasets, identify patterns, trends, and
relationships, and communicate insights effectively. Visualizations play a critical role in the
data analysis process, making it easier to detect outliers, distributions, correlations, and
other important patterns that are often difficult to discern from raw data alone.
1. Histograms
A histogram is a graphical representation of the distribution of a dataset. It is used to show
how often different values occur within a dataset and helps to visualize the frequency
distribution.
When to use:
 To understand the distribution of numerical data (e.g., whether the data is normally
distributed or skewed).
 To check for outliers or gaps in the data.
Example:
If you have a dataset of exam scores, a histogram will show how many students fall into
each score range (e.g., 0-10, 10-20 etc.).
2. Box Plots (Box-and-Whisker Plots)
A box plot is a summary visualization that shows the distribution of a dataset based on five
summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
It helps detect outliers and understand data spread and skewness.
When to use:
 To understand the spread and variability of a dataset.
 To compare distributions across different groups or categories.
 To identify outliers.
Example:
A box plot for a dataset of salaries across various job titles would show the range of salaries,
the median, and any extreme outliers (e.g., very high or low salaries).
3. Scatter Plots
A scatter plot is used to visualize the relationship between two continuous variables. It
displays points on a 2D plane with each point representing a pair of values (x, y).
When to use:
 To check for correlations between two variables (e.g., height vs. weight, age vs.
income).
 To identify trends, clusters, or outliers in the data.
 To perform regression analysis (visualizing the fit of a regression line).
Example:
If you're trying to predict house prices based on their square footage, a scatter plot can
show how square footage correlates with price, and whether there's a linear relationship.
4. Line Graphs
A line graph is used to visualize trends over time. It connects data points with a line and is
typically used to represent time series data.
When to use:
 To visualize trends over time (e.g., stock prices, sales over months, temperature over
days).
 To identify periodicity, seasonality, and growth patterns.
Example:
A line graph can track the number of new website visitors over the past year, helping
identify peak periods or trends in user growth.
5. Heatmaps
A heatmap is a graphical representation where individual values are represented by colors.
It is typically used to visualize correlations or patterns in large datasets, and it can be
especially helpful in understanding the structure of complex data like correlations,
confusion matrices, or geographic distributions.
When to use:
 To visualize correlation matrices between different variables.
 To show the intensity of values in a matrix form (e.g., website clicks, geographic
regions).
 To visualize the distribution of values in large data grids.
Example:
In a dataset with customer interaction across multiple product categories, a heatmap can
show which categories have the highest or lowest engagement.
6. Pair Plots (Scatterplot Matrix)
A pair plot (or scatterplot matrix) is a collection of scatter plots used to visualize pairwise
relationships between multiple variables in a dataset. It is particularly useful when analyzing
high-dimensional data.
When to use:
 To visualize relationships between multiple variables.
Explain rank method in applied data science ?
In applied data science, the rank method refers to a technique used to transform or
organize data based on the relative ranking or order of values within a dataset. Ranking is a
fundamental concept in data analysis and is especially useful when dealing with data that
needs to be compared or when it's not necessary to work with the actual values but rather
their relative positions.
The rank method can be used for various purposes such as feature engineering, handling
ordinal data, statistical analysis, and data transformation. Here’s a breakdown of how and
why ranking is applied in data science:
1. What is Ranking in Data Science?
Ranking assigns a numerical order to items in a dataset based on their magnitude. For a
given dataset, the smallest value is assigned the rank of 1, the next smallest the rank of 2,
and so on. When there are ties (i.e., two or more identical values), different methods can be
used to assign ranks.
Example:
For the dataset 12,17,9,15
 Rank(9) = 1
 Rank(12) = 2
 Rank(15) = 3
 Rank(17) = 4
If there are duplicate values, they are assigned the average rank of their positions.
2. Rank Transformation in Data Science
In data preprocessing, the rank transformation technique involves replacing the raw data
values with their corresponding ranks. This is often done to:
 Normalize or scale the data: Ranking transforms data into ordinal scale, which might
be useful for certain algorithms.
 Reduce the impact of outliers: By replacing raw values with ranks, the influence of
outliers is minimized.
 Handle non-linear relationships: Ranking can sometimes make relationships between
variables more linear and easier to model.
Example of Rank Transformation:
Original Data: 12,17,9,15
After Rank Transformation: 2,4,1,3
This transformation helps to standardize the data by focusing on its relative positions rather
than the absolute values.
3.Rank-Based Methods in Applied Data Science
a. Spearman’s Rank Correlation
One of the most common rank-based statistical methods is Spearman’s Rank Correlation,
used to measure the strength and direction of the association between two variables.
Unlike Pearson correlation (which measures linear relationships), Spearman's correlation
assesses monotonic relationships (relationships where one variable tends to increase as the
other increases, though not necessarily in a linear fashion).
 Formula: ρ=1−6∑di2n(n2−1)
Example:
For two variables:
 Variable 1: 12,17,9,15
 Variable 2: 5,3,6,2
You would rank both variables and then calculate the differences in ranks, square them, sum
them up, and use the formula to find the Spearman rank correlation. This method helps
identify monotonic relationships in the data, even if those relationships are non-linear.
b. Rank-based Models
In some machine learning algorithms, such as RankNet (used in learning-to-rank problems),
rank-based models are used to predict the order of items. These models are used
extensively in recommendation systems (e.g., ranking movies or products), search engine
results, or any domain where the goal is to rank items in a specific order.
5. Ranking in Data Transformation and Normalization
In some situations, the rank method can be used as a form of data transformation. This can
be particularly useful when you want to make your data more robust to outliers or when
working with non-normal distributions. Ranking helps make the data more uniform,
especially for skewed or non-linear datasets.
Example of Ranking for Normalization:
Given a dataset of incomes: 1000,5000,3000,2500,10000
Ranking these values: 1,4,3,2,5
This transformation reduces the influence of large numbers (like the outlier of 10000) and
brings the dataset into a uniform scale based on relative positions.
Explain Two line of regression ?
In statistics and data science, two lines of regression refer to the regression models that
involve two variables: one independent variable (predictor) and one dependent variable
(response). The two lines of regression are used to predict the relationship between two
variables, and these lines are the least squares lines that best fit the data points in a scatter
plot.
The two primary types of regression lines are:
1. Regression of Y on X (or Y on X line)
2. Regression of X on Y (or X on Y line)
Both regression lines are used to describe the relationship between two variables, but they
are derived differently. Here’s an explanation of each:
1. Regression of Y on X (Y on X Line)
The regression of Y on X is the line that minimizes the sum of squared vertical deviations
(residuals) between the observed values of Y and the predicted values of Y based on X. In
other words, it tries to predict the value of the dependent variable Y using the independent
variable X.
Formula:
Y=a+bXY
Where:
 Y is the dependent variable (response variable).
 X is the independent variable (predictor variable).
 a is the intercept of the regression line (the value of Y when X = 0).
 b is the slope of the regression line (the change in Y for each one-unit change in X).
Interpretation:
 The slope b represents the amount by which Y changes for a one-unit change in X.
 The intercept a represents the value of Y when X is zero.
2. Regression of X on Y (X on Y Line)
The regression of X on Y is the line that minimizes the sum of squared horizontal deviations
(residuals) between the observed values of X and the predicted values of X based on Y. It
predicts the value of the independent variable X based on the dependent variable Y.
Formula:
X=c+dYX
Where:
 X is the independent variable (predictor variable).
 Y is the dependent variable (response variable).
 c is the intercept of the regression line.
 d is the slope of the regression line.
Interpretation:
 The slope d represents the amount by which X changes for a one-unit change in Y.
 The intercept c represents the value of X when Y is zero.
Key Differences between the Two Lines of Regression
 Direction of Prediction:
o Y on X predicts Y (dependent) based on X (independent).
o X on Y predicts X (independent) based on Y (dependent).
 Slope Difference:
o The slope of Y on X and X on Y is generally not the same, though they are
related.
 Minimizing Deviations:
o Y on X minimizes vertical deviations (residuals) between the observed and
predicted values of Y.
o X on Y minimizes horizontal deviations (residuals) between the observed and
predicted values of X.
Graphical Representation of the Two Lines of Regression
1. Regression of Y on X: The line of best fit passes through the data points, with vertical
residuals (the difference between actual Y values and predicted Y values from the
regression line).
2. Regression of X on Y: The line of best fit is also plotted, but now the residuals are
horizontal (the difference between actual X values and predicted X values from the
regression line).
Here’s a simple example to illustrate:
Example Dataset:
X (Independent) Y (Dependent)
1 2
2 3
3 5
4 4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy