Fda End Sem
Fda End Sem
Descriptive Statistics
Descriptive statistics involves methods for summarizing and organizing data so that it can be
easily understood. These statistics describe the main features of a dataset, providing simple
summaries and graphical representations of the data.
Inferential Statistics
Inferential statistics involves methods that take a sample from a population and make
inferences or predictions about the larger population. These methods help in making
generalizations beyond the immediate data available.
1. Mean (Average): The sum of all values divided by the number of values. It provides a
measure of the central tendency of the data.
o Example: If we have test scores of 70, 80, 90, and 100, the mean is (70 + 80 +
90 + 100) / 4 = 85.
o Use: The mean can be used to summarize the average performance of students
in a class.
2. Median: The middle value when the data is ordered from least to greatest. If there is
an even number of observations, the median is the average of the two middle
numbers.
o Example: For the test scores 70, 80, 90, and 100, the median is (80 + 90) / 2 =
85.
o Use: The median is useful in understanding the center of the data, especially
when the data is skewed.
3. Standard Deviation: A measure of the amount of variation or dispersion in a set of
values .
Example: If the test scores are closely packed around the mean, the standard
deviation will be low; if they are spread out, it will be high.
Use: The standard deviation helps in understanding the variability of the data
Readable Text: Ensure that all text, including labels, titles, and annotations, is easily
readable.
2. Accuracy
3. Consistency
Uniform Design: Use consistent colors, fonts, and shapes across similar types of data
to avoid confusion.
4. Relevance
Focus on Key Information: Highlight the most important data points and trends
relevant to the audience.
5. Efficiency
Quick Insights: Design visualizations that allow viewers to quickly grasp the main
message or insights without extensive explanation.
6. Aesthetics
Color Use: Use color effectively to highlight key data points and differentiate
categories. Be mindful of colorblindness and choose palettes that are accessible to all
viewers.
7. Simplicity
8. Functionality
9. Hierarchy
Visual Hierarchy: Use size, color, and positioning to create a visual hierarchy that
guides viewers through the information in a logical order.
10. Storytelling
Narrative Flow: Create a logical flow that tells a story with the data, leading the
viewer from introduction to conclusion.
Q4. What are some popular Python libraries used for data visualization
during EDA?
Matplotlib:
Pandas :
Description: Directly built into the Pandas library, it allows quick and easy plotting
directly from DataFrame and Series objects.
data['column'].plot(kind='hist')
GGplot:
In a one-way Analysis of Variance (ANOVA), the null hypothesis (H₀) being tested is that
the means of the different groups are equal. This hypothesis can be formally stated as:
H0: μ1=μ2=μ3= . . . . = μk
Alternative Hypothesis (H₁): At least one group mean is different from the others.
Data visualization has a rich history that spans centuries. It evolved from
simple charts and diagrams to sophisticated digital graphics. Here's a
brief overview:
6. Big Data Era: In the 21st century, the explosion of data from
various sources necessitated more sophisticated visualization
techniques to make sense of complex datasets. Visualization tools
and techniques continue to evolve to handle the challenges posed
by big data.
Q8. Discuss the advantages and disadvantages of using data visualization for
data analysis.
Advantages:
Simplification of complex data
Pattern recognization
Faster processing
Customization
Quick response
asthetics
efficient
Disadvantaes:
Over-simlified
Bias
Technical expertise
Tool limitations
time and cost
maintaince
Q9. What is the role of features in machine learning? Can you give an
example of a feature and its target variable?
features play a crucial role as they are the input variables used by the model to
make predictions or decisions. Features are the measurable properties or
characteristics of the phenomenon being observed.
1. Data Representation: Features represent the data points that the model
will use to learn. They help in capturing the underlying patterns in the
data.
2. Model Training: During the training process, the machine learning
algorithm uses features to understand how different values of the features
correlate with the target variable.
3. Prediction: Once the model is trained, it uses the features from new,
unseen data to make predictions about the target variable.
4. Feature Engineering: The process of selecting, modifying, or creating
new features can significantly improve model performance. Feature
engineering is a critical step in the machine learning workflow.
Consider a scenario where we are building a machine learning model to predict house prices.
In this case:
Feature: A feature could be the size of the house in square feet. (size of house).
Target Variable: The target variable would be the price of the house. (house price).
Q10. List two types of non-parametric tests used to compare two
independent groups.
1. Mann-Whitney U Test:
Purpose: This test is used to determine whether there is a significant difference
between the distributions of two independent groups.
Data not normally distributed.
Example: Comparing the median incomes of two different cities to determine if
one city has a significantly different income distribution compared to the other.
2. Kolmogorov-Smirnov Test:
Purpose: This test is used to compare the distributions of two independent
samples. samples come from the same distribution.
differences in the shape and location
Example: Comparing the distribution of daily temperatures between two
different time periods
Q11. Compare and contrast the assumptions of ANOVA with those of a non-
parametric test like the Mann-Whitney U test.
Assumptions of ANOVA:
Normality: group are normally distributed.
Variance: variances are equal across the groups.
Independent: observations should be independent
Normal data
Sensitive to outliers
1. Grid Search:
2. Random Search:
3. Bayesian Optimization:
4. Hyperband:
Description: dynamically allocates resources to different configurations,
allowing for early stopping of poor configurations.
Q13. In simple terms, explain the concept of a decision tree. How does it
make decisions for classifying data points?
Decision tree in machine learning:
1. Prepare the data
2. Start with root node
3. Find the best split/feature
4. Create branches
5. Build tree
6. High entropy = more disorder
7. Low entropy = more purity
1. Descriptive Analytics
Key Features:
2. Predictive Analytics
3. Prescriptive Analytics
Purpose: Prescriptive analytics goes a step further by not only predicting future
outcomes but also recommending actions to achieve desired results. It combines
predictive analytics with optimization techniques to suggest the best course of
action.
Key Features:
Q15. Describe the basic idea behind linear regression. How can you interpret
the coefficients in a linear regression model?
Linear regression is a statistical method used to model and analyze the
relationships between a dependent variable and one or more independent
variables.
best-fitting linear relationship that predicts the dependent variable based on
the independent variables
y = β0 + β1x1 + β2x2 +⋯+ βnxn + ϵ
y = dependent variable
β0 = (xi = 0)
x1 = independent variable
e = diff between predicted value and expected values
B1: coefficient
Interpreting the Coefficients in a Linear Regression Model:
1. Intercept (β0): value of the dependent variable when all independent
variables are equal to zero.
2. Slope Coefficients (β1,β2,…,βn): represents the expected change in the
dependent variable for a one-unit increase in the corresponding independent
variable.
if β1 = 2 it means that for every one-unit increase in x1 the dependent variable
y is expected to increase by 2 units
Q16. Role of P-Value in hypothesis testing interpret (one tailed two tailed)
p-value key concept in hypothesis testing that determine observed data is
consistent with the null hypothesis.
One-Tailed Test:
Purpose: Tests if the effect is in one specific direction (either greater than
or less than a certain value).
Example:
Testing if a new drug increases recovery rates more than the existing
drug:
Two-Tailed Test:
Example:
Testing if a new drug has a different effect (either more or less effective) than
the existing drug:
Q17. Describe the various stages of the Data Analysis Process (e.g., Data
Collection, Cleaning, Exploration, Analysis, Visualization). Briefly explain the
importance of each stage:
extracting meaningful data from raw data
Data Collection: Sets the stage for analysis by ensuring relevant and comprehensive data.
Data Cleaning: Ensures the accuracy and reliability of the data.
Data Exploration: Provides initial insights and guides further analysis[mean, median,
mode, plots, graphs]
Data Analysis: Extracts meaningful and actionable insights from the data.
Data Visualization: Communicates findings effectively and aids in decision-making.
Q18. Generate a simulated time series data with a specific ARIMA model and
analyze its ACF and PACF plots
ACF: (Autocorrelation)
correlation between two past values’
more than one time lag
not remove effect of shorter lags
uses indirect impact
does not use coefficient
PACF (Partial Autocorrelatioon)
correlation between one past value and one current values.
Only one time lag
Remove the effect of shorter lag
Use direct impact
Uses coefficient