0% found this document useful (0 votes)
26 views

Fda End Sem

fda

Uploaded by

anshulsharmacoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Fda End Sem

fda

Uploaded by

anshulsharmacoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Q1. Explain the difference between descriptive and inferential statistics.

Provide examples of common descriptive statistics (e.g., mean, median,


standard deviation) and how they can be used to summarize data.

Descriptive Statistics

Descriptive statistics involves methods for summarizing and organizing data so that it can be
easily understood. These statistics describe the main features of a dataset, providing simple
summaries and graphical representations of the data.

Inferential Statistics

Inferential statistics involves methods that take a sample from a population and make
inferences or predictions about the larger population. These methods help in making
generalizations beyond the immediate data available.

Common Descriptive Statistics:

1. Mean (Average): The sum of all values divided by the number of values. It provides a
measure of the central tendency of the data.
o Example: If we have test scores of 70, 80, 90, and 100, the mean is (70 + 80 +
90 + 100) / 4 = 85.
o Use: The mean can be used to summarize the average performance of students
in a class.
2. Median: The middle value when the data is ordered from least to greatest. If there is
an even number of observations, the median is the average of the two middle
numbers.
o Example: For the test scores 70, 80, 90, and 100, the median is (80 + 90) / 2 =
85.
o Use: The median is useful in understanding the center of the data, especially
when the data is skewed.
3. Standard Deviation: A measure of the amount of variation or dispersion in a set of
values .

Example: If the test scores are closely packed around the mean, the standard
deviation will be low; if they are spread out, it will be high.

 Use: The standard deviation helps in understanding the variability of the data

Q2. Key principles of data visualization?

Data visualization is an essential tool for communicating information effectively. To create


clear, accurate, and compelling visualizations, it’s important to follow key principles that
guide their design and use. Here are some of the key principles of data visualization:
1. Clarity

 Readable Text: Ensure that all text, including labels, titles, and annotations, is easily
readable.

2. Accuracy

 Represent Data Honestly: Avoid distorting the data or creating misleading


representations. For example, use appropriate scales and avoid manipulating axis
ranges to exaggerate trends.

3. Consistency

 Uniform Design: Use consistent colors, fonts, and shapes across similar types of data
to avoid confusion.

4. Relevance

 Focus on Key Information: Highlight the most important data points and trends
relevant to the audience.

5. Efficiency

 Quick Insights: Design visualizations that allow viewers to quickly grasp the main
message or insights without extensive explanation.

6. Aesthetics

 Color Use: Use color effectively to highlight key data points and differentiate
categories. Be mindful of colorblindness and choose palettes that are accessible to all
viewers.

7. Simplicity

 Direct Presentation: Present information in a straightforward manner without


unnecessary embellishments.

8. Functionality

 Purpose-Driven Design: Ensure each element of the visualization serves a clear


purpose and contributes to the overall message.

9. Hierarchy

 Visual Hierarchy: Use size, color, and positioning to create a visual hierarchy that
guides viewers through the information in a logical order.

10. Storytelling
 Narrative Flow: Create a logical flow that tells a story with the data, leading the
viewer from introduction to conclusion.

Q3. Explain the concept of data visualization in EDA. Why is it important?


Key Concepts of Data Visualization in EDA [exploratory data analysis]

1. Graphical Representation: Transforming data into visual formats such as charts,


graphs, and plots, making complex data more accessible.
2. Pattern Recognition: Visual tools like scatter plots, histograms, and box plots help
identify patterns, trends, and relationships in the data.
3. Anomaly Detection: Visualization helps in spotting outliers and anomalies, which
might indicate errors or significant insights.
4. Hypothesis Testing: Visual comparison of data subsets can help in forming and
testing hypotheses about the data.
5. Data Distribution: Charts like histograms and density plots illustrate the distribution
of data, showing how values are spread.
6. Data Cleaning: Visualizations can reveal missing values, duplicates, or other data
quality issues that need addressing.

Importance of Data Visualization in EDA

1. Enhanced Understanding: Visual representations simplify complex data, making it


easier to grasp and interpret. This is crucial for both analysts and non-technical
stakeholders.
2. Effective Communication: Visuals convey insights more effectively than raw data or
summary statistics alone, aiding in clear communication of findings.
3. Quick Insights: Visualization allows for rapid identification of key trends and
patterns, speeding up the analysis process and enabling faster decision-making.
4. Data Quality Assessment: By visualizing data, analysts can quickly spot and address
data quality issues, ensuring more accurate and reliable analyses.
5. Guidance for Further Analysis: Initial visual exploration can highlight areas that
warrant deeper investigation, guiding subsequent analytical efforts.
6. Interactive Exploration: Interactive visual tools enable users to explore data
dynamically, asking new questions and uncovering deeper insights through direct
manipulation of the visuals.

Q4. What are some popular Python libraries used for data visualization
during EDA?

Matplotlib:

 Description: A comprehensive library for creating static, animated, and interactive


visualizations in Python.
 import matplotlib.pyplot as plt
 plt.hist(data['column'])
 plt.show()
Seaborn:

Description: Built on top of Matplotlib, Seaborn provides a high-level interface for


drawing attractive and informative statistical graphics.

 import seaborn as sns


 sns.distplot(data['column'])

Pandas :

 Description: Directly built into the Pandas library, it allows quick and easy plotting
directly from DataFrame and Series objects.
 data['column'].plot(kind='hist')

GGplot:

A Python implementation of the grammar of graphics, inspired by ggplot2 in R.

Example: from ggplot import ggplot, aes, geom_point

Q5. What is the null hypothesis tested in a one-way ANOVA?

In a one-way Analysis of Variance (ANOVA), the null hypothesis (H₀) being tested is that
the means of the different groups are equal. This hypothesis can be formally stated as:

H0: μ1=μ2=μ3= . . . . = μk

are the population means of the k different groups being compared.

H1 : at least one μi not equal to μj I is not equal to j

 Alternative Hypothesis (H₁): At least one group mean is different from the others.

Null Hypothesis (H₀): All group means are equal

Q6. Describe the situations where a non-parametric test would be preferred


over a parametric test.
Non-normal Distributions: When the data does not follow a normal
distribution, non-parametric tests, which do not assume normality, are more
suitable
Categorical Data: For categorical data that does not involve numerical values,
non-parametric tests are used.
Presence of outliers: Non-parametric tests are less sensitive to outliers
compared to parametric tests
Lack of Information: When there is insufficient information about the
population parameters (mean, variance), non-parametric tests are
advantageous
Unequal Variances: When homogeneity of variances is violated non-parametric
tests can be useful.
Non-linearity: In cases where the relationship between variables is non-linear
then non-parametric tests can be useful.
Q7. Briefly explain the evolution of data visualization. When and why did
data visualization become important?

Data visualization has a rich history that spans centuries. It evolved from
simple charts and diagrams to sophisticated digital graphics. Here's a
brief overview:

1. Early Visualizations: The roots of data visualization can be traced


back to ancient times with maps, cave paintings, and other
rudimentary forms of visual representation.

2. Statistical Graphics: In the 17th and 18th centuries, pioneers like


William Playfair and Joseph Priestley introduced statistical graphs,
laying the foundation for modern data visualization. Playfair, for
instance, created line graphs, bar charts, and pie charts.

3. Technological Advancements: The Industrial Revolution brought


about advancements in printing technology, enabling the mass
production of visual materials like charts and graphs.

4. Computer Age: The advent of computers in the 20th century


revolutionized data visualization. Tools like spreadsheets and
graphing software made it easier to create and analyze
visualizations.

5. Interactive Visualization: With the rise of the internet and


interactive technologies, data visualization became more dynamic
and user-friendly. Websites and software applications allowed
users to explore data in real-time.

6. Big Data Era: In the 21st century, the explosion of data from
various sources necessitated more sophisticated visualization
techniques to make sense of complex datasets. Visualization tools
and techniques continue to evolve to handle the challenges posed
by big data.

Importance of Data Visualization:

1. Cognitive Efficiency: Humans process visual information more


efficiently than text. Visualization helps in quickly understanding
complex data.
2. Pattern Recognition: Visualizations aid in identifying trends, patterns,
and outliers in datasets that might not be apparent through numerical
analysis alone.
3. Communication: Effective visualizations communicate data-driven
insights clearly and concisely to a broad audience, including non-
specialists.
4. Decision Making: In business, science, and public policy, visualizations
support informed decision-making by presenting data in an actionable
format.
5. Engagement: Interactive and dynamic visualizations engage users,
allowing them to explore data and gain insights actively.

Q8. Discuss the advantages and disadvantages of using data visualization for
data analysis.

Advantages:
Simplification of complex data
Pattern recognization
Faster processing
Customization
Quick response
asthetics
efficient
Disadvantaes:
Over-simlified
Bias
Technical expertise
Tool limitations
time and cost
maintaince

Q9. What is the role of features in machine learning? Can you give an
example of a feature and its target variable?
features play a crucial role as they are the input variables used by the model to
make predictions or decisions. Features are the measurable properties or
characteristics of the phenomenon being observed.

Role of Features in Machine Learning

1. Data Representation: Features represent the data points that the model
will use to learn. They help in capturing the underlying patterns in the
data.
2. Model Training: During the training process, the machine learning
algorithm uses features to understand how different values of the features
correlate with the target variable.
3. Prediction: Once the model is trained, it uses the features from new,
unseen data to make predictions about the target variable.
4. Feature Engineering: The process of selecting, modifying, or creating
new features can significantly improve model performance. Feature
engineering is a critical step in the machine learning workflow.

Example of a Feature and Its Target Variable:

Consider a scenario where we are building a machine learning model to predict house prices.
In this case:

 Feature: A feature could be the size of the house in square feet. (size of house).
 Target Variable: The target variable would be the price of the house. (house price).
Q10. List two types of non-parametric tests used to compare two
independent groups.
1. Mann-Whitney U Test:
Purpose: This test is used to determine whether there is a significant difference
between the distributions of two independent groups.
Data not normally distributed.
Example: Comparing the median incomes of two different cities to determine if
one city has a significantly different income distribution compared to the other.

2. Kolmogorov-Smirnov Test:
Purpose: This test is used to compare the distributions of two independent
samples. samples come from the same distribution.
differences in the shape and location
Example: Comparing the distribution of daily temperatures between two
different time periods

Q11. Compare and contrast the assumptions of ANOVA with those of a non-
parametric test like the Mann-Whitney U test.
Assumptions of ANOVA:
Normality: group are normally distributed.
Variance: variances are equal across the groups.
Independent: observations should be independent
Normal data
Sensitive to outliers

Assumptions of the Mann-Whitney U Test:


Ordinal: data that are not normally distributed.
Variance: variances are not equal across the groups.
Independent: observations are independent
Ordinal data
Less sensitive to outliers

Q12. Discuss the role of hyperparameters in machine learning models. How


are they tuned for optimal performance?
Role of Hyperparameters:
Model Capacity :
Model Complexity: control complexity of the model.
Training Process: learning rate, batch size.
Regularization:
Model-Specific Settings: Each machine learning algorithm has own
hyperparameters.

Tuning Hyperparameters for Optimal Performance:

1. Grid Search:

 Description: set of possible values for each hyperparameter and then


training model for every possible combination.

2. Random Search:

 Description: Instead of all possible combinations, random search


randomly samples from the hyperparameter space.

3. Bayesian Optimization:

Description: select the most promising hyperparameters to evaluate next.

4. Hyperband:
 Description: dynamically allocates resources to different configurations,
allowing for early stopping of poor configurations.

Q13. In simple terms, explain the concept of a decision tree. How does it
make decisions for classifying data points?
Decision tree in machine learning:
1. Prepare the data
2. Start with root node
3. Find the best split/feature
4. Create branches
5. Build tree
6. High entropy = more disorder
7. Low entropy = more purity

Q14. Discuss the differences between descriptive, predictive, and


prescriptive analytics. Provide examples of each.

1. Descriptive Analytics

Purpose: Descriptive analytics focuses on summarizing historical data to


understand what has happened in the past. It provides insights into past
performance and trends by interpreting raw data and transforming it into
meaningful information.

Key Features:

 Summarizes past events.


 Uses data aggregation and data mining techniques.
 Provides insights through reports, dashboards, and data visualization.

Annual revenue reports


Year-over-year sales reports

2. Predictive Analytics

Purpose: Predictive analytics aims to forecast future events by analyzing


historical data and identifying patterns. It uses statistical models and machine
learning algorithms to predict outcomes.
Key Features:

 Focuses on predicting future events based on historical data.


 Utilizes techniques such as regression analysis, time series analysis, and
machine learning.
 Helps in risk assessment, demand forecasting, and identifying future
opportunities.

Stock Price Forecasting:

3. Prescriptive Analytics

Purpose: Prescriptive analytics goes a step further by not only predicting future
outcomes but also recommending actions to achieve desired results. It combines
predictive analytics with optimization techniques to suggest the best course of
action.

Key Features:

 Provides actionable recommendations.


 Uses optimization and simulation algorithms.
 Focuses on decision-making and improving outcomes.

Healthcare Treatment Plans:

Q15. Describe the basic idea behind linear regression. How can you interpret
the coefficients in a linear regression model?
Linear regression is a statistical method used to model and analyze the
relationships between a dependent variable and one or more independent
variables.
best-fitting linear relationship that predicts the dependent variable based on
the independent variables
y = β0 + β1x1 + β2x2 +⋯+ βnxn + ϵ
y = dependent variable
β0 = (xi = 0)
x1 = independent variable
e = diff between predicted value and expected values
B1: coefficient
Interpreting the Coefficients in a Linear Regression Model:
1. Intercept (β0): value of the dependent variable when all independent
variables are equal to zero.
2. Slope Coefficients (β1,β2,…,βn): represents the expected change in the
dependent variable for a one-unit increase in the corresponding independent
variable.
if β1 = 2 it means that for every one-unit increase in x1 the dependent variable
y is expected to increase by 2 units

Q16. Role of P-Value in hypothesis testing interpret (one tailed two tailed)
p-value key concept in hypothesis testing that determine observed data is
consistent with the null hypothesis.

Role of the p-value in Hypothesis Testing

1. Definition: The p-value measures the probability of obtaining a test result


as the one observed, assuming that the null hypothesis (H0) is true.
2. Decision Rule:
o Low p-value (typically ≤ 0.05): suggesting that there is enough
evidence to reject H0.
o High p-value (> 0.05): suggesting that there is not enough
evidence to reject H0.

One-Tailed Test:

 Purpose: Tests if the effect is in one specific direction (either greater than
or less than a certain value).

Example:
Testing if a new drug increases recovery rates more than the existing
drug:

 H0: The new drug is not more effective.


 H1: The new drug is more effective.
 If the p-value is 0.03 you reject H0 and conclude that the new drug is
significantly more effective.

Two-Tailed Test:

 Purpose: Tests if the effect is in either direction (either greater than or


less than a certain value).

Example:

Testing if a new drug has a different effect (either more or less effective) than
the existing drug:

 H0: The new drug has the same effect. μ=μ0


 H1: The new drug has a different effect.
 If the p-value is 0.03 you reject H0 and conclude that the new drug's
effect is significantly different from the existing drug.

Q17. Describe the various stages of the Data Analysis Process (e.g., Data
Collection, Cleaning, Exploration, Analysis, Visualization). Briefly explain the
importance of each stage:
extracting meaningful data from raw data

 Data Collection: Sets the stage for analysis by ensuring relevant and comprehensive data.
 Data Cleaning: Ensures the accuracy and reliability of the data.
 Data Exploration: Provides initial insights and guides further analysis[mean, median,
mode, plots, graphs]
 Data Analysis: Extracts meaningful and actionable insights from the data.
 Data Visualization: Communicates findings effectively and aids in decision-making.

Q18. Generate a simulated time series data with a specific ARIMA model and
analyze its ACF and PACF plots
ACF: (Autocorrelation)
correlation between two past values’
more than one time lag
not remove effect of shorter lags
uses indirect impact
does not use coefficient
PACF (Partial Autocorrelatioon)
correlation between one past value and one current values.
Only one time lag
Remove the effect of shorter lag
Use direct impact
Uses coefficient

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy