Introduction To Data Analysis
Introduction To Data Analysis
a BegInner's guIde
➢ Types of Data
➢ Data Visualization
➢ Real-World Applications
1. Structured Data
Examples:
2. Unstructured Data
Examples:
• Emails
• Social media posts
• Audio/video files
• PDF documents
• Images
3. Semi-Structured Data
• JSON
• XML
• Log files
• Email headers
2.2 Based on Nature / Measurement Scale (Data
Types in Statistics)
1. Nominal Data
Examples:
2. Ordinal Data
Examples:
• Education level (High School, Bachelor’s, Master’s)
• Customer satisfaction ratings (Poor, Fair, Good,
Excellent)
3. Interval Data
Examples:
4. Ratio Data
Examples:
• Height
• Weight
• Income
• Age
2.3 Based on Source
1. Primary Data
Examples:
• Questionnaire responses
• Sensor data
• Lab experiment results
2. Secondary Data
Examples:
1. Cross-sectional Data
Example:
"Why did customer churn increase last quarter?"
"Which product is performing best among young
customers?"
4.2. Data Collection
Common Tasks:
Techniques Used:
• Regression analysis
• Hypothesis testing
• Machine learning algorithms
• Time series forecasting
4.6 Interpretation of Results
Key Questions:
Tools:
• Tableau, Power BI
• Python libraries (Matplotlib, Seaborn, Plotly)
• Excel, Google Data Studio
Purpose:
2. Removing Duplicates
5. Type Conversion
7. Validating Data
Purpose of EDA
The primary goals of EDA are:
1. Understand the data : Know what each variable
represents and how it's distributed.
2. Identify patterns and trends : Detect underlying
structures or behaviors in the data.
3. Detect anomalies : Find outliers, missing values, or
errors.
4. Test assumptions : Check if the data meets the
requirements for certain statistical models.
5. Formulate hypotheses : Generate questions or
insights for further analysis.
6. Choose appropriate models : Inform which machine
learning or statistical techniques to use.
1. Data Overview
Start by understanding the basic structure of your dataset.
• How many rows and columns?
• What do the variables represent?
• Are there any obvious issues?
Tools:
df.shape, df.head(), df.info() in Python (Pandas)
2. Summary Statistics
Check:
• Frequency counts
• Unique values
• Mode
3. Missing Values & Duplicates
Check for:
• Missing (NaN, null) values
• Duplicate rows
Techniques:
• Drop or impute missing values
• Remove duplicates
4. Outlier Detection
Use:
• Cross-tabulations
• Bar charts
• Stacked bar charts
8. Numerical vs Categorical Relationships
Median
• Middle value when data is sorted.
• Less affected by outliers.
Mode
• Most frequently occurring value.
• Useful for categorical data.
Range
• Difference between maximum and minimum values.
Range=Max−Min
Variance
• Measures how far each number in the set is from the
mean.
Variance (σ2) = ∑(xi−μ)2/n
Standard Deviation
• Square root of variance.
• Easier to interpret because it's in the same unit as
data.
Standard Deviation(σ) = √Variance
4. Data Distributions
Understanding how your data is distributed helps in
choosing the right statistical tests.
Normal Distribution (Gaussian)
• Bell-shaped curve.
• Symmetrical around the mean.
• Mean = Median = Mode.
Many statistical tests assume normality.
Skewness
• Measure of asymmetry in distribution.
• Positive skew : Tail on right side (mean > median)
• Negative skew : Tail on left side (mean < median)
🏔 Kurtosis
• Measures the "tailedness" of the distribution.
• Leptokurtic : Heavy tails, sharp peak
• Mesokurtic : Normal kurtosis
• Platykurtic : Light tails, flat peak
5. Hypothesis Testing
Used to make decisions about population parameters
based on sample data.
Steps in Hypothesis Testing:
1. Formulate Hypotheses
• Null Hypothesis (H₀) : No effect or difference.
• Alternative Hypothesis (H₁) : There is an effect
or difference.
2. Choose Significance Level (α)
• Usually 0.05 or 5%.
3. Calculate Test Statistic
• Depends on test type (e.g., t-test, z-test, chi-
square).
4. Determine p-value
• Probability of observing the result if H₀ is true.
5. Make Decision
• If p-value < α → Reject H₀
• If p-value ≥ α → Fail to reject H₀
6. Correlation vs Causation
Correlation
• Measures the strength and direction of a linear
relationship between two variables.
• Ranges from -1 to +1:
• +1 : Perfect positive correlation
• 0 : No correlation
• -1 : Perfect negative correlation
8. DATA VISUALIZATION
Data Visualization – A Detailed Guide
Data Visualization is the graphical representation of
information and data. By using visual elements like
charts, graphs, and maps, you can more easily understand
trends, patterns, and outliers in data.
It plays a critical role in data analysis , helping both
analysts and stakeholders interpret complex datasets
quickly and make informed decisions.
4. Be Accurate
6. Ensure Accessibility
1. Spreadsheet Tools
Perfect for beginners and small-scale analysis.
Microsoft Excel
• Most widely used tool for basic to intermediate data
analysis.
• Features:
• Built-in formulas (SUM, VLOOKUP, IF, etc.)
• Pivot tables for summarizing data
• Charts and graphs
• Conditional formatting
• What-if analysis and macros (VBA)
Google Sheets
• Similar to Excel but cloud-based.
• Great for collaboration and sharing.
Use Cases:
• Budgeting and forecasting
• Sales tracking
• Basic statistical analysis
2. Programming Languages
Used for more complex and scalable data analysis.
Python
• One of the most popular languages in data science
and analytics.
• Libraries:
• Pandas : Data manipulation and analysis
• NumPy : Numerical computing
• Matplotlib & Seaborn : Data visualization
• SciPy : Scientific computing and advanced math
• Scikit-learn : Machine learning
• Statsmodels : Statistical modeling
Why Python?
• Easy to learn syntax
• Large community support
• Works well with AI/ML pipelines
R
• Designed specifically for statistical analysis and
graphics.
• Packages:
• dplyr, tidyr : Data wrangling
• ggplot2 : Advanced visualizations
• caret, randomForest : Machine learning
• shiny : Build interactive dashboards
Why R?
• Excellent for statistical modeling
• Strong academic and research use
What It Is:
Descriptive analysis summarizes what happened by
organizing and presenting data in an understandable
format.
Purpose:
• Understand patterns and trends
• Provide insights into past performance
Common Methods:
• Mean, median, mode
• Standard deviation, variance
• Frequency distributions
• Visualizations (charts, histograms)
Example:
A company uses descriptive analysis to understand
monthly sales trends over the last year.
2. Diagnostic Analysis
What It Is:
Diagnostic analysis focuses on why something happened
by examining patterns and relationships in the data.
Purpose:
• Identify root causes of outcomes
• Detect anomalies or outliers
Common Methods:
• Drill-down, data discovery
• Correlation and regression analysis
• Time-series analysis
Example:
A drop in website traffic is analyzed using diagnostic
analysis to find out if it was due to a technical issue,
change in SEO strategy, or external factors.
3. Predictive Analysis
What It Is:
Predictive analysis uses historical data to forecast future
outcomes using statistical modeling and machine learning
algorithms.
Purpose:
• Anticipate trends
• Make informed decisions based on likely outcomes
Common Methods:
• Regression analysis (linear, logistic)
• Time series forecasting
• Machine learning models (e.g., Random Forest,
Decision Trees)
• Neural networks
Example:
A bank uses predictive analytics to assess the likelihood
of loan defaults based on customer data.
4. Prescriptive Analysis
What It Is:
Prescriptive analysis suggests actions to achieve desired
outcomes. It's the most advanced form of analysis and
often involves optimization and simulation.
Purpose:
• Recommend optimal actions
• Simulate different scenarios
Common Methods:
• Optimization algorithms
• Simulation models
• Decision modeling
• Machine learning with reinforcement learning
Example:
An airline uses prescriptive analysis to determine the best
pricing strategy for maximizing profits.
5. Exploratory Data Analysis (EDA)
What It Is:
EDA involves analyzing datasets to summarize their main
characteristics, often with visual methods.
Purpose:
• Discover patterns, trends, and relationships
• Test assumptions
• Identify outliers
Common Methods:
• Histograms, scatter plots, boxplots
• Correlation matrices
• Summary statistics
Example:
Before building a machine learning model, analysts use
EDA to understand variable distributions and detect
skewness.
6. Regression Analysis
What It Is:
Regression analysis helps understand the relationship
between one dependent variable and one or more
independent variables.
Purpose:
• Predict numerical outcomes
• Understand impact of input variables
Types:
• Linear Regression : One continuous outcome
• Logistic Regression : Binary classification (Yes/No)
• Polynomial Regression : Non-linear relationships
Example:
A real estate company uses linear regression to predict
house prices based on size, location, and number of
bedrooms.
7. Classification Analysis
What It Is:
Classification is a predictive modeling technique where
the output is a category or class label.
Purpose:
• Group data into predefined classes
• Predict categorical outcomes
Common Algorithms:
• Decision Trees
• Naive Bayes
• Support Vector Machines (SVM)
• K-Nearest Neighbors (KNN)
• Neural Networks
Example:
Email filters use classification to determine whether a
message is spam or not.
8. Clustering Analysis
What It Is:
Clustering groups similar items together without prior
knowledge of categories (unsupervised learning).
Purpose:
• Segment customers or users
• Discover hidden structures in data
Common Algorithms:
• K-Means Clustering
• Hierarchical Clustering
• DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
Example:
Retailers use clustering to segment customers based on
purchasing behavior for targeted marketing.
9. Time Series Analysis
What It Is:
Time series analysis deals with data indexed in
chronological order to identify trends, cycles, and
seasonal patterns.
Purpose:
• Forecast future values
• Analyze temporal patterns
Common Methods:
• Moving averages
• Exponential smoothing
• ARIMA (AutoRegressive Integrated Moving
Average)
• Prophet (by Facebook)
Example:
A stock market analyst uses time series analysis to
forecast future stock prices.
10. Sentiment Analysis
What It Is:
Sentiment analysis determines the emotional tone behind
words — whether the sentiment is positive, negative, or
neutral.
Purpose:
• Monitor brand reputation
• Understand customer opinions
Common Methods:
• Natural Language Processing (NLP)
• Lexicon-based scoring
• Machine learning classifiers
Example:
A hotel chain analyzes guest reviews using sentiment
analysis to improve service quality.
11. Hypothesis Testing
What It Is:
Hypothesis testing evaluates two mutually exclusive
statements about a population to determine which one is
supported by the sample data.
Purpose:
• Validate assumptions
• Make statistically sound decisions
Common Tests:
• t-test : Compare means of two groups
• ANOVA : Compare means across more than two
groups
• Chi-Square Test : Analyze categorical data
• Z-test : For large samples
Example:
A pharmaceutical company tests whether a new drug
significantly improves recovery rates compared to a
placebo.
12. Text Mining / NLP
What It Is:
Text mining extracts valuable information from
unstructured text data using natural language processing
techniques.
Purpose:
• Extract keywords, topics, or entities
• Transform text into structured data
Common Tasks:
• Tokenization
• Stemming & Lemmatization
• Named Entity Recognition (NER)
• Topic Modeling (LDA)
Example:
Customer feedback forms are analyzed using text mining
to identify common complaints or suggestions.
13. Data Wrangling / Cleaning
What It Is:
Data wrangling involves transforming raw data into a
clean, usable format.
Purpose:
• Prepare data for analysis
• Improve data quality
Common Tasks:
• Handling missing values
• Removing duplicates
• Converting data types
• Normalizing and standardizing data
Example:
Before running any analysis, a dataset with inconsistent
date formats is cleaned and standardized.
14. Correlation Analysis
What It Is:
Correlation analysis measures the strength and direction
of the relationship between two or more variables.
Purpose:
• Identify variables that move together
• Inform feature selection in modeling
Common Metrics:
• Pearson correlation coefficient
• Spearman rank correlation
• Heatmaps
Example:
An economist studies the correlation between income and
spending habits.
Choosing the Right Technique
Application:
Companies use data analysis to understand customer
behavior, improve marketing strategies, and optimize
operations.
Use Cases:
• Customer Segmentation : Group customers based on
purchasing behavior using clustering.
• Sales Forecasting : Predict future sales using time
series analysis or regression.
• A/B Testing : Compare two versions of a
website/email to see which performs better.
• Churn Prediction : Identify customers likely to leave
using classification models.
Tools/Techniques:
• Python (Pandas, Scikit-learn)
• SQL for querying customer databases
• Tableau / Power BI for dashboards
• Logistic regression, decision trees
Example:
Amazon uses predictive analytics to recommend products
based on browsing history and purchase behavior.
2. Healthcare
Application:
Data analysis helps improve patient care, reduce costs,
and support medical research.
Use Cases:
• Disease Prediction : Predict risk of heart disease or
diabetes using machine learning.
• Drug Discovery : Analyze clinical trial data to
evaluate drug effectiveness.
• Patient Monitoring : Real-time monitoring using IoT
sensors and anomaly detection.
• Hospital Resource Planning : Optimize bed allocation
and staff scheduling.
Tools/Techniques:
• R for statistical modeling
• NLP for analyzing medical notes
• Machine learning algorithms (e.g., Random Forest,
SVM)
• Time series forecasting
Example:
Johns Hopkins University uses data analysis to track
global pandemic spread and model outbreak scenarios.
3. Government & Public Policy
Application:
Governments use data to make informed decisions about
infrastructure, crime, education, and social services.
Use Cases:
• Crime Mapping : Use GIS and spatial analysis to
identify high-crime areas.
• Census Data Analysis : Understand population trends
and allocate resources.
• Policy Evaluation : Measure the impact of new laws
or programs.
• Fraud Detection : Detect anomalies in tax filings or
benefit claims.
Tools/Techniques:
• GIS tools like QGIS or ArcGIS
• SQL for large datasets
• Regression analysis
• Dashboard tools (Power BI)
4. Education
Application:
Educational institutions and EdTech companies analyze
student performance and learning behaviors.
Use Cases:
• Student Performance Prediction : Identify at-risk
students using classification.
• Curriculum Optimization : Use feedback and test
scores to improve course design.
• Personalized Learning : Recommend content based
on learner progress.
• Online Learning Analytics : Track engagement and
completion rates.
Tools/Techniques:
• Python for data cleaning and modeling
• Learning Management Systems (LMS) data
• Clustering and sentiment analysis
5. Retail & E-commerce
Application:
Retailers use data analysis to improve inventory
management, pricing, and customer experience.
Use Cases:
• Inventory Optimization : Predict demand to avoid
stockouts or overstocking.
• Dynamic Pricing : Adjust prices based on demand
and competitor pricing.
• Basket Analysis : Discover frequently bought items
together (market basket analysis).
• Store Layout Optimization : Analyze foot traffic
using heatmaps.
Tools/Techniques:
• Association rule mining (Apriori algorithm)
• Time series forecasting
• Excel and Python
• Heatmap visualization tools
6. Finance & Banking
Application:
Banks and financial institutions rely heavily on data to
manage risks, detect fraud, and offer better services.
Use Cases:
• Credit Risk Assessment : Evaluate loan applicants
using logistic regression.
• Fraud Detection : Detect unusual transactions with
anomaly detection.
• Portfolio Management : Optimize investment
portfolios using Monte Carlo simulation.
• Algorithmic Trading : Use predictive models to
execute trades automatically.
Tools/Techniques:
• Python (NumPy, Pandas, Scikit-learn)
• SQL for transactional data
• Sentiment analysis on news feeds
• Deep learning for fraud detection
7. Technology & Startups
Application:
Startups and tech companies analyze user data to improve
product features and grow their businesses.
Use Cases:
• User Behavior Analysis : Track how users interact
with apps/websites.
• Growth Hacking : Use funnel analysis to improve
conversion rates.
• Product Recommendation Engines : Suggest relevant
products/content.
• Churn Reduction : Predict and prevent user drop-offs.
Tools/Techniques:
• Google Analytics, Mixpanel
• Funnel analysis tools
• Collaborative filtering
• Survival analysis
8. Transportation & Logistics
Application:
Data analysis optimizes routes, improves supply chain
efficiency, and enhances customer satisfaction.
Use Cases:
• Route Optimization : Minimize delivery times using
shortest path algorithms.
• Fleet Management : Monitor vehicle health and
driver behavior.
• Demand Forecasting : Plan capacity based on
historical shipment data.
• Supply Chain Analytics : Track inventory levels and
supplier performance.
Tools/Techniques:
• Geospatial analysis (GPS data)
• Linear programming for optimization
• Machine learning for demand prediction
• Dashboards for KPI tracking
9. Science & Research
Application:
Researchers across disciplines use data analysis to
validate theories, discover patterns, and publish findings.
Use Cases:
• Genomics : Analyze DNA sequences to find genetic
markers for diseases.
• Climate Modeling : Predict climate change effects
using environmental datasets.
• Physics Experiments : Analyze particle collision data
from accelerators.
• Social Science Surveys : Extract insights from survey
responses.
Tools/Techniques:
• R for statistical analysis
• Python (SciPy, NumPy)
• MATLAB for simulations
• Data visualization (ggplot2, Matplotlib)
10. Artificial Intelligence & Machine Learning
Application:
Data analysis is foundational to training AI models that
power everything from chatbots to self-driving cars.
Use Cases:
• Image Recognition : Classify images using
convolutional neural networks.
• Natural Language Processing : Build chatbots and
voice assistants.
• Recommendation Systems : Personalize experiences
on platforms like YouTube or Spotify.
• Autonomous Vehicles : Process sensor data to make
driving decisions.
Tools/Techniques:
• TensorFlow, PyTorch
• Feature engineering
• Supervised and unsupervised learning
• Reinforcement learning