Comp Notes
Comp Notes
Informatics is the science of processing data for storage and retrieval. It blends fields like computer
science, information science, and business.
Business Informatics deals with the integration of IT and business processes. It focuses on how
information systems can support business operations and decision-making.
Business Analytics (BA), on the other hand, is the use of statistical tools, technologies, and
quantitative methods to analyze data and make informed business decisions.
Difference:
Example:
A retail chain may use analytics to:
Analytics has shifted from reactive (what happened?) to proactive (what should we do next?)
5. Analytic Foundations
A solid foundation in these areas ensures analytics can be implemented correctly and results
interpreted accurately.
• Cloud computing
Excel/Spreadsheet Technology:
• Easy-to-use interface
Descriptive What has happened? Charts, reports, dashboards Monthly sales reports
These analytics build upon each other—you need good descriptive analytics to move into predictive
and prescriptive.
Types of Data:
Data Sources:
• Internal: ERP, CRM, transaction systems
Important aspects:
1. Big Data
Definition:
Big Data refers to extremely large and complex datasets that traditional data processing tools (like
Excel or basic SQL) cannot manage effectively.
It is not just about the size of data, but also about the complexity, speed, and variety of data being
generated.
V Explanation
The speed at which data is created and processed. E.g., stock market data updates in
Velocity
milliseconds.
Veracity The uncertainty and accuracy of data. Big data can be messy, incomplete, or contain noise.
The worth or usefulness of the data when processed and analyzed. Not all big data is
Value
valuable.
• Online transactions
• Fraud detection
• Personalized marketing
• Predictive maintenance
2. Data Reliability
Definition:
Data reliability refers to the consistency and dependability of data over time and across different
systems or processes.
It answers the question: “Can we trust the data to be accurate and consistent every time we use
it?”
Key Aspects:
• Reproducibility: The same process should yield the same results every time.
• System Reliability: Databases and sources are stable, not prone to failure or data loss.
Definition:
Data validity is about how accurately data represents the real-world scenario it is supposed to
model. It answers: “Is this data correct and relevant for the intended analysis?”
Types of Validity:
Construct Measures what it’s supposed A performance score actually reflects employee
Validity to measure productivity
Criterion Correlates with external Predicting sales based on past trends is accurate when
Validity benchmarks compared with actual results
A sensor gives the same reading The sensor is measuring the correct
Example
every hour temperature, not humidity
Summary:
• Big Data provides vast opportunities for analysis, but requires special tools to manage its
volume, variety, and velocity.
• Data Validity ensures that the data actually reflects the truth and is meaningful for decision-
making.
Definition:
Descriptive analytics summarizes historical data to understand what has happened in the past. It is
the starting point of data analysis and involves organizing, tabulating, and visualizing data.
Purpose:
• Data aggregation
• Data mining
Examples:
Output:
Use Case:
A retailer analyzes sales data from last year to identify which products sold the most during festive
seasons.
Definition:
Predictive analytics uses historical data + statistical models + machine learning techniques to
forecast future outcomes.
Purpose:
Examples:
Output:
• Probability scores
• Predictive models
Use Case:
An e-commerce company uses purchase history and browsing behavior to predict which products a
customer is likely to buy next.
Definition:
Prescriptive analytics suggests optimal actions by analyzing data, predictions, and constraints. It goes
beyond forecasting and tells what decisions should be made to achieve desired outcomes.
Purpose:
• To optimize outcomes
• Simulation models
Examples:
Output:
• "What-if" scenarios
• Optimization recommendations
Use Case:
A logistics company uses prescriptive analytics to find the best delivery routes that minimize time
and cost, considering fuel prices and traffic.
Comparison Table:
Historical + predictions +
Data Used Historical Historical + new data
constraints
scss
CopyEdit
Summary:
• Predictive Analytics helps you forecast what will likely happen next.
• Prescriptive Analytics tells you the best course of action to take based on predictions and
constraints.
Model assumptions are the conditions or rules under which a model is expected to perform
accurately. If these assumptions are violated, the model's results may become unreliable, biased, or
misleading.
Each type of analytics—descriptive, predictive, and prescriptive—may have its own set of
assumptions, especially in predictive modeling, where statistical techniques are heavily involved.
• Linearity (when applicable): Some models assume a linear relationship between variables.
• No Multicollinearity: Predictor variables should not be highly correlated with each other
(important in regression models).
• Stationarity (for time series): The data should have consistent statistical properties over
time.
Descriptive analytics often involves basic statistical summaries, so assumptions are minimal, but
include:
• Data Accuracy and Completeness: Descriptive statistics only describe what is in the data. If
the data is flawed, the description will be misleading.
Predictive models use statistical methods like regression, classification, or machine learning
algorithms. Common assumptions include:
Assumption Meaning
Linearity The relationship between the independent and dependent variable is linear.
Homoscedasticity Constant variance of errors across all levels of the independent variable.
• Independence of observations
Most machine learning models (like decision trees, random forests, neural networks) are non-
parametric, meaning they make fewer statistical assumptions. However:
• Deterministic Models: All parameters are known and fixed (e.g., cost, resource availability).
• Divisibility: Decision variables can take fractional values (not always realistic).
• Certainty: All data and parameters are known without error (in basic models).
In real life, stochastic or probabilistic models are used when assumptions of certainty cannot be
met.
Importance of Assumptions
o Test assumptions before applying models (e.g., using residual plots, VIF, normality
tests)
A predictive model assumes linearity, but the actual relationship is curved. Result: the forecast will
be systematically wrong, leading to poor decisions.
Summary:
Uncertainty:
• Refers to situations where future outcomes are unknown or unpredictable due to a lack of
complete information.
• It involves imperfect or missing knowledge about the environment, future events, or model
behavior.
Example: Launching a new product in a market where no historical data exists—it's hard to estimate
customer response or market size.
Risk:
• Refers to situations where the probabilities of different outcomes are known or can be
estimated.
• With risk, you can use tools like probability distributions, expected values, and simulations.
Example: A bank offering loans knows from historical data that there's a 5% risk of default. This is
risk, not uncertainty.
Example Stock market volatility with historical data Regulatory changes with no precedent
a. Data Uncertainty
c. Parameter Uncertainty
• The model parameters (like coefficients or weights) may not be precisely known.
d. Environmental/External Uncertainty
a. Operational Risk
b. Market Risk
c. Credit Risk
d. Strategic Risk
e. Model Risk
a. Scenario Analysis
b. Sensitivity Analysis
• Runs thousands of simulations to estimate the range of outcomes and their probabilities.
d. Decision Trees
Function Contribution
Predictive Analytics Estimates future probabilities to convert uncertainty into measurable risk
7. Real-Life Examples
• Finance: Credit scoring models convert uncertainty of default into a measurable risk.
• Supply Chain: Demand forecasting models help reduce uncertainty in inventory decisions.
• Marketing: Conversion models estimate the risk of losing customers if prices change.
Summary:
• Uncertainty is when we don't know the outcome and can't quantify it.
• Risk is when we don't know the outcome but can estimate the probability of each.
• Business analytics helps convert uncertainty into risk through data, models, and simulations.
• Tools like Monte Carlo simulation, sensitivity analysis, and decision trees are widely used to
manage both.
1. Interpreting Results
2. Making a Decision
Definition:
Once an analytical model is run (descriptive, predictive, or prescriptive), results must be interpreted
to understand what they mean and how reliable they are.
Key Aspects:
o Regression coefficients
o Probability scores
o Clusters or segments
o Optimized solutions
• Contextual Interpretation:
o Ask: “What does this mean for the problem I’m trying to solve?”
o Graphs, heat maps, dashboards to help stakeholders easily grasp the insights
Example:
A model predicts that customers under age 30 have a 75% chance of switching to a competitor.
Interpretation: This segment needs targeted retention efforts.
Definition:
After interpreting the results, the next step is to choose the best course of action based on evidence
and insights.
Decision-Making Strategies:
• Cost-Benefit Analysis:
o Comparing the costs of possible actions with the expected benefits from analytical
predictions.
o Testing different scenarios to see how outcomes change with different inputs.
Factors to Consider:
• Stakeholder impact
Example:
A logistics firm finds from predictive analytics that rescheduling deliveries from evenings to mornings
can reduce costs by 12%. Decision: Implement the new delivery schedule in low-risk areas first.
Definition:
This is the process of putting the chosen analytical decision into action within the business
operations.
Steps in Implementation:
o Explain the rationale behind the decision using clear visuals and summaries.
5. Refine or Adjust:
Challenges in Implementation:
• Resistance to change
• Technical constraints
Example:
A bank implements a new credit scoring model. After 3 months, they observe a 20% reduction in
default rate among approved customers—validating the analytics-based decision.
Summary Table:
Conclusion:
This cycle ensures that analytics delivers real business value, not just technical results.
What is Excel?
• Microsoft Excel is a powerful spreadsheet application used widely in business analytics for
organizing, analyzing, and visualizing data.
• It provides a grid of rows and columns to input and manipulate data using formulas,
functions, and charts.
Feature Use
Charts Visualize data with bar charts, pie charts, line graphs
Conditional Formatting Highlight cells based on rules (e.g., highlight values > 100)
Data Filters & Sorts Filter and sort data for quick insights
What-If Analysis Tools Includes Goal Seek, Scenario Manager, and Data Tables
What is a Dataset?
• Typically organized in a tabular format, with rows as records and columns as variables or
fields.
Types of Datasets:
• Structured Data – Tabular format with defined rows and columns (Excel, SQL tables)
What is a Database?
• Microsoft Access
• MySQL
• SQL Server
• Oracle
• PostgreSQL
• Instead of using cell references like A1:A10, you can name that range as Sales2024.
Alternatively:
excel
=SUM(Sales)
Instead of:
excel
=SUM(A1:A10)
Best Practices:
Summary Table:
Excel Used for data entry, analysis, visualization; supports formulas, charts, pivot tables
Databases Systematic data storage and management systems for large data
Range Names Labels for cell ranges that simplify and clarify formulas in Excel
• A data query refers to the process of retrieving, extracting, or viewing specific data from a
dataset or table based on certain criteria.
In Excel and other data tools, this is commonly done through tables, sorting, filtering, and query
tools like Power Query.
1. Tables in Excel
What is a Table?
• A table is a structured range in Excel where data is organized in rows and columns with
special features that support easier management and analysis.
• When data is converted into a table format, Excel enables sorting, filtering, dynamic ranges,
and easy formula applications.
• Select the data range → Go to Insert > Table → Check “My table has headers” → Click OK.
Example Table:
2. Sorting Data
Definition:
Types of Sorting:
Type Description
How to Sort:
• For multi-level sorting: Use Data > Sort and add multiple sort levels (e.g., first by Region,
then by Price)
Why Sort?
• Helps identify top/bottom values, rankings, or patterns (e.g., highest sales, most frequent
customers).
3. Filtering Data
Definition:
Filtering is the process of displaying only rows that meet specific criteria, while hiding the rest
temporarily.
• Select data → Go to Data > Filter or click the filter icon in a Table header.
Types of Filters:
Color Filter Filter based on cell or font color (useful with conditional formatting)
Benefits:
Practical Example:
• Filter Region = South and Sales > 5000 to focus on high-performing areas.
• Use structured table references to apply formulas only to the filtered subset.
Tool Function
Power Query (Get & Transform) Load, clean, reshape, and combine data from multiple sources
Summary Table:
Querying Extract and analyze based on criteria Power Query, PivotTable, Advanced Filter
Logical functions are used to make decisions within formulas. They return values like TRUE, FALSE, or
user-defined outputs based on conditions.
Example:
excel
CopyEdit
These functions help extract data from tables or cross-reference different datasets.
Function Purpose
XLOOKUP Replaces VLOOKUP and HLOOKUP, works both vertically and horizontally (Excel 365)
excel
CopyEdit
=VLOOKUP(101, A2:D10, 3, FALSE) → Returns the 3rd column value for ID 101
3. Template Design
Templates are pre-formatted Excel workbooks designed to be reused for similar tasks like budgeting,
invoicing, or reporting.
Save as Template:
Features:
How to Use:
Types:
How to Add:
Use Case:
Use a drop-down to choose product name → linked cell updates formulas for price and stock.
6. Pivot Tables
A Pivot Table is a powerful Excel feature for summarizing, analyzing, exploring, and presenting large
datasets.
o Rows – Categories
o Columns – Subcategories
Use Cases:
• Sales by Region
• Product-wise revenue
Customization Options:
Example:
Show % of Total Sales by Region using Value Field Settings → Show Values As → % of Grand Total.
8. Slicers
Features:
How to Insert:
1. Click on PivotTable
Advantages:
Summary Table:
Topic Description
1. Data Visualization
• Definition:
Data visualization is the graphical representation of information and data using visual
elements like charts, graphs, maps, and dashboards.
• Purpose:
It helps to communicate data insights clearly and effectively, making complex data easier to
understand and interpret.
• Key Elements:
o Visual formats: Bar charts, line graphs, scatter plots, pie charts, heat maps,
dashboards, geographic maps.
Benefit Explanation
Improved Understanding Visuals make data patterns, trends, and outliers easier to identify.
Error Detection Spot anomalies or errors that may be hidden in raw data.
Google Data
Cloud-based, easy sharing & collaboration Free and user-friendly
Studio
QlikView / Qlik
Self-service BI with associative data indexing Good for large datasets
Sense
D3.js JavaScript library for custom web visualizations Requires coding skills
4. Creating Charts
Line Chart Show trends over time Stock prices over months
• PivotCharts:
Dynamic charts linked directly to PivotTables that update automatically when the PivotTable
changes.
• Benefits:
• How to create:
1. Create PivotTable.
• Definition:
Visualizing data that has a geographic or spatial component, often on maps.
o Tableau maps
o Power BI maps
• Use Cases:
o Customer locations
Excel Easy, familiar, good for small-medium data Business analysts, students
QlikView / Qlik Sense Fast data processing, associative modeling Large enterprises
Summary Table
Tools & Software Excel, Tableau, Power BI, Google Data Studio, etc.
Creating Charts Bar, line, pie, scatter, histogram; customized for clarity
Data Visualization Tools Wide range for different needs and expertise
Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset. It provides simple
summaries about the sample and the measures.
Types of Data:
Quantitative (Numerical) Data represents numeric values Age, Salary, Sales Amount
Measurement Scales:
Descriptive Metrics
2. Frequency Distributions
A frequency distribution is a summary that shows the number of occurrences of each unique value
or category in a dataset.
Purpose:
Example:
Percentiles
• The nth percentile is the value below which n% of the data falls.
• Example: The 90th percentile means 90% of data points are below this value.
Use in business:
Understanding customer spending patterns — e.g., top 10% spenders (90th percentile).
Quartiles
Use in business:
Identify sales distribution or employee performance spread, excluding outliers.
Example:
• Helps identify patterns like which product sells best in which region.
Median Middle value when data sorted Middle data point or average of middle two
Measures of Dispersion:
• Segment Customers:
Use percentiles/quartiles to group customers (e.g., high, medium, low spenders).
• Risk Management:
Detect variability in sales or production, enabling better forecasting.
• Performance Evaluation:
Analyze employee performance using central tendency and dispersion.
• Product Strategy:
Cross tabulations show which products perform better in specific regions or demographics.
• Quality Control:
Monitor consistency in processes using statistical measures.
Summary Table
Dispersion
Data variability Risk, quality, and performance analysis
Measures
Measures of Dispersion
Measures of dispersion tell us how much the data values vary or spread out around the central
value (like the mean). This helps understand the consistency, risk, and reliability of data.
1. Range
• Definition:
The difference between the maximum and minimum values in the dataset.
• Formula:
• Example:
If sales figures range from 50 to 200 units, Range = 200 − 50 = 150 units.
• Limitations:
Only depends on two values (max & min), so it can be affected by outliers.
2. Variance
• Definition:
Measures the average squared deviation of each data point from the mean. It shows how
data points spread around the mean.
• Formula (Population Variance):
Where,
Where,
• Interpretation:
Higher variance means more spread out data.
3. Standard Deviation
• Definition:
The square root of variance, giving spread in the original units of data.
• Formula:
σ=σ2\sigma = \sqrt{\sigma^2}σ=σ2
• Importance:
More interpretable than variance since it’s in the same unit as the data.
• Example:
If the average sales are 100 units with standard deviation 15, typical sales deviate about 15
units from the average.
• Definition:
Measures the spread of the middle 50% of data.
• Formula:
Where,
• Definition:
Average of the absolute differences between each data point and the mean.
• Formula:
• Interpretation:
Indicates average deviation from mean without squaring deviations.
• Risk Assessment:
High dispersion in sales or production indicates uncertainty or volatility.
• Quality Control:
Low variation in product measurements means better consistency.
• Customer Behavior:
Variability in customer purchase amounts can help segment customers.
• Decision Making:
Helps managers understand if average values are reliable or if data is too spread out.
Summary Table
Mean Absolute Average absolute Average deviation from Less sensitive to extreme
Deviation deviations mean values
Empirical Rule (68-95-99.7 Rule)
The Empirical Rule applies to data that follows a normal distribution (bell-shaped curve). It
describes how data values are spread around the mean using standard deviations.
Explanation:
A bell curve centered at the mean (μ), with shaded regions showing:
• Quality Control:
Products within ±3σ limits are considered acceptable; outliers may indicate defects.
• Risk Management:
Understanding variability and extremes in financial returns.
• Customer Behavior:
Identifying typical vs. extreme customer spending.
• Forecasting:
Setting realistic expectations on outcomes and variations.
Limitations:
• Applies only to normally distributed data or data that roughly approximates normal
distribution.
Example:
Assume average monthly sales = 1000 units, standard deviation = 100 units.
±1σ (68%) Most months sales fall here 900 to 1100 units
±2σ (95%) Almost all months fall here 800 to 1200 units
±3σ (99.7%) Nearly all months fall here 700 to 1300 units
Measures of Association
Measures of association quantify the strength and direction of a relationship between two or more
variables. They help understand how variables move together or influence each other.
• Purpose: Measures strength and direction of linear relationship between two continuous
variables.
• Range: −1 to +1
• Formula:
b) Covariance
• Interpretation:
d) Cramér’s V
• High positive correlation: Variables increase together (e.g., marketing budget and sales).
• High negative correlation: One variable increases as other decreases (e.g., price and
demand).
Summary Table
Descriptive statistics help summarize and organize survey data so you can easily understand patterns,
trends, and key insights from respondents' answers.
• Simplify complex data: Surveys often collect large amounts of data; descriptive stats reduce
it to manageable summaries.
• Prepare data for further analysis: Like hypothesis testing or predictive modeling.
• Mean: Average response (good for numerical scale data like ratings 1-5).
• Mode: Most frequent response (useful for categorical or nominal data, like favorite brand).
b) Measures of Dispersion
• Range: Difference between highest and lowest responses.
• Variance and Standard Deviation: Spread of numerical responses around the mean.
c) Frequency Distributions
Interval/Ratio Mean, median, mode, standard deviation, range Age, income, rating scores
• Bar Charts & Pie Charts: For categorical data to show proportions.
Statistical thinking is a mindset that emphasizes understanding and managing variability through
data to make informed business decisions. It helps managers and analysts interpret data correctly,
avoid misleading conclusions, and improve processes.
Key Concepts:
• Focus on processes: Analyze and improve the system producing the data.
Importance in Business:
• Forecasting & Planning: Use statistical models to predict future trends reliably.
2. Variability in Samples
Variability refers to how data points in a sample differ from each other and from the population
parameters.
Types of Variability:
• Natural variability: Intrinsic differences in data (e.g., daily sales fluctuate).
• Sample variance and standard deviation: Show spread within sample data.
• Standard error: Measures variability of a sample statistic (e.g., sample mean) from the
population parameter.
Where:
• Recognize that sample statistics (mean, proportion) are estimates with uncertainty.
• Use confidence intervals and hypothesis tests to make decisions acknowledging variability.
Summary
Statistical Data-driven, focus on variability and process Better quality, risk management,
Thinking improvement forecasting
Variability in Differences due to sampling, sample size, and Understand uncertainty, improve
Samples population diversity sampling design