0% found this document useful (0 votes)
10 views

Ds unit 1 notes

Uploaded by

Priyanka Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Ds unit 1 notes

Uploaded by

Priyanka Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1

Unit-1
Introduction to Data Science
Data in Data Science

1. Raw Information:
o Basic, unprocessed information collected from various sources.
2. Collection and Storage:
o Data is gathered from multiple sources and stored for processing.
3. Sources of Data:
o Sensors: Devices collecting environmental or operational data.
o Social Media: Posts, interactions, and user-generated content.
o Transactions: Financial and business transaction records.
o Other Digital Means: Various online and offline data generation activities.
4. Formats of Data:
o Structured Data: Organized in a specific format, like databases (e.g., SQL
tables).
o Unstructured Data: Not organized in a predefined manner, like text, images,
and videos.

Science in Data Science

1. Systematic Study:
o This means carefully investigating something step-by-step using scientific
methods.
2. Scientific Methods and Principles:
o Data science uses proven ways that scientists follow to explore and understand
data.
3. Components of Scientific Investigation:
o Forming Hypotheses:
▪ Making educated guesses or predictions based on what you initially
observe.
o Conducting Experiments:
▪ Setting up and carrying out tests to see if your hypotheses are correct.
o Analyzing Data:
▪ Using tools like statistics and computer programs to look at the results
from your experiments.
o Drawing Conclusions:
▪ Making decisions or conclusions based on what you learned from the
data analysis.

Data Science Explained in Detail

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured and unstructured data. Here
are its key components:

1. Data Collection
1
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
2

Definition: Gathering data from various sources.

Examples:

• Databases: Information stored in structured formats like SQL databases.


• Web Scraping: Extracting data from websites using tools like Beautiful Soup or
Scrapy.
• Sensors: Collecting data from IoT devices, like temperature sensors or fitness
trackers.
• Digital Means: Gathering data from social media, transaction logs, or mobile apps.

Illustration: Imagine collecting sales data from an e-commerce website, user reviews from
social media, and product usage data from IoT devices.

2. Data Processing and Cleaning

Definition: Transforming raw data into a usable format by handling missing values, outliers,
and inconsistencies.

Examples:

• Handling Missing Values: Filling in missing values using methods like mean
imputation or predicting them using machine learning models.
• Outlier Detection: Identifying and dealing with outliers that can skew analysis
results.
• Data Transformation: Converting data into the required format, such as normalizing
numerical data or encoding categorical variables.

Illustration: Cleaning a dataset of customer reviews by removing duplicates, handling


missing ratings, and standardizing the text format.

3. Data Analysis

Definition: Applying statistical methods and exploratory data analysis (EDA) to understand
data patterns, trends, and relationships.

Examples:

• Statistical Methods: Using mean, median, mode, standard deviation, and correlation.
• EDA Techniques: Visualizing data distributions with histograms, exploring
relationships with scatter plots, and summarizing data with box plots.

Illustration: Analyzing sales data to find seasonal trends, peak sales periods, and the
correlation between marketing spend and sales.

4. Data Visualization

Definition: Creating visual representations of data through charts, graphs, and dashboards to
communicate findings effectively.

2
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
3

Examples:

• Charts and Graphs: Bar charts, line graphs, pie charts, and scatter plots.
• Dashboards: Interactive dashboards using tools like Tableau, Power BI, or Google
Data Studio.

Illustration: Creating a dashboard to show the monthly sales trends, top-selling products, and
customer demographics for an e-commerce platform.

5. Machine Learning and Modeling

Definition: Building and training algorithms to make predictions or classify data. This
includes supervised learning, unsupervised learning, and reinforcement learning.

Examples:

• Supervised Learning: Algorithms like linear regression, decision trees, and neural
networks used for predicting house prices or classifying emails as spam.
• Unsupervised Learning: Algorithms like K-means clustering or principal component
analysis (PCA) used for customer segmentation or dimensionality reduction.
• Reinforcement Learning: Algorithms like Q-learning or deep Q-networks (DQNs)
used for training AI in games or robotics.

Illustration: Developing a recommendation system for a streaming service that suggests


movies based on user preferences and viewing history.

6. Interpretation and Communication

Definition: Interpreting the results and translating technical findings into actionable insights
for stakeholders.

Examples:

• Reports: Summarizing findings in clear and concise reports.


• Presentations: Using slides to explain data insights and recommendations.
• Stakeholder Meetings: Discussing the implications of data findings and deciding on
action plans.

Illustration: Presenting the results of a customer satisfaction survey to the marketing team,
highlighting areas for improvement and suggesting strategies.

7. Deployment and Monitoring

Definition: Implementing models into production systems and monitoring their performance
over time to ensure they provide accurate and reliable results.

Examples:

3
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
4

• Model Deployment: Using tools like Flask, Docker, or cloud services to deploy
machine learning models.
• Performance Monitoring: Tracking model performance using metrics like accuracy,
precision, recall, and F1 score.

Illustration: Deploying a fraud detection model in a banking system and continuously


monitoring its accuracy and false positive rate to ensure it effectively identifies fraudulent
transactions.

Summary

Data science combines knowledge from fields like statistics, computer science, domain
expertise, and mathematics to analyze and interpret complex data sets, enabling organizations
to make data-driven decisions. Each component plays a crucial role in turning raw data into
valuable insights and actionable strategies.

Data Science Venn Diagram


The popular Venn diagram Drew Conway’s Venn diagram of data science in which data
science is the intersection of three sectors – Substantive expertise, hacking skills, and math &
statistics knowledge.

How to Read the Data Science Venn Diagram (Simplified)

1. Three Key Skills:


o Hacking Skills: Knowing how to handle and manipulate data.

4
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
5

o Math and Statistics Knowledge: Understanding and applying mathematical


methods and statistical techniques.
o Substantive Expertise: Having in-depth knowledge of a specific field or
industry.
2. Skill Combinations:
o Hacking Skills + Math and Statistics: You can work with data and apply
statistical methods. This is useful but not enough to fully grasp data science.
o Math and Statistics + Substantive Expertise: You can analyze data using
statistical methods and understand the field's context. However, without
hacking skills, you might struggle with data handling and manipulation.
o Hacking Skills + Substantive Expertise: You can collect and structure data
from a specific field but might lack the ability to interpret the data correctly
using math and statistics.
3. Danger Zone:
o Hacking Skills + Substantive Expertise (without Math and Statistics): This
is risky because you might be able to extract and organize data, but without
understanding the math and statistics, you might misinterpret results or
produce misleading analysis. This can lead to incorrect conclusions or "lies,
damned lies, and statistics."
4. Why All Three Are Important:
o Hacking Skills: Essential for collecting and manipulating data.
o Math and Statistics: Crucial for analyzing data and deriving meaningful
insights.
o Substantive Expertise: Helps to understand the context and relevance of the
data.
5. Interdisciplinary Nature:
o Combining all three skills (hacking, math/statistics, and domain knowledge)
provides a well-rounded approach to data science.
o Each skill alone is valuable, but combining them is key to effective data
science work.

Coding/Hacking Skills in Data Science

1. What is Coding in Data Science?


o Coding is the first important part of data science. It's needed to collect and
prepare data.
2. Data Collection:

5
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
6

o Data is everywhere, like on websites and apps.


o Sometimes, this data is in different formats or needs to be extracted from
various places.
o Coding helps us gather this data.
3. Data Preparation:
o After collecting the data, coding is used to clean and organize it so it can be
used for analysis.
4. Building Prediction Models:
o Coding is also used to create models that can predict future trends based on the
data.
o These models help data scientists find insights and make predictions.
5. Tools and Languages:
o Scikit-learn: A popular tool used for building prediction models.
o Python and R: The most common programming languages in data science.
o SQL: Used for managing and querying databases.
6. Why Are Coding Skills Important?
o Coding skills are essential for manipulating and gathering data.
o They are also crucial for building prediction models to gain insights.

Summary

Coding is a key skill in data science. It helps in collecting, cleaning, and preparing data, as
well as building prediction models using tools like Scikit-learn and programming languages
like Python, R, and SQL.

2. Mathematical and Statistical Skills in Data Science

1. Basic Foundation:
o Math is the basic foundation for all tech-related fields.

6
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
7

o Statistics, which is a branch of math, is crucial for data processing.


2. Role in Data Science:
o Statistics helps us understand data through concepts like probability,
distribution, and regression.
o These concepts are useful for getting insights from data before making
prediction models.
3. Insight and Analysis:
o Math and statistics help us understand the type of data we have.
o They are important for data pre-processing and feature engineering (explained
in the next article).
4. Problem Identification:
o Math and statistical knowledge help in identifying problems in the data.
o They are essential for analyzing data and preparing it for further use.
5. Importance in Data Science:
o Math and statistics are crucial for understanding and working with data.
o They provide the tools needed to analyze data and make data-driven decisions.

Summary

Mathematics and statistics are essential for data science. They provide the foundation for
understanding and processing data, helping to gain insights, identify problems, and prepare
data for prediction models.

3. Domain Knowledge or Substantive Expertise in Data Science

1. What is Domain Knowledge?


o Domain knowledge is the understanding of a specific field or industry.
o It means knowing a lot about a particular area, like healthcare, finance, or
marketing.
2. Why is Domain Knowledge Important?

7
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
8

o It helps make sure that data science models and insights are relevant and
accurate for that field.
o Without domain knowledge, you might misinterpret data or make incorrect
conclusions.
3. What to Know in Domain Knowledge?
o Goals: Understand the main objectives of the field.
o Methods: Be familiar with the common techniques and practices used.
o Constraints: Know any limitations or special rules of the field.
4. How It Helps in Data Science?
o Ensures that the data science models you build fit well with the specific field.
o Helps in making accurate predictions and useful insights that can be
practically applied.
5. Value of Domain Knowledge:
o It makes the implementation of data science solutions more effective and
accurate.
o It is crucial for achieving meaningful and practical results from data science
projects.

Summary

Domain knowledge is about understanding a specific field, which is crucial for data science.
It helps ensure that data science models and insights are accurate and relevant for that
particular area, preventing mistakes and making the implementation of solutions more
effective.

Key Data Science Terminologies and Their Connections

1. Artificial Intelligence (AI):

o Definition: AI involves creating systems that can perform tasks requiring human-like
intelligence, such as learning, reasoning, and problem-solving.
o Example: Google’s AI-powered search engine can understand and respond to
complex queries.
o Connection: AI is a broad field that encompasses various techniques used in data
science to create intelligent systems and make data-driven decisions.

8
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
9

2. Data Science:

o Definition: An interdisciplinary field that extracts knowledge and insights from


structured and unstructured data using scientific methods, processes, and algorithms.
o Example: Analyzing customer behavior data to recommend products.
o Connection: Data science is the central discipline that applies various techniques and
tools, including AI, machine learning, and statistics, to solve real-world problems
using data.
3. Data Mining:

o Definition: The process of discovering patterns and knowledge from large datasets.
o Example: Mining retail transaction data to find patterns in customer purchasing
behavior.

9
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
10

o Connection: Data mining techniques are used in data science to extract meaningful
insights from large volumes of data.
4. Machine Learning (ML):

o Definition: A subset of AI where systems learn from data to improve their


performance over time without being explicitly programmed.
o Example: A recommendation system that learns user preferences and suggests new
products.
o Connection: ML is a key technique within data science used to build models that can
make predictions or classifications based on data.
5. Algorithm:

o Definition: A set of rules or steps for solving a specific problem or performing a task.
o Example: A sorting algorithm like quicksort that organizes data in a specific order.

10
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
11

o Connection: Algorithms are fundamental to data science as they are used to process
data, build models, and analyze results.
6. Big Data:

o Definition: Extremely large and complex datasets that traditional data processing
tools cannot handle efficiently.
o Example: Social media platforms processing petabytes of user-generated content.
o Connection: Data science often deals with big data to uncover insights and patterns
that would be difficult to detect with smaller datasets.
7. Business Intelligence (BI):

o Definition: Tools and techniques for analyzing business data to support decision-
making.
o Example: A BI dashboard displaying key performance indicators like sales revenue
and customer satisfaction scores.
o Connection: BI uses data science techniques to analyze business data and provide
actionable insights.
8. Computer Cluster:

11
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
12

o Definition: A group of interconnected computers that work together to perform tasks


more efficiently than a single machine.
o Example: A cluster used to process large-scale data analytics tasks in parallel.
o Connection: Data science projects that involve large datasets often use computer
clusters for efficient data processing and analysis.
9. Data Engineering:

o Definition: The practice of designing and building systems for collecting, storing,
and processing data.
o Example: Creating ETL (Extract, Transform, Load) pipelines to integrate data from
multiple sources into a data warehouse.
o Connection: Data engineering is crucial for preparing and managing data before it is
analyzed by data scientists.
10. Deep Learning:

o Definition: A subset of machine learning that uses neural networks with many layers
to analyze complex patterns in data.
o Example: Image recognition systems that can identify objects in photos.
o Connection: Deep learning is used in data science for tasks that involve complex data
such as images and speech.

12
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
13

11. Predictive Analytics:

o Definition: Using historical data and statistical algorithms to forecast future events or
behaviors.
o Example: Predicting customer churn based on past behavior and usage patterns.
o Connection: Predictive analytics is a key application of data science techniques to
anticipate future trends and outcomes.
12. Statistics:

o Definition: The study of data collection, analysis, interpretation, and presentation.


o Example: Calculating the average salary of employees in a company.
o Connection: Statistics provides the foundation for data analysis and interpretation in
data science.
13. Classification:

13
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
14

o Definition: A machine learning task that assigns data points to predefined categories.
o Example: Classifying emails as spam or not spam.
o Connection: Classification is a common technique in data science used for
categorizing data into different classes.
14. Descriptive Analytics:

o Definition: Analyzing historical data to understand what has happened in the past.
o Example: Generating reports on past sales performance to identify trends.
o Connection: Descriptive analytics is used to summarize and understand past data,
providing a basis for further analysis.
15. Regression:

o Definition: A statistical method for predicting a continuous outcome based on one or


more predictors.
o Example: Predicting house prices based on features like size and location.
o Connection: Regression analysis is a technique used in data science to model and
predict continuous outcomes.

14
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
15

16. Bias:

o Definition: Systematic error or unfairness in data or algorithms that can lead to


inaccurate or prejudiced results.
o Example: A hiring algorithm that favors candidates from a specific demographic
group.
o Connection: Data scientists must be aware of and address bias to ensure fair and
accurate results.
17. Data and Information Visualization:

o Definition: Creating graphical representations of data to help users understand and


interpret information.
o Example: Using charts and graphs to display sales data trends.
o Connection: Visualization is a crucial part of data science for communicating
insights and findings effectively.
18. Bayesian Inference:

15
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
16

o Definition: A statistical method for updating the probability of a hypothesis as more


evidence or information becomes available.
o Example: Adjusting the probability of a disease diagnosis based on new test results.
o Connection: Bayesian inference is used in data science for probabilistic modeling
and decision-making.
19. Architecture:
o Definition: The design and structure of data systems and processes, including
hardware and software.
o Example: Designing a data warehouse architecture for efficient data storage and
retrieval.
o Connection: Data science relies on robust architecture to support data storage,
processing, and analysis.
20. Data Warehouse:

16
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
17

o Definition: A centralized repository that stores data from multiple sources for
analysis and reporting.
o Example: A data warehouse that integrates sales, finance, and customer data for
business intelligence.
o Connection: Data warehouses are used in data science to consolidate and analyze
large volumes of data.
21. Database:
o Definition: An organized collection of data that is easily accessible, managed, and
updated.
o Example: A relational database that stores customer contact details and order history.
o Connection: Databases are essential for storing and retrieving data used in data
science projects.
22. Diagnostic Analytics:
o Definition: Analyzing data to understand why something happened by examining
causes and relationships.
o Example: Investigating a drop in sales by analyzing customer feedback and sales
data.
o Connection: Diagnostic analytics helps data scientists understand the reasons behind
observed trends and anomalies.
23. Data Governance:
o Definition: The management of data availability, usability, integrity, and security
within an organization.
o Example: Establishing policies for data access and quality control to ensure
compliance and accuracy.
o Connection: Data governance is crucial for maintaining the quality and security of
data used in data science.
24. Data:
o Definition: Raw facts and figures collected from various sources that can be analyzed
to gain insights.
o Example: Transaction records from an e-commerce site showing customer purchases.
o Connection: Data is the foundation of data science; all analyses and insights are
derived from it.

Data Science Case Studies: Summaries and


Explanations
1. Case Study: Netflix Recommendations
Summary: Netflix uses data science to provide personalized content recommendations to its users.
By analyzing viewing history, user ratings, and search behaviors, Netflix's algorithm suggests movies
and TV shows that users are likely to enjoy.

Key Points:

• Data Collection: Netflix collects data on viewing habits, search queries, and user ratings.
• Data Analysis: Uses collaborative filtering and content-based filtering algorithms to analyze
user preferences and predict what they might like.
• Outcome: Improved user engagement and satisfaction by delivering personalized
recommendations, leading to higher retention rates and subscription growth.

How Data Science Helps:

17
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
18

• Personalization: Enhances user experience by tailoring content to individual preferences.


• Increased Engagement: Drives higher user interaction with the platform.
• Data-Driven Decisions: Helps Netflix in content acquisition and production strategies based
on user interests.

2. Case Study: Amazon Fraud Detection


Summary: Amazon employs data science to detect and prevent fraudulent activities on its platform.
By analyzing transaction data and user behavior, Amazon's fraud detection system identifies
potentially fraudulent transactions and accounts.

Key Points:

• Data Collection: Collects data on transaction history, user behavior, and device information.
• Data Analysis: Uses machine learning models to detect anomalies and patterns indicative of
fraud.
• Outcome: Reduced fraudulent transactions and enhanced security measures, leading to
increased trust and reduced financial losses.

How Data Science Helps:

• Fraud Detection: Identifies suspicious activities and prevents financial losses.


• Enhanced Security: Strengthens the overall security of the platform.
• Customer Trust: Maintains customer confidence by protecting against fraud.


• CSV File: This is the starting point where data is stored in a CSV (Comma-Separated
Values) format.
• Amazon S3 Input Bucket: The CSV file is uploaded to an Amazon S3 bucket, which is a
storage service provided by AWS.
• AWS Lambda: This service retrieves the file contents from the S3 bucket. AWS Lambda
allows you to run code without provisioning or managing servers.
• Amazon Fraud Detector: The data retrieved by AWS Lambda is then processed by
Amazon Fraud Detector, which analyzes the data to detect potential fraudulent activities.
• Amazon S3 Output Bucket: Finally, the results (scored events) are written to another S3
bucket for storage.

18
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
19

3. Case Study: Uber Surge Pricing


Summary: Uber uses data science to implement surge pricing during high-demand periods. By
analyzing data on ride requests, weather conditions, and local events, Uber adjusts prices to balance
supply and demand.

Key Points:

• Data Collection: Gathers data on ride requests, location, time, and demand patterns.
• Data Analysis: Applies predictive modeling and real-time analysis to adjust pricing
dynamically.
• Outcome: Optimizes driver availability and balances supply with demand, ensuring efficient
service and maximizing revenue.

How Data Science Helps:

• Dynamic Pricing: Adjusts prices based on real-time data to manage demand and supply.
• Service Efficiency: Ensures that more drivers are available during peak times.
• Revenue Optimization: Maximizes earnings for Uber and drivers during high-demand
periods.

4. Case Study: Healthcare Predictive Analytics


Summary: Hospitals and healthcare providers use data science to predict patient outcomes and
improve treatment plans. By analyzing patient data, medical records, and historical health trends,
predictive models help in early diagnosis and personalized treatment.

Key Points:

• Data Collection: Collects data from patient records, lab results, and medical history.
• Data Analysis: Uses machine learning models to predict health risks and patient outcomes.
• Outcome: Enhances patient care by providing early warnings and personalized treatment
plans, leading to better health outcomes.

How Data Science Helps:

• Early Diagnosis: Identifies potential health issues before they become severe.
• Personalized Treatment: Tailors treatment plans based on individual patient data.
• Improved Outcomes: Leads to better patient health and reduced hospital readmissions.

5. Case Study: Target's Customer Purchase Prediction


Summary: Target used data science to predict customer purchasing behavior and target promotions
effectively. By analyzing purchase history and customer demographics, Target's algorithms identified
buying patterns and preferences.

Key Points:

• Data Collection: Gathers data on customer purchases, demographics, and browsing behavior.
• Data Analysis: Applies predictive analytics to identify trends and target marketing efforts.

19
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
20

• Outcome: Improved marketing strategies by personalizing promotions and increasing sales


through targeted offers.

How Data Science Helps:

• Targeted Marketing: Delivers personalized promotions to customers based on their


purchasing behavior.
• Increased Sales: Boosts sales through effective marketing strategies.
• Customer Insights: Provides valuable insights into customer preferences and behavior.

Types of Data in Data Science and Their Usefulness


1. Structured Data
o Definition: Data that is organized into rows and columns, often stored in databases or
spreadsheets. It is highly organized and easily searchable.
o Examples: Customer names, transaction amounts, dates.
o How It Helps:
▪ Easy Analysis: Structured data can be easily queried using SQL or analyzed
with statistical tools.
▪ Predictive Modeling: Ideal for building predictive models as it is well-
organized and standardized.
▪ Efficient Processing: Can be quickly processed and analyzed due to its
organized format.
2. Unstructured Data
o Definition: Data that does not have a predefined structure or format. It includes a
wide range of formats and requires more advanced techniques for analysis.
o Examples: Text documents, social media posts, video content.
o How It Helps:
▪ Rich Insights: Provides in-depth insights into customer sentiments, trends,
and behaviors.
▪ Advanced Analysis: Requires natural language processing (NLP) or
computer vision for analysis, helping uncover valuable information not
available in structured data.
▪ Contextual Understanding: Offers additional context that structured data
may not capture.
3. Semi-Structured Data
o Definition: Data that does not fit neatly into a table but still has some organizational
properties, such as tags or metadata.
o Examples: JSON files, XML data, email headers.
o How It Helps:
▪ Flexibility: Easier to parse and process compared to unstructured data,
allowing for more straightforward analysis.
▪ Integration: Often used for data interchange between systems, facilitating
data integration and analysis.
▪ Enhanced Analysis: Provides a middle ground between structured and
unstructured data, making it easier to extract insights.
4. Quantitative Data
o Definition: Data that is numerical and can be measured or counted. It provides the
ability to perform mathematical operations.
o Examples: Sales revenue, temperature readings, number of products sold.
o How It Helps:
▪ Numerical Analysis: Allows for statistical analysis, trend identification, and
quantitative modeling.

20
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
21

▪ Predictive Modeling: Essential for creating regression models and


forecasting future trends.
▪ Data Aggregation: Facilitates aggregation and summarization of data for
better insights.
5. Qualitative Data
o Definition: Non-numerical data that describes attributes or qualities. It is often
descriptive and categorical.
o Examples: Customer feedback, product reviews, interview transcripts.
o How It Helps:
▪ Contextual Insights: Provides deeper understanding of customer
experiences, opinions, and preferences.
▪ Complementary Analysis: Enhances quantitative analysis by adding context
and detail.
▪ Thematic Analysis: Useful for identifying themes and patterns in textual or
verbal data.
6. Categorical Data
o Definition: Data that represents distinct categories or groups, often used to classify
data into specific types.
o Examples: Gender, product types, customer segments.
o How It Helps:
▪ Segmentation: Useful for grouping and segmenting data, allowing for
targeted analysis and reporting.
▪ Pattern Recognition: Helps in identifying patterns and relationships between
different categories.
▪ Classification Tasks: Essential for building classification models that predict
category membership.
7. Time Series Data
o Definition: Data collected at successive points in time, often used to analyze trends
over time.
o Examples: Stock prices, weather data, website traffic logs.
o How It Helps:
▪ Trend Analysis: Allows for the analysis of trends, seasonal patterns, and
cyclical behaviors.
▪ Forecasting: Essential for predicting future values based on historical data.
▪ Anomaly Detection: Helps in identifying unusual patterns or deviations from
expected trends.
8. Spatial Data
o Definition: Data related to geographic locations and spatial relationships, often used
in mapping and location analysis.
o Examples: GPS coordinates, land use data, geographic boundaries.
o How It Helps:
▪ Geographic Analysis: Provides insights into geographic distributions and
patterns.
▪ Mapping: Useful for creating visual maps and spatial visualizations to
understand data location.
▪ Location-Based Insights: Helps in making location-based decisions and
analysis.
9. Transactional Data
o Definition: Data related to transactions or interactions, often capturing details about
purchases, sales, or other business activities.
o Examples: Purchase receipts, online transactions, financial records.
o How It Helps:
▪ Behavior Analysis: Provides insights into transactional behaviors and
spending patterns.

21
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
22

▪ Financial Analysis: Useful for understanding revenue streams, profit


margins, and financial performance.
▪ Fraud Detection: Helps in identifying unusual or suspicious transactions that
may indicate fraud.
10. Big Data
o Definition: Extremely large datasets that cannot be easily managed or analyzed using
traditional tools. It often requires distributed computing systems.
o Examples: Social media data, sensor data from IoT devices, large-scale customer
data.
o How It Helps:
▪ Scalability: Enables analysis of vast amounts of data that traditional systems
cannot handle.
▪ Advanced Analytics: Facilitates the use of big data technologies like
Hadoop and Spark for complex analyses.
▪ Comprehensive Insights: Provides a more detailed and nuanced view of
data for more accurate insights.

The four levels of data: Nominal level, Ordinal level, Interval


level, and Ratio level
Nominal Scale

• Definition: The nominal scale categorizes data into distinct groups with no inherent order.
• Characteristics:
o Mutually Exclusive: Each observation fits into only one category.
o Exhaustive: All possible categories are included.
o No Mathematical Operations: Arithmetic operations such as addition, subtraction,
and finding averages are not applicable.
• Examples:
o Types of Cars: Categories like sedan, SUV, and truck.
o Marital Status: Categories such as single, married, and divorced.

Ordinal Scale

• Definition: The ordinal scale arranges data into ordered categories, but the intervals between
categories are not consistent.
• Characteristics:
o Meaningful Order: Categories are ranked (e.g., low, medium, high).
o Non-Uniform Intervals: The difference between ranks is not uniform or
quantifiable.
o Limited Arithmetic: Arithmetic operations are not meaningful, but non-arithmetic
measures like median and mode are applicable.
• Examples:
o Educational Level: Categories such as high school diploma, bachelor’s degree, and
master’s degree.
o Likert Scale Responses: Responses like strongly disagree, disagree, neutral, agree,
strongly agree.
o Net Promoter Score (NPS): Ratings from 0 to 10 on likelihood to recommend a
product or service.

22
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
23

Interval Scale

• Definition: The interval scale is numerical and has equal intervals between values, but it does
not have a true zero point.
• Characteristics:
o Equal Intervals: The difference between values is consistent (e.g., the gap between
10°C and 20°C is the same as between 20°C and 30°C).
o Arbitrary Zero: Zero does not signify the absence of the attribute (e.g., 0°C does not
mean ‘no temperature’).
o Arithmetic Operations: Addition and subtraction are meaningful, but multiplication
and division are not.
• Examples:
o Temperature: Measurements in Celsius or Fahrenheit (e.g., 0°C, 10°C, 20°C).
o Dates: Specific dates like January 1st, February 1st, March 1st.
o IQ Scores: Numerical values such as 80, 100, 120.

Ratio Scale

• Definition: The ratio scale has all the properties of the interval scale, but it also includes a
true zero point.
• Characteristics:
o True Zero: Zero indicates the complete absence of the attribute being measured (e.g.,
0 cm means no height).
o All Arithmetic Operations: Addition, subtraction, multiplication, and division are
all meaningful.
• Examples:
o Height: Measurements in centimeters or inches (e.g., 150 cm, 170 cm, 190 cm).
o Weight: Measurements in kilograms or pounds (e.g., 50 kg, 70 kg, 90 kg).
o Time: Time taken to complete tasks (e.g., 0 seconds, 30 seconds, 60 seconds).

Importance in Data Science

Understanding these levels of measurement is crucial for selecting the right statistical techniques and
correctly interpreting data:

• Nominal Data: Useful for classification and counting. Helps in understanding distribution
and categorical comparisons.
• Ordinal Data: Important for ranking and order-based analysis. Helps in understanding
preferences and levels of agreement.
• Interval Data: Useful for comparing differences and trends over time. Enables detailed
statistical analysis like mean and variance.
• Ratio Data: Allows for comprehensive analysis involving all arithmetic operations. Essential
for precise measurements and ratio comparisons.

These levels help data scientists choose appropriate methods for data analysis, ensuring accurate and
meaningful insights.

23
Prepared by Mrs. Shah S. S. (CSE-AI & ML)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy