Ds unit 1 notes
Ds unit 1 notes
Unit-1
Introduction to Data Science
Data in Data Science
1. Raw Information:
o Basic, unprocessed information collected from various sources.
2. Collection and Storage:
o Data is gathered from multiple sources and stored for processing.
3. Sources of Data:
o Sensors: Devices collecting environmental or operational data.
o Social Media: Posts, interactions, and user-generated content.
o Transactions: Financial and business transaction records.
o Other Digital Means: Various online and offline data generation activities.
4. Formats of Data:
o Structured Data: Organized in a specific format, like databases (e.g., SQL
tables).
o Unstructured Data: Not organized in a predefined manner, like text, images,
and videos.
1. Systematic Study:
o This means carefully investigating something step-by-step using scientific
methods.
2. Scientific Methods and Principles:
o Data science uses proven ways that scientists follow to explore and understand
data.
3. Components of Scientific Investigation:
o Forming Hypotheses:
▪ Making educated guesses or predictions based on what you initially
observe.
o Conducting Experiments:
▪ Setting up and carrying out tests to see if your hypotheses are correct.
o Analyzing Data:
▪ Using tools like statistics and computer programs to look at the results
from your experiments.
o Drawing Conclusions:
▪ Making decisions or conclusions based on what you learned from the
data analysis.
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured and unstructured data. Here
are its key components:
1. Data Collection
1
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
2
Examples:
Illustration: Imagine collecting sales data from an e-commerce website, user reviews from
social media, and product usage data from IoT devices.
Definition: Transforming raw data into a usable format by handling missing values, outliers,
and inconsistencies.
Examples:
• Handling Missing Values: Filling in missing values using methods like mean
imputation or predicting them using machine learning models.
• Outlier Detection: Identifying and dealing with outliers that can skew analysis
results.
• Data Transformation: Converting data into the required format, such as normalizing
numerical data or encoding categorical variables.
3. Data Analysis
Definition: Applying statistical methods and exploratory data analysis (EDA) to understand
data patterns, trends, and relationships.
Examples:
• Statistical Methods: Using mean, median, mode, standard deviation, and correlation.
• EDA Techniques: Visualizing data distributions with histograms, exploring
relationships with scatter plots, and summarizing data with box plots.
Illustration: Analyzing sales data to find seasonal trends, peak sales periods, and the
correlation between marketing spend and sales.
4. Data Visualization
Definition: Creating visual representations of data through charts, graphs, and dashboards to
communicate findings effectively.
2
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
3
Examples:
• Charts and Graphs: Bar charts, line graphs, pie charts, and scatter plots.
• Dashboards: Interactive dashboards using tools like Tableau, Power BI, or Google
Data Studio.
Illustration: Creating a dashboard to show the monthly sales trends, top-selling products, and
customer demographics for an e-commerce platform.
Definition: Building and training algorithms to make predictions or classify data. This
includes supervised learning, unsupervised learning, and reinforcement learning.
Examples:
• Supervised Learning: Algorithms like linear regression, decision trees, and neural
networks used for predicting house prices or classifying emails as spam.
• Unsupervised Learning: Algorithms like K-means clustering or principal component
analysis (PCA) used for customer segmentation or dimensionality reduction.
• Reinforcement Learning: Algorithms like Q-learning or deep Q-networks (DQNs)
used for training AI in games or robotics.
Definition: Interpreting the results and translating technical findings into actionable insights
for stakeholders.
Examples:
Illustration: Presenting the results of a customer satisfaction survey to the marketing team,
highlighting areas for improvement and suggesting strategies.
Definition: Implementing models into production systems and monitoring their performance
over time to ensure they provide accurate and reliable results.
Examples:
3
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
4
• Model Deployment: Using tools like Flask, Docker, or cloud services to deploy
machine learning models.
• Performance Monitoring: Tracking model performance using metrics like accuracy,
precision, recall, and F1 score.
Summary
Data science combines knowledge from fields like statistics, computer science, domain
expertise, and mathematics to analyze and interpret complex data sets, enabling organizations
to make data-driven decisions. Each component plays a crucial role in turning raw data into
valuable insights and actionable strategies.
4
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
5
5
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
6
Summary
Coding is a key skill in data science. It helps in collecting, cleaning, and preparing data, as
well as building prediction models using tools like Scikit-learn and programming languages
like Python, R, and SQL.
1. Basic Foundation:
o Math is the basic foundation for all tech-related fields.
6
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
7
Summary
Mathematics and statistics are essential for data science. They provide the foundation for
understanding and processing data, helping to gain insights, identify problems, and prepare
data for prediction models.
7
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
8
o It helps make sure that data science models and insights are relevant and
accurate for that field.
o Without domain knowledge, you might misinterpret data or make incorrect
conclusions.
3. What to Know in Domain Knowledge?
o Goals: Understand the main objectives of the field.
o Methods: Be familiar with the common techniques and practices used.
o Constraints: Know any limitations or special rules of the field.
4. How It Helps in Data Science?
o Ensures that the data science models you build fit well with the specific field.
o Helps in making accurate predictions and useful insights that can be
practically applied.
5. Value of Domain Knowledge:
o It makes the implementation of data science solutions more effective and
accurate.
o It is crucial for achieving meaningful and practical results from data science
projects.
Summary
Domain knowledge is about understanding a specific field, which is crucial for data science.
It helps ensure that data science models and insights are accurate and relevant for that
particular area, preventing mistakes and making the implementation of solutions more
effective.
o Definition: AI involves creating systems that can perform tasks requiring human-like
intelligence, such as learning, reasoning, and problem-solving.
o Example: Google’s AI-powered search engine can understand and respond to
complex queries.
o Connection: AI is a broad field that encompasses various techniques used in data
science to create intelligent systems and make data-driven decisions.
8
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
9
2. Data Science:
o Definition: The process of discovering patterns and knowledge from large datasets.
o Example: Mining retail transaction data to find patterns in customer purchasing
behavior.
9
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
10
o Connection: Data mining techniques are used in data science to extract meaningful
insights from large volumes of data.
4. Machine Learning (ML):
o Definition: A set of rules or steps for solving a specific problem or performing a task.
o Example: A sorting algorithm like quicksort that organizes data in a specific order.
10
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
11
o Connection: Algorithms are fundamental to data science as they are used to process
data, build models, and analyze results.
6. Big Data:
o Definition: Extremely large and complex datasets that traditional data processing
tools cannot handle efficiently.
o Example: Social media platforms processing petabytes of user-generated content.
o Connection: Data science often deals with big data to uncover insights and patterns
that would be difficult to detect with smaller datasets.
7. Business Intelligence (BI):
o Definition: Tools and techniques for analyzing business data to support decision-
making.
o Example: A BI dashboard displaying key performance indicators like sales revenue
and customer satisfaction scores.
o Connection: BI uses data science techniques to analyze business data and provide
actionable insights.
8. Computer Cluster:
11
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
12
o Definition: The practice of designing and building systems for collecting, storing,
and processing data.
o Example: Creating ETL (Extract, Transform, Load) pipelines to integrate data from
multiple sources into a data warehouse.
o Connection: Data engineering is crucial for preparing and managing data before it is
analyzed by data scientists.
10. Deep Learning:
o Definition: A subset of machine learning that uses neural networks with many layers
to analyze complex patterns in data.
o Example: Image recognition systems that can identify objects in photos.
o Connection: Deep learning is used in data science for tasks that involve complex data
such as images and speech.
12
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
13
o Definition: Using historical data and statistical algorithms to forecast future events or
behaviors.
o Example: Predicting customer churn based on past behavior and usage patterns.
o Connection: Predictive analytics is a key application of data science techniques to
anticipate future trends and outcomes.
12. Statistics:
13
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
14
o Definition: A machine learning task that assigns data points to predefined categories.
o Example: Classifying emails as spam or not spam.
o Connection: Classification is a common technique in data science used for
categorizing data into different classes.
14. Descriptive Analytics:
o Definition: Analyzing historical data to understand what has happened in the past.
o Example: Generating reports on past sales performance to identify trends.
o Connection: Descriptive analytics is used to summarize and understand past data,
providing a basis for further analysis.
15. Regression:
14
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
15
16. Bias:
15
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
16
16
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
17
o Definition: A centralized repository that stores data from multiple sources for
analysis and reporting.
o Example: A data warehouse that integrates sales, finance, and customer data for
business intelligence.
o Connection: Data warehouses are used in data science to consolidate and analyze
large volumes of data.
21. Database:
o Definition: An organized collection of data that is easily accessible, managed, and
updated.
o Example: A relational database that stores customer contact details and order history.
o Connection: Databases are essential for storing and retrieving data used in data
science projects.
22. Diagnostic Analytics:
o Definition: Analyzing data to understand why something happened by examining
causes and relationships.
o Example: Investigating a drop in sales by analyzing customer feedback and sales
data.
o Connection: Diagnostic analytics helps data scientists understand the reasons behind
observed trends and anomalies.
23. Data Governance:
o Definition: The management of data availability, usability, integrity, and security
within an organization.
o Example: Establishing policies for data access and quality control to ensure
compliance and accuracy.
o Connection: Data governance is crucial for maintaining the quality and security of
data used in data science.
24. Data:
o Definition: Raw facts and figures collected from various sources that can be analyzed
to gain insights.
o Example: Transaction records from an e-commerce site showing customer purchases.
o Connection: Data is the foundation of data science; all analyses and insights are
derived from it.
Key Points:
• Data Collection: Netflix collects data on viewing habits, search queries, and user ratings.
• Data Analysis: Uses collaborative filtering and content-based filtering algorithms to analyze
user preferences and predict what they might like.
• Outcome: Improved user engagement and satisfaction by delivering personalized
recommendations, leading to higher retention rates and subscription growth.
17
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
18
Key Points:
• Data Collection: Collects data on transaction history, user behavior, and device information.
• Data Analysis: Uses machine learning models to detect anomalies and patterns indicative of
fraud.
• Outcome: Reduced fraudulent transactions and enhanced security measures, leading to
increased trust and reduced financial losses.
•
• CSV File: This is the starting point where data is stored in a CSV (Comma-Separated
Values) format.
• Amazon S3 Input Bucket: The CSV file is uploaded to an Amazon S3 bucket, which is a
storage service provided by AWS.
• AWS Lambda: This service retrieves the file contents from the S3 bucket. AWS Lambda
allows you to run code without provisioning or managing servers.
• Amazon Fraud Detector: The data retrieved by AWS Lambda is then processed by
Amazon Fraud Detector, which analyzes the data to detect potential fraudulent activities.
• Amazon S3 Output Bucket: Finally, the results (scored events) are written to another S3
bucket for storage.
18
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
19
Key Points:
• Data Collection: Gathers data on ride requests, location, time, and demand patterns.
• Data Analysis: Applies predictive modeling and real-time analysis to adjust pricing
dynamically.
• Outcome: Optimizes driver availability and balances supply with demand, ensuring efficient
service and maximizing revenue.
• Dynamic Pricing: Adjusts prices based on real-time data to manage demand and supply.
• Service Efficiency: Ensures that more drivers are available during peak times.
• Revenue Optimization: Maximizes earnings for Uber and drivers during high-demand
periods.
Key Points:
• Data Collection: Collects data from patient records, lab results, and medical history.
• Data Analysis: Uses machine learning models to predict health risks and patient outcomes.
• Outcome: Enhances patient care by providing early warnings and personalized treatment
plans, leading to better health outcomes.
• Early Diagnosis: Identifies potential health issues before they become severe.
• Personalized Treatment: Tailors treatment plans based on individual patient data.
• Improved Outcomes: Leads to better patient health and reduced hospital readmissions.
Key Points:
• Data Collection: Gathers data on customer purchases, demographics, and browsing behavior.
• Data Analysis: Applies predictive analytics to identify trends and target marketing efforts.
19
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
20
20
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
21
21
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
22
• Definition: The nominal scale categorizes data into distinct groups with no inherent order.
• Characteristics:
o Mutually Exclusive: Each observation fits into only one category.
o Exhaustive: All possible categories are included.
o No Mathematical Operations: Arithmetic operations such as addition, subtraction,
and finding averages are not applicable.
• Examples:
o Types of Cars: Categories like sedan, SUV, and truck.
o Marital Status: Categories such as single, married, and divorced.
Ordinal Scale
• Definition: The ordinal scale arranges data into ordered categories, but the intervals between
categories are not consistent.
• Characteristics:
o Meaningful Order: Categories are ranked (e.g., low, medium, high).
o Non-Uniform Intervals: The difference between ranks is not uniform or
quantifiable.
o Limited Arithmetic: Arithmetic operations are not meaningful, but non-arithmetic
measures like median and mode are applicable.
• Examples:
o Educational Level: Categories such as high school diploma, bachelor’s degree, and
master’s degree.
o Likert Scale Responses: Responses like strongly disagree, disagree, neutral, agree,
strongly agree.
o Net Promoter Score (NPS): Ratings from 0 to 10 on likelihood to recommend a
product or service.
22
Prepared by Mrs. Shah S. S. (CSE-AI & ML)
23
Interval Scale
• Definition: The interval scale is numerical and has equal intervals between values, but it does
not have a true zero point.
• Characteristics:
o Equal Intervals: The difference between values is consistent (e.g., the gap between
10°C and 20°C is the same as between 20°C and 30°C).
o Arbitrary Zero: Zero does not signify the absence of the attribute (e.g., 0°C does not
mean ‘no temperature’).
o Arithmetic Operations: Addition and subtraction are meaningful, but multiplication
and division are not.
• Examples:
o Temperature: Measurements in Celsius or Fahrenheit (e.g., 0°C, 10°C, 20°C).
o Dates: Specific dates like January 1st, February 1st, March 1st.
o IQ Scores: Numerical values such as 80, 100, 120.
Ratio Scale
• Definition: The ratio scale has all the properties of the interval scale, but it also includes a
true zero point.
• Characteristics:
o True Zero: Zero indicates the complete absence of the attribute being measured (e.g.,
0 cm means no height).
o All Arithmetic Operations: Addition, subtraction, multiplication, and division are
all meaningful.
• Examples:
o Height: Measurements in centimeters or inches (e.g., 150 cm, 170 cm, 190 cm).
o Weight: Measurements in kilograms or pounds (e.g., 50 kg, 70 kg, 90 kg).
o Time: Time taken to complete tasks (e.g., 0 seconds, 30 seconds, 60 seconds).
Understanding these levels of measurement is crucial for selecting the right statistical techniques and
correctly interpreting data:
• Nominal Data: Useful for classification and counting. Helps in understanding distribution
and categorical comparisons.
• Ordinal Data: Important for ranking and order-based analysis. Helps in understanding
preferences and levels of agreement.
• Interval Data: Useful for comparing differences and trends over time. Enables detailed
statistical analysis like mean and variance.
• Ratio Data: Allows for comprehensive analysis involving all arithmetic operations. Essential
for precise measurements and ratio comparisons.
These levels help data scientists choose appropriate methods for data analysis, ensuring accurate and
meaningful insights.
23
Prepared by Mrs. Shah S. S. (CSE-AI & ML)