0% found this document useful (0 votes)

6 views

Unit I_ Data Science Fundamentals

Data science is an interdisciplinary field that combines mathematics, statistics, and computer science to extract insights from complex datasets, often using machine learning for predictive modeling. It is crucial for organizations to leverage data for better decision-making, growth, and efficiency across various industries. The data science process typically follows the CRISP-DM model, which includes phases such as business understanding, data preparation, modeling, evaluation, and deployment.

Uploaded by

binwalparth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Unit I_ Data Science Fundamentals

Uploaded by

binwalparth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Unit I: Data Science Fundamentals

Introduction to Data Science

Data science is an interdisciplinary field that uses tools from mathematics, statistics, and computer
science to extract knowledge and insights from data. In practice, data scientists prepare and analyze
large, often complex datasets (which can be structured or unstructured) to answer questions or solve
problems. For example, astronomers discovered the existence of Comet NEOWISE by analyzing massive
astronomical survey data from a space telescope. Such applications show how data science combines
domain expertise (e.g. astronomy), statistical analysis, and computing methods to turn raw data into
discoveries. In general, a data scientist is someone who writes code and applies statistical techniques
to summarize data and build predictive models. This role requires not only technical skill but also an
understanding of the real-world context of the data.

Data science differs from simple data analysis in scope: while data analysts might answer specific
questions using clean, structured data, data scientists typically work with larger, messier datasets and
use advanced methods. They often handle unstructured data (like text or images) and apply machine
learning algorithms to make predictions. For instance, a data scientist might use machine learning to
build a model that predicts customer behavior, diagnoses medical conditions from images, or forecasts
sales. All of these tasks start with formulating a problem, gathering data, cleaning it, and then using
statistical models or algorithms to extract patterns and insights.

The Case for Data Science

Data science has become crucial because modern organizations generate and collect vast amounts of
data and need to turn it into actionable knowledge. Data-driven decision making leads to better
business outcomes: one study found that companies using data to guide decisions were three times
more likely to report significant improvements in decision-making than those relying on intuition 1 .

1
Moreover, data-driven strategies can dramatically boost growth and efficiency. For example, a McKinsey
report shows that companies using advanced data and analytics saw above-market growth and a 15–25%
increase in profitability compared to peers 2 3 .

These numbers reflect real-world impacts. Data science is applied across industries to optimize
operations and create value. For instance, manufacturers use predictive analytics to forecast demand,
adjusting production schedules in advance to meet spikes in customer orders and avoid overstock 4 .
In retail and tech, personalization is another success story: streaming and e-commerce services analyze
user behavior data to recommend products or content. An online streaming platform, for example,
uses viewing and rating data to suggest shows to each subscriber; this personalization increases user
engagement and subscription renewals 5 . Other applications include fraud detection in finance
(identifying suspicious transactions), optimizing routes in logistics, improving patient outcomes in
healthcare, and more.

• Better decisions: Data-driven companies make more informed choices. One survey found that
organizations relying heavily on data were three times more likely to improve decision quality
1 .

• Increased growth: Analytics fuel innovation and efficiency. Leading firms report 15–25% higher
growth and profit margins by leveraging data science 2 3 .
• Practical examples: Data science applications are everywhere. Examples include demand
forecasting in manufacturing 4 , content recommendation in media 5 , spam filtering in email
systems, credit scoring in finance, and many others.

These successes show the case for data science: by uncovering patterns and making accurate
predictions, data science helps organizations solve problems and seize opportunities that would be
impossible to tackle manually.

Data Science Classification

In data science, classification refers to a type of predictive modeling task where the goal is to assign
each input (data point) to one of several discrete categories or classes. It is a form of supervised
learning, meaning the model is trained on labeled examples. As one source puts it, “Classification is a
supervised learning technique… that predicts the category (also called the class) of new data points based on
input features.” 6 . During training, the algorithm learns patterns that distinguish each class; once
trained, it can classify new, unseen instances.

Common classification algorithms include k-nearest neighbors, decision trees, logistic regression, naive
Bayes, and support vector machines 7 . For example, a decision tree might learn rules like “if an email
contains certain keywords, classify it as spam.” In the real world, classification is used in many contexts:
- Email spam filtering: Classifying incoming email as “spam” or “not spam” based on content and
metadata 8 . A model is trained on past emails labeled by users, then predicts new ones.
- Medical diagnosis: Classifying medical images or patient data as indicating “disease” vs. “no disease.”
For example, an algorithm might classify tumors in an X-ray as malignant or benign.
- Credit scoring: Predicting whether a loan applicant should be classified as a “good” or “bad” credit
risk.
- Image recognition: Classifying pictures (cats vs. dogs, handwritten digit recognition, etc.).
- Other domains: Document categorization, sentiment analysis (positive/negative), and many more.

6 Classification algorithms learn from examples. For instance, logistic regression or a decision tree
can be trained on past data to separate classes. Overall, classification allows data scientists to automate
decision-making when each outcome fits one of several categories 6 8 .

2
Data Science Process
A structured process helps ensure data science projects are effective. One widely-used framework is the
CRISP-DM process model 9 . This cyclical model defines six major phases: - Business Understanding:
Define project objectives from a business perspective (what problem are we solving?).
- Data Understanding: Gather and explore the data available. Check quality, detect outliers, and gain
intuition about what’s in the data.
- Data Preparation: Clean and transform the raw data into a suitable form for analysis (see next
section).
- Modeling: Apply analytical or machine learning techniques to build predictive or descriptive models.
For example, select algorithms and train models on the prepared data.
- Evaluation: Assess the model’s performance against the business objectives. Use metrics (accuracy,
error rates, etc.) to decide if the model is good enough or needs refining.
- Deployment: Integrate the final model into production so that it can make predictions or inform
decisions in the real world (e.g. a scoring system in software, a regular report, etc.).

The diagram above (CRISP-DM) illustrates how these phases connect: arrows show that tasks often loop
back (for example, a poor evaluation may send us back to improve data preparation or modeling) 9
10 . In practice, data science is iterative. The process starts with understanding the goal and data, then

moves through preparation and modeling. After evaluation, one might return to earlier steps (e.g.
gather more data or engineer new features) before final deployment. This structured, cyclic process
ensures that the project stays aligned with real needs and that the data is handled carefully at each step
10 .

Following a process like CRISP-DM helps teams stay organized. For example, a team might first meet
with stakeholders to clarify goals, then spend weeks cleaning and preparing data. They would then try
different algorithms, evaluate results, and finally implement the winning model into business systems.
Even after deployment, data scientists monitor the model and may iterate as new data arrives.

3
Prior Knowledge
To succeed in data science, practitioners need a mix of technical skills, domain knowledge, and soft
skills. Key prerequisites include:

• Programming & Tools: Proficiency in programming languages such as Python or R is essential

for manipulating data and implementing algorithms 11 . You should also be familiar with data
tools (e.g. SQL databases, data visualization libraries, spreadsheet software).
• Mathematics & Statistics: A solid understanding of statistics and probability (mean, variance,
distributions, hypothesis testing, regression, etc.) is crucial for analyzing data and validating
models 12 13 . Linear algebra and calculus concepts help in understanding how many
algorithms work.
• Data Manipulation: Skills in data wrangling are important. This means knowing how to clean,
merge, and transform large datasets (handling missing values, encoding categories,
aggregating data, etc.) 11 .
• Machine Learning Basics: Familiarity with basic machine learning concepts and algorithms (e.g.
how classification and regression work, what overfitting is, evaluation metrics) helps in model
building 14 .
• Domain Knowledge: Understanding the specific field you’re working in greatly improves your
insights. For instance, a data scientist in healthcare should know medical terminology and
workflows 15 ; in finance, one should grasp financial concepts. Domain knowledge guides
meaningful feature selection and helps interpret results.
• Soft Skills: Communication and problem-solving abilities are also critical 16 . A data scientist
must explain complex results to non-technical stakeholders (using clear language or
visualizations) and work collaboratively in interdisciplinary teams.

In short, a well-rounded data scientist combines programming and statistical skills with an
understanding of the problem domain and good communication. As one tutorial notes, data science is
multidisciplinary and “combines statistics, computer science, and domain expertise” 17 . Building these
foundations (often through coursework or practical projects) provides the prior knowledge needed to
tackle data projects.

Data Preparation
Data preparation (also called data cleaning or wrangling) is the process of transforming raw data into a
clean, usable form before analysis. It is a critical early step because most real-world data is messy: it
may contain errors, missing values, inconsistent formats, or irrelevant fields. As one resource explains,
“Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis.” 18 . This can include standardizing data formats, correcting mistakes, and combining multiple
datasets 18 19 .

Common data preparation tasks include:

- Cleaning: Fix or remove incorrect or irrelevant records (e.g. remove duplicate entries, correct typos,
handle missing values).
- Transforming: Convert data into the needed format or scale (e.g. parse dates consistently, normalize
numeric ranges, encode categorical variables).
- Integrating: Merge data from different sources or tables (for example, joining customer data with
transaction data by ID).
- Feature Engineering: Create new useful variables from existing ones (e.g. combine date and time into
a single timestamp feature, or extract “age” from a birth date).

4
Though often tedious, good data preparation is crucial for accurate results. Statistics show that many
data scientists spend a large portion of their time on this step – one survey found that 76% of data
scientists consider data preparation the most difficult or time-consuming part of their work 20 .
However, thorough cleaning makes downstream analysis much more reliable. For example, removing
outliers and ensuring data consistency can prevent a model from learning incorrect patterns. In
practice, data preparation often loops back with modeling: after evaluating a model, a data scientist
might realize they need to engineer new features or gather more data, then repeat modeling.

Data Modeling
Data modeling is the stage where statistical or machine learning models are built from the prepared
data. The goal is to capture patterns in the data so we can make predictions or draw insights. In
predictive modeling, the data scientist fits a model to historical data to forecast future outcomes. As
TechTarget defines it, “Predictive modeling is a mathematical process used to predict future events or
outcomes by analyzing patterns in a given set of input data.” 21 . In other words, we use algorithms to
answer questions like “What will happen?” or “Which category does this belong to?”

During modeling, one chooses algorithms based on the problem type (regression, classification,
clustering, etc.) and trains models on the data. For instance, a linear regression model might be used to
predict sales revenue from advertising spend, or a decision tree might classify customers into
segments. Clustering algorithms (unsupervised learning) can find natural groupings in data without
predefined labels. The key point is using mathematical models (statistics or ML) to approximate the real-
world process that generated the data.

For example, a data scientist might build a fraud-detection model that learns from past transaction
data. By fitting a model, the algorithm can learn patterns that indicate fraud and predict the probability
that new transactions are fraudulent 22 . Or a sales forecasting model could learn from historical sales
and economic indicators to predict next quarter’s demand 21 . In both cases, the trained model
encapsulates learned relationships. According to one overview, knowledge of machine learning
algorithms for building predictive models is crucial for a data scientist 23 .

Once a model is built, it must be evaluated (next process step) to ensure it meets the project goals
(accuracy, reliability, etc.). A good model can then be used to make decisions automatically – for
instance, tagging future emails as spam, flagging bank transactions, or optimizing inventory orders. In
summary, data modeling is where the analytical power lies: by creating and refining models, data
scientists turn cleaned data into actionable predictions and insights 21 23 .

Sources: Authoritative data science resources and industry articles were used to compile these
explanations 9 18 21 . Each section includes examples and citations to help beginners understand
the concepts. The embedded diagrams visualize key ideas (the Venn of data science disciplines and the
CRISP-DM process model).

1 The Advantages of Data-Driven Decision-Making | HBS Online

https://online.hbs.edu/blog/post/data-driven-decision-making

2 3 Insights to impact: Creating and sustaining data-driven commercial growth | McKinsey

https://www.mckinsey.com/capabilities/growth-marketing-and-sales/our-insights/insights-to-impact-creating-and-
sustaining-data-driven-commercial-growth

5
4 5 The Role of Data Science in Business Decision-Making
https://www.exelatech.com/blog/role-data-science-business-decision-making

6 What is Classification in Machine Learning? | Grammarly

https://www.grammarly.com/blog/ai/what-is-classification/

7 Classification in Data Science

https://www.almabetter.com/bytes/tutorials/data-science/classification-in-data-science

8 21 22 What is Predictive Modeling?

https://www.techtarget.com/searchenterpriseai/definition/predictive-modeling

9 10 Cross-industry standard process for data mining - Wikipedia

https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

11 13 14 15 16 17 23 Introduction to Data Science : Skills Required | GeeksforGeeks

https://www.geeksforgeeks.org/introduction-data-science-skills-required/

12 7 Skills Every Data Scientist Should Have | Coursera

https://www.coursera.org/articles/data-scientist-skills

18 19 20 What is Data Preparation? Processes and Example | Talend

https://www.talend.com/resources/what-is-data-preparation/

Internshala Summer Training Report On Data Science
77% (22)
Internshala Summer Training Report On Data Science
70 pages
SQLMAP
No ratings yet
SQLMAP
23 pages
Data Roadmap Template
No ratings yet
Data Roadmap Template
5 pages
AWS Sysops
No ratings yet
AWS Sysops
45 pages
SQLite Kotlin - Notes App - Android Studio Tutorial
No ratings yet
SQLite Kotlin - Notes App - Android Studio Tutorial
11 pages
DATA SCIENCE LIFE CYCLE
No ratings yet
DATA SCIENCE LIFE CYCLE
12 pages
Wepik Unleashing The Power of Data Science Unlocking Insights For Business Success 20231028080159NgQF
No ratings yet
Wepik Unleashing The Power of Data Science Unlocking Insights For Business Success 20231028080159NgQF
14 pages
Raw Data Science Personal Statement
No ratings yet
Raw Data Science Personal Statement
5 pages
Data Science in 2021
No ratings yet
Data Science in 2021
28 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
Applied Data Science Career Guide
No ratings yet
Applied Data Science Career Guide
15 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
CHAPTER 1
No ratings yet
CHAPTER 1
85 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
What Is Data Science - A Beginner's Guide To Data Science - Edureka
No ratings yet
What Is Data Science - A Beginner's Guide To Data Science - Edureka
14 pages
Data Science Tutorial 1
No ratings yet
Data Science Tutorial 1
26 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Class 2 - Lifecycle Ml Concepts in Ds
No ratings yet
Class 2 - Lifecycle Ml Concepts in Ds
22 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
DS QB unit 1
No ratings yet
DS QB unit 1
45 pages
Data Science
100% (2)
Data Science
33 pages
Intro To Career in Data Science: Md. Rabiul Islam
100% (1)
Intro To Career in Data Science: Md. Rabiul Islam
62 pages
Introduction to Data Science- Unit-1
No ratings yet
Introduction to Data Science- Unit-1
9 pages
Unit 5 BDTT
No ratings yet
Unit 5 BDTT
19 pages
Introduction To Data Science - A Beginner Guide
100% (1)
Introduction To Data Science - A Beginner Guide
18 pages
Data Science 2020
100% (1)
Data Science 2020
123 pages
Datascience
100% (2)
Datascience
15 pages
Extended_Comprehensive_Guide_to_Data_Science
No ratings yet
Extended_Comprehensive_Guide_to_Data_Science
2 pages
Data Science
No ratings yet
Data Science
3 pages
Introduction to Datascience (en)
No ratings yet
Introduction to Datascience (en)
44 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Ch7-Overview of Data Science-part 1
No ratings yet
Ch7-Overview of Data Science-part 1
37 pages
Day 1 Intro To DS and ML - New
No ratings yet
Day 1 Intro To DS and ML - New
41 pages
Data Science Tips and Tricks To Learn Data Science Theories Effectively
No ratings yet
Data Science Tips and Tricks To Learn Data Science Theories Effectively
208 pages
A Beginners Guide To Getting First Data Science Job PDF
No ratings yet
A Beginners Guide To Getting First Data Science Job PDF
64 pages
data science chacha
No ratings yet
data science chacha
150 pages
unit 1 notes
No ratings yet
unit 1 notes
17 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
Data Science and The Future
No ratings yet
Data Science and The Future
20 pages
Data Science
No ratings yet
Data Science
18 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
Data Science PDF
No ratings yet
Data Science PDF
8 pages
7 Step Ebook Guide
No ratings yet
7 Step Ebook Guide
69 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Learn About The Importance of Data Science
No ratings yet
Learn About The Importance of Data Science
10 pages
himadev
No ratings yet
himadev
37 pages
Project Report
No ratings yet
Project Report
29 pages
module-2
No ratings yet
module-2
49 pages
What Is Data Science
No ratings yet
What Is Data Science
13 pages
What Is Data Science
No ratings yet
What Is Data Science
8 pages
02 Introduction_Fall 23-24
No ratings yet
02 Introduction_Fall 23-24
29 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
22 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
25 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
19 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
24 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
32 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Data Science
No ratings yet
Data Science
18 pages
DataScience Reading
No ratings yet
DataScience Reading
6 pages
Lecture 1 Introduction Tools An - Chniques For Data Science
No ratings yet
Lecture 1 Introduction Tools An - Chniques For Data Science
16 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
9 pages
Starting A Data Science Team: Dr. Jonathan D. Adler
No ratings yet
Starting A Data Science Team: Dr. Jonathan D. Adler
39 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
2351039_2
No ratings yet
2351039_2
1 page
Project 2025
No ratings yet
Project 2025
29 pages
Frontend_Dashboard_Project
No ratings yet
Frontend_Dashboard_Project
2 pages
Computer Graphics Curves
No ratings yet
Computer Graphics Curves
7 pages
week11
No ratings yet
week11
2 pages
Swayam January 2025 Emester
No ratings yet
Swayam January 2025 Emester
4 pages
GA4 Cheat Sheet
100% (1)
GA4 Cheat Sheet
13 pages
Mid Exam Part2, Var 3
No ratings yet
Mid Exam Part2, Var 3
21 pages
PHP Arrays
No ratings yet
PHP Arrays
22 pages
Mock Exam COMP1638
No ratings yet
Mock Exam COMP1638
1 page
Indroduction To Data Warehousing (Alex Berson)
0% (2)
Indroduction To Data Warehousing (Alex Berson)
20 pages
Handling Exceptions With PL/SQL
No ratings yet
Handling Exceptions With PL/SQL
17 pages
Implementation
No ratings yet
Implementation
16 pages
CMP 310 - Lab #1 Intro To UNIX/LINUX Essentials
No ratings yet
CMP 310 - Lab #1 Intro To UNIX/LINUX Essentials
2 pages
Stacks and Queues
No ratings yet
Stacks and Queues
18 pages
Azure databricks mastery
No ratings yet
Azure databricks mastery
53 pages
Computer Practical Programs 2024-25
No ratings yet
Computer Practical Programs 2024-25
4 pages
Data Description For Data Mining
No ratings yet
Data Description For Data Mining
7 pages
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
No ratings yet
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
4 pages
Forensics Handbook
No ratings yet
Forensics Handbook
80 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Review: IOE 373 - Data Processing
No ratings yet
Review: IOE 373 - Data Processing
19 pages
Unit 2-Unix
No ratings yet
Unit 2-Unix
17 pages
Sharing and Visibility Designer
No ratings yet
Sharing and Visibility Designer
1 page
DBMS External Internal Question Bank
No ratings yet
DBMS External Internal Question Bank
10 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
40 pages
STD 12 CA Complete Material 2024-2025
No ratings yet
STD 12 CA Complete Material 2024-2025
72 pages
Unit 1: Db2 Program Preparation
No ratings yet
Unit 1: Db2 Program Preparation
167 pages
5.1 - Chapter 3 - Functional Dependencies
No ratings yet
5.1 - Chapter 3 - Functional Dependencies
34 pages
Termux Commands
No ratings yet
Termux Commands
2 pages
Database Course for Electrical Engineering (Full)
No ratings yet
Database Course for Electrical Engineering (Full)
63 pages
Ankush Agarwal: Work Experience Skills
No ratings yet
Ankush Agarwal: Work Experience Skills
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit I_ Data Science Fundamentals

Uploaded by

Unit I_ Data Science Fundamentals

Uploaded by

Unit I: Data Science Fundamentals

Introduction to Data Science

The Case for Data Science

Data Science Classification

• Programming & Tools: Proficiency in programming languages such as Python or R is essential

Common data preparation tasks include:

1 The Advantages of Data-Driven Decision-Making | HBS Online

2 3 Insights to impact: Creating and sustaining data-driven commercial growth | McKinsey

6 What is Classification in Machine Learning? | Grammarly

7 Classification in Data Science

8 21 22 What is Predictive Modeling?

9 10 Cross-industry standard process for data mining - Wikipedia

11 13 14 15 16 17 23 Introduction to Data Science : Skills Required | GeeksforGeeks

12 7 Skills Every Data Scientist Should Have | Coursera

18 19 20 What is Data Preparation? Processes and Example | Talend

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.