Unit I_ Data Science Fundamentals
Unit I_ Data Science Fundamentals
Data science is an interdisciplinary field that uses tools from mathematics, statistics, and computer
science to extract knowledge and insights from data. In practice, data scientists prepare and analyze
large, often complex datasets (which can be structured or unstructured) to answer questions or solve
problems. For example, astronomers discovered the existence of Comet NEOWISE by analyzing massive
astronomical survey data from a space telescope. Such applications show how data science combines
domain expertise (e.g. astronomy), statistical analysis, and computing methods to turn raw data into
discoveries. In general, a data scientist is someone who writes code and applies statistical techniques
to summarize data and build predictive models. This role requires not only technical skill but also an
understanding of the real-world context of the data.
Data science differs from simple data analysis in scope: while data analysts might answer specific
questions using clean, structured data, data scientists typically work with larger, messier datasets and
use advanced methods. They often handle unstructured data (like text or images) and apply machine
learning algorithms to make predictions. For instance, a data scientist might use machine learning to
build a model that predicts customer behavior, diagnoses medical conditions from images, or forecasts
sales. All of these tasks start with formulating a problem, gathering data, cleaning it, and then using
statistical models or algorithms to extract patterns and insights.
1
Moreover, data-driven strategies can dramatically boost growth and efficiency. For example, a McKinsey
report shows that companies using advanced data and analytics saw above-market growth and a 15–25%
increase in profitability compared to peers 2 3 .
These numbers reflect real-world impacts. Data science is applied across industries to optimize
operations and create value. For instance, manufacturers use predictive analytics to forecast demand,
adjusting production schedules in advance to meet spikes in customer orders and avoid overstock 4 .
In retail and tech, personalization is another success story: streaming and e-commerce services analyze
user behavior data to recommend products or content. An online streaming platform, for example,
uses viewing and rating data to suggest shows to each subscriber; this personalization increases user
engagement and subscription renewals 5 . Other applications include fraud detection in finance
(identifying suspicious transactions), optimizing routes in logistics, improving patient outcomes in
healthcare, and more.
• Better decisions: Data-driven companies make more informed choices. One survey found that
organizations relying heavily on data were three times more likely to improve decision quality
1 .
• Increased growth: Analytics fuel innovation and efficiency. Leading firms report 15–25% higher
growth and profit margins by leveraging data science 2 3 .
• Practical examples: Data science applications are everywhere. Examples include demand
forecasting in manufacturing 4 , content recommendation in media 5 , spam filtering in email
systems, credit scoring in finance, and many others.
These successes show the case for data science: by uncovering patterns and making accurate
predictions, data science helps organizations solve problems and seize opportunities that would be
impossible to tackle manually.
Common classification algorithms include k-nearest neighbors, decision trees, logistic regression, naive
Bayes, and support vector machines 7 . For example, a decision tree might learn rules like “if an email
contains certain keywords, classify it as spam.” In the real world, classification is used in many contexts:
- Email spam filtering: Classifying incoming email as “spam” or “not spam” based on content and
metadata 8 . A model is trained on past emails labeled by users, then predicts new ones.
- Medical diagnosis: Classifying medical images or patient data as indicating “disease” vs. “no disease.”
For example, an algorithm might classify tumors in an X-ray as malignant or benign.
- Credit scoring: Predicting whether a loan applicant should be classified as a “good” or “bad” credit
risk.
- Image recognition: Classifying pictures (cats vs. dogs, handwritten digit recognition, etc.).
- Other domains: Document categorization, sentiment analysis (positive/negative), and many more.
6 Classification algorithms learn from examples. For instance, logistic regression or a decision tree
can be trained on past data to separate classes. Overall, classification allows data scientists to automate
decision-making when each outcome fits one of several categories 6 8 .
2
Data Science Process
A structured process helps ensure data science projects are effective. One widely-used framework is the
CRISP-DM process model 9 . This cyclical model defines six major phases: - Business Understanding:
Define project objectives from a business perspective (what problem are we solving?).
- Data Understanding: Gather and explore the data available. Check quality, detect outliers, and gain
intuition about what’s in the data.
- Data Preparation: Clean and transform the raw data into a suitable form for analysis (see next
section).
- Modeling: Apply analytical or machine learning techniques to build predictive or descriptive models.
For example, select algorithms and train models on the prepared data.
- Evaluation: Assess the model’s performance against the business objectives. Use metrics (accuracy,
error rates, etc.) to decide if the model is good enough or needs refining.
- Deployment: Integrate the final model into production so that it can make predictions or inform
decisions in the real world (e.g. a scoring system in software, a regular report, etc.).
The diagram above (CRISP-DM) illustrates how these phases connect: arrows show that tasks often loop
back (for example, a poor evaluation may send us back to improve data preparation or modeling) 9
10 . In practice, data science is iterative. The process starts with understanding the goal and data, then
moves through preparation and modeling. After evaluation, one might return to earlier steps (e.g.
gather more data or engineer new features) before final deployment. This structured, cyclic process
ensures that the project stays aligned with real needs and that the data is handled carefully at each step
10 .
Following a process like CRISP-DM helps teams stay organized. For example, a team might first meet
with stakeholders to clarify goals, then spend weeks cleaning and preparing data. They would then try
different algorithms, evaluate results, and finally implement the winning model into business systems.
Even after deployment, data scientists monitor the model and may iterate as new data arrives.
3
Prior Knowledge
To succeed in data science, practitioners need a mix of technical skills, domain knowledge, and soft
skills. Key prerequisites include:
In short, a well-rounded data scientist combines programming and statistical skills with an
understanding of the problem domain and good communication. As one tutorial notes, data science is
multidisciplinary and “combines statistics, computer science, and domain expertise” 17 . Building these
foundations (often through coursework or practical projects) provides the prior knowledge needed to
tackle data projects.
Data Preparation
Data preparation (also called data cleaning or wrangling) is the process of transforming raw data into a
clean, usable form before analysis. It is a critical early step because most real-world data is messy: it
may contain errors, missing values, inconsistent formats, or irrelevant fields. As one resource explains,
“Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis.” 18 . This can include standardizing data formats, correcting mistakes, and combining multiple
datasets 18 19 .
4
Though often tedious, good data preparation is crucial for accurate results. Statistics show that many
data scientists spend a large portion of their time on this step – one survey found that 76% of data
scientists consider data preparation the most difficult or time-consuming part of their work 20 .
However, thorough cleaning makes downstream analysis much more reliable. For example, removing
outliers and ensuring data consistency can prevent a model from learning incorrect patterns. In
practice, data preparation often loops back with modeling: after evaluating a model, a data scientist
might realize they need to engineer new features or gather more data, then repeat modeling.
Data Modeling
Data modeling is the stage where statistical or machine learning models are built from the prepared
data. The goal is to capture patterns in the data so we can make predictions or draw insights. In
predictive modeling, the data scientist fits a model to historical data to forecast future outcomes. As
TechTarget defines it, “Predictive modeling is a mathematical process used to predict future events or
outcomes by analyzing patterns in a given set of input data.” 21 . In other words, we use algorithms to
answer questions like “What will happen?” or “Which category does this belong to?”
During modeling, one chooses algorithms based on the problem type (regression, classification,
clustering, etc.) and trains models on the data. For instance, a linear regression model might be used to
predict sales revenue from advertising spend, or a decision tree might classify customers into
segments. Clustering algorithms (unsupervised learning) can find natural groupings in data without
predefined labels. The key point is using mathematical models (statistics or ML) to approximate the real-
world process that generated the data.
For example, a data scientist might build a fraud-detection model that learns from past transaction
data. By fitting a model, the algorithm can learn patterns that indicate fraud and predict the probability
that new transactions are fraudulent 22 . Or a sales forecasting model could learn from historical sales
and economic indicators to predict next quarter’s demand 21 . In both cases, the trained model
encapsulates learned relationships. According to one overview, knowledge of machine learning
algorithms for building predictive models is crucial for a data scientist 23 .
Once a model is built, it must be evaluated (next process step) to ensure it meets the project goals
(accuracy, reliability, etc.). A good model can then be used to make decisions automatically – for
instance, tagging future emails as spam, flagging bank transactions, or optimizing inventory orders. In
summary, data modeling is where the analytical power lies: by creating and refining models, data
scientists turn cleaned data into actionable predictions and insights 21 23 .
Sources: Authoritative data science resources and industry articles were used to compile these
explanations 9 18 21 . Each section includes examples and citations to help beginners understand
the concepts. The embedded diagrams visualize key ideas (the Venn of data science disciplines and the
CRISP-DM process model).
5
4 5 The Role of Data Science in Business Decision-Making
https://www.exelatech.com/blog/role-data-science-business-decision-making