Chap 1

Chapter 1
Basics of Machine Learning
Machine Learning (ML) is a branch of artificial intelligence that enables computers to

learn from data automatically, identifying patterns and making predictions without being
explicitly programmed. It involves developing algorithms that can learn from and make
decisions based on data, rather than following predetermined rules.
In the early days of machine learning, the field was often referred to as pattern recog-
nition (PR) in engineering, focusing on specific applications such as optical character
recognition, speech recognition, and face recognition. These tasks, while intuitive for
humans, pose significant challenges for computers because the underlying cognitive pro-
cesses are not fully understood, making it difficult to manually program computers to
perform them. Machine learning offers a solution by designing algorithms that allow
computers to learn from labeled examples and generalize to new, unseen data.
To illustrate the concept, consider the task of handwritten digit recognition. The
variability and uniqueness of human handwriting make it challenging for computers to
accurately identify digits. Machine learning tackles this problem by training a model on
a large dataset of labeled handwritten digits. The process involves several key steps:
1. Feature Extraction: The algorithm identifies and extracts relevant features from
the images, such as edges, angles, and shapes associated with each digit. Advanced
models like Convolutional Neural Networks (CNNs) can automatically learn the
most pertinent features.
2. Learning from Data: The model adjusts its internal parameters to minimize the
difference between its predictions and the actual labels, learning to recognize digits
more accurately.
3. Generalization: The trained model can then be applied to new, unseen examples of
handwritten digits, identifying them based on the patterns learned during training.
This approach, based on learning from examples, highlights the power and flexibility
of machine learning. It mimics the way humans learn, improving through exposure to
diverse examples. Handwritten digit recognition serves as a practical introduction to the
broader principles and potential of machine learning.
9
10 CHAPTER 1. BASICS OF MACHINE LEARNING
1.1 Understanding Machine Learning through SORA

SORA, a "world-simulating video generation model" introduced in early 2024, has gar-
nered significant attention for its ambitious claims. It represents a potential paradigm
shift in fields like computer graphics and entertainment. This section explores the tech-
nological advancements and challenges presented by SORA, fostering a discussion on the
intersection of machine learning and physical realities.
Manifold Distribution Theorem and Low-Dimensional Data The Manifold Distri-

bution Theorem suggests that natural datasets can be viewed as probability distribu-
tions over manifolds. Combined with the observation that data tends to occupy low-
dimensional spaces, this theorem forms the foundation for machine learning’s ability to
discover patterns in complex datasets. It emphasizes the importance of designing algo-
rithms that can uncover the underlying structure of data, improving learning efficiency
and generalization.
The low-dimensionality of data arises from the various natural laws governing our
universe. These laws limit the possible variations in data samples, preventing them from
spanning the entire space. For example, in a dataset of human faces, each image repre-
sents a point in the high-dimensional pixel space. However, only a small subset of these
points corresponds to valid human faces, constrained by physiological laws. As a result,
the manifold of human face images has a much lower dimension than the original pixel
space.
Physical Laws, Differentiability, and SORA’s Limitations SORA’s approach highlights

a fundamental challenge in machine learning: accurately capturing physical causality
and the global context of data. Relying solely on statistical correlations often fails to fully
represent the deterministic nature of physical laws, leading to outputs that may be locally
consistent but globally incoherent.
One of the key issues is the differentiability of physical laws. Many physical pro-
cesses are governed by smooth, continuous functions that are differentiable, meaning
that small changes in input lead to proportional changes in output. However, machine
learning models, particularly those based on neural networks, often struggle to learn and
represent such differentiable functions accurately. This limitation can result in generated
outputs that violate physical constraints or exhibit unrealistic behavior.
Moreover, SORA’s approach of learning from local patterns and correlations may not
capture the global context and long-range dependencies present in physical systems.
Physical phenomena often involve complex interactions and feedback loops that span
different scales and time frames. Capturing these global dependencies requires models
that can effectively propagate information across the entire system and maintain coher-
ence over extended periods.
Critical States and Physical Paradoxes Another challenge faced by SORA and similar
machine learning models is the difficulty in simulating critical transitions between stable
states. Many physical systems exhibit phase transitions or tipping points, where a small
1.1. UNDERSTANDING MACHINE LEARNING THROUGH SORA 11
change in conditions can lead to a dramatic shift in behavior. For example, the transition
from water to ice or the onset of turbulence in fluid flow.
Capturing these critical states and the associated physical paradoxes poses a signif-
icant challenge for machine learning models. The inherent stochasticity and sensitivity
to initial conditions in these systems make them difficult to model accurately. Machine
learning algorithms, which rely on learning from data, may struggle to capture the un-
derlying dynamics and predict the occurrence of critical transitions.
This issue highlights the need for models that can faithfully represent the complex
interplay between deterministic and stochastic aspects of physical systems. Incorporating
prior knowledge about the governing physical laws and the characteristic behavior of
critical phenomena could help improve the ability of machine learning models to simulate
these transitions more accurately.
Geometric Methods for Advancing Machine Learning To address the limitations of

current machine learning techniques, particularly in modeling physical systems, explor-
ing geometric methods holds promise. Techniques from differential geometry and opti-
mal transport theory offer powerful tools for understanding data manifolds and gaining
deeper insights into the underlying physical phenomena.
Differential geometry provides a framework for studying the intrinsic properties of
manifolds, such as curvature and geodesics, which can shed light on the structure and
dynamics of data. By incorporating geometric information into machine learning models,
it may be possible to capture the essential features of physical systems more accurately
and efficiently.
Optimal transport theory, on the other hand, focuses on the problem of finding the
most efficient way to transform one probability distribution into another. It provides a
principled approach for comparing and interpolating between distributions, which can be
useful for understanding the evolution of physical systems over time. Optimal transport-
based methods have shown promise in applications such as fluid dynamics and image
processing, suggesting their potential for enhancing machine learning’s ability to model
physical phenomena.
The integration of machine learning with fundamental mathematical principles presents
an exciting frontier, with the potential to overcome current limitations and deepen our
understanding of the natural world. By embracing these challenges and exploring new
theoretical avenues, machine learning can make significant advancements and contribute
to our comprehension of complex systems.
However, realizing the full potential of geometric methods in machine learning re-
quires further research and development. Key challenges include the scalability of ge-
ometric algorithms to high-dimensional data, the integration of geometric constraints
into existing machine learning frameworks, and the interpretation and visualization of
geometric insights in a way that is accessible to domain experts.
Despite these challenges, the fusion of machine learning and geometry holds great
promise for advancing our understanding of the physical world. By leveraging the strengths
of both fields, we can develop more accurate, interpretable, and robust models that cap-
ture the essential features of complex systems. This interdisciplinary approach has the
potential to unlock new insights and drive innovation across a wide range of applications,
from scientific simulations to engineering design and beyond.
1.2 Basic Concepts and Elements

Machine learning is founded on several key concepts, including samples, features, labels,
models, and learning algorithms. These elements form the backbone of the machine
learning process, enabling computers to extract meaningful patterns from data and make
predictions about new, unseen instances.
1.2.1 Understanding the Elements

Samples and Features In machine learning, a sample (or instance) represents an in-
dividual unit of observation, such as a single data point in a dataset. Each sample is
characterized by a set of features, which are measurable properties or attributes of the
sample. Features can be quantitative, such as numerical values representing size, weight,
or intensity, or qualitative, such as categorical variables representing color, type, or loca-
tion. The selection and engineering of informative and discriminative features are crucial
steps in the machine learning pipeline, as they directly influence the model’s ability to
learn and make accurate predictions. Effective feature representation can capture the
underlying patterns and relationships in the data, facilitating the learning process.
Labels Labels are the target values or desired outputs that a model aims to predict. In
supervised learning tasks, labels are provided along with the input features during the
training phase. The nature of the labels depends on the type of problem being addressed.
For regression tasks, labels are continuous values, such as predicting house prices based
on features like area, number of rooms, and location. For classification tasks, labels are
discrete categories, such as classifying emails as spam or non-spam based on features like
word frequencies and sender information. Labels serve as the ground truth against which
the model’s predictions are compared and optimized, allowing the model to learn the
underlying patterns and relationships between the input features and the target outputs.
Models and Learning Algorithms A model in machine learning is a mathematical or

computational representation of the relationship between the input features and the tar-
get labels. It encapsulates the learned patterns and can be used to make predictions on
new, unseen data. The choice of model depends on the nature of the problem, the type
of data, and the desired level of complexity. Common types of models include linear
models, decision trees, support vector machines, and neural networks, each with its own
assumptions, strengths, and limitations.
Learning algorithms are the methods used to train the model on the available data.
They adjust the model’s internal parameters or weights to minimize the discrepancy be-
tween the model’s predictions and the true labels. This process is typically formulated as
1.2. BASIC CONCEPTS AND ELEMENTS 13
an optimization problem, where the objective is to find the set of parameters that mini-
mizes a predefined loss function. Popular learning algorithms include gradient descent,
stochastic gradient descent, and backpropagation (for training neural networks). The
choice of learning algorithm depends on the model architecture, the size of the dataset,
and the computational resources available.
Learning or Training The process of learning or training involves using a learning al-
gorithm to find the optimal model parameters based on the given training data. The
training data consists of a set of input-label pairs, where each pair represents a sample
and its corresponding target value. During training, the learning algorithm iteratively
updates the model’s parameters to minimize the loss function, which quantifies the dif-
ference between the predicted labels and the true labels. The assumption is that the
training samples are independently and identically distributed (i.i.d.), meaning that they
are drawn from the same underlying probability distribution and are mutually indepen-
dent.
The goal of training is to find a model that can accurately predict the labels for new,
unseen samples. This is typically achieved by optimizing the model’s parameters to min-
imize the empirical risk, which is the average loss over the training set. However, min-
imizing the empirical risk alone may lead to overfitting, where the model learns to fit
the noise and peculiarities of the training data, resulting in poor generalization to new
data. To mitigate overfitting, various regularization techniques can be employed, such
as adding penalty terms to the loss function or using techniques like early stopping or
dropout.
1.2.2 The Goal of Machine Learning

The ultimate goal of machine learning is to develop models that can generalize well to
new, unseen data. Generalization refers to the model’s ability to make accurate predic-
tions on samples that were not part of the training set. Achieving good generalization
requires careful consideration of several factors:
• Model Complexity: The complexity of the model should be appropriate for the
task at hand. Overly simple models may underfit the data, failing to capture the
underlying patterns, while overly complex models may overfit the data, memorizing
noise and peculiarities of the training set. Striking the right balance between model
complexity and generalization is crucial. Techniques like regularization and cross-
validation can help in controlling model complexity and selecting the appropriate
level of complexity for a given problem.
• Representativeness of Training Data: The training data should be representative

of the problem domain and the distribution of unseen data. If the training data
is biased or lacks diversity, the model may not generalize well to new instances.
Techniques like stratified sampling, where the training data is split into subsets that
maintain the overall class distribution, can help ensure the representativeness of
the training data. Data augmentation techniques, such as applying random trans-
formations to the training samples, can also help in increasing the diversity and
representativeness of the training set.
• Choice of Learning Algorithm: Different learning algorithms have different induc-

tive biases and assumptions about the data. Some algorithms may be more suitable
for certain types of problems or data distributions than others. Selecting an ap-
propriate learning algorithm that aligns with the characteristics of the problem and
the data can improve generalization. For example, decision trees may work well
for problems with categorical features and clear decision boundaries, while neural
networks may be more suitable for problems with complex, non-linear relationships
between the features and the labels.
• Evaluation and Validation: Evaluating the model’s performance on an indepen-

dent test set provides an estimate of its generalization ability. However, using a
single test set may not provide a reliable estimate, especially if the test set is small
or not representative of the true data distribution. Techniques like k-fold cross-
validation, where the data is split into k subsets and the model is trained and evalu-
ated k times using different subsets as the validation set, can provide a more robust
estimate of the model’s generalization performance. Hold-out validation, where a
separate validation set is used for model selection and hyperparameter tuning, can
also help in assessing the model’s performance on unseen data.
Through iterative refinement and validation, machine learning seeks to bridge the
gap between empirical data and theoretical understanding, enabling computers to make
predictions and decisions with increasing accuracy and confidence. By leveraging the
power of data and algorithms, machine learning aims to uncover patterns, extract in-
sights, and solve complex problems across various domains, from image recognition and
natural language processing to recommender systems and autonomous vehicles.
1.2.3 Evaluating Performance

Once a model is trained, its performance is evaluated on a test set, which is assumed to
be independent and identically distributed (i.i.d.) with respect to the training set. The
test set should consist of samples that were not used during the training process, allowing
for an unbiased assessment of the model’s generalization ability.
Various performance metrics can be used to evaluate the model’s predictions, depend-
ing on the type of problem and the nature of the labels. For classification tasks, common
metrics include:
• Accuracy: The proportion of correctly classified samples out of the total number of
samples in the test set.
• Precision: The proportion of true positive predictions among all positive predic-
tions.
1.3. THE THREE ELEMENTS OF MACHINE LEARNING 15
• Recall: The proportion of true positive predictions among all actual positive sam-
ples.
• F1-score: The harmonic mean of precision and recall, providing a balanced mea-
sure of the model’s performance.
• Area Under the Receiver Operating Characteristic (ROC) Curve (AUC-ROC):

A measure of the model’s ability to discriminate between positive and negative
samples, independent of the classification threshold.
For regression tasks, common metrics include:
• Mean Squared Error (MSE): The average of the squared differences between the
predicted and true values.
• Mean Absolute Error (MAE): The average of the absolute differences between the
predicted and true values.
• R-squared (R2 ): The proportion of the variance in the target variable that is pre-
dictable from the input features.
It is important to note that the performance on the test set is an estimate of the
model’s generalization ability, and it may not always reflect its performance in real-world
scenarios. Factors such as data drift, where the distribution of the data changes over
time, or the presence of outliers or noisy samples can affect the model’s performance
in practice. Therefore, it is crucial to continuously monitor and validate the model’s
performance on new, unseen data to ensure its robustness and reliability over time.
In summary, understanding the basic concepts and elements of machine learning,
including samples, features, labels, models, and learning algorithms, is essential for de-
veloping effective machine learning solutions. By carefully considering factors such as
model complexity, data representativeness, choice of learning algorithm, and evaluation
metrics, practitioners can design and implement machine learning systems that can learn
from data, generalize to new instances, and drive innovation across various domains.
1.3 The Three Elements of Machine Learning

Machine learning is fundamentally about extracting generalizable patterns from limited
observational data and applying these patterns to make predictions on unseen data. This
process hinges on three critical elements: the model, the learning criterion, and the
optimization algorithm.
1.3.1 Model
The first step in any machine learning endeavor is to define the input space X and the
output space Y, which together form the framework for the problem at hand. The nature
of the output space determines the type of machine learning task: binary classification
(Y = {+1, −1}), multi-class classification (Y = {1, 2, . . . , C}), or regression (Y = R).
The input space X , often referred to as the feature space, and the output space Y
constitute the sample space. Individual observations are represented as (x, y) ∈ X × Y.
It is assumed that there exists an unknown but true mapping function y = g(x) or a
true conditional probability distribution pr (y|x) that accurately captures the relationship
between x and y. The goal of machine learning is to approximate this true mapping
g : X → Y or the distribution pr (y|x) as closely as possible.
Given the unknown nature of g(x) or pr (y|x), we propose a hypothesis space Fa set
of candidate functions from which we aim to select the most suitable hypothesis f ∗ ∈ F
based on empirical evidence from a training set D.
This hypothesis space F is typically represented as a parametrized function family:
! "
F = f (x; θ)|θ ∈ Θ ⊆ RD , (2.5)
where each function f (x; θ) is parameterized by θ, representing the model with D
denoting the dimensionality of the parameter space.
Models can be categorized as linear or nonlinear, depending on the structure of the
hypothesis space.
Linear Models A linear model assumes a direct linear relationship between the input
features and the output:
f (x; θ) = w⊤ x + b, (2.6)
where θ consists of the weight vector w and the bias term b. Linear models form
the basis for many classification and regression tasks, providing a simple yet effective
approach to learning linear relationships.
Nonlinear Models Nonlinear models extend linear models by incorporating nonlinear

transformations of the input features:
f (x; θ) = w⊤ φ(x) + b, (2.7)

where φ(x) represents a vector of nonlinear basis functions that transform the input
space into a higher-dimensional feature space. This allows for the learning of complex
patterns beyond linear separability.
In models where φ(x) includes learnable parameters, such as neural networks, we
have:
# $
φk (x) = h w⊤
k φ ′
(x) + b k , ∀1 ≤ k ≤ K, (2.8)
where h(·) is a nonlinear activation function. This makes f (x; θ) resemble the archi-
tecture of a neural network, capable of capturing highly intricate relationships between
input and output spaces through multiple layers of computation.
Nonlinear models offer greater flexibility and expressiveness compared to linear mod-
els, enabling them to capture intricate patterns and relationships in the data. However,
this increased complexity comes at the cost of higher computational requirements and
potentially reduced interpretability. Nonlinear models are also more prone to overfitting,
especially when the number of parameters is large relative to the size of the training set.
The choice between linear and nonlinear models depends on the characteristics of the
problem, the complexity of the underlying relationships, and the available computational
resources. In practice, it is common to start with simpler models and gradually increase
complexity as needed, guided by performance metrics and domain knowledge.
1.3.2 Learning Criteria

The foundation of machine learning relies on the assumption# (n) that $ we can learn general
rules from a limited set of observational data, D = { x , y (n)
}n=1 , consisting of N
N
samples that are independent and identically distributed (IID). These samples are drawn
from a joint space (X × Y) under an unknown but fixed distribution pr (x, y). The con-
stancy of pr (x, y) over time is crucial; any fluctuation would compromise the integrity of
learning.
A competent model, f (x, θ∗ ), should closely approximate the true mapping function
y = g(x) or align with the true conditional probability distribution pr (y|x), implying
|f (x, θ∗ ) − y| < ϵ, ∀(x, y) ∈ X × Y, (2.9)

or, equivalently, for the model’s probability predictions,
|fy (x, θ∗ ) − pr (y|x)| < ϵ, ∀(x, y) ∈ X × Y, (2.10)

where ϵ is a small positive threshold, ensuring model predictions are within an ac-
ceptable margin of error from actual outcomes or the true probability distributions.
The discrepancy between the model’s predictions and the actual labels is quantified
through loss functions, which are non-negative real-valued metrics. Several common loss
functions include:
• 0-1 Loss Function: Directly measures the error rate, defined as
L(y, f (x; θ)) = I(y ̸= f (x; θ)), (2.13)
where I(·) is the indicator function. Although intuitively appealing, the 0-1 loss
function is discontinuous and not differentiable, making optimization challenging.
• Quadratic Loss Function: Used for regression problems, measuring the squared
difference between predictions and true values,
1
L(y, f (x; θ)) = (y − f (x; θ))2 . (2.14)
2
• Cross-Entropy Loss Function: Commonly used in classification tasks, measuring

the divergence between the true and predicted probability distributions,
C
%
L(y, f (x; θ)) = − yc log fc (x; θ), (2.18)
c=1
where y is a one-hot vector of the true labels, and fc (x; θ) represents the model’s
predicted probability for class c.
• Hinge Loss Function: Used for binary classification, encouraging a margin of safety
in predictions,
L(y, f (x; θ)) = max(0, 1 − y · f (x; θ)), (2.20)
promoting not only correct classification but also confidence in the decision.
The choice of loss function significantly influences the model’s learning dynamics and
performance. The loss function should be selected based on the nature of the problem
(regression or classification) and the desired properties of the solution (e.g., robustness,
sparsity, or interpretability).
Risk Minimization Criterion The primary goal in machine learning is to develop a

model, f (x; θ), that accurately predicts outcomes for unseen data. This requires estimat-
ing the model’s performance based on the available data, despite the true data distribu-
tion, pr (x, y), being unknown. We achieve this through the concept of risk minimization.
!# (n) (n) $"N
Given a training dataset D = x ,y n=1
, we define the empirical risk as the
average loss incurred by the model on these training samples:
N
1 % # (n) # (n) $$
Remp
D (θ) = L y ,f x ;θ , (2.22)
N n=1
The task then becomes an Empirical Risk Minimization (ERM) problem, where we
seek parameters θ∗ that minimize this empirical risk:
θ∗ = arg minRemp
D (θ). (2.23)
θ
ERM is a fundamental principle in machine learning, as it provides a practical ap-

proach to learning from finite data samples. By minimizing the empirical risk, we aim
to find a model that performs well on the training data, with the hope that it will also
generalize well to unseen data.
However, ERM alone does not guarantee good generalization performance. Overfit-
ting and underfitting are two common pitfalls that can occur when the model’s complex-
ity is not properly controlled.
Understanding Overfitting and Underfitting Overfitting occurs when a model learns

the noise and idiosyncrasies of the training data too well, to the point that it negatively
impacts the model’s performance on new data. An overfit model has high complexity
and low bias but high variance, meaning it fits the training data very closely but fails to
generalize well to unseen data.
On the other hand, underfitting happens when the model is too simple to capture
the underlying structure of the data, resulting in poor performance on both the training
data and new, unseen data. An underfit model has low complexity and high bias but low
variance, meaning it fails to capture the relevant patterns in the data and has limited
predictive power.
Definition 1.1 (Overfitting) A model is said to overfit the training data if it performs sig-
nificantly better on the training data than on any other dataset drawn from the same distri-
bution, indicating a loss of generalization ability.
To mitigate overfitting and promote generalization, regularization techniques are ap-

plied. This approach, known as Structural Risk Minimization (SRM), adds a complexity
penalty to the empirical risk:
& '
1
∗
θ = arg min Remp
D (θ)
2
+ λ|θ| , (1.1)
θ 2
where λ is a regularization parameter that balances the model’s fit to the data against
its complexity, and θ|2 is the L2 norm of the model parameters, which serves as a measure
of the model’s complexity.
Regularization helps to prevent overfitting by constraining the model’s complexity
and encouraging simpler, more generalizable solutions. The regularization parameter λ
controls the trade-off between fitting the training data and keeping the model simple. A
higher value of λ places more emphasis on simplicity, while a lower value allows for more
complex models.
Figure 1.1 visually summarizes the concepts of underfitting, optimal fit, and overfit-
ting. It emphasizes the delicate balance required in machine learning to achieve models
that are complex enough to capture the essential patterns in the training data without
being overly complex and capturing noise, thereby ensuring effective generalization to
new data.
In summary, the learning criteria in machine learning involve defining an appropriate
loss function to quantify the discrepancy between the model’s predictions and the true la-
bels, and then minimizing the empirical risk (ERM) or the regularized risk (SRM) to find
the optimal model parameters. The choice of loss function and regularization technique
depends on the nature of the problem and the desired properties of the solution. Strik-
ing the right balance between model complexity and generalization ability is crucial for
developing models that can learn meaningful patterns from the training data and make
accurate predictions on unseen data.
Figure 1.1: Illustrative examples of underfitting, good fit, and overfitting, demonstrating
the balance required between model complexity and generalization capability.
1.3.3 Optimization Algorithms

Optimization algorithms play a crucial role in machine learning, enabling the identifica-
tion of optimal model parameters that minimize the loss function given a specific training
set, hypothesis space, and learning criterion. The essence of training a machine learning
model lies in effectively solving this optimization problem.
Gradient Descent (GD) Gradient Descent is a fundamental optimization technique that

iteratively refines parameters θ to minimize the loss function. The process iterates as
follows:
θt+1 = θt − α∇θ RD (θt ), (1.2)
where α represents the learning rate, controlling the step size towards the minimum of
the loss function. While robust, gradient descent can be computationally inefficient for
large-scale data due to the need to compute gradients across the entire dataset.
While robust, gradient descent can be computationally inefficient for large-scale data
due to the need to compute gradients across the entire dataset. This has led to the
development of more efficient variants of gradient descent, such as stochastic gradient
descent and mini-batch gradient descent.
Stochastic Gradient Descent (SGD) Stochastic Gradient Descent improves upon tra-
ditional gradient descent by updating parameters using the gradient computed from a
randomly selected sample or a small subset of the training data:
# # $$
θt+1 = θt − α∇θ L y (i) , f x(i) ; θt , (1.3)
# # $$
where i represents the index of the selected sample, and ∇θ L y (i) , f x(i) ; θt denotes
the gradient of the loss function with respect to the parameters θ for the i-th sample at
iteration t.
This approach significantly reduces the computational cost per iteration and is effec-
tive at navigating complex, non-convex optimization landscapes to find global minima.

SGD introduces randomness into the optimization process, which can help escape local
minima and saddle points.
Algorithm 1 Stochastic Gradient Descent

Require: Training set D, Learning rate α
1: Initialize θ randomly
2: repeat # $
3: Select x(i) , y (i) randomly
# from# D $$
4: Compute gradient: ∇θ L #y (i) , f #x(i) ; θ $$
5: Update θ: θ ← θ − α∇θ L y (i) , f x(i) ; θ
6: until Convergence criterion is met
Ensure: Optimized θ
Mini-Batch Gradient Descent Mini-Batch Gradient Descent strikes a balance between

the computational efficiency of SGD and the stability of full-batch gradient descent by
using small subsets of the training data, called mini-batches, for parameter updates:
m
1 % # (i) # (i) $$
θt+1 = θt − α∇θ L y , f x ; θt , (1.4)
m i=1
where m denotes the mini-batch size.

This approach reduces the variance in the parameter updates compared to SGD while
being more computationally efficient than using the entire dataset. Mini-batch gradient
descent allows for a trade-off between the speed of convergence and the stability of the
optimization process.
Early Stopping Early Stopping is a practical technique to prevent overfitting by stop-

ping the training process when the performance on a validation set starts to degrade.
It involves monitoring the validation loss and halting training when this loss begins to
increase, indicating that the model is starting to memorize noise rather than learning
generalizable patterns.
Early stopping can be seen as a form of regularization, as it limits the model’s capacity
to overfit the training data by restricting the number of training iterations. The optimal
stopping point is determined by evaluating the model’s performance on a separate vali-
dation set, which is not used for training.
Hyperparameter Tuning
The effectiveness of optimization algorithms heavily depends on the choice of hyperpa-
rameters, such as the learning rate α and the mini-batch size m in Mini-Batch Gradient
Descent. Tuning these hyperparameters often requires careful experimentation or ad-
vanced optimization techniques to achieve optimal training performance.
Common approaches to hyperparameter tuning include:
• Grid Search: Exhaustively searching through a predefined set of hyperparameter

values to find the best combination.
• Random Search: Randomly sampling hyperparameter values from a specified

range or distribution.
• Bayesian Optimization: Adaptively selecting hyperparameter values based on a

probabilistic model of the objective function.
• Gradient-Based Optimization: Treating hyperparameters as learnable parameters

and optimizing them using gradient descent alongside the model parameters.
The choice of hyperparameter tuning method depends on the complexity of the model,
the size of the search space, and the available computational resources. Efficient hyper-
parameter tuning can significantly improve the performance and generalization ability of
machine learning models.
In summary, optimization algorithms are essential for training machine learning mod-
els by minimizing the chosen loss function and finding the optimal model parameters.
Gradient descent and its variants, such as stochastic gradient descent and mini-batch
gradient descent, are widely used optimization techniques that iteratively update the
parameters based on the gradients of the loss function. Early stopping and hyperparam-
eter tuning are important practices that help prevent overfitting and improve the model’s
generalization performance.
As machine learning models become more complex and datasets grow larger, the de-
velopment of efficient and scalable optimization algorithms remains an active area of
research. Techniques like momentum, adaptive learning rates, and second-order opti-
mization methods have been proposed to accelerate convergence and improve the ro-
bustness of the optimization process. Additionally, distributed and parallel optimization
algorithms have been developed to handle large-scale datasets and take advantage of
modern computing architectures.
Understanding the principles and practical considerations of optimization algorithms
is crucial for successfully training machine learning models and achieving state-of-the-art
performance on real-world tasks. By carefully selecting and tuning the optimization algo-
rithm, practitioners can effectively navigate the complex landscape of model parameters
and find solutions that generalize well to unseen data.
In conclusion, a solid grasp of the three elements of machine learningmodel, learning
criteria, and optimization algorithmsis essential for anyone seeking to harness the power
of this transformative technology. By understanding the strengths and limitations of each
element and how they interact, practitioners can design and implement machine learning
solutions that are both effective and efficient. As the field continues to advance, staying
up-to-date with the latest developments in these areas will be crucial for staying at the
forefront of this exciting and rapidly evolving discipline.
1.4. A SIMPLE EXAMPLE OF MACHINE LEARNING – LINEAR REGRESSION 23
1.4 A Simple Example of Machine Learning – Linear Re-

gression
Linear Regression is a foundational model in both machine learning and statistics, pro-
viding a framework for understanding the relationship between a set of independent
variables and a dependent variable. This model is known for its simplicity and wide ap-
plicability, making it an ideal starting point for exploring the general process of machine
learning and the interplay among various learning criteria, including Empirical Risk Min-
imization, Structural Risk Minimization, Maximum Likelihood Estimation, and Maximum
A Posteriori Estimation.
Linear Regression can be categorized based on the number of independent variables
involved: simple regression when there is a single independent variable, and multiple
regression when there are multiple independent variables.
In the context of machine learning, the independent variables are represented as fea-
ture vectors x ∈ RD , where D is the dimensionality of the feature space, corresponding
to the number of independent variables. The dependent variable, or the label y, is a
continuous value, making it y ∈ R. The goal is to model the relationship between x and
y through a set of parameterized linear functions defined as:
f (x; w, b) = w⊤ x + b, (2.30)
where w ∈ RD represents the weight vector, and b ∈ R signifies the bias. This
formulation encapsulates the linear model f (x; w, b) ∈ R.
To streamline notation and simplify further analysis, we introduce augmented vectors
for both the weights and features:
⎡ ⎤
x1
( ) ⎢ .. ⎥
x
= x ⊕ 1 = ⎢ . ⎥,
⎢ ⎥
x̂ = (2.32)
1 ⎣ xD ⎦
1
⎡ ⎤
w1
( ) ⎢ .. ⎥
w
= w ⊕ b = ⎢ . ⎥,
⎢ ⎥
ŵ = (2.33)
b ⎣ wD ⎦
b
where ⊕ denotes vector concatenation. This allows us to express the linear regression
model succinctly as f (x; w) = w⊤ x, using augmented vectors for simplicity. Henceforth,
w and x will refer to these augmented vectors, simplifying expressions and computations
within the linear regression framework.
1.4.1 Parameter Learning

The objective in parameter learning for the linear regression model is to identify the op-
!# (n) (n) $"N
timal set of model parameters w that best fit the training data D = x ,y n=1
,
comprising N samples. We explore four principal methods of parameter estimation: Em-
pirical Risk Minimization (ERM), Structural Risk Minimization (SRM), Maximum Likeli-
hood Estimation (MLE), and Maximum A Posteriori Estimation (MAP).
Empirical Risk Minimization (ERM) In linear regression, where both the predicted
outputs and the true labels are continuous, the squared loss function effectively quantifies
the discrepancy between predictions and actual values. Under the ERM framework, the
empirical risk is defined as the sum of squared losses across all training samples:
N
% # # $$
R(w) = L y (n) , f x(n) ; w
n=1
%N
1 # $2
= y (n) − w⊤ x(n)
2 n=1
10 02
= 0y − X ⊤ w 0 ,
2
1 2⊤
omitting the constant N1 for brevity. Here, y = y (1) , · · · , y (N ) is the vector of true
labels, and X, a matrix of input features, is structured as:
⎡ (1) (2) (N )
⎤
x1 x1 · · · x1
⎢ .. .. ... .. ⎥
⎢ . . . ⎥
X = ⎢ (1) ⎥.
⎣ x x
(2)
· · · x
(N ) ⎦
D D D
1 1 ··· 1
The risk function R(w) is convex in w. By setting its gradient with respect to w to
zero, we derive the optimal parameters w∗ as:
# $−1
w∗ = XX ⊤ Xy,
known as the solution via the Least Squares Method (LSM). This approach is visu-
ally depicted in Figure 1.2 for an intuitive understanding of linear regression parameter
learning.
Structural Risk Minimization (SRM) SRM tackles the critical issue of overfitting in lin-
ear regression by integrating a regularization term into the loss function. This approach
not only aims to minimize empirical risk but also constrains model complexity, thereby
enhancing the model’s ability to generalize to unseen data.
A commonly employed variant within this framework is Ridge Regression, which in-
corporates regularization directly into the linear regression model. The regularization
Figure 1.2: Illustrative Example of Least Squares Method.
term specifically targets large coefficients, thus mitigating multicollinearity and bolster-
ing the model’s stability and predictive performance on new datasets.
The Ridge Regression solution is mathematically represented as:
# $−1
w∗ = XX ⊤ + λI Xy, (2.43)
where λ > 0 denotes the regularization strength, and I is the identity matrix. This
formulation guarantees the invertibility of the matrix XX ⊤ + λI, ensuring a solution
even in the presence of linearly dependent features.
The Ridge Regression objective function, inclusive of the regularization component,
is defined as:
10 0
0y − X ⊤ w02 + 1 λ∥w∥2 ,
R(w) = (2.44)
2 2
The equation comprises two terms: the Residual Sum of Squares (RSS) and the reg-
ularization penalty. The latter, 12 λ∥w∥2 , actively encourages the maintenance of smaller
coefficient magnitudes, effectively diminishing model complexity and aiding in overfit-
ting prevention.
By judiciously balancing empirical risk minimization with model complexity control,
the SRM principle significantly contributes to achieving improved generalization capabil-
ities, ensuring the model remains robust and effective across new, unseen data scenarios.
Maximum Likelihood Estimation (MLE) MLE is a fundamental approach in both statis-

tics and machine learning for parameter estimation, particularly effective in scenarios
where the relationship between feature vectors x and labels y is governed by proba-
bilistic models. Unlike methods that assume a deterministic function y = h(x), MLE
deals with models where the conditional probability p(y|x) is subject to an underlying,
unknown distribution.
Consider the scenario in linear regression where the label y is modeled as the output
of a linear function f (x; w) = w⊤ x, perturbed by random noise ϵ:
y = f (x; w) + ϵ, (2.45)
with ϵ adhering to a Gaussian distribution of mean zero and variance σ 2 . This assump-
tion implies that y itself is normally distributed with mean w⊤ x and the same variance
σ2:
p(y|x; w, σ) = N (y; w⊤ x, σ 2 ) (1.5)

& '
1 (y − w⊤ x)2
=√ exp − . (2.48)
2πσ 2σ 2
Given the training set D, the likelihood of observing the dataset for a specific param-
eter set w, assuming independence among samples, is:
N
3
p(y|X; w, σ) = p(y (n) |x(n) ; w, σ) (1.6)
n=1
3N
= N (y (n) ; w⊤ x(n) , σ 2 ). (2.50)
n=1
To facilitate analysis and optimization, we consider the log-likelihood:
N
%
log p(y|X; w, σ) = log N (y (n) ; w⊤ x(n) , σ 2 ). (2.51)
n=1
MLE aims to identify the parameter set w that maximizes this log-likelihood, thereby
most likely explaining the observed data. Solving for w by setting the derivative of the
log-likelihood to zero reveals:
# $−1
wM L = XX ⊤ Xy, (2.52)
demonstrating that the MLE solution coincides with that derived via the Least Squares
Method. This convergence underscores the Least Squares Method’s probabilistic under-
pinnings when assumptions about normality in errors are made.
Maximum A Posteriori Estimation (MAP) While Maximum Likelihood Estimation (MLE)

offers a powerful framework for parameter estimation, it is susceptible to overfitting, es-
pecially with limited training data. MAP estimation extends MLE by incorporating prior
knowledge about the parameter distribution, thereby mitigating overfitting and improv-
ing parameter accuracy.
Assuming the parameter vector w is a random vector following a prior distribution
p(w; v), it is common to model this prior as an isotropic Gaussian distribution for sim-
plicity:
# $
p(w; v) = N w; 0, v 2 I , (2.53)
where v 2 represents the variance of each dimension. Through Bayes’ theorem, the
posterior distribution of w, given the training data X and labels y, and assuming a
variance v for the prior and σ for the likelihood, is expressed as:
p(w, y | X; ν, σ)
p(w | X, y; v, σ) = 4 (2.54)
w p(w, y | X; v, σ)
∝ p(y | X, w; σ)p(w; v), (2.55)
where p(y | X, w; σ) is the likelihood of w, and p(w; ν) serves as the prior.

MAP estimation seeks the parameter set w that maximizes this posterior distribution.
The log-posterior, combining the log-likelihood with the log-prior, is optimized to find
wM AP , the parameter values at the peak of the posterior density:
wM AP = arg maxp(y | X, w; σ)p(w; ν), (2.56)

w
Simplifying further, the optimization objective can be articulated as minimizing a reg-

ularized risk, effectively blending empirical risk with model complexity control through
regularization:
1 0 0
0y − X ⊤ w02 − 1 w⊤ w,
log p(w | X, y; v, σ) ∝ − (2.59)
2σ 2 2v 2
where the regularization coefficient λ = σ 2 /v 2 aligns the structural risk minimization
objective with Bayesian principles.
MAP and MLE are reflective of the Bayesian and frequentist interpretation paradigms,
respectively. As v → ∞, indicating minimal prior influence, MAP estimation converges to
MLE, illustrating a transition from a Bayesian approach to a frequentist one in the limit
of an uninformative prior.
The linear regression example showcases the intricacies and connections among var-
ious learning criteria and parameter estimation methods. Empirical Risk Minimization,
with the squared loss function, leads to the familiar Least Squares solution. Structural
Risk Minimization introduces regularization to mitigate overfitting and enhance gener-
alization. Maximum Likelihood Estimation uncovers the probabilistic foundations of the
Least Squares approach under Gaussian noise assumptions. Finally, Maximum A Poste-
riori Estimation incorporates prior knowledge to regularize the solution and bridge the
gap between frequentist and Bayesian perspectives.
This example underscores the importance of understanding the interplay between
different learning criteria and their implications for model performance and generaliza-
tion. It also highlights the role of probabilistic modeling in machine learning, where
assumptions about the data generation process can lead to principled and interpretable
solutions.
As we delve deeper into more complex models and learning algorithms, the insights
gained from this linear regression example will serve as a foundation for understanding
the broader landscape of machine learning. The principles of risk minimization, regular-
ization, and probabilistic inference will recur in various guises, guiding the development
and analysis of more sophisticated techniques.
In the next section, we will explore the concept of generalization in more detail,
discussing the factors that influence a model’s ability to perform well on unseen data and
the techniques used to estimate and improve generalization performance. We will also
introduce the bias-variance tradeoff, a fundamental concept that underlies the balance
between model complexity and generalization ability.
1.5 Generalization and Model Selection

The ultimate goal of machine learning is to develop models that can accurately predict
outcomes for new, unseen data. This ability to generalize beyond the training set is what
distinguishes a successful model from one that merely memorizes the training examples.
In this section, we will delve into the concept of generalization, discussing the factors
that influence a model’s generalization performance and the techniques used to estimate
and improve it.
1.5.1 Generalization Error and Overfitting

Generalization error, also known as the out-of-sample error, is the expected error of a
model on new, unseen data drawn from the same distribution as the training data. It is
defined as:
R(θ) = E(x,y)∼pr (x,y) [L (y, f (x; θ))] , (3.1)

where L(·, ·) is the loss function, and pr (x, y) is the true data distribution.
In practice, the true data distribution is unknown, making it impossible to directly
compute the generalization error. Instead, we rely on estimates based on the model’s
performance on a held-out test set or through techniques like cross-validation.
Overfitting, as discussed earlier, occurs when a model learns the noise and idiosyn-
crasies of the training data to the extent that it negatively impacts the model’s perfor-
mance on new data. Overfitting is characterized by a model that has low training error
but high generalization error.
1.5. GENERALIZATION AND MODEL SELECTION 29
The risk of overfitting increases with model complexity. As the number of parameters
or the flexibility of the model grows, it becomes more capable of fitting the training data
perfectly, but at the cost of learning spurious patterns that do not generalize. This high-
lights the need for controlling model complexity and finding the right balance between
bias and variance.
1.5.2 Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept in machine learning that character-
izes the sources of error in a model’s predictions. The tradeoff arises from the fact that
the generalization error can be decomposed into three components: bias, variance, and
irreducible noise.
• Bias: Bias refers to the error introduced by approximating a real-world problem

with a simplified model. High bias models are overly simplistic and tend to underfit
the data, missing relevant patterns and relationships.
• Variance: Variance refers to the error introduced by a model’s sensitivity to small

fluctuations in the training data. High variance models are overly complex and tend
to overfit the data, learning noise and spurious patterns.
• Irreducible Noise: Irreducible noise is the inherent uncertainty or randomness in

the data that cannot be reduced by any model. It represents the lower bound on
the achievable generalization error.
The goal of machine learning is to find the sweet spot between bias and variance that
minimizes the overall generalization error. This is often achieved through techniques like
regularization, which constrains model complexity, or by using ensemble methods that
combine multiple models to reduce variance.
Figure 1.3 illustrates the bias-variance tradeoff, showing how the generalization er-
ror changes as a function of model complexity. As the complexity increases, the bias
decreases, but the variance increases. The optimal model complexity is found at the
minimum of the generalization error curve.
1.5.3 Techniques for Estimating Generalization Performance

Given the importance of generalization in machine learning, it is crucial to have reliable
techniques for estimating a model’s generalization performance. Two commonly used
approaches are holdout validation and cross-validation.
Holdout Validation In holdout validation, the available data is split into three subsets:
a training set, a validation set, and a test set. The training set is used to fit the model, the
validation set is used to tune the model’s hyperparameters and assess its generalization
performance, and the test set is used for a final, unbiased evaluation of the model’s
performance.
Figure 1.3: The bias-variance tradeoff and its relationship to model complexity and gen-
eralization error.
The holdout method is computationally efficient, as the model is only trained once.
However, its estimates of generalization performance can be sensitive to the specific split
of the data, especially when the data is limited.
Cross-Validation Cross-validation is a more robust technique for estimating generaliza-

tion performance that makes better use of the available data. In k-fold cross-validation,
the data is split into k equally sized subsets, or folds. The model is trained and evaluated
k times, each time using a different fold as the validation set and the remaining k-1 folds
as the training set. The final estimate of the generalization performance is the average of
the k validation scores.
Cross-validation provides a more reliable estimate of generalization performance, as
it reduces the variability introduced by a single split of the data. It is particularly useful
when the data is limited, as it allows for a more efficient use of the available samples.
Figure 1.4 illustrates the process of 5-fold cross-validation, showing how the data is
split into folds and how the model is trained and evaluated iteratively.
1.5.4 Model Selection and Hyperparameter Tuning

Model selection is the process of choosing the best model from a set of candidate models
based on their estimated generalization performance. This can involve comparing models
Figure 1.4: The process of 5-fold cross-validation for estimating generalization perfor-
mance.
with different architectures, regularization techniques, or hyperparameter settings.

Hyperparameters are parameters of the learning algorithm itself, such as the learning
rate in gradient descent, the regularization strength, or the number of hidden units in
a neural network. Unlike the model parameters, which are learned from the training
data, hyperparameters are typically set before training and remain fixed throughout the
process.
Hyperparameter tuning is the process of finding the optimal hyperparameter settings
for a given model and dataset. This is typically done using a validation set or through
cross-validation, where the model is trained and evaluated multiple times with different
hyperparameter settings. The hyperparameters that yield the best generalization perfor-
mance are then selected for the final model.
Grid search and random search are two common strategies for hyperparameter tun-
ing. Grid search exhaustively evaluates all combinations of hyperparameters from a pre-
defined set of values, while random search samples hyperparameter settings from a spec-
ified distribution. Random search has been shown to be more efficient than grid search
in high-dimensional hyperparameter spaces.
More advanced techniques, such as Bayesian optimization, can further improve the
efficiency of hyperparameter tuning by intelligently searching the hyperparameter space
based on the performance of previous settings.
1.5.5 Regularization Techniques
Regularization is a key technique for controlling model complexity and preventing over-
fitting. By adding a penalty term to the loss function, regularization constrains the model
parameters and encourages simpler, more generalizable solutions.
L1 and L2 Regularization L1 and L2 regularization are two common forms of regular-

ization that differ in the type of penalty they impose on the model parameters.
L1 regularization, also known as Lasso regression, adds the absolute values of the
model parameters to the loss function:
D
%
R(θ) = L(θ) + λ |θi |, (3.2)
i=1
where λ is the regularization strength, and D is the number of model parameters.

L1 regularization has the effect of shrinking some model parameters exactly to zero,
leading to sparse solutions. This can be useful for feature selection, as it effectively
identifies the most relevant features for the task.
L2 regularization, also known as Ridge regression, adds the squared values of the
model parameters to the loss function:
D
%
R(θ) = L(θ) + λ θi2 , (3.3)
i=1
L2 regularization shrinks the model parameters towards zero, but does not force them
to be exactly zero. This can be useful for stabilizing the solution and reducing the impact
of highly correlated features.
The choice between L1 and L2 regularization depends on the specific characteristics
of the problem and the desired properties of the solution. In some cases, a combination of
both, known as Elastic Net regularization, can be used to balance the benefits of sparsity
and stability.
Early Stopping Early stopping is a simple yet effective regularization technique that
involves monitoring the model’s performance on a validation set during training and
stopping the training process when the performance starts to degrade. By preventing the
model from overfitting to the training data, early stopping can improve generalization
performance without requiring an explicit regularization term in the loss function.
Early stopping can be seen as a form of implicit regularization, as it constrains the
model complexity by limiting the number of training iterations. The optimal stopping
point is typically determined by monitoring the validation error and selecting the model
parameters that yield the lowest error.
Figure 1.5 illustrates the concept of early stopping, showing how the training and val-
idation errors evolve during training and how the optimal stopping point is determined.
1.5.6 Ensemble Methods

Ensemble methods are a powerful class of techniques that combine multiple models to
improve generalization performance. By aggregating the predictions of several base mod-
els, ensemble methods can reduce the variance and bias of the final predictions, leading
to more robust and accurate results.
Figure 1.5: Early stopping as a regularization technique. The optimal stopping point is
determined by the lowest validation error.
Bagging Bootstrap Aggregating, or Bagging, is an ensemble method that trains multi-

ple instances of the same base model on different subsets of the training data, obtained
through random sampling with replacement. The final predictions are obtained by av-
eraging the predictions of the base models (for regression) or by majority voting (for
classification).
Bagging can significantly reduce the variance of the base models, making it partic-
ularly effective for high-variance models like decision trees. Random Forests, one of
the most popular ensemble methods, combine bagging with random feature selection to
further improve the diversity and robustness of the ensemble.
Boosting Boosting is another ensemble method that combines multiple weak learners
(models that perform only slightly better than random guessing) into a strong learner.
The key idea behind boosting is to train the base models sequentially, each time focusing
on the samples that were misclassified by the previous models.
AdaBoost, short for Adaptive Boosting, is one of the most widely used boosting al-
gorithms. It assigns weights to the training samples based on their difficulty and trains
the base models to minimize the weighted error. The final predictions are obtained by a
weighted sum of the base model predictions, where the weights are determined by the
performance of each base model.
Gradient Boosting is another popular boosting algorithm that trains the base models
to minimize the residuals of the previous models. By iteratively fitting the residuals, Gra-
dient Boosting can capture complex non-linear relationships and achieve high predictive
performance.
Figure 1.6 illustrates the concept of ensemble methods, showing how multiple base
models are combined to produce the final predictions.
Figure 1.6: Ensemble methods combine multiple base models to improve generalization
performance.
1.5.7 Model Interpretation and Explainability

As machine learning models become more complex and are applied to increasingly high-
stakes domains, the ability to interpret and explain their predictions has become increas-
ingly important. Model interpretation and explainability techniques aim to provide in-
sights into how a model makes its predictions and what factors influence its decisions.
Feature Importance Feature importance is a common technique for understanding the

relative contribution of each input feature to the model’s predictions. It can be computed
globally, for the entire dataset, or locally, for individual predictions.
For linear models, feature importance can be directly derived from the model coef-
ficients. For more complex models, such as decision trees or neural networks, feature
importance can be estimated through techniques like permutation importance or SHAP
(Shapley Additive Explanations).
1.6. CONCLUSION 35
Partial Dependence Plots Partial dependence plots (PDPs) show the marginal effect
of one or two features on the model’s predictions, while holding all other features con-
stant. PDPs can be used to visualize the relationship between a feature and the predicted
outcome, and to identify non-linear or interaction effects.
PDPs are particularly useful for understanding the behavior of complex models, such
as random forests or gradient boosting machines, where the relationship between the
features and the predictions may be difficult to interpret from the model parameters
alone.
Local Interpretable Model-agnostic Explanations (LIME) LIME is a technique for ex-

plaining the predictions of any machine learning model by approximating it locally with
an interpretable model, such as a linear regression or a decision tree. By perturbing the
input features and observing the effect on the model’s predictions, LIME can identify the
features that are most important for a specific prediction.
LIME explanations are presented as a weighted list of features, where the weights in-
dicate the importance of each feature for the prediction. This allows users to understand
the key factors that influenced the model’s decision for a particular instance.
1.6 Conclusion
In this chapter, we have explored the fundamental concepts and techniques of machine
learning, focusing on the key elements of models, learning criteria, and optimization
algorithms. We have seen how these elements work together to enable the extraction
of meaningful patterns from data and the generation of accurate predictions on unseen
instances.
Through the linear regression example, we have demonstrated the interplay between
different learning criteria, such as empirical risk minimization, structural risk minimiza-
tion, maximum likelihood estimation, and maximum a posteriori estimation. This exam-
ple has highlighted the connections between these approaches and their implications for
model performance and generalization.
We have also discussed the importance of generalization in machine learning and the
techniques used to estimate and improve generalization performance, such as holdout
validation, cross-validation, and regularization. The bias-variance tradeoff has been in-
troduced as a fundamental concept that underlies the balance between model complexity
and generalization ability.
Furthermore, we have explored advanced topics such as ensemble methods, which
combine multiple models to improve generalization performance, and model interpreta-
tion techniques, which provide insights into the behavior of complex machine learning
models.
As machine learning continues to evolve and be applied to an ever-growing range of
domains, it is essential for practitioners to have a solid understanding of these fundamen-
tal concepts and techniques. By mastering the principles of model selection, regulariza-
tion, and evaluation, and by staying up-to-date with the latest advancements in the field,
machine learning practitioners can develop models that are both accurate and robust,
and that can be applied with confidence to real-world problems.
Looking ahead, the field of machine learning is poised for continued growth and
innovation. With the increasing availability of large-scale datasets and the development
of more powerful computational resources, machine learning models are becoming more
sophisticated and capable of tackling ever-more complex tasks.
At the same time, there is a growing recognition of the importance of responsible and
ethical machine learning practices. As machine learning models are applied to sensitive
domains such as healthcare, criminal justice, and finance, it is crucial to ensure that
these models are fair, unbiased, and transparent. This has led to an increased focus
on techniques for detecting and mitigating bias in machine learning models, as well as
on developing frameworks for ensuring the accountability and explainability of these
models.
Another key trend in machine learning is the integration of domain knowledge and
expert insights into the model development process. While machine learning models
are capable of automatically extracting patterns from data, they can often benefit from
the incorporation of prior knowledge and expertise. This has led to the development
of techniques such as knowledge distillation, domain adaptation, and transfer learning,
which allow machine learning models to leverage existing knowledge and adapt to new
domains and tasks.
Finally, the field of machine learning is becoming increasingly interdisciplinary, with
researchers and practitioners from a wide range of backgrounds contributing to its de-
velopment. From computer science and statistics to psychology and neuroscience, the
insights and techniques from multiple disciplines are being brought to bear on the chal-
lenges of machine learning. This interdisciplinary approach is crucial for addressing the
complex and varied nature of real-world problems and for ensuring that machine learn-
ing models are both technically sound and socially responsible.
In conclusion, machine learning is a powerful and rapidly evolving field that has the
potential to transform a wide range of industries and domains. By understanding the
fundamental concepts and techniques of machine learning, and by staying abreast of the
latest developments in the field, practitioners can position themselves to make significant
contributions to this exciting and impactful area of research and application.

Chap 1

Uploaded by

Copyright:

Available Formats

Chap 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap 1

Uploaded by

Copyright:

Available Formats

Chapter 1

Basics of Machine Learning

Machine Learning (ML) is a branch of artificial intelligence that enables computers to

1.1 Understanding Machine Learning through SORA

Manifold Distribution Theorem and Low-Dimensional Data The Manifold Distri-

Physical Laws, Differentiability, and SORA’s Limitations SORA’s approach highlights

Geometric Methods for Advancing Machine Learning To address the limitations of

1.2 Basic Concepts and Elements

1.2.1 Understanding the Elements

Models and Learning Algorithms A model in machine learning is a mathematical or

1.2.2 The Goal of Machine Learning

• Representativeness of Training Data: The training data should be representative

• Choice of Learning Algorithm: Different learning algorithms have different induc-

• Evaluation and Validation: Evaluating the model’s performance on an indepen-

1.2.3 Evaluating Performance

• Area Under the Receiver Operating Characteristic (ROC) Curve (AUC-ROC):

For regression tasks, common metrics include:

1.3 The Three Elements of Machine Learning

Nonlinear Models Nonlinear models extend linear models by incorporating nonlinear

f (x; θ) = w⊤ φ(x) + b, (2.7)

1.3.2 Learning Criteria

|f (x, θ∗ ) − y| < ϵ, ∀(x, y) ∈ X × Y, (2.9)

|fy (x, θ∗ ) − pr (y|x)| < ϵ, ∀(x, y) ∈ X × Y, (2.10)

• 0-1 Loss Function: Directly measures the error rate, defined as

L(y, f (x; θ)) = I(y ̸= f (x; θ)), (2.13)

• Cross-Entropy Loss Function: Commonly used in classification tasks, measuring

L(y, f (x; θ)) = max(0, 1 − y · f (x; θ)), (2.20)

Risk Minimization Criterion The primary goal in machine learning is to develop a

ERM is a fundamental principle in machine learning, as it provides a practical ap-

Understanding Overfitting and Underfitting Overfitting occurs when a model learns

To mitigate overfitting and promote generalization, regularization techniques are ap-

1.3.3 Optimization Algorithms

Gradient Descent (GD) Gradient Descent is a fundamental optimization technique that

tive at navigating complex, non-convex optimization landscapes to find global minima.

Algorithm 1 Stochastic Gradient Descent

Mini-Batch Gradient Descent Mini-Batch Gradient Descent strikes a balance between

where m denotes the mini-batch size.

Early Stopping Early Stopping is a practical technique to prevent overfitting by stop-

Common approaches to hyperparameter tuning include:

• Grid Search: Exhaustively searching through a predefined set of hyperparameter

• Random Search: Randomly sampling hyperparameter values from a specified

• Bayesian Optimization: Adaptively selecting hyperparameter values based on a

• Gradient-Based Optimization: Treating hyperparameters as learnable parameters

1.4 A Simple Example of Machine Learning – Linear Re-

1.4.1 Parameter Learning

Figure 1.2: Illustrative Example of Least Squares Method.

Maximum Likelihood Estimation (MLE) MLE is a fundamental approach in both statis-

p(y|x; w, σ) = N (y; w⊤ x, σ 2 ) (1.5)

To facilitate analysis and optimization, we consider the log-likelihood:

Maximum A Posteriori Estimation (MAP) While Maximum Likelihood Estimation (MLE)

where p(y | X, w; σ) is the likelihood of w, and p(w; ν) serves as the prior.

wM AP = arg maxp(y | X, w; σ)p(w; ν), (2.56)

Simplifying further, the optimization objective can be articulated as minimizing a reg-

1.5 Generalization and Model Selection

1.5.1 Generalization Error and Overfitting

R(θ) = E(x,y)∼pr (x,y) [L (y, f (x; θ))] , (3.1)

1.5.2 Bias-Variance Tradeoff

• Bias: Bias refers to the error introduced by approximating a real-world problem

• Variance: Variance refers to the error introduced by a model’s sensitivity to small