5th Unit Answer Bank AIML
5th Unit Answer Bank AIML
5th Unit Answer Bank AIML
estimation
Nonparametric Density Estimation Density Estimation: Given a sample S={xi }i=1..N from a distribution
obtain an estimate of the density function at any point. Parametric : Assume a parametric density family f
(.|θ) , (ex. N(µ,σ2) ) and obtain the best estimator of θ Advantages: • Efficient • Robust to noise: robust
estimators can be used Problem with parametric methods • An incorrectly specified parametric model has
a bias that cannot be removed even by large number of samples. Nonparametric : directly obtain a good
estimate of the entire density from the sample. Most famous example: Histogram
Analyze the K-nearest neighbor estimator
K-Nearest Neighbors
The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other.
There are other ways of calculating distance, and one way might be preferable depending
on the problem we are solving. However, the straight-line distance (also called the
Euclidean distance) is a popular and familiar choice.
3.1 Calculate the distance between the query example and the current example from the
data.
3.2 Add the distance and the index of the example to an ordered collection
4. Sort the ordered collection of distances and indices from smallest to largest (in
ascending order) by the distances
5. Pick the first K entries from the sorted collection
To select the K that’s right for your data, we run the KNN algorithm several times with
different values of K and choose the K that reduces the number of errors we encounter
while maintaining the algorithm’s ability to accurately make predictions when it’s given
data it hasn’t seen before.
1. As we decrease the value of K to 1, our predictions become less stable. Just think for a
minute, imagine K=1 and we have a query point surrounded by several reds and one
green (I’m thinking about the top left corner of the colored plot above), but the green is
the single nearest neighbor. Reasonably, we would think the query point is most likely
red, but because K=1, KNN incorrectly predicts that the query point is green.
2. Inversely, as we increase the value of K, our predictions become more stable due to
majority voting / averaging, and thus, more likely to make more accurate predictions
(up to a certain point). Eventually, we begin to witness an increasing number of errors.
It is at this point we know we have pushed the value of K too far.
3. In cases where we are taking a majority vote (e.g. picking the mode in a classification
problem) among labels, we usually make K an odd number to have a tiebreaker.
Advantages
1. The algorithm is simple and easy to implement.
3. The algorithm is versatile. It can be used for classification, regression, and search (as
we will see in the next section).
Disadvantages
Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry
too much about choosing just the right features.
Nonparametric methods seek to best fit the training data in constructing the mapping function, whilst maintaining some
ability to generalize to unseen data. As such, they are able to fit a large number of functional forms.
An easy to understand nonparametric model is the k-nearest neighbors algorithm that makes predictions based on the k
most similar training patterns for a new data instance. The method does not assume anything about the form of the
mapping function other than patterns that are close are likely to have a similar output variable.
k-Nearest Neighbors
Decision Trees like CART and C4.5
Support Vector Machines
Benefits of Nonparametric Machine Learning Algorithms:
More data: Require a lot more training data to estimate the mapping function.
Slower: A lot slower to train as they often have far more parameters to train.
Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made.
Nonparametric machine learning algorithms are those which do not make specific
assumptions about the type of the mapping function. They are prepared to choose any
functional form from the training data, by not making assumptions.
The word nonparametric does not mean that the value lacks parameters existing in it, but
rather that the parameters are adjustable and can change. When dealing with ranked data
one may turn to nonparametric modeling, in which the sequence in that they are ordered is
some of the significance of the parameters.
A simple to understand the nonparametric model is the k-nearest neighbors' algorithm,
making predictions for a new data instance based on the most similar training patterns k.
The only assumption it makes about the data set is that the training patterns that are the
most similar are most likely to have a similar result.
k-Nearest Neighbors
Decision Trees like CART and C4.5
1. Parametric models deal with discrete values, and nonparametric models use
continuous values.
2. Parametric models are able to infer the traditional measurements associated with
normal distributions including mean, median, and mode. While some nonparametric
distributions are normally oriented, often one cannot assume the data comes from a
normal distribution.
4. A parametric model can predict future values using only the parameters. While
nonparametric machine learning algorithms are often slower and require large amounts
of data, they are rather flexible as they minimize the assumptions they make about the
data.
Parameters
sampling_strategystr, list or callable
Sampling information to sample the data set.
When str, specify the class targeted by the resampling. Note the the number of samples will
not be equal in each. Possible choices are:
'majority': resample only the majority class;
'not minority': resample all classes but the minority class;
'not majority': resample all classes but the majority class;
'all': resample all classes;
'auto': equivalent to 'not minority'.
When list, the list contains the classes targeted by the resampling.
When callable, function taking y and returns a dict. The keys correspond to the targeted
classes. The values correspond to the desired number of samples for each class.
random_stateint, RandomState instance, default=None
Control the randomization of the algorithm.
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
n_neighborsint or estimator object, default=None
If int, size of the neighbourhood to consider to compute the nearest neighbors. If object, an
estimator that inherits from KNeighborsMixin that will be used to find the nearest-neighbors.
If None, a KNeighborsClassifier with a 1-NN rules will be used.
n_seeds_Sint, default=1
Number of samples to extract in order to build the set S.
n_jobsint, default=None
Number of CPU cores used during the cross-validation loop. None means 1 unless in
a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
Attributes
sampling_strategy_dict
Dictionary containing the information to sample the dataset. The keys corresponds to the
class labels from which to sample and the values are the number of samples to sample.
estimator_estimator object
The validated K-nearest neighbor estimator created from n_neighbors parameter.
sample_indices_ndarray of shape (n_new_samples,)
Indices of the samples selected.
New in version 0.4.
n_features_in_int
Number of features in the input dataset.
New in version 0.9.
An animal agent has eyes, ears, and other organs for sensors and mouth, legs, wings, and so on for
actuators. A software agent receives data as sensory inputs and acts on the environment by displaying
on the screen, writing files, and sending network packets.
We use the term percept to refer to the agent’s perceptual inputs at any given instant. An agent’s
choice of action at any given time depends on what he perceives or has perceived until now, but not
on anything it hasn’t perceived. The agent’s behavior is described by a function that maps any given
perceived sequence to an action.
The mapping the agent’s response to every possible percept sequence defines the agent function that
maps any given percept sequence to an action. This function can be thought of as a very large table—
infinite, in fact. It can be, constructed by trying out all possible percept sequences and specifying the
agent response to it. In practice this table is implemented by the agent program.
The Elements of Reinforcement Learning
Beyond the agent and the environment, there are four main elements of a reinforcement learning
system: a policy, a reward, a value function, and, optionally, a model of the environment.
A policy defines the way the agent behaves in a given time. Roughly speaking, a policy is a mapping
from the states of the environment to actions to the actions the agent takes in the environment. The
policy can be a simple function or lookup table in the simplest cases, or it may involve complex
function computations. The policy is the core of what the agent learns.
A reward defines the goal of a reinforcement learning problem. On each time step, the action of the
agent results on a reward. The agent’s final objective is to maximize the total reward it receives. The
reward thus distinguishes between the good and bad action results for the agent. In a natural system,
we might think of rewards as pleasure and pain experiences.
The reward is the primary way for impacting the policy; if an action selected by the policy results in
a low reward, then the policy can be changed to select some other action in the same situation.
The reward signal indicates good actions in an immediate sense: each action results immediately on a
reward, a value function defines what is good in the long run.
The value of a state is the total aggregated number of rewards that the agent can expect to get in the
future if it starts from that state. Values indicate the long-term desirability of a set of states taking
into account the likely future states and the rewards yielded by those states. Even if a state might
yield a low immediate reward, it can still have a high value because it is regularly followed by other
states that yield higher rewards.
The interplay between rewards and values is often confusing for beginners as one is an aggregation
of the other. Rewards are primary and immediate, values, on the other hands are predictions of
rewards, they are secondary. Without rewards there are no values, and the only purpose of estimating
values is to achieve more reward. Nevertheless, it is values which we consider when making and
evaluating decisions. Action choices are ultimately made based on value judgments.
The agent will seek actions that bring states of highest value, not highest reward, these states will
lead to actions that earn the greatest amount of reward over the long run.
How then we determine the values and the rewards?
Rewards are directly given by the environment, but values have been constantly estimated from the
sequences of observations the agent makes at each interaction. This will make the method(s) of
estimating values efficiently the most important component reinforcement learning algorithms.
Another important element of some reinforcement learning systems is the model of the
environment. This is something that reproduce the behavior of the environment and allows
inferences to be made about how the environment will react. This model will help the agent to
predict the next reward id an action is taken and hence base the current action selection based on the
future environment reaction
Exploitation vs. exploration
A Reinforcement Learning agent will gradually learn the best (or near-best) policy essentially based
on trial and error, through random interactions with the environment and by incorporating the
responses of these interactions, in order to improve the overall performance. The agent’s actions
serve both as a means to explore (learn better strategies) and a way to exploit (greedily use the best
available strategy). Since exploration is costly in terms of resource, time and opportunity, a crucial
question in Reinforcement Learning is to address the dichotomy between exploration into uncharted
territory and exploitation of existing proven strategies. Specifically, the agent has to balance between
greedily exploiting what he learned so far and choose actions that currently yield the highest reward,
and continuously explore the environment to acquire more information to potentially achieve a
higher value in the long term.
An example
To illustrate these ideas, let’s use a simple example—the vacuum-cleaner world shown in the
following Figure.
This world is simple and made-up world, we can describe everything that happens in it and consider
several variations. This particular world has 9 locations: squares labelled by coordinates (i, j) where i
=1, 2, 3. and j =1, 2, 3. The vacuum agent perceives which square it is in and whether there is dirt in
the square. It can choose to move left, right, up or down, suck up the dirt, or do nothing. One very
simple agent function is the following: if the current square is dirty, then suck; otherwise, move to
the next square.
It’s important to define when a reward is given to the agent and whether it is positive or negative. A
naive approach would be to give a positive reward whenever the agent cleans all the squares.
However, as the agent explore randomly, the chances of receiving the reward, by cleaning all the
squares, is small. To guide the agent towards the desired goal, a better strategy would be to give
a small positive reward whenever it cleans a square and a small negative reward if the agent
attempts to clean an already cleaned square. And a big positive reward when all squares are
cleaned.
Here is how the vacuum cleaner problem would be approached by making use of value functions.
First, we would set up a table of numbers, one for each possible state of this small world. Each
number will be the latest estimate of the probability of our finishing the cleaning from that state. We
treat this estimate as the state’s value, and the whole table is the learned value function. State A has
higher value than state B, or is considered “better” than state B, if the current estimate of the
probability of our winning from A is higher than it is from B. All the states in which all the squares
are clean have a probability of finishing the cleaning of 1, because we have already cleaned all the
space.
To select the next move the agent examines the states that would result from each of our possible
moves (one for each of the 4 directions and 2 possible functions; sucking up or not) and look up their
current values in the table. Most of the time the agent will move greedily, selecting the move that
leads to the state with greatest value, that is, with the highest estimated probability of finding a dirty
square. Occasionally, however, the agent will choose randomly from among the other moves instead.
These are called exploratory moves because they cause us to experience states that we might
otherwise never see. The value of this exploration becomes apparent if we add to the value function a
reward related to how fast the space is cleaned. This would allow the agent to select better space
traversal strategy.
The goal of MBML is "to provide a single development framework which supports the creation
of a wide range of bespoke models". This framework emerged from an important convergence
of three key ideas:
The core idea is that all assumptions about the problem domain are made explicit in the form of a
model. In this framework, a model is simply a set of assumptions about the world expressed in a
probabilistic graphical format with all the parameters and variables expressed as random
components.
The first key idea enabling this different framework for machine learning is Bayesian
inference/learning. In MBML, latent/hidden parameters are expressed as random variables with
probability distributions. This allows for a coherent and principled manner of quantification of
uncertainty in the model parameters. Once the observed variables in the model are fixed to their
observed values, initially assumed probability distributions (i.e. priors) are updated using the
Bayes' theorem.
Factor Graphs
The second cornerstone to MBML is the use of Probabilistic Graphical Models (PGM),
particularly factor graphs. A PGM is a diagrammatic representation of the joint probability
distribution over all random variables in a model expressed as a graph. Factor graphs are a type
of PGM that consist of circular nodes representing random variables, square nodes for the
conditional probability distributions (factors), and vertices for conditional dependencies between
nodes (Figure 1). They provide a general framework for modeling the joint distribution of a set
of random variables.
The joint probability P(μ, X) over the whole model in Figure 1 is factorized as:
P(μ, X)=P(μ)*P(X|μ)
Where μ is the model parameter and X are the set of observed variables.
Figure 1: A Factor Graph
In factor graphs, we treat the latent parameters as random variables and learn their probability
distributions using Bayesian inference algorithms along with graph. Inference/learning is simply
the product of factors over a subset of variables in the graph. This allows for easy
implementation of local message-passing algorithms.
Stages of MBML
There are 3 steps to model-based machine learning, namely:
1. Describe the Model: Describe the process that generated the data using factor graphs.
2. Condition on Observed Data: Condition the observed variables to their known quantities.
3. Perform Inference: Perform backward reasoning to update the prior distribution over the latent
variables or parameters. In other words, calculate the posterior probability distributions of latent
variables conditioned on observed variables.
Write in detail about partially observables state in learning?
A partially observable Markov decision process (POMDP) is a combination of an MDP and a hidden
Markov model. Instead of the state being observable, there are partial and/or noisy observations of the state that the
agent gets to observe before it has to act.
A POMDP consists of
•
SS, a set of states of the world
•
AA, a set of actions
•
OO, a set of possible observations
•
P(S0)P(S0), which gives the probability distribution of the starting state
P(S′∣S,A)P(S′∣S,A), which specifies the dynamics – the probability of getting to state S′S′ by doing
•
or