0% found this document useful (0 votes)
27 views45 pages

Week-12 - Introduction To ML-NN-CNN

Uploaded by

grupsakli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views45 pages

Week-12 - Introduction To ML-NN-CNN

Uploaded by

grupsakli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Dr.

Ahmet Esad TOP


ahmetesadtop@aybu.edu.tr
o Formal definition by Tom M. Mitchell
o "A computer program is said to learn from experience E with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P, improves with experience E."
o Learning from experiences is natural
o For humans or animals
o Humans collect information from events or observation of facts
o When a new event occurs and the result is unknown,
o The collected knowledge is used

o ML enables machines to learn from previous experiences


o ML techniques learn directly from the data itself
o No pre-determined equations or explicitly programmed decisions
o They essentially build the path to the answer using just the data
o ML algorithms find patterns or regularities in data
o It can make judgments, estimations, or take actions

o Data size is crucial for ML


o As the number of samples increase, performance is improved
Traditional Programming

Data
Computer Output
Program

Machine Learning

Data
Computer Program
Output
o ML is used when:
o Human expertise does not exist (navigating on Mars)
o Humans can’t explain their expertise (speech recognition)
o Models must be customized (personalized medicine)
o Models are based on huge amounts of data (genomics)

o Learning isn’t always useful:


o There is no need to “learn” to calculate payroll
o Examples: Network intrusion detection, e-mail filtering, speech recognition,
bioinformatics, and computer vision
o When developing explicit algorithms is challenging (or they fail)
o ML is employed
o ML divided into 2 categories
o supervised learning
o unsupervised learning
o Unsupervised learning has no labels
o without the corresponding output
o some patterns and relations can be found
o It learns underlying structure and distribution in the data (to model them)
o No correct answer or teacher (supervisor) available in this learning type
o After the similarities or differences has been revealed
o The data can be grouped up
o If grouped according to some rules → association solution
o association detects sets of items that frequently occur together in dataset
o If grouped according to inherent groupings in the data → clustering solution
o Clustering splits the dataset into groups according to similarities
o Common unsupervised learning algorithms are K-Means Clustering, Principal Component
Analysis (PCA), Hidden Markov Model, and Apriori algorithm
o Learns from labeled data
o Data must be composed of pairs
o It learns a generalized rule of mapping of the pairs
o from sample inputs to their desired outputs
o After training with samples
o it produces a function (mapping)
o new inputs to their unknown outputs
o intention is accurately discovering the labels
o success depends on the generalization capacity of the algorithm
o Y=f(X) → Y is output, X is input, and f() is the learned mapping function
o In supervised learning
o a supervisor assigns labels to data,
o then it is processed by one of the supervised learning algorithms
o to generate the desired function
oSupervised learning uses two techniques:
oclassification
o predicts discrete responses
o their outputs are categorical such as "black" or "white"
o E.g.: "yes" or "no"
oregression
o predicts continuous responses
o their outputs are real numbers such as "temperature"
othey are used to generate predictive models
o Given (x 1, y1), (x 2, y2), ..., (x n, yn)
o Learn a function f(x) to predict y given x
o y is real-valued == regression

9
September Arctic Sea Ice Extent

8
7
(1,000,000 sq km)

6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year
o Given (x 1, y1), (x 2, y2), ..., (x n, yn)
o Learn a function f(x) to predict y given x
o y is categorical == classification

Breast Cancer (Malignant / Benign)

1(Malignant)

0(Benign)
Tumor Size
Predict Benign Predict Malignant

Tumor Size
Classification
2.1
⇒ 𝑔𝑜𝑜𝑑
1.8

Regression
2.1
⇒ .9
1.8
With a likelihood of 90%, this email is good
oOne of the most popular ML approaches is Artificial Neural Networks (ANN)
o also referred to as neural networks (NN)
oIt is an information processing system
oIt can be employed for both unsupervised or supervised learning
oThere is no pre-knowledge or set of programmed rules for the task expected
to be performed by the ANN before the training
oAn ANN simply takes input data (i.e., example data)
o and learns the ability to perform the required task
o by parsing the data and detecting patterns inside the data
o ANN learns (e.g., categorizing animals from images like ‘wolf’, ’giraffe’, or ‘dog’) by using
sample images
o it gets the image with a corresponding output (i.e., labels) as its training data
o they should be in the form of described pairs
o Initially, ANN uses the first sample image as its input
o feeds forward, and then receives the output of the first image
o According to the output, it measures the error
o analyzes how close it is to the intended result
o Then, it makes some adjustments to the weights
o by using the gradient descent algorithm
o Its weights are more accurately adjusted after several iterations
o using various samples
o An ANN is a network made up of several nodes
o each node communicates with linked nodes
o the receiving node process what it gets and sends the new info to next linked nodes
o Nodes are referred to as "artificial neurons" and are comparable to biological neurons
o Each node uses a nonlinear function to produce its output
o the function’s input is the sum of all the inputs of the node
o Edges, which are like "neurotransmitters", are the connections between nodes
o Each edge has its own weight, which is going to be updated after a backpropagation pass
o Layers are groups of neurons that are on the same level
o Each layer waits until the preceding layer has completed all of its computations
o In 1958, Rosenblatt invented the perceptron
o his single-layer perceptron was unable to solve the XOR issue
o until the backpropagation method was created in 1975
o One-layered perceptrons (excluding the input layer) can be used to solve "AND" and
"OR" gates
o but a one-layered perceptron cannot be used to create an "XOR" gate
o as a single line is not enough to split "XOR" in a Cartesian plane
o The structure and function of the human brain (i.e., biological neural networks and
neurons) serve as an inspiration for ANN
o Inputs are feature values
o Each feature has a weight
o Sum is the activation

w1
o If the activation is: f1


w2
o Positive, output +1 f2 >0?
o Negative, output -1 w3
f3
o "XOR" must be separated by using at least 2 lines
o at least a 2-layered perceptron network (excluding the input layer) is required
o 2 hidden layers, as layers between the input and the output are known as hidden layers
o This system is called as "Multilayer Perceptron (MLP)"
o Every node except the input layer uses a nonlinear activation function for its output
o e.g., sigmoid function or hyperbolic tangent
o MLPs employ the backpropagation approach for their training phase
o While training an ANN, the gradient needs to be calculated to update the weights
o after a forward pass
o this is done by backpropagation
o Gradient descent is employed
o it determines the gradient of the loss function
o The error is propagated to previous layers and neurons
o directly or indirectly connected neurons to the output neuron
o Each neuron’s net (i.e., incoming) values are calculated
o Each neuron’s out (i.e., outgoing) values are calculated
o The squared errors for the outputs are then determined
o The squared error cost function is minimized using
gradient descent
o weights are revised after each iteration, and this process
continues until the cost is as low as possible
o Gradient descent is guaranteed to converge to a
hypothesis with minimum squared error
o If the given learning rate is sufficiently small
o Gradient descent has the risk that it can over-step
the minimum in the error surface
o If the learning rate is too large
o Gradient descent may not find the global optimum
o If there are multiple local optima in the error surface
o Converging to a local optimum is sometimes slow
o -> To overcome the issues, variants of gradient
descent were developed
o batch gradient descent tends to overlap the global optima
o as it updates after seeing the whole data
o stochastic gradient descent tends to get stuck at the local
optima
o as it updates after seeing each sample
o Deep learning (DL) is a subset of ML where the learning procedure occurs in deeper
structures
o Deeper structures → the presence of several hidden layers
o Deep networks may contain tens or even hundreds of hidden layers
o traditional NNs only have one or two hidden layers
o DL eliminates the requirement for manual feature extraction unlike ML
o by converting the data into intermediate feature representations
o it extracts features first-hand from the data itself
o Another standout benefit of DL is its capability to continuously enhance its performance
o it keeps getting better performance as the size of the data increases
o improvement of technology → huge amount of data and powerful GPUs are available recently
o this situation has made DL so popular recently
o "Deep Learning" term came out to AI community in 1986 by Rina Dechter
o In 1989, Yann LeCun et al. developed a DNN that could read handwritten ZIP codes from mail using
the backpropagation approach
o ML approaches were more popular back in the day because of the high processing cost
of ANNs
o Later, advances in GPU technology became more significant
o making DL considerably more popular than other methods
o The "Big Bang of DL" occurred in 2009, when Nvidia trained DNNs on Nvidia GPUs
o NNs form the basis of the majority of all DL methods
o DNN is an ANN with multiple hidden layers
o As DNNs are feedforward networks, data travels
from the input layer to the output layer
o DNNs have a strong modeling capability since
they can sort out linear or non-linear
connections
o When modeling complex data, putting additional
layers in hidden layers may reduce the number
of units required in each hidden layer
o unlocks getting the combination of features from
previous layers
o In general, DNNs are very challenging to train
o CNN is a notable exception for training deep networks
o A 7-layered CNN called LeNet-5 was introduced in 1998 by LeCun et al. to recognize digits
from 32x32 photos.
o The resolution was limited to 32x32 due to the limited hardware capabilities available at the time
o CNNs attracted great attention after the computing industry acquired advanced
hardware capabilities.
o Several studies have utilized and demonstrated how to train CNNs on GPUs and their appropriate
approaches
o Nowadays, CNN is one of the most prominent and effective DNN types
o It is generally applied to computer vision such as image/video recognition
o The CNN is designed to minimize the need for pre-processing
o making it a very convenient choice for many applications
o CNN does not require manual feature engineering
o CNN also has the success of achieving state-of-the-art results
o can be re-trained for new data or tasks
o When people view a picture of a cat
o they can identify it based on its unique features
o such as its claws, four legs, tail, and whiskers
o Similarly, a CNN can classify a picture of a cat by processing the low-level features
o such as curves and edges
o and then creating more abstract concepts using multiple convolutional layers
o Traditional MLP architectures suffer from not scaling well to higher-resolution images
o Due to the "curse of dimensionality"
o a phenomenon
o the number of weights required by the model increases exponentially
o The reason behind this is full connectivity between nodes
o a fully connected layer of a 32x32 input image requires 1024 weights
o a fully connected layer of a 224x224 input image requires 50,176 weights
o However, a convolutional layer in a CNN can operate with a much smaller number of weights
o E.g., a 7x7 filter that convolves on a 32x32 or 224x224 image will always require only 49 learnable parameters
o regardless of the size of the input image
o This efficiency makes CNNs a more practical choice for image classification tasks
o CNNs are composed of layers with three-dimensional neurons
o each of which is connected to a small region of the previous layer known as the receptive field
o this structure allows CNNs to operate with fewer weights compared to traditional MLP architectures
o as the connections between neurons are more localized and not fully connected
o CNNs differ from ANNs in the types of operations performed by their hidden layers
o They include a combination of
o convolutional layers,
o pooling layers,
o a softmax layer,
o fully connected layers,
o and Rectified Linear Units (ReLUs)
o The convolutional layer is always the primary component of every CNN
o The other layers are inserted between convolutional layers
o The fully connected layers are placed at the end of the network
o These hidden layers serve to introduce non-linearity and maintain the dimensions of the input data
o It is characterized by a set of learnable filters
o These filters are used to scan the receptive field to search for matching patterns
o The filters are represented as rectangular arrays of numbers that serve as feature identifiers
o helping the CNN to identify and extract important features from the input data
o slide across the width and height of the input image
o ReLU is a type of activation function that is
widely used in deep learning models
o introduces non-linearity to the model
o generally follows each convolutional layer
o One of the main advantages of ReLU is its
simplicity
o its implementation is pretty straightforward and
does not require additional hyperparameters
o ReLU has improved the training speed of DNNs
o compared to other activation functions (e.g.,
sigmoid or hyperbolic tangent).
o ReLU also helps to alleviate the vanishing
gradient problem
o which occurs when the gradients of the weights
in the network become too small (i.e., training
becomes slow).
o Pooling layer is an essential component of CNN
architectures
o the majority of them are used right after the convolutional
layers
o Their duty is to reduce the spatial dimensions of the results
generated by the convolutional layers
o by merging the outputs of multiple neurons into a single neuron
that utilizes non-linear functions
o This simplification (i.e., downsampling) operation reduces
the number of parameters
o hence reduces computational overhead, as well as helping to
prevent overfitting
o Overfitting occurs when a model becomes too closely tuned
(i.e., close to ideal or fully ideal) to the training data but fails
on the test data (i.e., a generalization issue)
o pooling layers also help to maintain the spatial invariance
of the network
o i.e., can recognize an object regardless of its position in the
image
o Dropout is a handy regularization technique used to
reduce overfitting in neural networks
o It can be applied at different levels of the network
o This method consists of randomly dropping out certain
activations in a layer, with a probability commonly set at
0.5
o half of the hidden neurons are dropped out randomly
o once the training is completed, these neurons are recovered
with their weights
o This technique is beneficial for preventing overfitting
o It can also improve the model’s generalization
capabilities
o FC layers, also known as dense layers, are typically
placed at the end of a CNN
o Learns non-linear combinations of high-level feature
activations that have been extracted through a series
of convolutional and pooling layers
o FC layers are also used to map high-level feature
activations to the final output, making predictions or
classifications
o Combines and mixes important information from all
preceding convolutional layers
o The Softmax layer’s main duty is to perform multi-class classification
o It is typically placed at the end of a CNN as the last layer
o The softmax function (i.e., a probability distribution function) inspired the name of this layer
o through the softmax function, it produces probabilities of each class
o indicates the output class that the input is most likely to belong to
Thanks for your attention!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy